How Did We Improve Our Document Translator Speed by 20X?
The first project after I joined this company was to help the team to deliver a document translator project. When I joined, the team already built a MVP version: it worked, but it was very slow. Translating a 50-page Microsoft Docx file would cost us more than 10 minutes, which was hardly acceptable in the production environment.
As an engineer with previous experience in NLP and deep learning, I found the root cause, and offered help to fix this problem. First, let us understand the background of this project.
TL;DR
Background
Translating a 50-page Microsoft Docx file would cost us more than 10 minutes, which was hardly acceptable in the production environment.
Fix
The fix significantly compresses the network IO to the minimal, from O(n) to O(1), and it takes advantages of batch processing in GPU.
Result
The same 50-page Docx file only took less than 30 seconds to be processed, in comparison to more than 10 minutes with the MVP version. This is a 20-times speed increase.
Takeaways
1. Developing an good machine learning model itself is hard, and productionizing a good model is even harder;
2. The gap between the model development and backend develop cannot be ignored, and filling this gap can significantly improve the final product.
Background
This document translator project consists of two parts:
- The machine translation service handles all translation jobs;
- The document parser service handles extracting contents from a Docx file, and generating a new Docs file from translated results.
The machine translation service was a deep learning model built by the model team, and our backend team already helped to wrap it as a RestFul API service, running on a GPU instance.
Therefore, the scope in our team was to build a document parser:
- extracting textual data from a Docx document and sending them to to a machine translation service;
- using the translated result to generate a new Docx document with same or similar format (such as fonts, headings, etc).
The challenges in our end include:
- the fundamental data container in a Docx file includes multiple XML files, and Microsoft has a unique XML schema for these files. One typical example is that they randomly break a single sentence to multiple chunks and wrap them with XML tags, not following the linguistic structure.
- the available open-sourced Docx extractor library only has partial functionalities we wanted, so we put several libraries together to solve our problems.
- compared to the extracting texts from a document, generating the new document with translated results was more challenging as some of the translated results have different turn from the source (e.g.: from left-to-right to right-to-left).
MVP solution
Due to the first challenge above, the MVP solution for our MVP is whenever we extract a chunk of text from the XML file, we send this chunk to the MT service; once we get the translation result, we insert it back to the original position, and then go to the next chunk. In this way, the generated Docx can keep the same format as the original.
For example, a sentence like this: I had a nice lunch with Hummus.
is supposed to be wrapped as <p>I had a nice lunch with Hummus.</p>
, so the extracted result matches the linguistic structure.
However, in a Docx, this may be wrapped as:
<p>I had a nice </p><p>lunch with Hummus.</p>
.
Then, the extracted results have two chunks:
I had a nice
and lunch with Hummus.
Both are grammatically incomplete.
This solution has Three obvious problems:
- The number of API requests will be very high, considering how Microsoft breaks the sentences into small chunks. One page might generate dozens of requests, and a 50-pages may result in hundreds or thousands of requests. This number of requests will lead to a very high number of network IO, not to mention the added network latency.
- When we used the MT service, we treated it as a synchronised service, so all the processing was linear. Hence, we didn't fully grasp the full power of batch processing in GPU processing.
- The extracted chunks do not follow linguistic structure, and this will affect the machine translation accuracy.
As a result, the MVP version of this service had a very poor processing speed and poor translation result.
The Fix
The fix involved two sides: the deep learning model side, and the doc parser side.
The deep learning side needs update their IO interface:
- it needs to accept a list of text strings as inputs, as long as the list size is smaller than the batch size, so it can then use the batch processing in the GPU side.
- Meanwhile, the output needs to return a list of text strings, keeping the same order as input batches.
This fix is pretty straightforward, so we coordinated the model team to handle this.
On the document parser side, the fix needs to follow the deep learning model IO design:
- we try to send batches of extracted strings according to the batch size limit;
- to prepare the batched input, we need an algorithm to pack the extracted strings in order;
- to extract strings from Microsoft-customized XML, we need extract them and stitch them according the linguistic structure, then break them into sentences.
- at last, when we receive the output from the machine translation service, we need put the translated result back with the original format if they keep the same turn (e.g.: from English to French), or similar format if they are in different turn (e.g.: from English to Arabic).
Step 2 echoes step 4. Therefore, if we can design a proper packing algorithm and data structure in step 2, then step 4 is easier to generate a document from translated results.
The following code are the simplified demonstration of this packing algorithm:
- We extract all the chunks from a Docx XML file at once, and then concatenate chunks to assemble complete sentences with grammatically correct structure.
- This
batch_translator_core
function prepares the input batches from the contents extracted from a Docx file. It would assemble the request body according to the data structure required by the machine translation service. Here each batch can take 1000 chunks.
def batch_translator_core(url, input_list, lang_from, lang_to, size=1000):
payload_slice = []
for i in range(0, len(input_list), size):
one_request = {'from_lang': lang_from, 'to_lang': lang_to, "text": [p for p, _ in input_list[i:i + size]]}
payload_slice.append(one_request)
response_list = [req.post(url, json=load) for load in payload_slice]
return response_list
3. This batch_translate
function calls the above batch_translator_core
, sends the requests to the machine translation API, and unpacks the output data from the API response.
def batch_translate(seq, body_elements, lang_from='en', lang_to='ar', url='http://abc', path='.//w:t', pstyle='Paragraph', size=1000):
rec = seq if seq else body_elements.xpath(path)
input_batch = (r.text for r in rec if r.text != '' and r.text != ' ' and r.text is not None)
response_list = batch_translator_core(url, input_batch, lang_from, lang_to, pstyle, size)
response_list = [r.json() for r in response_list]
flat_response_list = []
for responses in response_list:
flat_response_list += responses
output_dict = {i['src']: i['tgt'] for i in flat_response_list}
return output_dict
Result
With the IO fix on the machine translation service and this packing algorithm in the document parser side, we observed a huge speed and accuracy improvement of the overall document translation service:
- The same 50-page Docx file only took less than 30 seconds to be processed, in comparison to more than 10 minutes with the MVP version. This is a 20-times speed increase.
- Since we sent grammatically correct chunks, the machine translation accuracy were also improved.
To understand point 1 better, we can analyze the time complexity of the translation requests part, we can assume that:
- The 95% percentile of the translation request takes 0.6 sec, and the longest translation request can take up to 2 sec;
- A 50-page Docx file can have 500 linguistic sentences, and 1000 XML chunks according to Microsoft Docx XML schema.
Based on these two assumptions:
- The MVP version sends 1000 network requests to the translation API. Since these request are synchronised, which means the computing time is linear O(n). Hence, it takes at least 600 seconds to finish.
- The fix packs 1000 XML chunks into 500 sentences, which is smaller than one batch, so we can send all sentences in one network request. Since the computing time of batch processing in GPU is constant O(1) - the longest sentence in one batch determines the computing time, so processing one sentences as a batch and processing 500 sentences in a batch would take similar time. Let's use the longest translation request time here, so it takes 2 sec to finish.
- The overall computing time is reduced from more than 10 minutes to less than 30 seconds, which is because that the extraction and generation of Docx takes some time.
- The fix remarkably compresses the network IO to the minimal, from O(n) to O(1), and it takes advantages of batch processing in GPU. Even the packing and unpacking batches may add some overhead than the MVP version, this overhead is not that significant compared to the network IO.
The lessons can be taken from this improvement include:
- developing an good machine learning model itself is hard, and productionizing a good model is even harder;
- the gap between the model development and backend develop cannot be ignored, and filling this gap can significantly improve the final product.
Comments ()