Tech

This is the index page of tech articles I wrote, mainly about NLP and machine learning.


How Did We Improve An Algorithm Speed by 60 Times?
Recently, my mentee and I refactored a key algorithm in one of our projects, and we achieved a 60-time speed gain. Problem We need regenerate a Docx file based on the OCR result, aiming to match the original format. However, the OCR result does not contain the font size, and
A Rare libmagic Bug in Docker
We recently caught a rare bug: One of our services is used to upload files to our storage service, and after uploading, it would check file format, and pass the extracted type to the next service. This file type inspection relies on python-magic. The bug occurs as it detected a
How Did We Save Our Video Classifier Cost by 84%
A recent project in our team was to improve the computing efficiency of a deep-learning-based video classifier. After carefully tuning the models and migrating to Kafka, we successfully reduced the cost by 84%. Background When we received the request, the model team already developed a POC versio…
How Did We Improve Our Document Translator Speed by 20X?
The first project after I joined this company was to help the team to deliver a document translator project. When I joined, the team already built a MVP version: it worked, but it was very slow. Translating a 50-page Microsoft Docx file would cost us more than 10 minutes, which
Workarounds in BentoML 0.12
The following workarounds cover the BentoML 0.12, and with the rapid development, BentoML may solve these issues in the future releases. Therefore, these workarounds may only be useful to BentoML 0.12.1 or previous versions.
Using Pandas as a Unified IO Tool
When I wrote my dissertation, I used Pandoc to convert my draft in markdown format to the final version in PDF format. Pandoc is an extremely powerful and easy-to-use tool for file conversion, as shown in the following diagram. When we load data into Python, we have a similar demand
Using CRF in Python
CRF (Conditional Random Fields) has been a popular supervised learning method before deep learning occurred, and still, it is a easy-to-use and robust machine learning algorithm. We recently used this algorithm to do NER (name entity recognition), and here is a brief summary of using CRF in Python.…
General Pipelines for Chinese NLP Engineering with Stanford NLP Software
With a focus on processing Chinese textual data, we have been using different tools extensively, namely Jieba and Stanford NLP software. Either of them has its own advantages and drawbacks. To balance efficiency and accuracy, we eventually chose Stanford NLP software as our default toolset. Therefor…