You’ve been using Lilt, and are impressed with its translation capabilities. Now, just how good is Lilt’s translation? If you have some reference documents handy and a Python environment, you can self-measure the Lilt platform translation quality via the API, using an objective metric: the BLEU algorithm.
The BLEU Algorithm
Take a look at Wikipedia’s description of the BLEU algorithm:
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. [...]
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. [...]
BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1.
You will need:
- An active Lilt account with an API key
- A Python environment with sacrebleu (https://github.com/mjpost/sacreBLEU) installed
- A reference set of document(s) for which to calculate the average BLEU score across segments. See section “Preparing the test set” below for more information.
Preparing the evaluation set
Select one or more reference documents as your evaluation set. Each segment in your evaluation set must have at least one human reference translation; preferably, each segment has multiple reference translations, which increases the robustness of the BLEU metric.
Documents in the evaluation set:
- Should be representative of the type of text you usually translate
- Must have high-quality human reference translations
- Should contain 1000 - 3000 segments, as too small evaluation sets lead to unreliable metrics
- Must not have been uploaded to or translated with Lilt, or any competing engines you are evaluating against
Generating Lilt output translations
We will be comparing Lilt’s output translation against human reference translations. In this guide, we assume the use of the API.
First, we must draw a distinction between adapted and unadapted machine translation models, as BLEU expectations differ.
- Unadapted: The default models that are first created for a specific language pair when you create a project in Lilt using a default memory. They are pristine, in the sense that you have not done any translation with them or uploaded any TMX files. The expected BLEU scores for this unadapted model will be lower.
- Adapted: Given a project, if you have uploaded TMX files or translated and confirmed segments within the project, the base model will have adapted to those translations. The expected BLEU scores for this adapted model will be higher.
Decide which type of model to generate BLEU scores against. We recommend doing both and comparing the BLEU scores of an unadapted and adapted model to get a sense of the quality increase that adaptation provides.
Choose a project
First, ensure you have a project created in Lilt in the language pair of the document: https://lilt.com/docs/api#tag-Projects.
[Adapted models only]: Use a memory on which a reasonable number of documents with content similar to your reference document have been translated. Alternatively, use a project with a standard memory that has been updated with a TMX file containing segments similar to the content in the reference document. This memory should have been given sufficient time to adapt to the TMX file.
Run the translate endpoint on all segments in the reference document: https://lilt.com/docs/api#tag-Translate. No matter how you call the API, be sure you can later match the reference segments to the Lilt translation segments. This is essential during scoring.
You may also use batch translation instead of translating segment-by-segment. The translation output from Lilt, and therefore the resulting BLEU score, will be equivalent.
To do this, first upload a document: https://lilt.com/docs/api#operation--documents-post. Then, run pre-translation on it: https://lilt.com/docs/api#operation--documents-pretranslate-post and save the output segments.
Calculating the BLEU score against the reference translation
First, you must format the output and reference translations so that they can be easily processed with the Python package `sacrebleu`. Both output and reference translations:
- Must in plain text format and UTF8-encoded
- Must have one segment per line
- Must perfectly align; that is, each segment in the reference file must match one-to-one on the same line as the corresponding segment in the translation output
It is possible to concatenate multiple output and reference translation files into a single file, provided that they fulfill the requirements above.
To calculate the BLEU score, run:
cat translated_segments.txt > sacrebleu [-tok zh] path/to/reference_segments.txt
Note that when running on Chinese or Japanese output, the optional flag [-tok zh] should be passed.
Details about advanced `sacrebleu` usage can be found at https://github.com/mjpost/sacreBLEU.
The BLEU score is not the only metric of translation quality, and has its pitfalls. Namely, BLEU scores are compared to reference human translations, which differ from translator to translator. Therefore, BLEU scores give a general sense of how good translation is, but will never be a perfect assessment of translation quality.