Calculating BLEU Scores with the API
If you have a Python environment and some reference documents handy, you can use the API to self-measure Lilt's translation quality for both baseline and adapted models. This can be useful for determining the performance of your Lilt Memories in terms of:
- Which Memories perform best.
- Which Memories are best suited for a given translation project.
This article discusses the use of the BLEU algorithm as an objective metric for calculating translation quality.
The BLEU algorithm
Take a look at Wikipedia’s description of the BLEU algorithm:
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. [...]
Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. [...]
BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1.
Requirements
You will need:
- An active Lilt account with an API key.
- A Python environment with SacreBLEU installed.
- A reference set of document(s) for which to calculate the average BLEU score across segments. See section “Preparing the test set” below for more information.
Procedure
Preparing the evaluation set
Select one or more reference documents as your evaluation set. Each segment in your evaluation set must have at least one human reference translation. Preferably, each segment has multiple reference translations to increase the robustness of the BLEU metric.
Documents in the evaluation set:
- must have high-quality human reference translations
- must not have been uploaded to or translated with Lilt, or any competing engines you are evaluating against
- should be representative of the type of text you usually translate
- should contain 1000 – 3000 segments, as too small evaluation sets lead to unreliable metrics
Generating Lilt output translations
We will be comparing Lilt’s output translation against human reference translations. In this guide, we assume the use of the API.
First, we must draw a distinction between adapted and unadapted machine translation models, as BLEU expectations differ.
- Unadapted: The default models that are first created for a specific language pair when you create a project in Lilt using a default memory. They are pristine, in the sense that you have not done any translation with them or uploaded any TMX files. The expected BLEU scores for this unadapted model will be lower.
- Adapted: If you have uploaded TMX files or translated and confirmed segments within the project, the base model will have adapted to those translations. The expected BLEU scores for this adapted model will be higher.
Decide which type of model to generate BLEU scores against. We recommend doing both and comparing the BLEU scores of an unadapted and adapted model to get a sense of the quality increase that adaptation provides.
Choose a project
First, ensure you have a project created in Lilt in the language pair of the document: https://lilt.com/docs/api#tag-Projects
[Adapted models only]: Use a Lilt Memory on which a reasonable number of documents with content similar to your reference document have been translated. Alternatively, use a project with a standard Lilt Memory that has been updated with a TMX file containing segments similar to the content in the reference document. This Lilt Memory should have been given sufficient time to adapt to the TMX file.
Segment-by-segment
Run the translate endpoint on all segments in the reference document.
No matter how you call the API, be sure you can later match the reference segments to the Lilt translation segments. This is essential during scoring.
Batch translation
You may also use batch translation instead of translating segment-by-segment. The translation output from Lilt, and therefore the resulting BLEU score, will be equivalent. Follow these steps to batch translate:
- Upload a document.
- Run pre-translation on the document.
- Save the output segments.
Calculating the BLEU score against the reference translation
First, you must format the output and reference translations so they can be easily processed with the Python package SacreBLEU. Both output and reference translations:
- must be in plain text format and UTF8-encoded
- must have one segment per line
- must perfectly align; that is, each segment in the reference file must match one-to-one on the same line as the corresponding segment in the translation output
It is possible to concatenate multiple output and reference translation files into a single file, provided they fulfill the requirements above.
To calculate the BLEU score, run:
cat translated_segments.txt > sacrebleu [-tok zh] path/to/reference_segments.txt
Note that when running on Chinese or Japanese output, the optional flag [-tok zh] should be passed.
Details about advanced SacreBLEU usage can be found at: https://github.com/mjpost/sacreBLEU
Limitations
The BLEU score is not the only metric of translation quality, and has its pitfalls. Namely, BLEU scores are compared to reference human translations, which differ from translator to translator. Therefore, BLEU scores give a general sense of how good a translation is, but will never be a perfect assessment of translation quality.