Skip to main content
Skip table of contents

Calculating BLEU Scores with the API

If you have a Python environment and some reference documents handy, you can use the API to self-measure Lilt's translation quality for both baseline and adapted models. This can be useful for determining the performance of your Lilt Models in terms of:

  • Which Models perform best.
  • Which Models are best suited for a given translation project.

This article discusses the use of the BLEU algorithm as an objective metric for calculating translation quality.

The BLEU algorithm

Take a look at Wikipedia’s description of the BLEU algorithm:

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. [...]

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. [...]

BLEU's output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1.


You will need:

  • An active Lilt account with an API key.
  • A Python environment with SacreBLEU installed.
  • A reference set of document(s) for which to calculate the average BLEU score across segments. See section “Preparing the test set” below for more information.


Preparing the evaluation set

Select one or more reference documents as your evaluation set. Each segment in your evaluation set must have at least one human reference translation. Preferably, each segment has multiple reference translations to increase the robustness of the BLEU metric.

Documents in the evaluation set:

  • must have high-quality human reference translations
  • must not have been uploaded to or translated with Lilt, or any competing engines you are evaluating against
  • should be representative of the type of text you usually translate
  • should contain 1000 – 3000 segments, as too small evaluation sets lead to unreliable metrics

Generating Lilt output translations

We will be comparing Lilt’s output translation against human reference translations. In this guide, we assume the use of the API.

First, we must draw a distinction between adapted and unadapted machine translation models, as BLEU expectations differ.

  • Unadapted: The default models that are first created for a specific language pair when you create a project in Lilt using a default data source. They are pristine, in the sense that you have not done any translation with them or uploaded any TMX files. The expected BLEU scores for this unadapted model will be lower.
  • Adapted: If you have uploaded TMX files or translated and confirmed segments within the project, the base model will have adapted to those translations. The expected BLEU scores for this adapted model will be higher.

Decide which type of model to generate BLEU scores against. We recommend doing both and comparing the BLEU scores of an unadapted and adapted model to get a sense of the quality increase that adaptation provides.

Choose a project

First, ensure you have a project created in Lilt in the language pair of the document:

[Adapted models only]: Use a Lilt Data Source on which a reasonable number of documents with content similar to your reference document have been translated. Alternatively, use a project with a standard Lilt Data Source that has been updated with a TMX file containing segments similar to the content in the reference document. This Lilt Data Source should have been given sufficient time to adapt to the TMX file.


Run the translate endpoint on all segments in the reference document.

No matter how you call the API, be sure you can later match the reference segments to the Lilt translation segments. This is essential during scoring.

Batch translation

You may also use batch translation instead of translating segment-by-segment. The translation output from Lilt, and therefore the resulting BLEU score, will be equivalent. Follow these steps to batch translate:

  1. Upload a document.
  2. Run pre-translation on the document.
  3. Save the output segments.

Calculating the BLEU score against the reference translation

First, you must format the output and reference translations so they can be easily processed with the Python package SacreBLEU. Both output and reference translations:

  • must be in plain text format and UTF8-encoded
  • must have one segment per line
  • must perfectly align; that is, each segment in the reference file must match one-to-one on the same line as the corresponding segment in the translation output

It is possible to concatenate multiple output and reference translation files into a single file, provided they fulfill the requirements above.

To calculate the BLEU score, run:

cat translated_segments.txt > sacrebleu [-tok zh] path/to/reference_segments.txt

Note that when running on Chinese or Japanese output, the optional flag [-tok zh] should be passed.

Details about advanced SacreBLEU usage can be found at:


The BLEU score is not the only metric of translation quality, and has limitations.

  • BLEU scores are compared to reference human translations, which differ from translator to translator, and therefore is not a fully objective assessment of translation quality.
  • BLEU scores fail to assess fluency, idiomatic expressions, and language subtleties which are essential for an accurate translation.
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.