Translation quality is subjective—unlike, say, optical character recognition—but that characteristic doesn't prohibit quantitative evaluation. The goal of any subjective evaluation should be to:
- Minimize the probability that random chance accounts for the outcome
- Maximize the agreement between raters in the experiment
These two goals are related. For example, suppose that ten people are rating dresses according to stylishness. You'd like to construct the evaluation such that the ten people are likely to agree on the level of stylishness. You'd also like to know that were you to select ten different people, you'd observe the same rating.
Ad-hoc evaluation of translation quality—i.e., sampling a few sentences and having translators count errors—satisfies neither of these goals.
This guide explains how to select an MT system in a principled and repeatable way. There are three approaches in increasing order of cost and time:
- Automatic quality evaluation — an algorithm determines or predicts the quality of the MT output. The most common use case is to compare the translation against a set of human-produced reference translations and calculate a score that reflects the quality. Automatic evaluation is cheap and fast.
- Human quality evaluation — experts (e.g. translators) look at the translated output, and give a score to the quality of the translation. Ratings can vary significantly across experts, so the evaluation should be designed to maximize agreement among raters. Human evaluation is costly and time-consuming.
- Human productivity evaluation — translators work with machine translation to produce final translations. The metrics are throughput (words per hour) and quality.
Start with an automatic evaluation to narrow the list of systems. Then use human evaluation to select the best system. Finally, conduct a human productivity evaluation to measure the impact of machine assistance on the translation workflow.