This article describes current best practices for human evaluation of translation quality. For the latest research on quality evaluation, see the annual reports from the Workshop on Machine Translation (WMT), the most recent of which was held in August of 2016.
Current approaches to translation quality evaluation use the method of pairwise comparison. Consider the evaluation of clothing:
One question that you might ask is:
How good does Jeremy look in a green suit?
Raters could evaluate Jeremy on a scale of 1-10 for some definition of "good." You could also ask:
Does Jeremy look better than, worse than, or the same in a green suit vs. a pink suit?
It has been found that humans tend to render more consistent judgments when comparing two items than when rating an item on an arbitrary scale. This finding goes for clothes, and for translation. In the WMT 2007 evaluation campaign, it was observed that inter-rater agreement was considerably higher for pairwise comparisons ("sentence ranking" in the table below) vs. evaluation of fluency/adequacy:
In this table, a value of K = 1.0 indicates perfect agreement among raters. Fluency / adequacy judgments, which were popular in the early 2000s, have been almost entirely abandoned in favor of pairwise comparisons.
Let's continue the example from the automatic quality evaluation. We have two test sets (email and listing), each with source sentences and target references. We also have the output of the two systems B and C. Now we need:
- Bilingual human raters — the raters should be fluent in both languages, preferably with native proficiency in the target.
- A ranking interface — for collecting the human judgments.
For each source sentence, the ranking interface shows a target reference, and MT outputs with the system identity concealed. The ordering of the systems should be randomized across screens:
In our example, the interface would show only two system outputs: B and C. The pairwise comparison setup can be safely extended to a relative ranking of up to five system outputs.
It is common to have several human raters independently score each source sentence.
Analyzing the Results
For The ranking interface yields two kinds of data:
- Relative ordinal ranks
- Pairwise preferences — B > C, B < C, C = B, etc.
Let's use the relative ordinal ranks to create a side-by-side comparison, which is used at Google and in other academic settings. We simply compute the average rank across all sentences in the test for each system (B and C). We can create a table like this:
We can see that system B is superior for both test sets.
To recap, we used an automatic quality evaluation to eliminate three of the five systems under consideration. Then we ran a human quality evaluation to differentiate the final two systems. The human quality evaluation was constructed to maximize inter-rater agreement, so we have confidence that were we to re-run the evaluation with a different set of raters, we would observe the same outcome.
Advanced topic — The pairwise preferences can be used for more sophisticated statistical analysis. To learn more, refer to section 4.3.2 of our EMNLP 2014 paper.