Word Prediction Accuracy (WPA)
Use below method to calculate the WPA metrics at LILT.Enable WPA writes
WPA is enabled by default to write data to minIO/s3. However, if you have release prior to the Q2 2024 release and would like to enable, add below configs in the customvalues.yaml and redeploy app
Read WPA stats
Use these steps to read the WPA stats generated using above method:-
Fetch update-neural pod using
-
Use below command to fetch the stats
-
--datastore(required): csv -
--base-path: Location to store the results. Should be similar to the value ofwpaDataStoreBasePathvalues file parameter. It’s possible to use the local pod folders as base path by referring it asfile://<path> -
--src_lang: Only results for this source language will be used. Each language is given as a 2 letter code according to ISO 639-1, e.g., de for german. More details here. -
--tgt_lang- Only results for this target language will be used. The format is the same as--src_lang
- Unadapted model %: Percentage of words/ char correctly predicted by the system when using an unadapted/pre-training model.
- Adapted model %: Percentage of words/ char correctly predicted by the system when using an adapted/trained model.
BLEU score
Take a look at Wikipedia’s description of the BLEU algorithm. BLEU score can be useful for determining the performance of your LILT Data Sources in terms of:- Which Data Sources perform best.
- Which Data Sources are best suited for a given translation project.
Preparing the evaluation set
If you would like to calculate the BLEU score on your sample evaluation dataset then use below method to prepare the evaluation set: Select one or more reference documents as your evaluation set. Each sentence in your evaluation set must have at least one human reference translation. Preferably, eachsentencehas multiple reference translations to increase the robustness of the BLEU metric. Documents in the evaluation set:- must have high-quality human reference translations
- must not have been uploaded to or translated with LILT, or any competing engines you are evaluating against
- must have sentences in source and target language as we need to translate the source using LILT for comparison
- should be representative of the type of text you usually translate
- should contain 1000 – 3000 sentences, as too small evaluation sets lead to unreliable metrics
- sentence size: longer reference sentences can potentially make it more challenging for machine translations to achieve high BLEU scores, as the machine translation may not match the reference sentence exactly in terms of word order or structure. On the other hand, shorter reference sentences may make it easier for machine translations to achieve higher scores.*It’s important not to sacrifice diversity and naturalness for the sake of higher scores.*In summary, there isn’t a universally “good” reference sentence size, but it’s essential to prioritize naturalness, diversity, and relevance to the evaluation goals. Using multiple reference sentences and considering variations in translation quality is generally a good practice. Ultimately, the choice of reference sentences should align with the specific context and objectives of the machine translation evaluation.
Running the evaluation test
This section will provide instructions on how to capture BLEU score by running the batch updater and inference on the segments present in the Segments table. In summary, this section will describe steps to download the segments, split them into train and test sets, and run inference evaluation. The code will be provided as a docker image.-
Load the inference docker file
-
Preparing config and data directories: In a desired directory (henceforth referred to as
/path/to/root/directory), create 2 directories as below -
Download segments from the database
-
Connect to database using
kubectlproxy or preferred method of your choice -
Create a config with below values
-
Copy above config to this path:
/path/to/root/directory/configs/download_user_data.yaml -
Use below docker command to download the segments
The output of this should be the below files
Meta_data.jsoncontains the # of segments
segments.SRC/TGT_LANGcontains the source and target segments
-
Connect to database using
-
Split segments data into train and test sets
Here
TRAIN_SPLIT_PROPORTIONis a floating point number between 0 and 1 to indicate how much of the downloaded segments need to be used for training. Eg, if set to 0.7, then 70% of the segments downloaded into/path/to/root/directory/data/db_segmentswill be used for train and the remaining 30% will be used for test. For the provided sample withinsample_data/db_segmentsusing aTRAIN_SPLIT_PROPORTIONof 0.7 will yield 7 train segments and 3 test segments as provided withinsample_data/db_segments/splits/{train,test} -
Run inference evaluation
-
Prepare the config to run inference evaluation
-
Run below docker command
This command first creates a fresh memory and project to be used for the purpose of this evaluation. Their respective ids are printed in the logs as belowIf these memory and project ids need to be reused, then the evaluation config needs to be updated with these
-
Prepare the config to run inference evaluation
/path/to/root/directory/data/splits/train will be used to run a batch updater operation, and once completed, the data present within /path/to/root/directory/data/splits/test will be used to run batch translate operation on the recently updated memory. The results will be downloaded and the computed bleu scores will be persisted in /path/to/root/directory/data/evaluation.lilt_api.json
Sample output
The inference will be run on the directories present within the
splits directory, except for the ones which aren’t mentioned to be used for batch updater. If the inference section in the config is updated to operations: [ "translate" ], then train directory won’t be used for batch updater anymore and inference will be run on that too. To prevent this, train directory can be deleted from the splits directory so the inference is run only on test.
