Word Prediction Accuracy (WPA)
Use below method to calculate the WPA metrics at LILT.
Enable WPA writes
WPA is enabled by default to write data to minIO/s3. However, if you have release prior to the Q2 2024 release and would like to enable, add below configs in the custom values.yaml
and redeploy app
update:
onpremValues:
config:
# Configure WPA DataStore
wpaDataStoreType: "csv”
# Write updates for every 10 confirmed segments, change this as required
wpaDataStoreBatchSize: 10
# Path to store WPA results, change it as required
wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev2:
onpremValues:
config:
# Configure WPA DataStore
wpaDataStoreType: "csv”
# Write updates for every 10 confirmed segments, change this as required
wpaDataStoreBatchSize: 10
# Path to store WPA results, change it as required
wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev3:
onpremValues:
config:
# Configure WPA DataStore
wpaDataStoreType: "csv”
# Write updates for every 10 confirmed segments, change this as required
wpaDataStoreBatchSize: 10
# Path to store WPA results, change it as required
wpaDataStoreBasePath: "s3://lilt/wpa/”
cd ~/install_dir
./install-lilt.sh
Read WPA stats
Use these steps to read the WPA stats generated using above method:
-
Fetch update-neural pod using
kubectl get pods -n lilt | grep update-neural
-
Use below command to fetch the stats
kubectl -n lilt exec <update-neural-pod-name> -- bash scripts/metrics/visualize_wpa.py --datastore csv --base-path s3://lilt/wpa --start_date <start-date yyyy-mm-dd> --end_date <end-date yyyy-mm-dd> --memory_id <memory-id>
Parameters:
-
--datastore
(required): csv
-
--base-path
: Location to store the results. Should be similar to the value of wpaDataStoreBasePath
values file parameter. It’s possible to use the local pod folders as base path by referring it as file://<path>
-
--src_lang
: Only results for this source language will be used. Each language is given as a 2 letter code according to ISO 639-1, e.g., de for german. More details here.
-
--tgt_lang
- Only results for this target language will be used. The format is the same as --src_lang
Sample output
Backend v1
Number of segments: 10
Average prediction accuracy unadapted model: 81.4%
Average prediction accuracy adapted model: 82.6%
Difference: 1.2%
Average char prediction accuracy unadapted model: 77.0%
Average char prediction accuracy adapted model: 79.9%
Char difference: 2.9%
-
Unadapted model %: Percentage of words/ char correctly predicted by the system when using an unadapted/pre-training model.
-
Adapted model %: Percentage of words/ char correctly predicted by the system when using an adapted/trained model.
BLEU score
Take a look at Wikipedia’s description of the BLEU algorithm.
BLEU score can be useful for determining the performance of your LILT Data Sources in terms of:
-
Which Data Sources perform best.
-
Which Data Sources are best suited for a given translation project.
Preparing the evaluation set
If you would like to calculate the BLEU score on your sample evaluation dataset then use below method to prepare the evaluation set:
Select one or more reference documents as your evaluation set. Each sentence in your evaluation set must have at least one human reference translation. Preferably, eachsentencehas multiple reference translations to increase the robustness of the BLEU metric.
Documents in the evaluation set:
-
must have high-quality human reference translations
-
must not have been uploaded to or translated with LILT, or any competing engines you are evaluating against
-
must have sentences in source and target language as we need to translate the source using LILT for comparison
-
should be representative of the type of text you usually translate
-
should contain 1000 – 3000 sentences, as too small evaluation sets lead to unreliable metrics
-
sentence size: longer reference sentences can potentially make it more challenging for machine translations to achieve high BLEU scores, as the machine translation may not match the reference sentence exactly in terms of word order or structure. On the other hand, shorter reference sentences may make it easier for machine translations to achieve higher scores.*It’s important not to sacrifice diversity and naturalness for the sake of higher scores.*In summary, there isn’t a universally “good” reference sentence size, but it’s essential to prioritize naturalness, diversity, and relevance to the evaluation goals. Using multiple reference sentences and considering variations in translation quality is generally a good practice. Ultimately, the choice of reference sentences should align with the specific context and objectives of the machine translation evaluation.
Running the evaluation test
This section will provide instructions on how to capture BLEU score by running the batch updater and inference on the segments present in the Segments table. In summary, this section will describe steps to download the segments, split them into train and test sets, and run inference evaluation. The code will be provided as a docker image.
-
Load the inference docker file
Docker:
docker load -i on_prem_inference_docker.tar
Containerd:
ctr image import on_prem_inference_ctr.tar
-
Preparing config and data directories: In a desired directory (henceforth referred to as
/path/to/root/directory
), create 2 directories as below
mkdir configs && mkdir data
-
Download segments from the database
-
Connect to database using
kubectl
proxy or preferred method of your choice
kubectl port-forward -n lilt service/mysql 3306:3306
-
Create a config with below values
db-connection:
# DB username
user: "DB_USER_NAME"
# DB password
password: "DB_PASS_WORD"
user-data:
# Don't change data-output-path
data-output-path: "/home/lilt/external_data/db_segments"
# Replace with source and target language ISO code
language-pair: ["SRC_LANG", "TGT_LANG"]
# Memory ID for above lang-pair
memory-ids: []
# Don't change data-type
data-type: "segments"
-
Copy above config to this path:
/path/to/root/directory/configs/download_user_data.yaml
-
Use below docker command to download the segments
docker:
docker run \
--volume /path/to/root/directory/configs/:/home/lilt/external_configs \
--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"
Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"
The output of this should be the below files
$ ls /path/to/root/directory/data/db_segments
meta_data.json segments.SRC_LANG segments.TGT_LANG
Meta_data.json
contains the # of segments
segments.SRC/TGT_LANG
contains the source and target segments
-
Split segments data into train and test sets
docker:
docker run \
--volume /path/to/root/directory/configs/:/home/lilt/external_conf--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"
Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"
Here TRAIN_SPLIT_PROPORTION
is a floating point number between 0 and 1 to indicate how much of the downloaded segments need to be used for training. Eg, if set to 0.7, then 70% of the segments downloaded into /path/to/root/directory/data/db_segments
will be used for train and the remaining 30% will be used for test. For the provided sample within sample_data/db_segments
using a TRAIN_SPLIT_PROPORTION
of 0.7 will yield 7 train segments and 3 test segments as provided within sample_data/db_segments/splits/{train,test}
-
Run inference evaluation
-
Prepare the config to run inference evaluation
languages:
# Source language ISO code
src-lang: "en"
# Target language ISO code
tgt-lang: "es"
db-connection:
# DB username
user: "DB_USER_NAME"
# DB password
password: "DB_PASS_WORD"
inference:
# Don't change
model-owners: [ "lilt_api" ]
# Tasks to perform
operations: [ "batch-updater", "translate" ]
# Don't change
metrics: [ "bleu" ]
# TOKENIZER: use `zh` for target languages `zh` and `zt`,
# `ja-mecab` for target language `ja`
# `ko-mecab` for target language `ko`
# `13a` for everything else
sacrebleu-tokenizer: "TOKENIZER"
# Source directory for batch updates
batch-updater-source: "train"
# Set to false to overwrite the existing results
ignore-operation-if-output-present: true
lilt_api:
# Lilt API key
api-key: "API_KEY"
# API endpoint, for eg. https://test.lilt.com/2
api-endpoint: "API_ENDPOINT"
# Uncomment to reuse existing memory/ project id
# memory-id:
# project-id:
-
Run below docker command
docker:
docker run \
--volume /path/to/root/directory/configs/:/home/lilt/external_configs \
--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"
Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"
This command first creates a fresh memory and project to be used for the purpose of this evaluation. Their respective ids are printed in the logs as below
INFO:root:Using Memory Id: 1234
INFO:root:Using Project Id: 5678
If these memory and project ids need to be reused, then the evaluation config needs to be updated with these
lilt_api:
memory-id: 1234
project-id: 5678
Once the memory and project are created/reused, the data present within /path/to/root/directory/data/splits/train
will be used to run a batch updater operation, and once completed, the data present within /path/to/root/directory/data/splits/test
will be used to run batch translate operation on the recently updated memory. The results will be downloaded and the computed bleu scores will be persisted in /path/to/root/directory/data/evaluation.lilt_api.json
Sample output
INFO:root:Seeding meta_data.json files for data sources in /home/lilt/external_data/splits
INFO:root:Validating test
INFO:root:Validating train
INFO:root:Running Lilt API inference
INFO:root:engine URL: mysql://root:***@127.0.0.1:3306/lilt_dev?charset=utf8mb4
Handling connection for 3306
INFO:root:connection established: <sqlalchemy.engine.base.Connection object at 0x7f29b916b1c0>
INFO:root:Using Memory Id: 10
INFO:root:Using Project Id: 10
INFO:root:Generating a TMX file from the data present in train
INFO:root:Uploading /home/lilt/external_data/splits/train/train.tmx to memory 10 and running batch updater
Handling connection for 3306
INFO:root:Successfully uploaded /home/lilt/external_data/splits/train/train.tmx. Waiting till batch updater completes
INFO:root:Pretranslating document at path /home/lilt/external_data/splits/test/test.de
INFO:root:Creating a temp file /tmp/tmp3u_rqp_p/test.de.test.txt compatible with API usage
INFO:root:Requested pretranslation for document test.de.test.txt with id 12
INFO:root:Waiting for pretranslation on document test to complete
INFO:root:Downloading pretranslated document for test into /home/lilt/external_data/translations_out/lilt_api_10_10/test/translated.en
INFO:root:Computing BLEU scores
INFO:root:Metrics for lilt_api:
{
"bleu": {
"lilt_api_10_10": {
"test": [
79.98082588232835,
"90.7/82.5/76.5/71.4"
]
}
}
}
INFO:root:Saving metrics json in path: /home/lilt/external_data/evaluation.lilt_api.json
The inference will be run on the directories present within the splits
directory, except for the ones which aren’t mentioned to be used for batch updater. If the inference
section in the config is updated to operations: [ "translate" ]
, then train
directory won’t be used for batch updater anymore and inference will be run on that too. To prevent this, train
directory can be deleted from the splits
directory so the inference is run only on test
.