Generate Evaluation Metrics (WPA, BLEU)

Word Prediction Accuracy (WPA)

Use below method to calculate the WPA metrics at LILT.

Enable WPA writes

WPA is enabled by default to write data to minIO/s3. However, if you have release prior to the Q2 2024 release and would like to enable, add below configs in the custom values.yaml and redeploy app

update:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev2:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev3:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”

cd ~/install_dir
./install-lilt.sh

Read WPA stats

Use these steps to read the WPA stats generated using above method:

Fetch update-neural pod using

kubectl get pods -n lilt | grep update-neural

Use below command to fetch the stats

kubectl -n lilt exec <update-neural-pod-name> -- bash scripts/metrics/visualize_wpa.py --datastore csv --base-path s3://lilt/wpa --start_date <start-date yyyy-mm-dd> --end_date <end-date yyyy-mm-dd> --memory_id <memory-id>

Parameters:

--datastore (required): csv
--base-path: Location to store the results. Should be similar to the value of wpaDataStoreBasePath values file parameter. It’s possible to use the local pod folders as base path by referring it as file://<path>
--src_lang: Only results for this source language will be used. Each language is given as a 2 letter code according to ISO 639-1, e.g., de for german. More details here.
--tgt_lang - Only results for this target language will be used. The format is the same as --src_lang

Sample output

Backend v1
    Number of segments: 10
    Average prediction accuracy unadapted model: 81.4%
    Average prediction accuracy adapted model: 82.6%
    Difference: 1.2%
    Average char prediction accuracy unadapted model: 77.0%
    Average char prediction accuracy adapted model: 79.9%
    Char difference: 2.9%

Unadapted model %: Percentage of words/ char correctly predicted by the system when using an unadapted/pre-training model.
Adapted model %: Percentage of words/ char correctly predicted by the system when using an adapted/trained model.

BLEU score

Take a look at Wikipedia’s description of the BLEU algorithm. BLEU score can be useful for determining the performance of your LILT Data Sources in terms of:

Which Data Sources perform best.
Which Data Sources are best suited for a given translation project.

Preparing the evaluation set

If you would like to calculate the BLEU score on your sample evaluation dataset then use below method to prepare the evaluation set: Select one or more reference documents as your evaluation set. Each sentence in your evaluation set must have at least one human reference translation. Preferably, eachsentencehas multiple reference translations to increase the robustness of the BLEU metric. Documents in the evaluation set:

must have high-quality human reference translations
must not have been uploaded to or translated with LILT, or any competing engines you are evaluating against
must have sentences in source and target language as we need to translate the source using LILT for comparison
should be representative of the type of text you usually translate
should contain 1000 – 3000 sentences, as too small evaluation sets lead to unreliable metrics
sentence size: longer reference sentences can potentially make it more challenging for machine translations to achieve high BLEU scores, as the machine translation may not match the reference sentence exactly in terms of word order or structure. On the other hand, shorter reference sentences may make it easier for machine translations to achieve higher scores.*It’s important not to sacrifice diversity and naturalness for the sake of higher scores.*In summary, there isn’t a universally “good” reference sentence size, but it’s essential to prioritize naturalness, diversity, and relevance to the evaluation goals. Using multiple reference sentences and considering variations in translation quality is generally a good practice. Ultimately, the choice of reference sentences should align with the specific context and objectives of the machine translation evaluation.

Running the evaluation test

This section will provide instructions on how to capture BLEU score by running the batch updater and inference on the segments present in the Segments table. In summary, this section will describe steps to download the segments, split them into train and test sets, and run inference evaluation. The code will be provided as a docker image.

Load the inference docker file

Docker:
docker load -i on_prem_inference_docker.tar

Containerd:
ctr image import on_prem_inference_ctr.tar

Preparing config and data directories: In a desired directory (henceforth referred to as /path/to/root/directory), create 2 directories as below
```
mkdir configs && mkdir data
```

Download segments from the database

Connect to database using kubectl proxy or preferred method of your choice
```
kubectl port-forward -n lilt service/mysql 3306:3306
```

Create a config with below values

db-connection:
  # DB username
  user: "DB_USER_NAME"
  # DB password
  password: "DB_PASS_WORD"
user-data:
  # Don't change data-output-path 
  data-output-path: "/home/lilt/external_data/db_segments"
  # Replace with source and target language ISO code
  language-pair: ["SRC_LANG", "TGT_LANG"]
  # Memory ID for above lang-pair
  memory-ids: []
  # Don't change data-type
  data-type: "segments"

Copy above config to this path: /path/to/root/directory/configs/download_user_data.yaml

Use below docker command to download the segments

docker:
docker run \
--volume /path/to/root/directory/configs/:/home/lilt/external_configs \
--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"

Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"

The output of this should be the below files

$ ls /path/to/root/directory/data/db_segments
meta_data.json  segments.SRC_LANG  segments.TGT_LANG

Meta_data.json contains the # of segments
segments.SRC/TGT_LANG contains the source and target segments

Split segments data into train and test sets

docker:
docker run \
--volume /path/to/root/directory/configs/:/home/lilt/external_conf--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"

Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"

Here TRAIN_SPLIT_PROPORTION is a floating point number between 0 and 1 to indicate how much of the downloaded segments need to be used for training. Eg, if set to 0.7, then 70% of the segments downloaded into /path/to/root/directory/data/db_segments will be used for train and the remaining 30% will be used for test. For the provided sample within sample_data/db_segments using a TRAIN_SPLIT_PROPORTION of 0.7 will yield 7 train segments and 3 test segments as provided within sample_data/db_segments/splits/{train,test}

Run inference evaluation

Prepare the config to run inference evaluation

languages:
  # Source language ISO code
  src-lang: "en"
  # Target language ISO code
  tgt-lang: "es"
db-connection:
  # DB username
  user: "DB_USER_NAME"
  # DB password
  password: "DB_PASS_WORD"
inference:
  # Don't change
  model-owners: [ "lilt_api" ]
  # Tasks to perform
  operations: [ "batch-updater", "translate" ]
  # Don't change
  metrics: [ "bleu" ]
  # TOKENIZER: use `zh` for target languages `zh` and `zt`,
  # `ja-mecab` for target language `ja`
  # `ko-mecab` for target language `ko`
  # `13a` for everything else
  sacrebleu-tokenizer: "TOKENIZER"
  # Source directory for batch updates
  batch-updater-source: "train"
  # Set to false to overwrite the existing results
  ignore-operation-if-output-present: true
  lilt_api:
    # Lilt API key
    api-key: "API_KEY"
    # API endpoint, for eg. https://test.lilt.com/2
    api-endpoint: "API_ENDPOINT"
    # Uncomment to reuse existing memory/ project id
    # memory-id: 
    # project-id:

Run below docker command

docker:
docker run \                                                   
--volume /path/to/root/directory/configs/:/home/lilt/external_configs \
--volume /path/to/root/directory/data/:/home/lilt/external_data \
--network="host" \
on-prem-inference:1 \
bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"

Containerd:
sudo ctr run -t --net-host \
--mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
--mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
--rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
/bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"

This command first creates a fresh memory and project to be used for the purpose of this evaluation. Their respective ids are printed in the logs as below

INFO:root:Using Memory Id: 1234
INFO:root:Using Project Id: 5678

If these memory and project ids need to be reused, then the evaluation config needs to be updated with these

lilt_api:
  memory-id: 1234
  project-id: 5678

Once the memory and project are created/reused, the data present within /path/to/root/directory/data/splits/train will be used to run a batch updater operation, and once completed, the data present within /path/to/root/directory/data/splits/test will be used to run batch translate operation on the recently updated memory. The results will be downloaded and the computed bleu scores will be persisted in /path/to/root/directory/data/evaluation.lilt_api.json Sample output

INFO:root:Seeding meta_data.json files for data sources in /home/lilt/external_data/splits
INFO:root:Validating test
INFO:root:Validating train
INFO:root:Running Lilt API inference
INFO:root:engine URL: mysql://root:***@127.0.0.1:3306/lilt_dev?charset=utf8mb4
Handling connection for 3306
INFO:root:connection established: <sqlalchemy.engine.base.Connection object at 0x7f29b916b1c0>
INFO:root:Using Memory Id: 10
INFO:root:Using Project Id: 10
INFO:root:Generating a TMX file from the data present in train
INFO:root:Uploading /home/lilt/external_data/splits/train/train.tmx to memory 10 and running batch updater
Handling connection for 3306
INFO:root:Successfully uploaded /home/lilt/external_data/splits/train/train.tmx. Waiting till batch updater completes
INFO:root:Pretranslating document at path /home/lilt/external_data/splits/test/test.de
INFO:root:Creating a temp file /tmp/tmp3u_rqp_p/test.de.test.txt compatible with API usage
INFO:root:Requested pretranslation for document test.de.test.txt with id 12
INFO:root:Waiting for pretranslation on document test to complete
INFO:root:Downloading pretranslated document for test into /home/lilt/external_data/translations_out/lilt_api_10_10/test/translated.en
INFO:root:Computing BLEU scores
INFO:root:Metrics for lilt_api:
{
    "bleu": {
        "lilt_api_10_10": {
            "test": [
                79.98082588232835,
                "90.7/82.5/76.5/71.4"
            ]
        }
    }
}
INFO:root:Saving metrics json in path: /home/lilt/external_data/evaluation.lilt_api.json

The inference will be run on the directories present within the splits directory, except for the ones which aren’t mentioned to be used for batch updater. If the inference section in the config is updated to operations: [ "translate" ], then train directory won’t be used for batch updater anymore and inference will be run on that too. To prevent this, train directory can be deleted from the splits directory so the inference is run only on test.

​Word Prediction Accuracy (WPA)

​Enable WPA writes

​Read WPA stats

​BLEU score

​Preparing the evaluation set

​Running the evaluation test

Word Prediction Accuracy (WPA)

Enable WPA writes

Read WPA stats

BLEU score

Preparing the evaluation set

Running the evaluation test