Word Prediction Accuracy (WPA)

Use below method to calculate the WPA metrics at LILT.

Enable WPA writes

WPA is enabled by default to write data to minIO/s3. However, if you have release prior to the Q2 2024 release and would like to enable, add below configs in the custom values.yaml and redeploy app
update:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev2:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”
updatev3:
  onpremValues:
    config:
      # Configure WPA DataStore
      wpaDataStoreType: "csv”
      # Write updates for every 10 confirmed segments, change this as required
      wpaDataStoreBatchSize: 10
      # Path to store WPA results, change it as required
      wpaDataStoreBasePath: "s3://lilt/wpa/”
cd ~/install_dir
./install-lilt.sh

Read WPA stats

Use these steps to read the WPA stats generated using above method:
  1. Fetch update-neural pod using
    kubectl get pods -n lilt | grep update-neural
    
  2. Use below command to fetch the stats
    kubectl -n lilt exec <update-neural-pod-name> -- bash scripts/metrics/visualize_wpa.py --datastore csv --base-path s3://lilt/wpa --start_date <start-date yyyy-mm-dd> --end_date <end-date yyyy-mm-dd> --memory_id <memory-id>
    
Parameters:
  • --datastore (required): csv
  • --base-path: Location to store the results. Should be similar to the value of wpaDataStoreBasePath values file parameter. It’s possible to use the local pod folders as base path by referring it as file://<path>
  • --src_lang: Only results for this source language will be used. Each language is given as a 2 letter code according to ISO 639-1, e.g., de for german. More details here.
  • --tgt_lang - Only results for this target language will be used. The format is the same as --src_lang
Sample output
Backend v1
    Number of segments: 10
    Average prediction accuracy unadapted model: 81.4%
    Average prediction accuracy adapted model: 82.6%
    Difference: 1.2%
    Average char prediction accuracy unadapted model: 77.0%
    Average char prediction accuracy adapted model: 79.9%
    Char difference: 2.9%
  • Unadapted model %: Percentage of words/ char correctly predicted by the system when using an unadapted/pre-training model.
  • Adapted model %: Percentage of words/ char correctly predicted by the system when using an adapted/trained model.

BLEU score

Take a look at Wikipedia’s description of the BLEU algorithm. BLEU score can be useful for determining the performance of your LILT Data Sources in terms of:
  • Which Data Sources perform best.
  • Which Data Sources are best suited for a given translation project.

Preparing the evaluation set

If you would like to calculate the BLEU score on your sample evaluation dataset then use below method to prepare the evaluation set: Select one or more reference documents as your evaluation set. Each sentence in your evaluation set must have at least one human reference translation. Preferably, eachsentencehas multiple reference translations to increase the robustness of the BLEU metric. Documents in the evaluation set:
  • must have high-quality human reference translations
  • must not have been uploaded to or translated with LILT, or any competing engines you are evaluating against
  • must have sentences in source and target language as we need to translate the source using LILT for comparison
  • should be representative of the type of text you usually translate
  • should contain 1000 – 3000 sentences, as too small evaluation sets lead to unreliable metrics
  • sentence size: longer reference sentences can potentially make it more challenging for machine translations to achieve high BLEU scores, as the machine translation may not match the reference sentence exactly in terms of word order or structure. On the other hand, shorter reference sentences may make it easier for machine translations to achieve higher scores.*It’s important not to sacrifice diversity and naturalness for the sake of higher scores.*In summary, there isn’t a universally “good” reference sentence size, but it’s essential to prioritize naturalness, diversity, and relevance to the evaluation goals. Using multiple reference sentences and considering variations in translation quality is generally a good practice. Ultimately, the choice of reference sentences should align with the specific context and objectives of the machine translation evaluation.

Running the evaluation test

This section will provide instructions on how to capture BLEU score by running the batch updater and inference on the segments present in the Segments table. In summary, this section will describe steps to download the segments, split them into train and test sets, and run inference evaluation. The code will be provided as a docker image.
  1. Load the inference docker file
    Docker:
    docker load -i on_prem_inference_docker.tar
    
    Containerd:
    ctr image import on_prem_inference_ctr.tar
    
  2. Preparing config and data directories: In a desired directory (henceforth referred to as /path/to/root/directory), create 2 directories as below
    mkdir configs && mkdir data
    
  3. Download segments from the database
    1. Connect to database using kubectl proxy or preferred method of your choice
      kubectl port-forward -n lilt service/mysql 3306:3306
      
    2. Create a config with below values
      db-connection:
        # DB username
        user: "DB_USER_NAME"
        # DB password
        password: "DB_PASS_WORD"
      user-data:
        # Don't change data-output-path 
        data-output-path: "/home/lilt/external_data/db_segments"
        # Replace with source and target language ISO code
        language-pair: ["SRC_LANG", "TGT_LANG"]
        # Memory ID for above lang-pair
        memory-ids: []
        # Don't change data-type
        data-type: "segments"
      
    3. Copy above config to this path: /path/to/root/directory/configs/download_user_data.yaml
    4. Use below docker command to download the segments
      docker:
      docker run \
      --volume /path/to/root/directory/configs/:/home/lilt/external_configs \
      --volume /path/to/root/directory/data/:/home/lilt/external_data \
      --network="host" \
      on-prem-inference:1 \
      bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"
      
      Containerd:
      sudo ctr run -t --net-host \
      --mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
      --mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
      --rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
      /bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/download_segments_data.py -c /home/lilt/external_configs/download_user_data.yaml;"
      
      The output of this should be the below files
      $ ls /path/to/root/directory/data/db_segments
      meta_data.json  segments.SRC_LANG  segments.TGT_LANG
      
      Meta_data.json contains the # of segments
      segments.SRC/TGT_LANG contains the source and target segments
  4. Split segments data into train and test sets
    docker:
    docker run \
    --volume /path/to/root/directory/configs/:/home/lilt/external_conf--volume /path/to/root/directory/data/:/home/lilt/external_data \
    --network="host" \
    on-prem-inference:1 \
    bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"
    
    Containerd:
    sudo ctr run -t --net-host \
    --mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
    --mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
    --rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
    /bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/train_test_split.py -i /home/lilt/external_data/db_segments -o /home/lilt/external_data/splits -s SRC_LANG -t TGT_LANG -p TRAIN_SPLIT_PROPORTION;"
    
    Here TRAIN_SPLIT_PROPORTION is a floating point number between 0 and 1 to indicate how much of the downloaded segments need to be used for training. Eg, if set to 0.7, then 70% of the segments downloaded into /path/to/root/directory/data/db_segments will be used for train and the remaining 30% will be used for test. For the provided sample within sample_data/db_segments using a TRAIN_SPLIT_PROPORTION of 0.7 will yield 7 train segments and 3 test segments as provided within sample_data/db_segments/splits/{train,test}
  5. Run inference evaluation
    1. Prepare the config to run inference evaluation
      languages:
        # Source language ISO code
        src-lang: "en"
        # Target language ISO code
        tgt-lang: "es"
      db-connection:
        # DB username
        user: "DB_USER_NAME"
        # DB password
        password: "DB_PASS_WORD"
      inference:
        # Don't change
        model-owners: [ "lilt_api" ]
        # Tasks to perform
        operations: [ "batch-updater", "translate" ]
        # Don't change
        metrics: [ "bleu" ]
        # TOKENIZER: use `zh` for target languages `zh` and `zt`,
        # `ja-mecab` for target language `ja`
        # `ko-mecab` for target language `ko`
        # `13a` for everything else
        sacrebleu-tokenizer: "TOKENIZER"
        # Source directory for batch updates
        batch-updater-source: "train"
        # Set to false to overwrite the existing results
        ignore-operation-if-output-present: true
        lilt_api:
          # Lilt API key
          api-key: "API_KEY"
          # API endpoint, for eg. https://test.lilt.com/2
          api-endpoint: "API_ENDPOINT"
          # Uncomment to reuse existing memory/ project id
          # memory-id: 
          # project-id:
      
    2. Run below docker command
      docker:
      docker run \                                                   
      --volume /path/to/root/directory/configs/:/home/lilt/external_configs \
      --volume /path/to/root/directory/data/:/home/lilt/external_data \
      --network="host" \
      on-prem-inference:1 \
      bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"
      
      Containerd:
      sudo ctr run -t --net-host \
      --mount type=bind,src=/path/to/root/directory/configs/,dst=/home/lilt/external_configs,options=rbind:rw \
      --mount type=bind,src=/path/to/root/directory/data,dst=/home/lilt/external_data,options=rbind:rw \
      --rm docker.io/library/on-prem-inference:ubuntu on-prem-inference-ubuntu \
      /bin/bash -c "cd /home/lilt; source py_env/bin/activate; bash src/run_inference.py -c /home/lilt/external_configs/inference_evaluation.yaml -i /home/lilt/external_data/splits -m /home/lilt/external_data/evaluation.json"
      
      This command first creates a fresh memory and project to be used for the purpose of this evaluation. Their respective ids are printed in the logs as below
      INFO:root:Using Memory Id: 1234
      INFO:root:Using Project Id: 5678
      
      If these memory and project ids need to be reused, then the evaluation config needs to be updated with these
      lilt_api:
        memory-id: 1234
        project-id: 5678
      
Once the memory and project are created/reused, the data present within /path/to/root/directory/data/splits/train will be used to run a batch updater operation, and once completed, the data present within /path/to/root/directory/data/splits/test will be used to run batch translate operation on the recently updated memory. The results will be downloaded and the computed bleu scores will be persisted in /path/to/root/directory/data/evaluation.lilt_api.json Sample output
INFO:root:Seeding meta_data.json files for data sources in /home/lilt/external_data/splits
INFO:root:Validating test
INFO:root:Validating train
INFO:root:Running Lilt API inference
INFO:root:engine URL: mysql://root:***@127.0.0.1:3306/lilt_dev?charset=utf8mb4
Handling connection for 3306
INFO:root:connection established: <sqlalchemy.engine.base.Connection object at 0x7f29b916b1c0>
INFO:root:Using Memory Id: 10
INFO:root:Using Project Id: 10
INFO:root:Generating a TMX file from the data present in train
INFO:root:Uploading /home/lilt/external_data/splits/train/train.tmx to memory 10 and running batch updater
Handling connection for 3306
INFO:root:Successfully uploaded /home/lilt/external_data/splits/train/train.tmx. Waiting till batch updater completes
INFO:root:Pretranslating document at path /home/lilt/external_data/splits/test/test.de
INFO:root:Creating a temp file /tmp/tmp3u_rqp_p/test.de.test.txt compatible with API usage
INFO:root:Requested pretranslation for document test.de.test.txt with id 12
INFO:root:Waiting for pretranslation on document test to complete
INFO:root:Downloading pretranslated document for test into /home/lilt/external_data/translations_out/lilt_api_10_10/test/translated.en
INFO:root:Computing BLEU scores
INFO:root:Metrics for lilt_api:
{
    "bleu": {
        "lilt_api_10_10": {
            "test": [
                79.98082588232835,
                "90.7/82.5/76.5/71.4"
            ]
        }
    }
}
INFO:root:Saving metrics json in path: /home/lilt/external_data/evaluation.lilt_api.json
The inference will be run on the directories present within the splits directory, except for the ones which aren’t mentioned to be used for batch updater. If the inference section in the config is updated to operations: [ "translate" ], then train directory won’t be used for batch updater anymore and inference will be run on that too. To prevent this, train directory can be deleted from the splits directory so the inference is run only on test.