Skip to main content
Skip table of contents

Data Source Maintenance Best Practices

This article goes over several strategies project managers can use to leverage Data Sources when optimizing localization workflows.

TMX file sizes

TM files can be uploaded directly to Lilt if they are 200 MB or smaller. For files larger than 200 MB, you can zip the TMX file and upload it with the extension .tmx.zip. Note that TMX files are currently the only type of zipped memory file Lilt supports for import.

For a list of supported file types, see the Supported File Formats article.

TB CSV file format

Before uploading CSV files, you’ll want to check the Data Source Settings for the following:

  • If you want the entries you import to be immediately visible to translators, set the Default TB Entry Status to Reviewed.

  • If you want the imported entries to be reviewed by a linguist before becoming available for use in the Lilt Translate Terminology Pane, set the Default TB Entry Status to Unreviewed or Draft.

When uploading CSV files, the columns should be structured with the first two columns as source text and target text. If a row is missing either of these, the row will not be imported. Termbase CSV files can optionally include a header row as the first row.

CSV Termbase example:

source text,target text,status,created date,updated date
hello,hola,272542,2021-04-09 15:25,2021-04-09 15:25
where,dónde,272542,2021-04-10 16:33,2021-04-10 16:33
fly,volar,272542,2021-04-11 10:01,2021-04-11 10:01
listen,escuchar,272542,2021-04-12 13:48,2021-04-12 13:48

CSV Termbase example as viewed in a spreadsheet application:

Note: If en entry's source text or target text content contains commas, you will need to wrap it in quotes to let Lilt know where the text starts and ends. If you don't wrap the text in quotes, Lilt will assume the text ends at the first comma it finds.

CSV files with header rows can be imported by using the Termbase (TB, with header) option when importing, as described in the Uploading TMX (Memory) Files article.

Termbase CSV files can have as many metadata columns as you would like, and can also contain rows that omit any amount of metadata fields. Each metadata field needs to be separated by a comma. In the example screenshots above, the metadata columns are status, created date, and updated date. Metadata fields are useful for providing contents to translators.

Before uploading, make sure there are not any undesired spaces in the CSV file, as all spaces are included in the imported data, even spaces directly after commas.

TMX content clean-up

To save time from having to clean up TM entries in Lilt after uploading TMX files, it is best to clean up the files beforehand. If a Data Source contains polluted TMX files, issues can arise with inconsistency and reduced productivity. Look out for the following when cleaning your TMX files:

  • Sort through the older TM entries and remove entries that are outdated. This can be as easy as removing all entries that were last utilized before a given date.
  • Remove duplicate TM entries that have the same source text but different target text.

TMX file naming convention

TMX files evolve over time, so to better keep track of your TMX files, it is helpful to name them with informative labels such as the date, type of content, and name of the product or project. Whatever format you choose, maintaining a consistent naming convention will help project managers stay organized by being able to easily identify which projects are associated with which Data Sources.

Example naming convention:

  • [DATE]-[CONTENT TYPE]-[NAME]

In this example, [NAME] could be the product or the project. Examples using this naming convention:

  • 01012020-Legal-BankContracts
  • 01012020-Marketing-Devicev2.0

TM repetition management

Every segment that gets translated in Lilt Translate is added as a source/target TM entry within the associated Data Source. If the same source/target pair is processed multiple times, this will result in duplicate TM entries in the Project's associated Data Source. In the Lilt Translate Segment Context pane, duplicate entries are stacked to avoid confusion. In the example below, this stacking is displayed as +1 identical result.

It is up to project managers to determine how often they want to remove duplicates, or whether they want to remove duplicates at all. On the Manage entries page of the Sources tab, entries are sorted by the date they were last updated. If you want to find duplicates for a specific entry, use the search bar to filter the results.

Archiving Projects to preserve TM entries

If you want to remove a Project from the Projects List, you can either archive the Project or delete the Project.

  • Archiving a Project will retain that Project’s TM entries within the associated Data Source. If you plan to reuse segments, archiving a Project is generally a better option than deleting the Project.
  • Deleting a Project will also delete that Project’s TM entries from within the associated Data Source. If you want to delete a Project, consider first downloading the TM entries using the Download memory as TMX option and downloading the TB entries using the Download termbase as TMX option. This will allow you to utilize the Project’s TM entries and TB entries in the future by importing the data into a Data Source.

TMX files and machine translation

When TMX files are uploaded, the Contextual AI foreground model associated with the Data Source learns from the document content immediately. Note that deleting documents from a Data Source does not affect the Contextual AI model (i.e. the Contextual AI model does not unlearn the deleted resources). However, there is a recency bias, meaning the most recent documents have a stronger input on the translation output.

Every year or so, Lilt releases a new background Contextual AI model that is used to retrain all foreground Contextual AI models. Retraining does not take document recency into account. For this reason, it is generally good practice to remove old and incorrect translations from the TM. However, in the case that your Data Source has two TMX files that are very similar, but the newer document is only a subset of the older document, it is advisable to keep both so as not to throw away any additional data contained in the older document.

Uploading TMX to Concordance only

When uploading TMX files, you have the option to upload them for Concordance only. This option makes it so the TMX files are indexed for Concordance, but are not used to train the Contextual AI or as TM results. This can be a useful with outdated TMX files that still contain useful context and information you want to be made available to linguists under Concordance.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.