Start Working With Data

This is a free add-on to the guide to Post-editing of Machine Translation available for purchase on this website, even though the information herein is not limited to use with machine translation.

This post might also be read in conjunction with the other one on translation metadata, especially as per the implications with quality and portability and with the often predicted—and still unfollowed—death of translation memories and the tools to produce and manage them.

The New Oil?

A now common refrain says that “Data is the new oil,” and, just like oil, data is of little use until it is refined into something profitable.

Most of the data available today and generally meant to be like oil was created in the last two years. Just as a tiny portion of this data may include language and translation data.

In fact, if data really is the new oil, its owners should care and protect it as a precious resource, but managing data could be problematic, time-consuming and costly, even for those who know how to do it.

For example, to harvest the full potential of language and translation data, the following steps are essential:

  1. Datafication;
  2. Cleaning;
  3. Tagging;
  4. Transformation;
  5. Normalization.

Datafication

Datafication mainly consists in making all text available in a digital format suitable for text analytics. This also includes scanning any paper sources and converting the output in a format suitable for text analytics.

This done, identifying and documenting of all the types and source of data is necessary. This goes well beyond just storing all data in a single place, it means detailing how this data is produced (i.e. the data source), stored, in which format it is stored, the content type, and the date of creation and last update.

Cleaning

Data that is inaccurate, damaged, corrupt, or flawed thus being unfit for analytics must be rectified, and when this is impossible, permanently removed and stored separately for any possible future rectifications.

During cleaning, errors should be picked up ranging from human errors to corrupt data caused by faulty applications, systems or storage.

The most common issues with language data are noise and formatting. Special attention must be paid for empty and/or untranslated segments, duplications, misspellings, typos, diacritics, coding errors, punctuation errors, extra spaces, etc.

Most data preparation tasks can be performed using standard tools.

Tagging

To make the job of text analytics systems easier, data should be tagged, i.e. labelled with additional descriptive data. Each segment/record/block should be marked up with important and relevant data such as the author’s name, the date of creation or last update, the domain, the project, etc.

Transformation

Once the data is clean and tagged, it is necessary to convert it into the correct format for the analytics systems of choice to work with, discarding any elements which they will not read.

Normalization

Analytics software usually expects dates, numbers, measures, formulas, codes, names, etc. to be presented in a uniform way, to avoid any misinterpretations. For example, dates should be presented in an eight-digit format, typically YYYY-MM-DD.

Is your data ready?