Data’s Worth

The Simile with Oil

Data cleaningStrange to say, but oil is a commodity, one of the world’s most important one, but a commodity anyway. As such, it is fungible, that is, all units are equivalent or nearly so with no regard to who produces them. They are therefore essentially indistinguishable and interchangeable.

Specifically, oil is a hard commodity, that is a natural resource that must be mined through drilling. Once extracted, oil must be refined and separated into various types of products for final use. Refining and separation are what makes crude oil really valuable.

Therefore, intuitively, it is not exactly true that data is the new oil.

A major MLV might be at best, and very generously, compared to Jett Rink, while no other LSP can be compared to any of the seven sisters, or even to Eni (which is a major player in the petroleum industry), especially as to mining and refining technologies. In general, an SLV might be compared to the owner of an oil well or two.

On the contrary, GAFAM resemble the seven sisters. Like the big oil companies, GAFAM control the platforms through which they mine the data and have the technical and technological capabilities to refine and separate it and make a profit out of it. Generally, this profit does not come from selling the data, but by putting the platforms from which it is mined at the service of customers so that these can directly benefit from the data for their own purposes. As with the big oil companies, the main source of income is not the raw material, but its primary derivatives. Just like companies in the plastics industry use petroleum derivatives to make sophisticated finished products for sale, marketing companies use data to sell profiling, marketing, and advertising services and more.

Use of Linguistic Data

In type, use and, more importantly, volume, linguistic data is absolutely not comparable to the data that is commonly referred to as the new oil, Big Data.

Yet, linguistic data are crucial for Natural Language Processing (NLP). NLP is a disruptive subfield of linguistics, computer science, and artificial intelligence pertaining more to machine learning than to software development. Today, virtually all messengers, personal assistants, and voice apps use NLP to power their linguistic interfaces, and cloud-based NLP solutions are more and more in demand to reduce overall costs and improve scalability.

The data for NLP applications consists of properly curated linguistic datasets of written and spoken natural human language.

Curating specific datasets may take a lot of time. Luckily, given the growing importance of NLP, there is plenty of freely available datasets for a variety of applications. These applications address sentiment analysis (mostly through anonymized emails), voice recognition and chatbots. Audio speech datasets are also freely available for virtual assistants, in-car navigation, and any other sound-activated systems.

Not surprisingly, the global NLP market is expected to grow from USD 11.6 billion to USD 35.1 billion by 2026, at a CAGR of 20.3 percent.

NLP is a subset of AI just like machine translation is a subset of NLP. NLP and machine translation algorithms process linguistic data. Both require high-quality datasets for proper training and effective functioning.

Curation is crucial, since annotation enhances and augments a corpus with higher-level information thus helping machine learning algorithms build associations between actual and conveyed meaning and recognize patterns when presented with new data.

High quality is crucial because, due to the large volume of data required, even a tiny error in the training data could lead to large-scale errors in a machine learning system’s output.

While general-purpose NLP applications do not require specific data, machine translation models for specialized usage require a significant amount of high-quality, domain-specific training data.

Datasets for non-specialized NLP applications are usually dumped from the most diverse sources, as long as they are representative of the real world, even from IoT devices. These sources consist of conversational exchanges in movies and e-mail, news and scientific articles, journals, books, manuscripts, archival materials, electronic resources, audio and video, social media postings and messages, etc.

Datasets for specialized NLP applications are harder to harvest through web scraping/crawling. It is much easier–and cheaper–to gather as much parallel language data as possible directly from owners, typically translation buyers and providers.

On the other hand, back to the simile with the oil industry, what would the owner of an oil well or two do with the oil extracted, especially without the necessary resources to refine and separate it? The best option is to try and sell it on a commodity market. Of course, it is not as straightforward as one may think, it would be necessary to call in a professional trader and accept the risks involved.

MT-savvy LSPs, on the contrary, always have the chance to exploit their own data to train the machine translation engine(s) of their choice. This, at least, is the vulgate from the “Sly Foxes” out there, waiting for a chicken to pluck. To be clear, everyone may choose to give their money to whoever they wish, and then, maybe, complain of having been deceived. It is their loss, after all. In fact, training an NMT algorithm is no piece of cake, no more or less than training an SMT algorithm was.

Anyway, a training dataset should be:

  1. Accurate–the values and metadata in it are correct, specific, and narrow.
  2. Complete–the data in it does not have any gaps.
  3. Up to date–the data is relevant to the intended performance or action of the algorithm.

Of course, a high-quality dataset has gone through thorough cleaning to achieve these features.


Cleaning is essential to get high-quality datasets. Cleaning allows for identifying and eliminating any errors or duplicate data, and removing all outdated, incorrect, or even irrelevant information.

When dealing with linguistic data, coming whether from Web scraping or from TMs, the issues to check for are repetitions, duplications and multiple matches (in TMs), wrong alignment and segmentation (in TMs), wrong encoding and extra spaces on the ‘mechanical’ side, and mistranslations, omissions, lexical and morphological errors on the linguistic side, while spelling in general, diacritics and punctuation may pertain to both sides. Formatting of dates, numbers, and formulas should also be ‘normalized’ together with letter case.

Special issues are terminology conflicts and domain-specific data (usually inaccurate translations), all requiring manual treating.

In general, data cleaning starts with the selection and removal of duplicates, and the correction of incomplete and corrupt data, and continues with annotation wherever possible.

Not always, anonymization, through the identification and removal of personally identifiable information, may be necessary. When necessary, it is a tedious, time-consuming, supervised task involving a lot of manual work. On the other hand, the concern of personal data is like the attention to gender issues and bias in general, as if removing pork or cow meat from menus should be enough to boost sales in Muslim or Hindu countries. Of all data manipulation tasks, anonymization is the most demanding and costly, and should be run only when strictly necessary. In most cases, in the typical translation-industry practice still heavily and largely involving humans, it is not.

Anyway, in general, however crucial clean data is for AI and consequently for NLP and MT, cleaning is often overlooked and insubstantial, mostly because of the costs, the work and, obviously, the time involved. In fact, even in the outmost urgency of the pandemic, the gathering and cleaning of data in the SARS-COV-2 domain and the training of language models has taken several months.

Also, of all data manipulation tasks, anonymization is the most time-consuming, demanding, and costly, and should be run only when strictly necessary. In most cases, in the typical translation-industry practice still heavily and largely involving humans, it is not.

Unfortunately, too many people still seem not to care for the language professionals who are the end users of raw machine translation output. Likewise, too many people in this industry have been focusing for much too long on marketing and sales only, catering to whatever buzzwords buyers (seem to) have in their heads to call the shots.

Profiting from Data

The growing hype on data and datafication is generating a new business issue, making money out of the data produced or gathered.

Contrary to what some people claim and would make us believe, the economic value of linguistic data does not depend, as for other types of data, on its intended use. The value of business data, for example, comes from the insights that can be drawn from it.

For years, translation tool vendors have nurtured the idea that translation memories were assets. And virtually all LSPs have bought it. This idea stemmed from claiming substantial discounts for repetitions, full and fuzzy matches, which allowed LSPs to monetize the data gathered, owned, and leveraged. Over the time, though, once more and more customers have become familiar with these technologies, many of them joined the largest customers who had first demanded such discounts. Today the pressure on prices is such that these ‘recoveries’ are no longer enough to sustain it.

Indeed, the demand for data is strong, but the supply is just as strong, and mostly free. In fact, there is plenty of sites listing data repositories to pull data from for free. This is as true for strategic data–that is, that to get business insights from and drive a company’s future–as it is for text and voice data used for NLP applications. This is certainly not narrow, vertical data, for which companies must continue to provide on their own. However, many companies still lack a data vision to drive them in the AI/ML path. Useless to say, most SMEs do not have a data science department or even one or two data scientists in house and this is a serious impediment for them to seeing the value in data. ‘Ordinary’ data, anyways.

For this reason, some contend that a new breed of service companies is emerging whose core business is built around data. They deliver business intelligence from customers data to allow them to make informed decisions about new business efforts. The business is called Data Science as a Service (DSaaS) and consists in outsourcing data science. It is meant for companies suffering for the shortage of data scientists, thus not having data science capabilities in-house, and to provide them with a way to leverage data science.

The business and the relevant technology are still quite niche, and it is not clear yet whether it is just another phrase to play with and another fad, or it is going to be the business of the next future.

If DSaaS is not just yet more crap, it will make the monetization of language data even more unlikely because DSaaS companies are going to hoard data for major customers (typically big techs) and play it down to secure their market share. For LSPs to monetize their data, they will have to move fast now, although without certainties. Also, to raise any interest, datasets would have to be very narrow and sizeable, which excludes almost all SME LSPs.

Still willing to chase the data myth?

Author: Luigi Muzii

Luigi Muzii

Leave a Reply