In a dedicated chapter in The Sense of Style, Steven Pinker defines the ‘curse of knowledge’ as “a difficulty in imagining what it is like for someone else not to know something that you know,” and links it to information asymmetry.

In fact, this kind of cognitive bias favors better-informed parties over lesser-informed parties, and yet, according to Pinker, “sellers don’t take full advantage of this private knowledge. They price their assets as if their customers knew as much about their quality as they do.

The Disease of an Entire Industry

Leveraging market-specific information can make a substantial difference in pricing especially in highly segmented markets. In the translation industry, the larger an LSP, the more information it retains and can leverage.

However, this information is not necessarily related with business, and with financial aspects.

The translation industry is traditionally affected by information asymmetry, ensuing from a signaling problem and leading to an imbalance in the ability of negotiating terms and conditions of agreements.

This is one of the causes for the plunge in compensations over the last few years. Unfortunately, the lack of understanding of business dynamics in many industry players has done nothing but worsen the situation. In fact, the yearly growth rate of the translation industry is mostly due to the unparalleled growth in volumes, rather than to the unprecedented higher awareness of the importance of translation and professional services, which in fact took place. And, like it or not, this is mostly due to online machine translation engines, which make translation freely available, with their penalties.

Compensations have not been dropping because the industry has been increasing steadily for years, but because the marginal cost of translation has been steadily approaching zero, while its marginal utility has not been increasing, at least not at the same rate.

Price is determined by both marginal utility and marginal cost, and this dynamic explains clearly not only why the marginal cost of water is far lower than that of diamonds, but also why quality is an expected feature in a good or service, which is not linked to its selling price.

Marginal cost, marginal utility, and information asymmetry are usually not taught in translation schools, where, conversely, quality is a major topic, although not yet considered in a business perspective. And this is a reason for translation being supplied without qualitative differentiation across markets, which makes it a commodity.

Quality is also the third element beyond demand and supply, and in a huge, global, undifferentiated market, there are always many low-quality suppliers, meeting demand with exceeding supply.

Just like with money, Gresham’s law finds its application also in the translation industry, with bad translators driving out the good.

There are small market niches where customers pay better fees because demand is very specific, challenging, and yet high, while supply is low mostly because although higher compensations are still below standards for the expertise required. Translation is anyway a low-income profession.

Translation used to be a long tail business, but technology has been lowering marginal costs making revenues and income no longer corresponding to talent and effort, and competition will be increasingly unstable and asymmetrical.

Data as Asset

Linguistic data (term bases, corpora, translation memories) is part of the information asymmetry. LSPs use it in negotiations to lock in their customers with the promise of higher quality and faster delivery and to secure discount from their vendors for repetitions, full and fuzzy matches.

However, since the translation industry is highly fragmented and largely dominated by middlemen, these often add little or no value to the products they buy in from their vendors, and linguistic data is an asset only for their face value, i.e. for what it was paid.

Assets are supposed to carry some value, but linguistic data does not necessarily carry and intrinsic value, since, in this case, value comes from its exploitation, which depends on the user’s ability.

In addition, to be a real asset, and validate the investment, linguistic data must be large and reliable, i.e. essentially clean. Unfortunately, this data usually has been passing from hand to hand over the years, most often resulting in a damaged legacy, as wells as not the property of a sole proprietor.

Clean Data vs. Quality Data

In recent years, Statistical Machine Translation (SMT) has become interesting particularly for LSPs, mostly thanks to the availability of free DIY SMT engines or affordable online commercial engines, and of large when not massive, huge, immense amount of linguistic data to build customized engines.

However, contrary to expectations, assembling training data for an SMT engine to run smoothly and effectively can be costly.

In fact, the larger the data, the harder it is to clean it, and even when data can be considered as clean, a further distinction exists between clean data and quality data, with the latter including the former. The following table should help clarify this concept.

Clean Data Quality Data
Small number of trusted quality sources Actual data
Domain relevance (restricted) Standard length sentences
No less than 1,000 segments Terminologically consistent
Encoding consistency Consistent writing style
No empty segments No mistakes or errors (syntax, grammar, spelling)
No mechanical errors (diacritics, punctuation, capitalization, spelling) Correct translation (exact words, morphology, no loans)

Software tools can be used to clean data for training purposes, while only a human being with a thorough understanding of the data can refine data for quality, i.e. to match the intended purpose and target audience with preferred writing style and terminology.

Unfortunately, most LSPs are not adequately staffed for such a task, and not even to manage a PEMT (post-edited machine translation) project.

Terminology and SMT

Linguistic data can be an asset, though, when clean, reliable, consistent, correct, and suitable.

Effective training data can be rapidly built from high-quality TMX files, which could be further refined with relevant term bases.

The scope of terminology work ranges from authoring to knowledge management, from education and training to marketing applications and is always worth the investment as terminology impacts virtually every business area of an organization. Today, with the increasing integration of corporate systems, terminology is the main information vehicle to name, label, and detail products, to conjure up an action, to help the user, to support and drive a maintenance task, or to persuade and sell or purchase.

Especially when oriented to building controlled languages, terminology helps branding and improve communication.

A TMX file can unreservedly encompass multilingual glossaries providing terms contained in parallel translation segments. In fact, feeding an SMT engine with small parallel segments can help attain better quality, with the retrieving terms as partial translations when a certain number of those segments present a required similarity percentage.

Cleaning is fundamental in this respect since training for terminology must consist also in removing any translation units with incorrect terminology from the training corpus and replacing these units with others with correct terms in them.

Sharing Data

Why, then, sharing linguistic data? After all, 50,000 segments can be worth up to € 150,000 in revenues…

Do not expect translation vendors, whether freelance translators or LSPs to share their linguistic data with anyone, especially potential competitors. It is up to customers to claim the data they pay for, even indirectly, and share it in their own interest.

Is there any better way to have this data cleaned or to acquire new, fresh and clean data?