The Quality Club

ClubIn his famous speech in West Berlin, on June 26, 1963, U.S. president John F. Kennedy erroneously juxtaposed his ‘Ich bin ein Berliner’ and ‘civis Romanus sum’.

Why erroneously? Because the pride for being a Roman citizen came from the Roman code of law of 450 BC. The leges duodecim tabularum stated the rights and duties of the Roman citizen in a sufficiently comprehensive manner and, for this very reason, were accessible to everyone in the Forum.

Their simplicity and clearness guaranteed fairness and objectivity. And ultimately freedom.

Es gibt noch Richter in Berlin!

“There are still judges in Berlin!” is often attributed to Bertolt Brecht, but it actually comes from the legend of The Miller of Sanssouci.

Umberto Eco reassembled the real story from the reports of a trial that an Italian jurist and politician published in 1880. The story has it that, the Count of Schmettau owned a water mill on the Potsdam hill that had leased for generations to the same family. A Baron von Gersdof diverted waters to fill a fishpond he had built upstream of the mill, and the mill remained waterless and could not work. The miller could no longer pay the rent to the Count that dragged him before the feudal judge who condemned the miller to pay. The miller could not pay, so the judge put the mill up for auction and the Baron bought it. Through her soldier nephew, the miller’s wife told the story to Prince Leopold of Brunswick, the king’s nephew. The Prince’s mother spoke to his brother Frederick the Great who, examining the deeds and seeing that the miller was the victim of an injustice, reintegrated him in his rights and had him compensated for the damage with 1358 thalers, 11 groschen and 1 pfenning. The Emperor also sent the judge to prison.

The miller appealed unto the Emperor like Saul of Tarsus did fifteen centuries earlier claiming his right as a Roman citizen for a fair trial following undisputable rules.

Undisputable rules

Undisputable rules… Are there any in translation quality assessment?

Any assessment implies judgement, and quality assessment is still a major concern for translation buyers, whether they are at the end or at the beginning of the supply chain.

For at least the last three decades, brilliant scholars—and pundits—have devoted all their efforts to provide the industry with many different quality metrics, none of which, however, proved conclusive. In fact, to measure something, you must know what it is and develop metrics to assess it. Unfortunately, all translation quality metrics, so far, have followed the same error-catching-and-assessment approach, i.e. counting the number and ‘weighing’ the size of errors. Also, despite the analytical claim, all those metrics levy vague, blurry, and subjective criteria and are overly and unnecessarily complicated, lying on the intrinsically wrong assumption that buyers and providers share the same knowledge.

All this prevents translation buyers from having exact and unbiased data to assess their translation effort, budget it, and evaluate the product they receive. Also, being prone to subjective interpretation, these metrics make finding and adjusting the factors that need improvement very hard, thus failing the main goal of any assessment effort.

Don’t Bullshit a Bullshitter

For two decades or so, a new trend has been pervading academia, that of naming every discipline a science. But, if translation is a science, translation assessment should be as well. It would be interesting to know what the father of the scientific method would think of this. He would possibly also abjure his basic principle of measuring what is measurable and making measurable what it is not so.

In fact, everyone in translation knows that there are no two people who, by calculating the value of a translation quality metric, will ever be able to produce comparable results, although this is the basis for each customer’s primary request, reliability.

To properly use metrics, a quantitative approach is necessary. Measurement translates empirical observation into quantitative relationships, thus yielding hard facts.

Translation quality metrics mix quantitative statements with qualitative assessments. The qualitative approach is predominant and expert linguists are in fact necessary who are capable of weighing errors.

Maybe, a quality assessment run can even produce a numeric score and this could even be a close representation of the overall translation, but “there’s no close in science: close didn’t put men on the moon”.

A lesson everyone in translation should have learnt from the beginning is not to bullshit a bullshitter, and yet quality and metrics are still amongst the favorite topics of the many bullshitters around, together with standards, automation, innovation, lean, agile, and more recently blockchain. What is going to be the next? This is all serious stuff used to talk around the subject matter people who should be very familiar with it and should know how to spot the bullshit.

For example, although the translation industry is terribly late on every side of the business front, many of its players and pundits are obsessed with innovation, which is pathetical wishful thinking, given results, and with growth, which has been driven by M&A and it is still told by revenues only. Ludicrously, some bullshitters are still ‘chasing unicorns’. By the way, Tesla, Spotify, Dropbox, Uber, Lyft, Airbnb, Slack and Pinterest are all publicly traded, and they all lose money, in some cases a lot of money, sometimes for years and years, long after having gone public. Not to mention Theranos, whose case should serve to curb at least some of the excesses of unicorn culture, including its penchant for hype and its tendency to overlook aggressive rule-bending and corner-cutting.

The get-rich-quick business model that Uber and most other gig-economy companies use is smash-and-grab, screw-the-workers, grow as fast as you can, operate at a loss, and cash out in an IPO. Today, some ruthless entrepreneurs are replacing IPOs with ICOs, which are much easier to setup and equally enrich the few people at the top without keeping any of the promises made to everyone else.

Quoting[*] David Heinemeier Hansson, “Nobody wants to be the one who says the emperor has no clothes. But that’s why it’s so critical to get the message out there”.

On the other hand, great bullshitters confidently tell people only what they want to hear, speaking in a manner that captivates the audience, combining manipulation and lying by omission.

The Club

The concepts around translation quality and the associated metrics are the club that young slaves wield to keep the dangerous heretics afar from the ivory towers where their masters sit.

The clubs that slaves wield may be different according to the vassalage of their masters. Alongside with the slaves struggling for liberation and becoming masters themselves sooner or later, unfortunate wannabe slaves join the crusades elbowing their way to win the consideration of their peers while others mortgage their parents’ house to join some trade organization and pursue their way to visibility and defeat irrelevance.

Bullshitters know, lurk and, when a favorable opportunity arises, they strike.

On the other hand, the typical rent-seeking attitude of academics hinders any real change.

In his magnum opus, John Maynard Keynes wrote that, “The difficulty lies not so much in developing new ideas, but in escaping from the old ones”.

Despite the attitude towards it has radically changed over the last decade, a major hindrance to full acceptance of machine translation is still quality assessment. In fact, while performances have generally been perceived as growing steadily and noticeably, users keep striving to reach an objective approach to measurable quality.

The problem with machine translation quality evaluation is human translation quality evaluation. Unlike machine translation, human translation is always an interpretation (Latins did not use traducere or transferre, but interpretare).

Indeed, human assessment is still based on accuracy, fluency and adequacy, the stuff a quality club is made of.

Unfortunately, the error-catching approach makes it necessary to run translation quality assessment at the sentence level. While this might work for fluency, adequacy assessment can be effective only at the document level, especially when errors become harder and harder to spot at a sentence level and being decisive for discriminating quality of different outputs. Also, accuracy assessment at the sentence level might be misleading, especially on samples, however meaningful.

Finally, in their century-old brilliant ‘scientific’ efforts, scholars—and pundits—have still not found a clear-cut method to describe, spot, evaluate, and weigh errors, thus leaving accuracy, fluency and adequacy largely prone to subjective interpretation. In fact, none of the metrics available addresses any of these features.

Nevertheless, when reminded of possible alternatives, from the ranks of those scholars a shout testudo! rises.

As a matter of fact, there is no claim of a new or novel approach to evaluation around, but alternatives exist with checklists to allow users to prevent errors rather than catch them, or with predictive analysis, to allow buyers to know what they are going to spend for and for what it is worth.

These and other alternatives have been around for a very long time and dismissed for having ‘obvious’ limitations.

And if one is willing to expose himself in favor of a paradigm change, must he also detail all his arguments—and possibly proof—in advance to allow those who might feel ‘threatened’ by this outcaste to prepare a defense?

What if the suggestion to move from error-catching evaluation to positive evaluation with checklists of desirable features is pertinent and workable? Should a possibly exciting discussion be avoided anyway to protect the status quo?

The time when great innovations are presented at major events is long over. Today, trade and even academic events are just catwalks. When any innovations appear, they are introduced through preprints (if they are the outcome of scientific research) or at special, dedicated corporate events, with plenty of fanfares and media support, and this is especially true in the translation industry despite the poor attention that trade media can draw. The most prominent bullshitters are possibly invited, and obviously well paid for the annoyance, to act as testimonials. Pecunia non olet.

Score Matters

Simplicity is a primary ingredient for reliability. The outcome of automatic metrics is a simple number, which is straightforward, however not easy to interpret properly. Simplicity is the reason of success for automatic scoring.

When you get a score on a well-known scale, like the one your teachers used to rate you, you naturally tend to rely on that system, rather than on intricate evaluation metrics where a human variable—bias—is in the cards.

Also, however imperfect, automatic scoring metrics have largely proved effective and reliable to measure engine performances, especially when used consistently. If anything, to unleash their full potential and unlock all their benefits, metrics should be easier to understand and use.

A major problem with automatic metrics lies in the much-insisted correlation with human judgment of translation quality that follows the error-catching approach, though. As Kirti Vashee repeatedly wrote, the most popular automatic metric is BLEU, a string-matching algorithm measuring the similarity between two text strings, with no linguistic consideration. Translation quality evaluation following the typical human approach is hard because there is no absolute way to measure how correct a translation is, and because there can be as many correct answers as there are translators. Also, correct translations using different words may score poorly at BLEU if there is no match in the reference, simply because this uses different words.

Kirti Vashee also reminds that to get closer to the traditional translation quality evaluation approach based on contrastive analysis, a single-point score cannot prove reliable, and a more complex framework is necessary.

Automatic quality estimation using predictive analytics might be a workable solution, if consistency is guaranteed in all respects. The data must be from the same source, post-editors must always be the same, and the topics should pertain to the same domain. If all these conditions are met, automatic quality estimation can suggest whether a possible MT output might be good from the start. In this perspective, automatic scoring will eventually help users assess MT output at the document level and waive the sentence-level assessment approach for good.

The same framework above might include a mechanism to combine translatability scores computed upstream on source text with automatic quality estimation scores and compare them with post-factum scores from a combination of checklist scores, correlation-and-dependence, and precision-and-recall scores.

Deviations from expectations are way better, in a path of continuous improvement, than apodictic evaluations in absolute value.

Cet obscur objet du désir

The much-coveted innovation cannot come from giving the people what they want. Although there is no compelling evidence of Henry Ford ever saying, “If I had asked my customers what they wanted, they would have said a faster horse”, this still sound as a statement of rare entrepreneurial wisdom. Not only because there is always a problem trying to figure out what people want by canvassing them, but mostly because very few customers will most possibly able to envision the future: Indeed, vision should be a typical entrepreneurial ability. This explains Paul Valéry’s often misquoted and misattributed statement, “L’avenir est comme le reste: il n’est plus ce qu’il était”, and makes Watts Wacker, Howard B. Means, and Jim Taylor seemingly address the translators’ community when writing in their Visionary’s Handbook that “the closer your vision gets to a provable truth, the more you are simply describing the present”.

I might be true that academic conferences are not the place for advocacy, but to discuss solutions. However, if your ideas are your wealth, you won’t spoil them in a detailed document just to see it possibly refused and find your work exploited elsewhere, later. It happens.

Anyway, it is impressive how things get clearer when you get involved. Recently, István Lengyel, formerly co-founder of Kilgray (now memoQ), in a blog post to introduce his new company, wrote that he has discovered that productivity is certainly not the primary driver and that what matters most is delivering projects on time, no matter the cost. This is something every industry veteran has learnt the hard way and could confirm.

István Lengyel also candidly acknowledges that it took him a while to realize that the key to successful integrations is not the technology itself but the business model.

In this perspective, though, the right approach to manage payments for small jobs is twofold, consisting of SLAs and minimum fees. Otherwise, this is just another face of the widely described and stigmatized smash-and-grab, get-rich-quick, screw-the-workers Uber-like gig-economy model. On the other hand, translation buyers who can afford the continuous localization paradigm can also afford to sign comprehensive service-level agreements with their vendors including clauses for any sub-vendors.

Conversely, this scenario will not apply to ‘transcreation’ that many stubbornly insist on considering as the only possible future in translation, up to repudiating their long-time vision. If “words will be cost-free in 2020” having almost every word been already translated, all this is pointless.

In this case, it would be wise to carefully ponder what Daniel Marcu, Applied Science Director at Amazon, told Andrew Joscelyne in a recent interview for the TAUS blog, that the real question is not so much “is MT good enough?” but “how easy is it for people to use MT?” This is exactly where automatic metrics come into play. And where quality “frameworks” and associated old-style metrics have been failing. In fact, Daniel Marcu also said that there is a considerable lack of understanding in the field of translation evaluation and that what fundamentally matters is the impact of the technology in a specific situation.

Is the traditional approach to translation and translation quality still making sense then?

Stop Making Sense

Seemingly, the idea finally begins to develop that many challenges in the language services industry are due to unequal access to high-quality information, that buyers, LSPs, linguists operate at different levels and with different information. That’s information asymmetry.

Affordable and accessible information is pivotal to cultivate realistic expectations and take good decisions. Unfortunately, the increasing buzz of artificial intelligence in news is illusional and the business world, which the translation industry is part of, although minor, has been thinking that technology alone, when successfully implemented, will solve all problems. This is why more and more people are concerned about the rise of automation and artificial intelligence. It may be true that AI is going create 113 million new jobs while displacing 75 million of old ones by 2022, but most people will perceive only the loss, as long as they won’t see a contrasting reality.

For example, an article in the Guardian of May 29, 2019 unveiled the realities of producing Google Assistant. Behind the “magic” of its ability to interpret 26 languages is a huge team of linguists, working as subcontractors, who must tediously label the training data for it to work. They earn low wages. Behind Facebook’s content-moderating AI are thousands of content moderators; behind Amazon Alexa is a global team of transcribers; and behind Google Duplex are sometimes very human callers mimicking the AI that mimics humans.

If the translation industry has always relied on contingent workers, today an entire economy around relies on contingent workers. And they are not necessarily highly educated. The artificial-intelligence industry is already running on invisible labor and the model is spreading to more and more businesses even in the translation industry also thanks to delusional pundits who put their minds and pens (or keyboards or voice assistants) to the service of a few tech oligarchs. And they are not done yet.

These people are no different from those who slur them and sell their unfortunate audience scenarios and solutions they know to be improbable, that might have worked just forty years ago and, in any case, for only a 1%. The sheer and bleak current reality is that more and more highly educated people are doing ghost work.

Grand View Research estimates that the translation segment of the AI market is expected to grow at a 14.6% CAGR over the next few years, to reach $983.3 million by 2022. GM Insights predicts a $1.5 billion machine translation market by 2024. AI is going to be ever more critical for business competitiveness. A future awaits us of mundane drudgery, being it of evaluating translations sentence by sentence or cleaning language data.

Understanding Technology

Twenty-five years ago, translators and LSPs craved for Windows/Office training, and yet they were not willing to buy any from specialized companies, asking for ad-hoc programs. Then came translation-memory tools, and again the demand was for customized training, even at a basic level. For the last few years, the demand has focused on TMSs, although to a much lesser extent.

Every time, armies of consultants met these demands and brought up new ones. There must be a reason if, to date, understanding technology is still a major issue in the translation industry despite the much ado. Just think of those who still express uncertainty and apprehension for the cloud.

Recently, Kirty Vashee has returned to the importance of understanding technology. The much din around data has caused some incompetent and unscrupulous ‘consultant’ to originate and feed the idiocy of translation big data. The two led to the birth of one of today’s myths in the translation industry that anybody with a supply of translation memory data can easily develop and stand-up a reliable MT system using one of the many available DIY solutions. Kirti Vashee also reminded that a basic competence with machine learning technology, an understanding of the data to setup and tune an MT system and of the relevant preparation and optimization processes, and the understanding of the tools and metrics to measure the system performance are necessary. Also, solid IT skills at the system-engineering level would be necessary, but these can be taken for granted. After all, this is a high-tech industry as the usual suspects tirelessly repeat, isn’t it?

A few days before Daniel Marcu, Alon Lavie, now VP Language Technology at Unbabel, told Andrew Joscelyne that to reach human quality translation a combination of technology, data and human skills is fundamental. He also reminded that data must effectively represent the use case and that although this exists to a degree in translation memories, well organized data is necessary for full information access, but that this is not how data is sourced today. Finally, he admitted that we are still afar from automatic error detection and, from what follows, this is due to the still predominant human approach.

[*] As in Dan Lyon’s Lab Rats.