Old Dog, New Tricks

Old dog, new tricksYoast is a Dutch search-optimization firm with a most popular WordPress plugin that GoDaddy and Forbes describe as one of the most powerful applications available to WordPress users.

Yoast SEO offers a very interesting “Readability analysis” that detected four major problems in the post prior to this one and suggested one key improvement.

The problems were:

  • A Flesch-Reading-Ease score of 40.8, meaning it may be difficult to read;
  • 18.9% of sentences using passive voice (the recommended maximum is 10%);
  • 4 sections exceeding 300 words in length with no subheadings;
  • 45% of sentences with over 20 words (the recommended maximum is 25%).

The suggested improvement concerns shortening sentences (to a maximum of 20 words), using less difficult words, and shortening paragraphs (to a maximum of 150 words).

On the other hand, the software praised the use of transitions. Indeed, transitions improve readability by establishing a relationship between two sentences and making this apparent. Transitions also abridge memory lapse between paragraphs, helping increase variety and significance.

For example, memory time is supposed to make the difference in music to turn a cantata into a tormenting hit. That is to say, memory time might be what separates Monteverdi’s Magnificat from Macarena.

Transitions are helpful also in NMT, where decoders translate full sentences by focusing on sequence of words.

Improving Writing

The effectiveness of a translation depends primarily on the clarity, fairness and neatness of the source text. If it is solid, simple and concise even a few minor mistakes would result tolerable.

For this reason, even in the 1970s and 1980s, rule-based machine-translation software could work successfully without requiring a lot of post-editing as long as technical terms and other noun phrases were pre-translated and added to the software’s dictionaries.

Indeed, it was also for this reason if readability formulas met a certain success in those years, and their period of glory lasted as long as RbMT was in vogue. They are now mostly ignored, but maybe it will soon be the time of the emergence of a conscious, wise and limited use of it.

As a matter of fact, criticisms to readability formulas mostly address sentence and paragraph length as a measure of difficulty, the arbitrariness of compensation factors, the number of words in samples and their selection criteria. Such criticisms are well founded, since word length is not, per se, a measure of difficulty, as well as short, simple sentences do not necessarily help reading.

Still, Ernest Hemingway and Raymond Carver are both famous for their brutal minimalism. Hemingway followed four basic rules he received at the Kansas City Star when he was starting out and would never waive. The first rule read, “Use short sentences”. Later, he declared that “Eliminate every superfluous word” was the other basic rule ever learnt for the business of writing. Carver himself said that the hallmark of his prose was due to ongoing heavy editing and a quote from Ezra Pound printed on a 3×4 above his desk. That quote became Carver’s First Commandment: “Fundamental accuracy of statement is the one sole morality of writing”.

Matching Expectations

If translation is craftsmanship and its quality depend on the customer’s subjective expectations, terminology and style are crucial. Therefore, a glossary and a style guide should be made available before any translation project begins together with goals and expectations to minimize any misunderstandings, disputes and loss of time due to personal preferences.

What if authors of source content follow the same terminology and rules set forth for translation?

While every company could benefit from implementing a controlled language, not every company could afford it, and not all companies need one. It is not strictly necessary, indeed. Ambiguity and complexity can be reduced or eliminated with the accurate and consistent use of established terminology and a simple set of basic writing rules, like Hemingway’s.

Once you reach a certain effectiveness in writing, then you may plan to implement a controlled language.

In the process, improving your writing will also increase leverage in your content management system and lower the rate of errors, thus reducing time and cost for assistance.

At the same time, the definition of standardized terminology will help branding and user friendliness, protect from confusions and risk of undermining intellectual property rights or infringing those of competitors.

A company could benefit from controlled authoring also for tasks other than content production, for example in user assistance.

Refining and improving the authoring process is possible even in long-established settings because it is a disproven myth that older dogs cannot learn new tasks.

When dealing with quality, especially in multilingual projects, controlled authoring can help.

For example, according to the Yoast SEO WordPress plugin, Building a Localization Kit scored good about readability. Indeed, it was conceived, designed and written to this end, and with translatability in mind. It is deeply and inherently consistent in terminology and style, and this made it acceptable even in a SEO perspective. This means that the paper can be machine-translated and post-edited with a very low effort. The same goes for the PEMT guide.

By the way, for a few years now, a function to perform readability tests and display the relevant scores is also available in Word, with many specific grammar and style checks.

Rem tene verba sequentur, wrote Cato the Elder in his Orationes: Get the facts and words will come. Two thousand years later, in his ABC of Reading, Ezra Pound again wrote that “Incompetence will show in the use of too many words”.

A terminology tool and a grammar and style checker can help you make your writing inherently cohesive and this, in turn, will make it better translatable. In fact, readability and translatability are deeply related and the readability score may be used to compute a predictive quality score based on the vendor’s history in the same vertical.

Now, imagine a predictive quality score compared with an actual, post-factum score computed from content profiling and initial requirements (checklists), traditional translation “QA” (i.e. checking for machine-detectable errors in punctuation, numbers, inline tags, capitalization or extra spaces, missing translations or terminology inconsistencies), correlation and dependence, precision and recall and the increasingly venerated edit distance.

According to Yoast SEO, Building a Localization Kit scores 46.3 at the Flesch Reading Ease test, meaning that it is best understood by college graduates, who indeed it would address. Many sections of it are longer than 300 words and are not separated by any subheadings, while only 15.3% of sentences contain transition words and 14.9% contain passive voice. However fair, then, translatability could be improved. To verify all of the above, the Italian version of Building a Localization Kit is available for download and comparison.

Also, applications like Dave Landan’s StyleScorer would be a real breakthrough to benefit from better and cheaper machine learning and deep learning platforms and score new documents against the style of established documents, thus helping customers reasonably predict the expected quality of translations.

Because, as Isaac Asimov wrote in Change! Seventy-One Glimpses of the Future, “Part of the inhumanity of the computer is that, once it is competently programmed and working smoothly, it is completely honest”.

Integrating the Workflow

To effectively integrate machine translation into a company’s workflow, reviewing—and possibly re-engineering—the content development process is necessary to incorporate any downstream tasks. For example, Automatic Post-Editing (APE) may be an option, but, as Matteo Negri’s recently explained, its present and future depend on the quality of machine translation output. This, in turn, depends on the data and skills required, which present substantial costs.

It is true that, from a general perspective, SMT has been outperformed and replaced by NMT, which now represents the state of the art. However, it took twenty years to get some decimal points more in BLUE scores and reduce PEMT effort by 10% in this shift, and it will most possibly take a few years more to further reduce it by another 5%.

Therefore, it is at least over-optimistic the claim that there will soon be no need for post-editing any more.

Also, not only are NMT engines data-hungry, the underlying infrastructures and knowledge required are also very taxing. This makes training and tuning NMT engines challenging and tells why developing MT engines that might eventually make post-editing—or even APE—pointless will long remain a prerogative of few.

For this and many other reasons that have been illustrated here too, everyone willing to integrate MT in one’s business workflow must keep their eyes wide open, primarily in choosing the engine, and never rely solely on automatic routing tools: Each content is different, so authoring, profiling and pre-assessing must be thoroughly and competently cured.