Developing Requirements and Compensation Schemes for PEMT Jobs

Post-editing of Machine Translation
© Santlov

PEMT

PEMT expands in post-editing of machine translation, and usually refers to the process of checking machine translation output for errors and making appropriate corrections to improve it with a minimum effort.

A Brief Introduction to MT

Machine translation does not come with Google Translate: it dates back to almost 60 years ago, born from military needs.

The best known event in machine translation history is the publication of the ALPAC report in late 1966 defining machine translation “too expensive, too time consuming, too inaccurate”, thus bringing research funding to an end in the United States for some twenty years, and sending the clear message to the general public and the rest of the scientific community that MT was hopeless.

Since then, MT has always met the opposition of translators fearing it could take work away from them or harm their reputation.

Even today, despite Google Translate’s success, the ‘failure’ of MT is still repeated by many as an indisputable fact, and translation industry insiders are still debating on nothing.

Conclusions in the ALPAC report were mostly due to the large amount of available translators at the time and the relatively small amount of texts to be translated, that made MT uneconomical.

Today, the volume of content requiring translation is growing steadily, together with the demand for translations of publishable quality, bringing translation beyond human scale, and only a tiny amount of content that should be translated actually is.

The recent developments of MT are making it a translation tool, pushing traditional translation buyers, LSPs and translators into a different perspective, of what MT can do for them.

Applications and Rationale for MT

At present, the typical applications of MT are in knowledge bases, support articles, user assistance information, technical documentation, and instructions for use. This is the kind of material that manufacturers, and companies in general, make available to inform or support users. It is either specialized information affecting a very small segment of users or durable documentation with a long working life. In this area, PEMT and a combination of TMs and MT called MTM is used: when no fuzzy or full match is available in a translation memory, a machine translation suggestion is interactively proposed to the translator-editor, which can be improved using the translation memory for concordance.

Another typical application of MT is in the field of intelligence, with text analytics or text mining, to extract information from text, in search, and with patent applications, where gisting is still a valid option. This usually affects large sporadic volumes of general text.

The main drivers for MT are utility, with readability and accuracy in first place, usability, with an eye to accessibility and problem solving, where gisting is the main interest, and informational intent, where consistency is the primary goal for safety purposes, while style being more easily achievable on very large volumes.

Methods

Two are the current approaches to machine translation: rule-based and data-driven or stochastic.

In rule-based machine translation, translation is made based on linguistic information about source and target languages basically retrieved from bilingual dictionaries and grammars of each language in a pair. The morphological, syntactic, and semantic analysis of the source text is needed to perform the translation, in a human-like approach, typical of transfer systems. In interlingual systems, an intermediate artificial language is used to make adding languages simpler.

In the data-driven approach, translation is performed according to the probability that a string in the target language is the translation of a string in the source language. This probability is derived from the computational analysis of relevant bilingual corpora with an exhaustive search in each corpus.

Both methods can work for projects where MT is suitable. Furthermore, most commercial systems are now all hybrids of some sort, using one approach for processing and the other for fine-tuning and cleanup:

  • Statistical machine translation smoothing tacked on after the rule-based machine translation is completed;
  • Rule-based machine translation to modify source text before and after statistical machine translation takes place to normalize source and adjust output.

Rule-based engines must be tailored to each specific pair of languages, and perform at best with ??similar languages. They also need really good, exhaustive, and constantly updated dictionaries. Linguistic information needs to be set manually, especially for rule interactions, and often do not generalize to other languages.

Therefore, disambiguation remains a severe issue as well as the creation of new rules and the extension and adaptation of lexicon. This makes RbMT engines hard to adapt to new domains and changes very costly.

Data-driven machine translation includes example-based and statistical machine translation. In example-based machine translation, translation is made by analogy. A bilingual corpus is used as a body of reference for similarities, and a combination of segments is made through best approximation, using algorithms closely resembling fuzzy matching algorithms.

In statistical machine translation, the analogical approach is integrated with a probabilistic analysis and a model-based validation using an empirical strategy and the statistical assessment of words and phrase positioning within segments from corpus. It is a typical brute-force computing application, with millions of possible ways explored of putting smaller pieces of text together.

While, data-driven machine translation engines do not generally have to be tailored to any specific pair of languages, corpus creation can be costly and time demanding, especially for users with limited knowledge and resources. Sentence alignment is crucial, since in parallel corpora single sentences in one language can be found translated into several sentences in the other and vice versa. Mistranslations or missed translations could take place due to the occurrence of errors in the reference corpora or to the abundance of different occurrences or to the absence of occurrences. Also, results are unpredictable since word order and collocations are pivotal and an SMT engine could still perform differently according to languages.

Data driven machine translation engines perform at best when translation models can be learned automatically from previously translated text. When building a model for a specific domain, data from alternative sources to the specific brand is then used to train the engine. Two models are commonly used in this respect, a translation model and a target-language model. In the translation model, words and word sequences in the source language are parsed to find the most likely corresponding words in the target language. In the target-language model, the most likely way is found in which corresponding target language words will be combined.

The analysis of parallel corpora is challenging, and it gets harder when corpora grow larger. In statistical machine translation, the primary challenge consists in scoring good translations higher than bad translations. This is why the largest a corpus is not necessarily the best.

When translation memories are used instead of corpora, the primary issue is segmentation for accuracy and matching, with the typical problems with alignment.

Translation models handle word sequences. This implies a likeliness of reproducing a wrong interpretation if the model contains a wrong sequence. A sequence could be wrong for the different dependencies between words that could be hard to capture or for the many suitable translations of an ambiguous word due to the words surrounding it.

MT SWOT

Speed, volumes, and consistency constitute the strengths in a SWOT analysis for MT, while weaknesses include complexity, error incidence, and the amount of skills, expertise and understanding needed that are not commonplace.

On the other hand, MT offers a series of opportunities to users, especially for least engaging, highly rewarding, and non-binding content, with many areas for improvement through training and customization of engines, from building language data and rules definition and optimization to writing habits, especially through the definition and adoption of controlled languages.

Anyway, the need for skills and expertise as well as the uncertainty in calculating the return on investment can be a serious threat for unknowing users.

MT UsageThis is especially true when a convergence of full automatic processing and high quality output is being sought with unrestricted texts.

This impractical goal is at the foundation of the typical biased attitude towards MT, claiming it is always bad. In fact, this attitude reveals a rather modest knowledge of MT, and goes together with the same old jokes about silly mistakes, in a mixture of ignorance and fear, as if the same mistakes could not be found in human translations.

MT Quality Assessment

To overcome the harshness of human assessment, MT scientists and engineers developed automatic metrics for a practical assessment of MT quality. They are hard to interpret, though, especially when it comes to estimate the PEMT effort. To this end, the use of annotations on a score scale is gaining ground.

The most popular automated metric is BLEU. Scores are calculated by comparing machine translation output with a reference human translation. The higher the score the closer to a human translation the machine translation is. Intelligibility or grammatical correctness are not taken into account. Quality is considered as a function of the proximity of the machine’s output to that of a human. NIST, METEOR and F-Measure metrics are derived from BLEU with some alterations, while WER is a metric for measuring the difference between two sequences through approximate string matching that is mostly used for measuring the performance of speech recognition and does not give any details about the nature of errors.

Post-editing: Levels, Effort, Requirements, Specifications

The insurgence of post-editing dates back to the late 1970’s with the implementation of machine translation at some international institutions and some large corporations. Since then, many definitions have been given, mostly focusing on the concept of improving the output through the correction of errors.

Today, three prerequisites are commonly placed on post-editing:

  1. To not spoil the benefits deriving from speed and large volumes, post-editing throughput must be significantly faster than human translation;
  2. To increase speed and reduce the likelihood of introducing new errors, post-editing must be less keyboard intensive than translation;
  3. To increase speed by reducing the need for search and validation, and contain costs by optimizing the use of human resources, post-editing must be less cognitively demanding than translation.

The PEMT effort is a function of several variables. First of all the MT method. The output of one method, in one language combination cannot be compared with that in another. Also, performance is affected by the exhaustiveness and accuracy of dictionary and rules suitability and customizability in rule-based engines, and by the quality and amount of training data for statistical engines. Finally, technologies have their role too: different technologies can be used within more than just one tool and produce different output.

What determines then the amount of PEMT effort? There is no standard way to measure PEMT effort. So far, time has been recognized as the most crucial element in measuring PEMT effort, with the amount of editing having the highest impact, especially with regard to speed. This is determined by the amount and typologies of MT errors, but also by terminology, sentence structure, and punctuation in the source text. A substantial PEMT slowdown can also be caused by pausing and referencing, which are typical effects of the so-called cognitive effort.

Anyway, regardless of the method, the MT output strongly depends on text suitability and domain restriction.

Errors are different according to method and technology. Rule-based engines typically incur in incorrect words/terms, incorrect attachments, and disambiguation errors, while data-driven engines are prone to mechanical errors (capitalization and punctuation), fluency inconsistencies and wrong word orders, and missing words.

Today, PEMT is usually restricted to three levels of effort:

  1. Gist, consisting in raw MT with virtually no corrections for disposable content or validation of automatic evaluation;
  2. Light, to make the translation understandable, consisting in adjusting mechanical errors (capitalization and punctuation), replacing unknown words (often misspelled in the source text), removing redundant words, and ignoring all stylistic issues;
  3. Heavy, to make the translation stylistically appropriate, by fixing machine-induced meaning distortion, making grammatical and syntactic adjustments, checking terminology for untranslated terms that are most possibly new terms, partially or completely rewriting sentences for target language fluency.

The level of PEMT effort to apply depends on user requirements, quality expectations, content perishability, content volume, text function and turn-around time.

To be worthwhile, a PEMT job should have the following features:

  • Source files should have been written or edited for machine translation, in a very plain and consistent language, with short and grammatically correct sentences, and straightforward word order;
  • A large, comprehensive, and established glossary should have been used;
  • Source files should contain no typos;
  • The MT engine should have been long and properly trained.

Then, the best advice to translators and LSPs willing to take on a PEMT job is to stay away from projects without such features, to prevent being paid for iron while providing gold.

Therefore, a set of specifications must be defined. When a PEMT project comes to a translator or LSP nothing can be done on the source text or the initial requirements. Any action that can be taken must be taken upstream based on the relationship with the customer, by starting with asking questions for guidance. This questions must investigate the method of MT, if the MT has been run in house or has been outsourced, the type of MT output, whether it is generic, from an untrained engine or it comes from a domain-restricted source and a trained engine. Quality guidelines should be asked for, together with assessment scores for raw translation to use as acceptance thresholds.

The rationale for MT should be investigated to adjust the effort for increased throughput, faster turnaround time, reduced cost or accuracy and consistency. In fact, the quality of the finished product will vary depending on the end users and their expectations. Finally, the PEMT effort will be different if the MT output will be reprocessed or published.

In many cases, experienced MT users will be able to indicate the amount and type of PEMT, and ask for participation in the training of engine.

In any case, it is never to be forgotten that MT engines are not all equal, raw output quality is not consistent from system to system and language to language, and that MT error patterns are not consistent from segment to segment.

Post-editing Instructions

To prevent post editor are left to themselves, thus taking arbitrary decisions that undermine the achievement of MT goals, they must be given specific instructions. These instructions must be clear and concise, and direct post-editors on the amount of raw translation to retain, the extension of style edits, the amount and extension of checks, the type of errors that can be found in the output. They should also cover tools and style, especially as to language conventions.

Similarly, to improve productivity, many ancillary tools can be provided, such as macros for global search & replace, to replace strings, possibly with format characters, or to deal with any repetitive actions.

Post-editors must know what they should expect and how the errors they could encounter may be different from those in a traditional translation. If the output is statistical, they should know not to be swayed by those fluid sentences and to be extra vigilant against a missed word that changes the meaning entirely. With rule-base machine translation, post-editors will ideally work in an environment where they can see which terms came directly from the glossary, so they will not need to check them. But they will need to spend extra time on sentence structure.

Information should also be provided about the expected quality level in the final product and the overall throughput. For light post-editing this could also be of 20,000 words per day.

The Ideal Post-editor

The ideal post-editor should have a working knowledge of the source language, an excellent command of the target language, and a specialized domain knowledge. Most importantly, the ideal post-editor should be able to comply with guidelines (from style guides to PEMT instructions) and show an unbiased attitude towards MT.

Post-editors can be mono or bilingual. Monolingual post-editors are experts in the domain, but not bilingual, and their ability is typical restricted to light PEMT of knowledge bases, support articles, user assistance information, technical documentation, instructions for use etc. It is the same kind of editing accomplished in in-country reviews. Bilingual post-editors are typically professional translators with the language skills to understand the source language and write properly in the target language, with the domain knowledge to understand the content to translate, and trained to understand MT issues and fine tune the MT engine.

Although post-editors do not necessarily need to undergo an ad-hoc training, setting them to work with no training at all is a serious mistake, since post-editing is not the same as revision or editing, and MT output is definitely different from translation memory fuzzy matches.

Compensation of Post-editing Jobs

It will take a year or two more to build out a widely accepted and dominant compensation model. The final model will most probably be tied to productivity, as productivity is the metric that relates directly to profit margin.

Measuring productivity provides LSPs and post-editors with a simple means to determine a fair rate for MT post-editing.

So far, a common approach consists is paying PEMT as for high fuzzy matches, but PEMT and post-editing of fuzzy matches are deeply different. Fuzzy matches are inherently correct segments, with minor changes (possibly a term or two). MT is not necessarily inherently correct, and even light PEMT could eventually result heavy.

Therefore, the best approach seems to be paying a time-based fee for a PEMT effort determined on the amount of editing. On the other hand, if PEMT can be three times faster than a human only translation, then there is justification for reducing compensation by 33%.

Anyway, a fair compensation can be established that is based upon the productivity gain and the reduced effort required to deliver the same quality output. Since both can vary according to domain and language combination, compensation must be agreed on throughput rates calculated through a pilot project.

For any pilot projects, rely on translators with previous PEMT experience or that are openly interested in working with PEMT. Also, before establishing any production deadlines and compensation schemes, be sure that post-editors are well aware of the scope and issues of each assignment by allowing them to experience their throughput capacity.

At the end of this pilot project, a compensation grid could be drawn that takes into consideration the MT method, the type and quality of output, quality expectations, the PEMT type, and the relevant technical aspects.

Post-editors are essential for quality MT. When proper requirements are defined, far-reaching instructions are given, ad hoc training is provided, and expectations are correctly set and exposed, PEMT will no longer be considered as a minor, trivial, demeaning task.


Many thanks to Ana Guerberof Arenas, Kirti Vashee, and Jost Zetsche for the unvaluable contributions given with their writings on the subject

See also the presentation on SlideShare