Content translation/Translation tools

Content translation is a Computer Assisted Translation (CAT) tool aimed to make translation of Wikipedia articles easier and better, and to collect a corpus of parallel text. It complements the existing Translate extension.

CAT tools generally work by breaking text into segments (individual sentences, titles, image captions etc), then applying to each segment a number of language components, including:

  • Machine translation (via external systems: Moses, Apertium, proprietary providers etc).
  • Translation memory (databases of previously translated segments)
  • Bilingual dictionaries and glossaries (internally and via external providers, including Wikidata)
  • Link conversion
  • Language proofing (spell checking, grammar checking, style checking etc).

Note that both machine translation and translation memory yield a candidate segment: a suggested translation which the translator can use and if necessary, correct. Conceptually, machine translation provides an unreliable translation of the right segment, whereas translation memory provides a reliable translation of a wrong segment. An ideal candidate segment requires few edits to reach an acceptable translation.

Usefulness edit

The relative usefulness of different features depends on the translator's strength in the source and target languages:

  • A translator who knows both languages well will find a low-quality candidate segment worse than starting from scratch. Everyday vocabulary aids will be unnecessary and distracting; only technical terms will be useful.
  • A learner of the source language will find dictionaries useful, but needs less help expressing the translation.
  • A learner of the target language will appreciate even lower-quality candidate text, as it will contain useful fragments of text to emulate.
  • Typing speed matters. In particular, if the target segment is hard to input (e.g. obscure Chinese characters for many users), candidate text is relatively more useful.

Usefulness also depends on the quality of the available tools for the language pair in question:

  • In the best case, effective machine translation may be close to correct already. The translation process is then more like proofreading.
  • Low-quality machine translation (e.g. between less well-resourced language pairs) may be worse than useless. Translation memory suggestions may require fewer edits.
  • Dictionary suggestions may not work well unless at least approximate stemming is available in the source language.

Also important is language register: the level of formality in the text. In many languages, there is a much greater difference between formal and informal styles than seen in English.

  • General-purpose machine translation may output the wrong target language register.
  • External, general-purpose translation memory text may use the wrong target language register.
  • Changing language register often requires many edits, if many noun endings, verb inflections or vocabulary substitutions are required. Fuzzier translation memory text with the right target language register may actually require fewer edits.

Finally, the nature of the translation has big consequences for usefulness.

  • Candidate text is most useful for strict translation, where the source and target text are identical in meaning.
  • It is much less useful for free re-authoring, where the target text is more loosely based on the source text.
  • Parallel segments and even parallel paragraphs are not very helpful for writing a shortened summary article (e.g. when translating from a large Wikipedia to a small one).

In summary, different use cases need different things. By not over generalising based on a few major languages pairs' requirements, we ensure the system is as useful as possible to long-tail languages.

Machine translation (MT) edit

An MT system takes source text in one language, and outputs target text in another language. The output is generally not perfect – it is candidate text for a human translator to correct. MT quality depends on many factors, including: the language pair, subject domain, register, translation engine and source text quality. For content translation, the best measure of text quality is not whether the meaning is correct but whether few edits are required to fix it.

MT is asymmetrical: translating English to Chinese raises different challenges from translating Chinese to English. In general, better MT is available when there is a large parallel corpus, and where the languages are typologically similar (sharing word order, complexity of word endings etc).

Subject domain and language register affect translation quality. An MT system trained with/for one type of text may not perform well with another type. In many languages, the difference between formal and informal registers is much greater than in English. Therefore, a general-purpose translation engine may not produce good results for encyclopaedia article text. This is especially true because we need few edits, whereas register differences often change many words a little bit (e.g. suffixes on many nouns or verbs, or scattered vocabulary changes).

Google, Microsoft and Yandex all have online translation systems covering larger languages, with pay-to-use API access. Translation often uses English as an intermediate language, i.e. two-step machine translation (even between similar, well-resourced languages such as German and Dutch).

Moses is an open-source Statistical Machine Translation system and Apertium is an open-source Rule-based Machine Translation system. The barrier to creating custom models/rulesets for new languages, registers and domains is relatively low. Content translation is designed to support such 'small engines', as they may be the best answer for many language pairs and subject domains.

MT engines may not translate instantaneously; a content translation design goal is supporting a translation speed of a few seconds per sentence, without affecting the user responsiveness, via prefetching and caching.

Translation memory (TM) edit

A Translation memory (TM) is a database of parallel pairs of segments which have already been translated by a human. When translating a segment, the TM can be searched for exact matches (where the same source segment has been translated before), and for fuzzy matches (where a similar source segment has been translated before). Fuzzy matches happen because similar sentence patterns arise repeatedly; e.g. "In geometry, a pentagon is any five-sided polygon", "In geometry, a hexagon is any six-sided polygon", etc. The translator uses the fuzzy match target segment as candidate text, and modifies it so its meaning matches the new source segment.

A project can use a single TM for all types of material, or many TMs organised according to subject domain, language register etc. TMs are weakly asymmetrical (e.g. English->German translations tend to be slightly different from German->English translations), but it is common to search segments translated in either direction.

Effective TM search generally requires word segmentation (and in many cases approximate stemming) in the source language. Other possible search techniques involve fuzzy word matching (e.g. based on edit distance), machine learning or more complete morphological analysis. TMs are most effective when a consistent sentence segmentation algorithm is used for all the source segments.

External TM APIs, like MyMemory, could be used in conjunction with internal TMs. TMs can be plaintext-only or can support markup.

Bilingual dictionaries and glossaries edit

A bilingual dictionary contains translations for individual words and phrases, generally with other information such as part-of-speech (e.g. verb, feminine noun) and disambiguators to distinguish between different meanings.

A bilingual glossary is a similar resource, but with less information for each entry. Glossaries are made for the purposes of computer-assisted translation, and are often subject-specific, created quickly, and less comprehensive, standardized and information-rich than dictionaries.

Effective glossary and dictionary search requires word segmentation, and in many cases approximate stemming, in the source language. Other possible search techniques involve fuzzy word matching (e.g. based on edit distance), machine learning or more complete morphological analysis.

Link conversion edit

Article links within source text can be automatically localized to a corresponding article in the target Wikipedia. For example, [[Mol]] in German can be localized to [[Mole_(unit)]] in English.

Language proofing edit

Spellcheckers (such as Hunspell) and grammar checkers (such as LanguageTool Grammar Check) perform checks on the translator's target text. They are usually highly language-specific, and their usefulness varies greatly between languages.

Grammar checking can be particularly useful in conjunction with statistical machine translation, which often contains chunks of good text interspersed with 'obvious' grammatical errors (e.g. failures to conjugate adjectives according to the gender of the nouns that they modify).

Style checking ensures consistency of language features (e.g. so that "I'm" and "I am" are not mixed arbitrarily in the same document). This is far more important in certain languages where the language register differences are much wider than in English.

Glossary edit

Candidate segment
Suggested text which acts as a starting point for a human's translation. A candidate segment is most useful if it takes few edits to make it acceptable.
Computer-assisted translation (CAT) tool
Software that assists a human in translating text. CAT tools may incorporate: sentence segmentation, translation memories, bilingual glossaries, machine translation engines, language proofing tools, translation workflow management, etc.
There are many CAT tools for translating documents, web content and software interfaces. Open-source examples include MediaWiki's Translate extension, Pootle and OmegaT. Proprietary examples include Wordfast, SDL Trados, Déjà Vu and Google Translator Toolkit.
A common rule of thumb is that CAT tools can speed the translation process by 40% with no loss of quality; however this depends greatly on the particular situation.
The subject matter of some text. This can affect the likely meaning of words; e.g. mole, mole, mole.
Few edits
A good candidate segment takes few edits to make it acceptable. Edit distance, rather than correctness of meaning or grammar, is the essential measure of good candidate texts. A translator can replace one big chunk of text faster than several small chunks, so a candidate segment is less useful if several scattered changes are required.
For example, a segment with a missing 'not' may be a good candidate: the meaning may be drastically wrong, but a human can correct it with a one-word edit. Conversely, a segment with the wrong capitalization for the context may have correct meaning and grammar, but require several edits to fix.
Language register
The variations in language for different purposes and social environments. In many languages, the formal and informal registers are very different.
Machine translation, rule-based
Rule-based Machine Translation (RBMT) uses an extensive set of linguistic rules to analyse the grammatical structure of source text and transform it into target text.
Machine translation, statistical
Statistical machine translation (SMT) uses little or no linguistic knowledge about the source and target languages. Instead, automated analyses of large parallel corpora (say 500,000+ sentences) generate a mathematical model, which can then be applied to translate new sentences.
Sentence segmentation
Splitting of text into individual sentences, titles, captions etc. Sentence segmentation is weakly language-specific; a single straighforward mechanism works reasonably well for many languages (see Unicode UAX29), but can be improved on with language-specific rules.
Finding the root form of words with grammatical inflections. E.g. swim/swims/swimming/swam/swum → swim. In some languages, a single root form can give rise to hundreds of inflected forms, so without stemming, text search is not very useful. Note that some forms can arise from more than one stem; e.g. in English the form 'spoke' has two possible stems, 'speak' (a verb) and 'spoke' (a noun, meaning a part of a wheel).
For content translation purposes, it is usually possible to use approximate stemming.
Stemming, approximate
Approximate stemming is similar to stemming, but less precise and easier to implement. For a given word, approximate stemming returns a set of possible candidate stems. With high probability, the set contains the correct stem; it may also contain incorrect stems. This is useful if no true stemmer is available, in contexts where false positives (spurious matches) can be tolerated.
Approximate stemming can often be implemented with simpler rules, and less lexical information, than true stemming. For example in English, approximate stemming of plural nouns can be performed with simple substitution/blacklisting. Applying an approximate stemming algorithm to the English word 'knives' might result in the set {'knife', 'knive'}, containing the correct form 'knife' and an incorrect guess, 'knive'.
As this example shows, the forms in the result set are not always correct, and often are not real words. This is tolerable for searching dictionaries, glossaries and translation memories, because the translator can easily recognise and ignore the occasional false positive suggestion.
Translation workflow management
Tools to manage the process of creating a translation. These may include text analysis, time estimation, task allocation, progress tracking, proofreader management, version management and source text update tracking.
Word segmentation
Splitting a text into individual words. This is relatively simple in non-agglutinative languages that use space between words, but highly complex in other cases (e.g. Chinese). Note that words are not always the basic unit of text; for example, effective glossary and dictionary search must support Multiword expressions.