Content translation/Language anomalies

Note that this is only a proposal, not something that is scheduled for implementation.

Language anomalies can be detected by using a recurrent neural net with word embeddings. This is a pretty straight forward task, and can be used to detect strange language constructs. If the checked text comes from a machine translation engine or Content translation, then the generated text can contain language anomalies, and thus can be detected. Examples of language anomalies might be lack of agreement or wrong gender.

A simplistic model of a recurrent neural network for estimation of language anomalies. The given example sequence should give a rather low anomaly estimate if there are no errors, ie. it is a high likelihood for the sequence being correct. The intentional error "form" will trigger a high likelihood for an anomaly.

Algorithm

Assume an ordinary setup of a recurrent neural network, possibly also a bidirectional one, and also most likely as gated recurrent units (GRU), where an unknown word is guessed. This is the grammar model, a language model without explicit words, and all words are assumed to be word embeddings. The sequence comes from the known wikitext to be checked. Even if the "unknown word" is really known the network, it still tries to guess its value given the grammar model. The guessed value is given as an embedding in the word space.

Now take the "unknown word" (but really known) and find its representation as a word embedding inside the word space. This word space is the dictionary model. If the embedding for the unknown word in the dictionary is close to the previous estimated word embedding from the grammar model, then the word is most likely correct. If the word is pretty far from the estimated word embedding, then the word can be faulty but the grammar model can also be wrong. The distance of the known word from the estimated one is an anomaly estimate toward the specific known word given the context.

It is possible to learn a better classifier by using contrastive learning, that is giving known erroneous sequences, and thus learn a grammar space with multimodal distribution grammar space. The example sequence in the figure use the word "bird", but a pretty superficial check makes it clear that this specific embedding is just one mode of several. Replacing the rather simple distance measure with a better model, that is another neural network or a beam search, can lower the false positive rate considerably.

Additional details

The current ORES framework creates a basic environment for such a tool, even if the output from the estimator does not give an overall quality, it identifies locations within a text and gives estimates for that location. The tool would simply be a JSON formatted report that can be used for decorating text within VisualEditor, such that words that might be anomalies gets a colored curly underline, like when an editor is using a spellchecker in the browser. It would then be the editors own choice whether the text should be changed. In ContentTranslators editor this would then colorize weird constructs heavily, thus making the editor aware that the text most likely needs further editing.

The output from an estimator for language anomalies like this can't be used as a quality measure for some kind of "correct translation", it is a detector for weird or unlikely constructs. It might be correct to write that a horse could perform a senator's duties, even if the estimator would balk on the reference.

This kind of anomaly detection can not be used for strongly agglutinative languages without some changes. Instead of identifying an unknown word an unknown fragment inside a word must be identified. Detecting correct boundaries is difficult, and if they are set wrong there will be false positives.