Content translation/Product Definition/LinearDoc

In HTML, inline text annotations are represented in a tree structure. This structure is inconvenient for certain string operations, including taking substrings, reordering and performing plaintext-based searches and pattern matches. These operations are important for Content Translation, particularly when performing sentence segmentation and machine translation. LinearDoc is an alternative representation, where the tree structure has been flattened. A whole block of inline marked-up text is stored as a single array of text chunks. Each chunk contains some plaintext, and an array of the inline HTML tags which apply to it. It is then easy to take substrings (by slicing the array), and to perform searches and pattern matches (by matching on the plaintext, then applying the resulting search indexes to the marked-up text).

Representation of rich text in HTML


Here is an example Wikipedia-like chunk of rich text:

<a href="Conway">Conway</a> stated that young <a href="children">children</a>
<i>“understand <a href="Object_permanence">object permanence</a>.
<a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in
their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span>
<b>(<a href="Nielsen">Nielsen</a> equivalence).</b>

A simplified representation of the DOM tree would be something like this:

elementNode 'a'
    textNode 'Conway'
textNode ' stated that young '
elementNode 'a'
    textNode 'children'
elementNode 'i'
    textNode '“understand '
    elementNode 'a'
        textNode 'object permanence'
    textNode '. '
    elementNode 'a'
        textNode 'Concealed'
    textNode ' '
    elementNode 'a'
        textNode 'objects'
    textNode 'feature in their awareness.”'
elementNode 'span'
    elementNode 'a'
        textNode '[5]'
textNode ' '
elementNode 'b'
    textNode '('
    elementNode 'a'
        textNode 'Nielsen
    textNode ' equivalence).'

There are a number of operations which are challenging to perform on either the serialised HTML or on the DOM tree, including:

  • Performing plaintext based searches and pattern matches
  • Taking substrings
  • Reordering text



Hard to detect sentence boundaries


The example text above contains three sentences (note that trailing whitespace is included in a sentence):

  1. '<a href="Conway">Conway</a> stated that young <a href="children">children</a> <i>“understand <a href="Object_permanence">object permanence</a>. '
  2. '<a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span> '
  3. '<b>(<a href="Nielsen">Nielsen</a> equivalence).</b>'

Unfortunately, it would be difficult to implement a sentence boundary detection algorithm that worked either on the HTML representation or the DOM tree:

  • For the HTML representation, complications include inline tag text (such as <span...> and <i>) and semi-irrelevant text (like the [5] marking a reference).
  • For the DOM representation, regex-like pattern matching across the text of several nodes is difficult.

Since the rules for sentence boundary detection are different for each language, hundreds of pieces of complex code would be needed to do this correctly.

Cannot manipulate sentences


In many sentences, like the first two in the example above, the start and end offset nodes are not siblings. This can happen in two ways:

  • "... Foo <i>bar. Baz...</i>" (One end of the sentence is contained in an inline node split by the sentence).
  • "... Foo </i> bar <i>baz. Foo..." (Both ends of the sentence are contained in an inline node split by the sentence).

To manipulate the sentences individually (e.g. to highlight one sentence, or to find all links in the current sentence), we would like to wrap each sentence in a span tag, or consider it as an independent HTML fragment, or equivalent. This is impossible without complex tag rebalancing (for the HTML representation) or involved tree surgery (for the DOM tree).

Difficult to translate text


For the same reasons, it is non-trivial to reorder words and phrases in an inline HTML tree whilst preserving annotation equivalence and structural validity. Machine translation typically changes the word order, so this is a problem.

The LinearDoc representation


The inline HTML shown above can be transformed into the following data structure:

	{text:'Conway', tags:['<a href="Conway">']},
	{text:' stated that young ', tags:[]},
	{text:'children', tags:['<a href="children">']},
	{text:' ', tags:[]},
	{text:'“understand ', tags:['<i>']},
	{text:'object permanence', tags:['<i>', '<a href="Object_permanence">']},
	{text:'. ', tags:['<i>']},
	{text:'Concealed', tags:['<i>', '<a href="Concealment">']},
	{text:' ', tags:['<i>']},
	{text:'objects', tags:['<i>', '<a href="Object">']},
	{text:' feature in their awareness.”, tags:['<i>']},
	{text:'', tags:[], inline:'<span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span>'},
	{text:' (', tags:['<b>']},
	{text:'Nielsen', tags:['<b>', '<a href="Nielsen">']},
	{text:' equivalence).', tags:['<b>']}

Unlike the HTML representation, there is no tree structure, only a flat array of text chunks. Each text chunk contains a piece of plaintext, together with the full list of tags that apply to it. This means that a range of italics, say, is not 'opened' and 'closed': instead the <i> tag is repeated next to each piece of plaintext to which it applies. The reference span (the span with typeof="mw:Extension/ref") does not appear in the plaintext at all; rather it is an extra annotation that applies to an empty string.

It is clear that the HTML representation can be recovered from the representation above. LinearDoc objects are essentially the same as this, but with helper methods and with enhancements to deal with certain minor limitations.



An important property of the LinearDoc is that any slice of the array structure is also a valid LinearDoc. This makes it easy to take substrings. (An individual chunk of text in the array can be split easily, by turning it into two consecutive chunks with the same annotations).

Also, concatenating two LinearDocs results in a valid LinearDoc. This, together with the substring property, means that text can be reordered easily whilst preserving annotations.

Finally, annotations are stored separately from the flow of plaintext. Therefore, plaintext offsets are trivially easy to map onto the LinearDoc. This means matching and searching can be applied to the plaintext, and the resulting offsets can then be applied to the annotated version.

Machine translation


LinearDoc works well with any machine translation engine that can translate plaintext and output text reordering information. The process is as follows:

  1. The source-language plaintext of the LinearDoc is sent to the MT engine.
  2. The MT engine gives equivalent target-language plaintext, together with text reordering information.
  3. LinearDoc uses the text reordering information to map text annotations from the source document to the target document.

This is important, because not all MT engines can translate HTML markup successfully, and many practical applications (including wiki pages) have specific markup requirements that a generic MT engine may not accommodate.

Some MT engines, such as Moses and Bing Translator, can provide text reordering information directly. For other engines, such as Apertium, it is sometimes possible to derive the reordering information indirectly, e.g. by modifying the case of source text and observing corresponding changes in the translation:

Big white dog -> Ci mawr gwyn
BIG white dog -> Ci MAWR gwyn
Big WHITE dog -> Ci mawr GWYN
Big white DOG -> CI mawr gwyn

mapping = [ 2, 0, 1 ]

Where reordering information is missing or incomplete, LinearDoc will fall back gracefully to plaintext formatting on a token-by-token basis. This is cleaner than many MT engines, which can output unbalanced HTML tags in certain circumstances.

Similar data structures


LinearDoc was inspired by the document model of VisualEditor, which uses a similar linear structure to allow diff storage and text splicing.

The GramTrans MT system uses a similar approach called "style tags".[1][2]