Content translation/Product Definition/Dictionaries

Aim: Provide a reliable dictionary back end and api for CX

Free licensed multi lingual dictionaries are not widely available. The DICT protocol based dictionaries are packaged in GNU/Linux distros. The quality of these dictionaries vary a lot. There are many websites allowing users to lookup meaning for words. But they are designed for users in mind and provide a software consumable data(structured data). Wiktionary has lot of dictionary but its data is not well structured. Considering all these real world problems, the dictionary backend of CX is flexible to support multiple backend providers whilte it try to expose a general dictionary lookup REST api.

Dict protocol

Pros

Widely accepted dictionary protocol. Lot of desktop, webclients. Default dictionary clients in Gnome/KDE/MacOS support this protocol

Readily available packaged dictionaries in Debian

Dict servers do fast lookup on available dictionaries and clients does not have any performance overhead - See the Performance testing results
Supports the following search strategies
1. exact Match headwords exactly
2. prefix Match prefixes
3. nprefix Match prefixes (skip, count)
4. substring Match substring occurring anywhere in a headword
5. suffix Match suffixes
6. re POSIX 1003.2 (modern) regular expressions
7. regexp Old (basic) regular expressions
8. soundex Match using SOUNDEX algorithm
9. lev Match headwords within Levenshtein distance one
10. word Match separate words within headwords
11. first Match the first word within headwords
12. last Match the last word within headwords
- This is arguably a Con: the performance and security implications of a set of rarely-used search strategies all need investigating.

Cons

Available dictionaries vary a lot in quality. We might require handpicking dictionaries. To be solved by using alternate dictionary providers depending on availability to language pairs
The protocol is optimised around a human searching for a single dictionary word/phrase, with a response that is parsed by the human's eyeballs

Performance, availability, load testing

Simulation: 100 concurrent users hitting REST Api (https://gerrit.wikimedia.org/r/#/c/134074/) for 2 mins. Time between requests 2s

$siege -d2 -c100 -t 2m http://localhost:8000/dictionary/pen/en/en
**SIEGE 3.0.5
** Preparing 100 concurrent users for battle.
The server is now under siege...
Lifting the server siege...      done.

Transactions:                  11884 hits
Availability:                 100.00 %
Elapsed time:                 119.50 secs
Data transferred:              56.92 MB
Response time:                  0.00 secs
Transaction rate:              99.45 trans/sec
Throughput:                     0.48 MB/sec
Concurrency:                    0.09
Successful transactions:       11884
Failed transactions:               0
Longest transaction:            0.06
Shortest transaction:           0.00

JSON Dictionary file

Convert the dictionary sources to json format (offline) and write a code that does lookup on the json.

Pros

Immediate:
1. Deployment is simpler (no separate dictd service)
2. Coding is simpler (no need for robust dictd client code)
3. Runtime is simpler (no possibility of dictd protocol/state issues at runtime)
Future:
1. Not restricted to dictd dictionaries (can use exports from terminology resources in TBX format etc).
2. Can search every word in a paragraph at once (e.g. to highlight matches in the source text)
3. Can do things the dictd protocol doesn't support well (e.g. better word morphology support in searches)

Cons

Immediate:
1. Need to extract the data (but we need the code to do this in any case).
  - Each dictionary needs data mining separately to get good word correspondences
  - The extraction process is quite simple (a few lines of code)
  - The size is quite small (~240K of uncompressed json for 6000 word pairs)
  - The data quality will vary in many ways (number of headwords, subject, richness of information etc)
  - Caution is required before assuming information will be useful to the user (does a translator really need to know what simple words like "you" mean?)
Future:
1. Big memory consumption for the in-memory representation of dictionary. The English-English webster dictionary is 39MB uncompressed. Nodejs is never recommended for memory or cpu consuming operations because it is single threaded and can freeze the request for few milliseconds to seconds depending the volume of data we have. If performance becomes a problem we can optimise with any HTTP-based approach (even non-Node if we like).
2. Need to re-implment search strategies. If the lookup is not the native json lookup, we need to add more time to the response. But in many cases we will need better search than the dictd files support, especially when the source language is not English: e.g. better word conjugation support.

API

API URL: dictionary/word/sourceLanguage/targetLanguage

Example http://cxserver.wmflabs.org/dictionary/egg/en/es

{
  "source": "egg",
  "translations": [
    {
      "phrase": "huevo",
      "info": "",
      "sources": [
        "fd-eng-spa"
      ]
    }
  ]
}

If the backend cannot provide structured information like this, output will be

{
  "source": "cat",
  "freetext": [
    {
      "text": "cat /kæt/\r\n Katze <f>",
      "sources": [
        [
          "fd-eng-deu"
        ]
      ]
    }
  ]
}

Sister projects

Look up a word at wiktionary or wikipedia interwiki. For instance, names of people and places are not always present in a classic dictionary. Should content translation be able to query API of interwiki of these two projects (or wikidata sisterlinks thingy) to assist with transliteration routine?