Content translation/Product Definition/Dictionaries
Aim: Provide a reliable dictionary back end and api for CX
Free licensed multi lingual dictionaries are not widely available. The DICT protocol based dictionaries are packaged in GNU/Linux distros. The quality of these dictionaries vary a lot. There are many websites allowing users to lookup meaning for words. But they are designed for users in mind and provide a software consumable data(structured data). Wiktionary has lot of dictionary but its data is not well structured. Considering all these real world problems, the dictionary backend of CX is flexible to support multiple backend providers whilte it try to expose a general dictionary lookup REST api.
Dict protocol
editPros
edit- Widely accepted dictionary protocol. Lot of desktop, webclients. Default dictionary clients in Gnome/KDE/MacOS support this protocol
- Readily available packaged dictionaries in Debian
- Dict servers do fast lookup on available dictionaries and clients does not have any performance overhead - See the Performance testing results
- Supports the following search strategies
- exact Match headwords exactly
- prefix Match prefixes
- nprefix Match prefixes (skip, count)
- substring Match substring occurring anywhere in a headword
- suffix Match suffixes
- re POSIX 1003.2 (modern) regular expressions
- regexp Old (basic) regular expressions
- soundex Match using SOUNDEX algorithm
- lev Match headwords within Levenshtein distance one
- word Match separate words within headwords
- first Match the first word within headwords
- last Match the last word within headwords
- This is arguably a Con: the performance and security implications of a set of rarely-used search strategies all need investigating.
Cons
edit- Available dictionaries vary a lot in quality. We might require handpicking dictionaries. To be solved by using alternate dictionary providers depending on availability to language pairs
- The protocol is optimised around a human searching for a single dictionary word/phrase, with a response that is parsed by the human's eyeballs
Performance, availability, load testing
editSimulation: 100 concurrent users hitting REST Api (https://gerrit.wikimedia.org/r/#/c/134074/) for 2 mins. Time between requests 2s
$siege -d2 -c100 -t 2m http://localhost:8000/dictionary/pen/en/en **SIEGE 3.0.5 ** Preparing 100 concurrent users for battle. The server is now under siege... Lifting the server siege... done. Transactions: 11884 hits Availability: 100.00 % Elapsed time: 119.50 secs Data transferred: 56.92 MB Response time: 0.00 secs Transaction rate: 99.45 trans/sec Throughput: 0.48 MB/sec Concurrency: 0.09 Successful transactions: 11884 Failed transactions: 0 Longest transaction: 0.06 Shortest transaction: 0.00
JSON Dictionary file
editConvert the dictionary sources to json format (offline) and write a code that does lookup on the json.
Pros
edit- Immediate:
- Deployment is simpler (no separate dictd service)
- Coding is simpler (no need for robust dictd client code)
- Runtime is simpler (no possibility of dictd protocol/state issues at runtime)
- Future:
- Not restricted to dictd dictionaries (can use exports from terminology resources in TBX format etc).
- Can search every word in a paragraph at once (e.g. to highlight matches in the source text)
- Can do things the dictd protocol doesn't support well (e.g. better word morphology support in searches)
Cons
edit- Immediate:
- Need to extract the data (but we need the code to do this in any case).
- Each dictionary needs data mining separately to get good word correspondences
- The extraction process is quite simple (a few lines of code)
- The size is quite small (~240K of uncompressed json for 6000 word pairs)
- The data quality will vary in many ways (number of headwords, subject, richness of information etc)
- Caution is required before assuming information will be useful to the user (does a translator really need to know what simple words like "you" mean?)
- Need to extract the data (but we need the code to do this in any case).
- Future:
- Big memory consumption for the in-memory representation of dictionary. The English-English webster dictionary is 39MB uncompressed. Nodejs is never recommended for memory or cpu consuming operations because it is single threaded and can freeze the request for few milliseconds to seconds depending the volume of data we have. If performance becomes a problem we can optimise with any HTTP-based approach (even non-Node if we like).
- Need to re-implment search strategies. If the lookup is not the native json lookup, we need to add more time to the response. But in many cases we will need better search than the dictd files support, especially when the source language is not English: e.g. better word conjugation support.
API
editAPI URL: dictionary/word/sourceLanguage/targetLanguage
Example http://cxserver.wmflabs.org/dictionary/egg/en/es
{
"source": "egg",
"translations": [
{
"phrase": "huevo",
"info": "",
"sources": [
"fd-eng-spa"
]
}
]
}
If the backend cannot provide structured information like this, output will be
{
"source": "cat",
"freetext": [
{
"text": "cat /kæt/\r\n Katze <f>",
"sources": [
[
"fd-eng-deu"
]
]
}
]
}
Sister projects
editLook up a word at wiktionary or wikipedia interwiki. For instance, names of people and places are not always present in a classic dictionary. Should content translation be able to query API of interwiki of these two projects (or wikidata sisterlinks thingy) to assist with transliteration routine?