Help:Extension:Translate/Translation memories/da
The Translate extension translation memory supports ElasticSearch. This page aims to guide you in installing ElasticSearch, and exploring its specifications in deeper detail.
Unlike other translation aids, for instance external machine translation services, the translation memory is constantly updated by new translations in your wiki. Advanced search across translations is also available at Special:SearchTranslations if you choose to use ElasticSearch.
Comparison
The database backend is used by default: it has no dependencies and doesn't need configuration. The database backend can't be shared among multiple wikis and it does not scale to large amounts of translated content. Hence we also support ElasticSearch as a backend. It is also possible to use another wiki's translation memory if their web API is open. Unlike ElasticSearch, remote backends are not updated with translations from the current wiki.
Database | Fjernbetjent API | ElasticSearch | |
---|---|---|---|
Aktiveret som standard | Yes | No | No |
Kan have flere kilder | No | Yes | Yes |
Opdateret med lokale oversættelser | Yes | No | Yes |
Åbner database direkte | Yes | No | No |
Adgang til kilde | Skribent | Link | Lokal skribent eller link |
Kan deles som en API-tjeneste | Yes | Yes | Yes |
Optræden | Does not scale well | Ukendt | Reasonable |
Krav
ElasticSearch backend
ElasticSearch is relatively easy to set up. If it is not available in your distribution packages, you can get it from their website. You will also need to get the Elastica extension. Finally, please see puppet/modules/elasticsearch/files/elasticsearch.yml for specific configuration needed by Translate.
The bootstrap script will create necessary schemas. If you are using ElasticSearch backend with multiple wikis, they will share the translation memory by default, unless you set the index parameter in the configuration.
When upgrading to the next major version of ElasticSearch (e.g. upgrading from 2.x to 5.x), it is highly recommended to read the release notes and the documentation regarding the upgrade process.
Installation
After putting the requirements in place, installation requires you to tweak the configuration and then execute the bootstrap.
Konfiguration
Al oversættelses-hjælp, herunder oversættelses-hukommelse er konfigureret med $wgTranslateTranslationServices
-konfigurationsindstillingen. Eksempel på konfiguration af TTMServers:
The primary translation memory backend must use the key TTMServer
.
The primary backend receives translation updates and is used by Special:SearchTranslations.
Example configuration of TTMServers:
Standard konfiguration |
---|
$wgTranslateTranslationServices['TTMServer'] = array(
'database' => false, // Passed to wfGetDB
'cutoff' => 0.75,
'type' => 'ttmserver',
'public' => false,
);
|
Fjernbetjent API-konfiguration |
$wgTranslateTranslationServices['example'] = array(
'url' => 'http://example.com/w/api.php',
'displayname' => 'example.com',
'cutoff' => 0.75,
'timeout' => 3,
'type' => 'ttmserver',
'class' => 'RemoteTTMServer',
);
|
ElasticSearch backend configuration |
In this case the single back-end service will be used both for reads & writes.
$wgTranslateTranslationServices['TTMServer'] = array(
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
/*
* See http://elastica.io/getting-started/installation.html
* See https://github.com/ruflin/Elastica/blob/8.x/src/Client.php
'config' => This will be passed to \Elastica\Client
*/
);
|
ElasticSearch multiple backends configuration (supported by MLEB 2017.04, dropped in MLEB 2023.10) |
// Defines the default service used for read operations
// Allows to quickly switch to another backend
// 'mirrors' configuration option is no longer supported since MLEB 2023.10
$wgTranslateTranslationDefaultService = 'cluster1';
$wgTranslateTranslationServices['cluster1'] = array(
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
/*
* Defines the list of services to replicate writes to.
* Only "writable" services are allowed here.
*/
'mirrors' => [ 'cluster2' ],
'config' => [ 'servers' => [ 'host' => 'elastic1001.cluster1.mynet' ] ]
);
$wgTranslateTranslationServices['cluster2'] = array(
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
/*
* if "cluster2" is defined as the default service it will start to replicate writes to "cluster1".
*/
'mirrors' => [ 'cluster1' ],
'config' => [ 'servers' => [ 'host' => 'elastic2001.cluster2.mynet' ] ]
);
|
ElasticSearch multiple services with single readable service using writable configuration (supported by MLEB 2023.04)
|
With writable configuration the following rules are enforced:
If a service is marked as writable, the mirrors configuration will not be allowed. // Three services configured with one being readable and the others being writable.
$wgTranslateTranslationServices['dc0'] = [
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
// Default service cannot be marked as write-only
];
$wgTranslateTranslationServices['dc1'] = [
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
// Marks this service as write-only
'writable' => true,
];
$wgTranslateTranslationServices['dc2'] = [
'type' => 'ttmserver',
'class' => 'ElasticSearchTTMServer',
'cutoff' => 0.75,
'writable' => true
];
$wgTranslateTranslationDefaultService = 'dc0';
|
Mulige nøgler og værdier er:
Nøgle | Gælder til | Beskrivelse |
---|---|---|
config
|
ElasticSearch | Solr eksempel-konfiguration for Solarium, se nedenfor. |
cutoff
|
Alle | Minimumsgrænse for matchende forslag. Kun få de bedste forslag er vist, selv hvis der var flere over grænsen. |
database
|
Lokal | Hvis du ønsker at gemme oversættelses-hukommelsen et andet sted, kan du angive databasenavnet her. Du skal også konfigurere MediaWiki's load balancer, så den ved hvordan der oprettes forbindelse til databasen. |
displayname
|
Fjernbejent | Teksten vises i værktøjs-tippet når der peges på forslagets kildelink (punkterne). |
index
|
ElasticSearch | The index to use in ElasticSearch. Default: ttmserver. |
public
|
Alle | Hvorvidt denne TTMServer kan forespørges via denne wikis api.php |
replicas
|
ElasticSearch | If you are running a cluster, you can increase the number of replicas. Default: 0. |
shards
|
ElasticSearch | How many shards to use. Default: 5. |
timeout
|
Fjernbetjent | Ventetid på et svar fra fjernbetjent service i sekunder. |
type
|
Alle | TTMServer-type i form af resultat-format. |
url
|
Fjernbetjent | URL til api.php af den fjernbetjente TTMServer. |
use_wikimedia_extra
|
ElasticSearch | Boolean, when the extra plugin is deployed you can disable dynamic scripting on Elastic v1.x. This plugin is now mandatory for Elastic 2.x clusters. |
mirrors (DEPRECATED Since MLEB 2023.04)
|
Writable services | Array of strings, defines the list of services to replicate writes to, it allows to keep multiple TTM services up to date. Useful for fast switch-overs or to reduce downtime during planned maintenance operations (Added in MLEB 2017.04). Cannot be used along with the newly added writable configuration.
|
writable (Added in MLEB 2023.04)
|
Write-only services | Boolean value, defined for a service if that service is write-only. The default service (wgTranslateTranslationDefaultService ) cannot be marked as write-only. If out of all the translation memory services configured, none are marked as writable then all services are considered to be readable and writable. Se task T322284
|
TTMServer
som matrixindekset til $wgTranslateTranslationServices
, hvis du ønsker at oversættelseshukommelsen skal opdateres med nye oversættelser. Fjernbetjente TTMServer'ere kan ikke anvendes til det, fordi de ikke kan opdateres. As of MLEB 2017.04 the key TTMServer
can be configured with the configuration variable $wgTranslateTranslationDefaultService
. Support for Solr backend was dropped in MLEB-2019.10, in October, 2019.I øjeblikket understøttes kun MySQL for databaserne.
Bootstrap
Once you have chosen ElasticSearch and set up the requirements and configuration, run ttmserver-export.php
to bootstrap the translation memory.
Bootstrapping is also required when changing translation memory backend.
If you are using a shared translation memory backend for multiple wikis, you'll need to bootstrap each of them separately.
Sites with lots of translations should consider using multiple threads with the --thread
parameter to speed up the process.
The time depends heavily on how complete the message group completion stats are (incomplete ones will be calculated during the bootstrap).
New translations are automatically added by a hook.
New sources (message definitions) are added when the first translation is created.
Bootstrap does the following things, which don't happen otherwise:
- adding and updating the translation memory schema;
- populating the translation memory with existing translations;
- cleaning up unused translation entries by emptying and re-populating the translation memory.
When the translation of a message is updated, the previous translation is removed from the translation memory. However, when translations are updated against a new definition, a new entry is added but the old definition and its old translations remain in the database until purged. When a message changes definition or is removed from all message groups, nothing happens immediately. Saving a translation as fuzzy does not add a new translation nor delete an old one in the translation memory.
TTMServer API
Hvis du ønsker at anvende din egen TTMServer-tjeneste er specifikationerne her.
Forespørgselsparametre:
Din tjeneste skal acceptere de følgende parametre:
Nøgle | Værdi |
---|---|
format
|
json |
action
|
ttmserver |
service
|
Valgfri tjeneste-id, hvis der er flere fælles oversættelseshukommelser. Hvis intet er angivet antages standard-tjenesten. |
sourcelanguage
|
Sprogkode som anvendt i MediaWiki, se IETF sprog-mærker og ISO693? |
targetlanguage
|
Sprogkode som anvendt i MediaWiki, se IETF sprog-mærker og ISO693? |
test
|
Kildetekst på originalsproget |
Din tjeneste skal give et JSON-objekt, der skal have nøglen ttmserver
med objekt-matrix. Disse objekter skal indeholde følgende data:
Those objects must contain the following data:
Nøgle | Værdi |
---|---|
source |
Oprindelig kildetekst. |
target |
Oversættelses-forslag. |
context |
Lokal id for kilden, valgfrit. |
location |
URL til siden hvor forslaget kan ses i brug. |
quality |
Decimaltal i intervallet [0..1], som beskriver forslagets kvalitet. 1 betyder perfekt. |
Eksempel:
- URL: http://translatewiki.net/w/api.php?action=ttmserver&sourcelanguage=en&targetlanguage=fi&text=january&format=jsonfm
- Resultat:
{
"ttmserver": [
{
"source": "January",
"target": "tammikuu",
"context": "Wikimedia:Messages\\x5b'January'\\x5d\/en",
"location": "https:\/\/translatewiki.net\/wiki\/Wikimedia:Messages%5Cx5b%27January%27%5Cx5d\/fi",
"quality": 0.85714285714286
},
{
"source": "January",
"target": "tammikuu",
"context": "Mantis:S month january\/en",
"location": "https:\/\/translatewiki.net\/wiki\/Mantis:S_month_january\/fi",
"quality": 0.85714285714286
},
{
"source": "January",
"target": "Tammikuu",
"context": "FUDforum:Month 1\/en",
"location": "https:\/\/translatewiki.net\/wiki\/FUDforum:Month_1\/fi",
"quality": 0.85714285714286
},
{
"source": "January",
"target": "tammikuun",
"context": "MediaWiki:January-gen\/en",
"location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January-gen\/fi",
"quality": 0.85714285714286
},
{
"source": "January",
"target": "tammikuu",
"context": "MediaWiki:January\/en",
"location": "https:\/\/translatewiki.net\/wiki\/MediaWiki:January\/fi",
"quality": 0.85714285714286
}
]
}
Database backend
Bagdelen indeholder tre tabeller: translate_tms
, translate_tmt
og translate_tmf
. De svarer til kilder, mål og fuldtekst. Du kan finde tabeldefinitionerne i sql/translate_tm.sql
. Kilderne indeholder alle meddelelses-definitionerne. Selvom de som regel altid er på det samme sprog, dvs. engelsk, opbevares tekstens sprog også i de sjældne tilfælde, at dette ikke er sandt.
Those correspond to sources, targets and fulltext.
You can find the table definitions in sql/translate_tm.sql
.
The sources contain all the message definitions.
Even though usually they are always in the same language, say, English, the language of the text is also stored for the rare cases this is not true.
Hver post har et unikt id og 2 ekstra felter, længde og kontekst. Længden bruges som første passerings-filter, så ved forespørgsel behøver vi ikke sammenligne teksten vi søger med hver post i databasen. Konteksten gemmer titlen på den side, hvor teksten kommer fra, for eksempel "MediaWiki:Jan/en". Ud fra denne information kan vi linke forslagene tilbage til "MediaWiki:Jan/da", som gør det muligt for oversættere hurtigt at rette ting eller bare at afgøre, hvor den slags oversættelse er brugt.
Length is used as the first pass filter, so that when querying we don't need to compare the text we're searching with every entry in the database.
The context stores the title of the page where the text comes from, for example MediaWiki:Jan/en
.
From this information we can link the suggestions back to MediaWiki:Jan/de
, which makes it possible for translators to quickly fix things, or just to determine where that kind of translation was used.
Den anden filtrerings-passage kommer fra fuldtekst-søgningen. Definitionerne er blandet med en ad hoc-algoritme. Først er teksten opdelt i inddelinger (ord) med MediaWikis Language::segmentByWord
. Hvis der er nok inddelinger, fratager vi dybest set alt hvad der ikke er bogstaver og normaliserer bøjningen. Så kan vi tage de første 10 unikke ord, som er mindst 5 bytes lange (5 bogstaver på engelsk, men kortere ord for sprog med multibyte kodepunkter). Disse ord bliver så lagret i fuldtekst-indekset for yderligere filtrering i længere strenge.
The definitions are mingled with an ad hoc algorithm.
First the text is segmented into segments (words) with MediaWiki's Language::segmentByWord
.
If there are enough segments, we strip basically everything that is not word letters and normalize the case.
Then we take the first ten unique words, which are at least 5 bytes long (5 letters in English, but even shorter words for languages with multibyte code points).
Those words are then stored in the fulltext index for further filtering for longer strings.
Når vi har filtreret listen over kandidater, henter vi de matchende mål fra mål-tabellen. Så anvender vi Levenshteins afstands-algoritme til at lave den endelige filtrering og rangordning. Lad os definere:
- E
- redigerings-afstand
- S
- teksten vi søger forslag til
- Tc
- den foreslåede tekst
- To
- den oprindelige tekst som Tc er oversættelse af
Kvaliteten af forslaget Tc beregnes som E/min(længde(Tc),længde(To)). Afhængigt af længden af strengene bruger vi enten PHP's oprindelige Levenshtein-funktion eller hvis en af strengene er længere end 255 byte, PHP-anvendelse af Levenshtein-algoritmen. [1] Det er ikke blevet testet om den oprindelige anvendelse af Levenshtein håndterer multibyte tegn korrekt. Dette kan være et svagt punkt når kildesproget ikke er engelsk (de andre er fuldtekst-søgningen og inddeling).[2] It has not been tested whether the native implementation of levenshtein handles multibyte characters correctly. This might be another weak point when source language is not English (the others being the fulltext search and segmentation).
Oversættere (main help page )
- Hvordan man oversætter
- Bedste oversættelses-øvelser
- Statistik og rapportering
- Kvalitetssikring
- Meddelelsesgruppe-statistik
- Offline oversættelse
- Ordbog
Oversættelsesadministratorer
- Hvordan man forbereder en side til oversættelse
- Sideoversættelses-administration
- Oversættelse af ustrukturerede elementer
- Gruppe-styring
- Move translatable page
- Import translations via CSV
- Working with message bundles
Systemadminer og udviklere