Content translation/Deployments/How-to/TPA

This is how-do document to update Template Parameter Alignment database in the cxserver.

Connect to stat100xEdit

ssh -N stat100X -L 8880:127.0.0.1:8880

Open, http://localhost:8880/

This will open JupyterHub, which requires LDAP password to login.

Starting notebookEdit

Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.

klist

Extend it by running kinit:

kinit

Running scriptsEdit

  1. Open terminal and clone: https://gitlab.wikimedia.org/dsaez/templatesAlignment
  2. Update config.json for pairs requires to generate template parameter alignments.
  3. Run all notebooks in order.
  4. 00ExtractNamedTempates.ipynb overwrites existing output files if it runs again, so it is better to save produced JSON files (eg: templates-articles_xx.json and templates-summary_xx.json) in other directory to avoid losing data. For large languages like en, it can be reused if we are running process within few days, this will save time.
  5. While running 02alignmentsSpark.ipynb, make sure that Wikidata partition is up-to-date.

Updating databaseEdit

Run: scripts/prepare-template-mapping.sh from cxserver pointing all generated files from the process.

This will update new templatemapping.db in the same folder. Use sqldiff command (available with sqlite3-tools package in Linux) to see difference between old and new database.

Copy it to config/templatemapping.db and submit patch for review. This database can be open with sqlite command to check number of template parameters updated.

eg: sqlite> select count(*) from templates where source_lang='en' and target_lang='vec';

NotesEdit

1. fastText_multilingual module is available at: https://github.com/babylonhealth/fastText_multilingual

2. `03ProduceAlignments.py` requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.

Also seeEdit