Content translation/Deployments/How-to/TPA

This is how-do document to update Template Parameter Alignment database in the cxserver.

Connect to stat100x

ssh -N stat100X -L 8880:127.0.0.1:8880

Open, http://localhost:8880/

This will open JupyterHub, which requires LDAP password to login.

Starting notebook

Make sure to check Kerberos authentication timeout first. Default is set to 48 hours now.

klist

Extend it by running kinit:

kinit

Running scripts

Open terminal and clone: https://gitlab.wikimedia.org/dsaez/templatesAlignment

Update config.json for pairs requires to generate template parameter alignments.

Run all notebooks in order.

00ExtractNamedTempates.ipynb overwrites existing output files if it runs again, so it is better to save produced JSON files (eg: templates-articles_xx.json and templates-summary_xx.json) in other directory to avoid losing data. For large languages like en, it can be reused if we are running process within few days, this will save time.

While running 02alignmentsSpark.ipynb, make sure that Wikidata partition is up-to-date.

Updating database

Run: scripts/prepare-template-mapping.sh from cxserver pointing all generated files from the process.

This will update new templatemapping.db in the same folder. Use sqldiff command (available with sqlite3-tools package in Linux) to see difference between old and new database.

Copy it to config/templatemapping.db and submit patch for review. This database can be open with sqlite command to check number of template parameters updated.

eg: sqlite> select count(*) from templates where source_lang='en' and target_lang='vec';

Notes

1. 02alignmentsSpark.ipynb will need fastText_multilingual module to be manually install in the conda envionment, which is available at: https://github.com/babylonhealth/fastText_multilingual

a. Find conda environment directory using conda list

b. Copy module to environment manually. eg /home/kartik/.conda/envs/2023-06-08T01.31.46_kartik/lib/python3.10/site-packages/fastText_multilingual

2. 03ProduceAlignments.py requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.

3. 03ProduceAlignments.py might throw error: IndexError: list index out of range when language has no {{Cite web}} available or linked to Wikidata. Try fixing Wikidata entry. If not, we need to skip that language.

Useful resources

All about Conda envionment: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda

Issues related to Kerberos access: https://wikitech.wikimedia.org/wiki/SWAP#Access_and_infrastructure

Jupyter at Wikitech contains useful information: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter