Content translation/Deployments/How-to/TPA
This is how-do document to update Template Parameter Alignment database in the cxserver.
Connect to stat100x
editssh -N stat100X -L 8880:127.0.0.1:8880
Open, http://localhost:8880/
This will open JupyterHub, which requires LDAP password to login.
Starting notebook
editMake sure to check Kerberos authentication timeout first. Default is set to 48 hours now.
klist
Extend it by running kinit:
kinit
Running scripts
edit- Open terminal and clone:
https://gitlab.wikimedia.org/dsaez/templatesAlignment
- Update
config.json
for pairs requires to generate template parameter alignments.
- Run all notebooks in order.
00ExtractNamedTempates.ipynb
overwrites existing output files if it runs again, so it is better to save produced JSON files (eg: templates-articles_xx.json and templates-summary_xx.json) in other directory to avoid losing data. For large languages like en, it can be reused if we are running process within few days, this will save time.
- While running
02alignmentsSpark.ipynb
, make sure that Wikidata partition is up-to-date.
Updating database
editRun: scripts/prepare-template-mapping.sh
from cxserver pointing all generated files from the process.
This will update new templatemapping.db in the same folder. Use sqldiff
command (available with sqlite3-tools package in Linux) to see difference between old and new database.
Copy it to config/templatemapping.db
and submit patch for review. This database can be open with sqlite command to check number of template parameters updated.
eg: sqlite> select count(*) from templates where source_lang='en' and target_lang='vec';
Notes
edit1. 02alignmentsSpark.ipynb
will need fastText_multilingual module to be manually install in the conda envionment, which is available at: https://github.com/babylonhealth/fastText_multilingual
a. Find conda environment directory using conda list
b. Copy module to environment manually. eg /home/kartik/.conda/envs/2023-06-08T01.31.46_kartik/lib/python3.10/site-packages/fastText_multilingual
2. 03ProduceAlignments.py
requires https://github.com/facebookresearch/fastText/tree/master/python instead of version provided by pip.
3. 03ProduceAlignments.py
might throw error: IndexError: list index out of range
when language has no {{Cite web}} available or linked to Wikidata. Try fixing Wikidata entry. If not, we need to skip that language.
Useful resources
edit- All about Conda envionment: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda
- Issues related to Kerberos access: https://wikitech.wikimedia.org/wiki/SWAP#Access_and_infrastructure
- Jupyter at Wikitech contains useful information: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter