About me

I am an ML and NLP enthusiast from Bangladesh. I love working wih data and drawing information from them. I did my Bachelors in Computer Science and Engineering from Shahjalal University of Science and Technology, Bangladesh and Masters in Computer Science from University of Waterloo, Canada. Upon graduation, I worked as a Machine Learning Engineer for about a year before joining Wikimedia Foundation as a Data Analyst and Researcher, performing several roles along the way.

N.B: This is my personal wiki page.

My work

I work with the Research Team as a Research Data Scientist (NLP) to develop Copyediting as a structured task. To increase and maintain the standard of Wikipedia articles, it is important to ensure articles don't have typos, spelling, or grammatical errors. While there are ongoing efforts to automatically detect "commonly misspelled" words in English Wikipedia, most other languages are left behind. We intend to find ways to detect errors in articles in all languages in an automated fashion.
Previously I worked with the Search and Analytics team to find ways to scale the Wikidata Query Service by analyzing the queries being made. Find the analysis results in User:AKhatun Subpages. Phabricator Work Board (WDQS Analysis).
I worked on the Abstract Wikimedia project to analyze find out central Scribunto Modules across all the wikis. This work leads to the creation of a central repository of functions to be used in a language-independent manner in the future. See our work in Phabricator and Github.

Contact me

IRC: tanny411 on Libera
Personal: website/blog
LinkedIn: tanny411
GitHub: tanny411
Meta: Aisha Khatun
WMF meta: AKhatun (WMF)

Outreachy Round 21 Internship Work

Overview

I am an Outreachy intern with the Wikimedia Foundation. My internship runs from 1 Dec, 2020 to 2 March, 2021. I am working on an initial step of the Abstract Wikipedia project - a project to make wikipedia reach millions by storing information in a more language independent manner.

I have blogged throughout my internship period here and will be doing more. More details about the specifics of my work can be found here:

My blog: Internship progress
Our work in GitHub: wikimedia/abstract-wikipedia-data-science/
Tasks in Phabricator: T263678

Project Description

T263678

The Abstract Wikipedia initiative will make it possible to generate Wikipedia articles with a combination of community authored programming functions on a "wiki of functions" and the data and lexicographic (dictionary, grammar, etc.) knowledge on Wikidata.
Today the way community authored programming functions are used on different language editions of Wikipedia involves a lot of copying and pasting. If someone wants to calculate the age of someone for a biography in their native language, they may need to first go to English Wikipedia for example, find the community authored programming function that calculates ages in English, then copy and paste it to their non-English Wikipedia. This process is error prone, can lead to code duplication, and worse, improvements to functions on one language edition may not ever make their way to other language editions.
Wouldn't it be easier if all of these functions were instead available centrally and people didn't have to go through this manual process?
This Outreachy task is about an important first step: finding the different community authored functions that are out there and helping to prioritize which ones would be good candidates for centralizing for Abstract Wikipedia and its centralized wiki of functions.

Mentor

Adam Baso

Task partner

Liudmila (Jade) Kalina

Blog posts

Outreachy Internship Updates

Week	Work done	Date
1	Created Wikitech account. Connected phabricator to Wikitech and Mediawiki. Set up 2FA in everything. Create and set committed identity in MediaWiki and MetaWiki User pages. Read about Gerrit and set up Mediawiki-Docker, ran the PHP unit tests, set up git-reviews etc following the How to become a MediaWiki Hacker page. Joined the required mailing lists, joined IRC channels (these channels have been so much help). Read the very awesome paper on abstract wikipedia. Also read Wikimedia Engineering Architecture Principles, Movement Strategy, WMF values and guiding principles. Started reading Wikipedia @ 20 and Debugging teams. Wrote my first blog post for people looking to intern through Outreachy and join Open Source: Blog. Challenges and lessons Learned about IRC and how it works. A very efficient way to connect with the open source community through various channels Got more familiar with Docker Ran into issues when creating the committed identity and setting it up in my user page. Through this learned about templates in wikis and how to naviagte around wikis to learn about more very useful templates.	1-7 Dec, 2020
2	There were a couple of new things I had to read and understand the working of this week. Among them were toolforge, VC in toolforge, Grid, cronjobs, MediaWiki databases, and the MediaWiki API. I created a tool and tested out various things to get comfortable working with this environment. I also experiemnted with local python, working in toolforge through terminal and PAWS jupyter notebooks to figure out which way is better suited for me. I ended up working in PAWS as it can connect to databases easily and export my finished code as python scripts our GitHub repo. I also tested running jobs in Grid, setting up dummy cronjobs. Poked around the database a little from terminal, PAWS, and locally. Challenges and lessons Understainding how toolforge, Grid works tool me some time. Also going around the database was a bit intimidating. But that was kept for later for a in-depth exploration. Inorder to connect to toolforge, I set up my keys and I learned more about ssh. Moreover I also learned about scp file transfer and was able to copy files over to my local environment for easier access. Later we will start working to and from databases, which will be a smoother experience as we can connect to database from local PC as well. Jade and I discussed how to get started with the task at hand and divided it into two initial tasks. Creating a parser to get all wiki pages links and collecting all the Scribunti Modules contents using API/DB/both. See our tasks in Phabricator T263678.	8-14 Dec, 2020
3	Started creating scripts to collect contents of Scribunto Modules across all wikis. Set up the script in Grid to run everyday as a cronjob. For now it collects all the contents fresh everyday. Since API can miss some pages, I collected page list (id, title) from database as well to check against the page list collected by API. Note that DB does not contain wiki contents, contents are only returned by the API. Next I compared page lists from DB and API to find some inconsistencies in both places. I had a bunch of issues when loading from csv files due to use of various symbols in the contents of the files (commas, quotes) and also some broken rows due to multiple crons writing to the same file at the same time. These are to be fixed next. Wrote my second blog post about my struggles: Blog. Challenges and lessons To make the 'content fetcher' work I had to test and trial the number of cronjobs I ran run while not running out of memory in Python. I ended up running 16 cronjobs (the maxlimit) and dividing all the wikis into 16 parts. Each cronjob would fetch contents from ~50 wikis and write to disk frequently to avoid running out of memory. While trying to compare two large csv files in pandas (pageid and wiki only, still large!) I ran into tons of memory issues. I had to make very careful decisions on how to write my code after getting lots of out-of-memory errors. See more about how I handled these in my internship progress blog. I discovered the racing condition happening on my files (quite obviously) while trying to read them. I will be solving this situation next by switching over to using DBs instead of files to write stuff.	15-21 Dec, 2020
4	This week I finalised my content fetcher. All code was transitioned to take input and give output to database. I cleaned and tested the database transition. Fixed some more memory errors and divided the cron jobs more to take advantage of parallel processing. Due to some large wiki sites (e.g. enwiki, frwiki) some jobs take upto 60 minutes. Cronjob re-arrangement was also necessary to fix some of the memory issues, probably due to large content of individual pages. Another script was set up to fetch page_id from database and fetch their content through API. The implementation of database usage made update, search and delete on the collected information much easier and more robust. Some analysis was done on pages found explicitly through API or DB. Pages that don't have content or were missed multiple times (due to not having content or not being Scribunto modules) were deleted. Task done here. Finally, started analyzing the database. Work progress here. From database analysis I will select relevant page information that we will use for data analysis later on. Challenges and lessons Working to/from database is a big relief and data manipulation became much easier. But it was to be done from toolforge only. To fix this I set up ssh tunneling when working locally. I still had to change code to make it compalible with toolforge before commits. To fix this more permanently a task has been opened. Setting up cron jobs again was a challenge as I had to wait an hour everytime to check how long each job took. I set up timing on the scripts for that. Then tried to divide up the heavier scripts into multiple parts.	12-28 Dec, 2020
5	Almost done exploiting the database this week(work progress). I explored all the tables, tried to understand what these held and how the information may be usefule to us. Spent some time finding our what pagelinks, langlinks, iwlinks and templatelinks were and their differences to make sure they don't have overlaps. Details of my work progress in my internship progress blog. I set up queries to maneouver and fetch data from all dbs, save them in user-db, and set cronjobs for the same. Unexpectedly db-fetching is taking much longer than fetching data from API took. Segmentation faults and memory error had to be sovled by `LIMIT OFFSET` queries. This made the queries even slower but able to fit into memory (chunk_size in dataframe failed me). Wrote my third blog post introducing Abstract Wikipedia and our work: Blog. Challenges and lessons Typically pandas chunk_size gives a generator, which fetches data from db with read_sql in chunks. But sometimes pandas tries to load the entire data into memory and gives us chunks from there (causing segmentation faults). Finally I used `LIMIT OFFSET` in SQL instead of df chunk_size. Slow but works. Time-Memory tradeoff. I couldn't create views or even temporary tables with read-only mode. MySQL also doesn’t support WITH clause, so in some places I had to repeat a sub-query which a query, where the sub-query itself was quite expensive (joining 4 tables!). Queries ran very slow in this case. I was able to fix minor bugs and get things working locally but when trying to diagnose memory issues, racing conditions, deadlocks on db etc, I had to turn to toolforge. I couldn’t rapidly prototype in toolforge plus sometimes it gets REALLY slow. When running cron jobs, I ran out of the number of open connections I could have. So I had to reduce the number of crons running and pair them up based on their time so that no single cron would take too long to finish. Another issue I ran into was that I created python generators with condition. Basically I had a flag that said either yield or do something else. But with the yield keyword, that function is now a generator and so it failed to do other things. And the worst part was that it wouldn’t raise any errors. I used vscode debugger and what not and finally caught the error and modified my code accordingly.	29 Dec, 2020-4 Jan, 2021
6	Finalized collecting data from databases. Incorporated mentors suggetsion about query optimization and decided to not use iwlinks for now as it was taking over multiple days to run and thus slowing down toolforge for other purposes. Introduced better exception handling and error reporting and edited code so that queries are retired few times before failing. Also set up 'fetch missed contents' scripts. Finished collecting pageviews of pages that transclude modules using php and REST APIs. Both blurted out errors of their own, I ended up using the php API. Challenges and lessons While handling MySQL errors I learnt to better handle errors with efficient usage of try-except-finally block, especially the `finally` block. I also learned about Deadlocks in queries and set up retires to handle that as well. MySQL by default does not do case-sensitive searches with `WHERE` clause. `LIKE` seems to do case-sensitive searches but it's results were not consistent from scripts and terminal for some reason. I searched and learnt about collations in MySQl tables but could not accomplish much with these informations. At last I stumbled upon the simplest solution - `WHERE col1 = BINARY col2`. Notice that this performs much faster than `BINARY col1 = col2`.	5-11 Jan, 2021
7	This week I started data analysis of the various numeric data from the user-table. There were lots of nulls, which I analysed by looking into respective wiki database tables and concluded that certain columns in our user database need to be modified to have default value 0. Next most important observation is that the data is HIGHLY skewed. The only way to visualize anything is to plot in log scale. So I viewed the data in small intervals and for each column I tried to find some very basic initial heuristic to identify what modules are important. As for pageviews, the script is still running to fetch pageviews for all data for 60 days. Since it is taking multiple days to run, I changed the script to fetch weekly instead of daily for later runs. But I think I might have to change it to fetch monthly instead. Wrote my fourth blog about Modifying Expectations: blog Challenges and lessons When all the cronjobs run, I got some `deadlock` and `lost connection` errors. To fix this I set up my code to loop through a few retries before throwing errors but the error was not getting caught. The problem was with `pd.read_sql` which was eating up the error. I ended up making multiple changes to use `cur.execute` instead and form dataframe later. I changed my error handling to have nested `try-catch` blocks and that finally fixed the retry thing I was trying to do. Data fetching seems a continuos thing that will need changes and correction the more we delve into data analysis and figure out better ways to fetch data. With this large amount of data and highly skewed nature, most of data analysis gets manual (which otherwise I would accomplish using interactive plots to easily discover insights).	12-18 Jan, 2021
8	Continued with data analysis for the numeric columns. Took some time to analyse transcluded_in and transclusions in more depth. For each column tried to find some heuristic values to determine which modules may be more important. Did some source code analysis as well. Page protection seems to have some new values that I couldn't understand. Understood from an answer in IRC that recent pages have page protection terms of user rights. i.e user with `X` rights can edit/move pages with `X` edit/move protection. Read up and tried to understand this more. Concluded that this isn't something universal across wikis, so I stuck to old page protection values as those are the majority. Challenges and lessons Refactored code to move around connection and cursor objects' opening and closing. I had to ensure the connection and cursor were closed before the script went into a 1 minute sleep in-order to retry a query. Checked `mysql connection lost error` does not occur in PAWS and Jade tested it locally, it works there too! Created bug report in Phab T272822 and asked in IRC.	19-25 Jan, 2021
9	Closed db-fetching task in phabricator T270492 after fixing bug T272822. The issue with `mysql connection lost` errors was that we were running it on `web` instead of `analytics` cluster as defaulted by toolforge python library. Tested by running all scripts, it seems to work now. Finished data analysis and created a PDF version of a summary. Shared with others. Merged db-fetcher with develop branch and incorporated feedback from Jade. Wrote 5th blog on my Future Goals Challenges and lessons Fixed pageviews script to get proper page titles. I noticed API and database table page titles differ (Module:a or b vs a_or_b for example). So fetched page title from page table using page id instead of manually intervention. Refactored pageviews script to fetch monthly data. Cleared the whole Scripts table and re-fetched everything due to Schema changing (changed certain columns default value from NULL to 0). This also led to re-starting the 15+ days running pageviews script.	26 Jan-1 Feb, 2021
10-11	Started applying the findings from data analysis to find important modules. As evident from my data analysis, the data we have is heavily skewed. We want pages with more transclusions or more page links etc to be counted as important modules. But the number of such modules is very low and lost within in the 99.99999...th percentile. To fix this I changed the distribution slightly and regared the-percentile-a-value-is-in as the score. See more details of how I did that in the phabricator task T272003. Finally a module-score is calculated as a weighted sum of feature-scores. The weights can be altered by the user to prioritize number of lang links over transclusions for example.	2-15 Feb, 2021
12-13	After weekly discussion, Jade and I decided to split our tasks once again. I continued working on similarity analysis that Jade had started. Jade on the other hand started building the web interface. Week 12 was a hectic week. I was buried in tons of experiments and had to come up with a way to find out modules similar to each other. Jade started by using levenshtein distance as features and performed DBSCAN clustering on it. This approach was a bit problematic in that levenshtein distance calculation is too slow to compute and takes `n x n` memory to create the distance matrix. Wouldn't be possible with our ~300k modules. To fix these I started out fresh and looked for other ways to make features. After a lot of experiments (see details in phabricator task T270827) I decided to go with FastText word embedding as features and OPTICS clustering algortihm. Next, I fixed the high number of noise that the algorithm detects by tuning the algorithm a bit, creating some pseudo-clusters from the noise, and finding ways to relate the clusters themselves. All of these are documented in a pdf uploaded in the phabricator task.	16 Feb-2 Mar, 2021
Final Update	All of our work is now accessible through the web-interface of our tool: abstract-wiki-ds.toolforge.org. You will be able to select some or all wiki projects, some or all languages, and give your weights (or use the defaults) to generate a list of 'important modules' based of their scores. Click on any module to get a list of modules that are similar to it. Now users can easily start the process of merging modules and move towards a more language independent wikipedia - Abstract Wikipedia!	-

Committed identity: 272c9b4c61b72b18be37261522c41fc2f8687bbdae7e364207943454f2ec695d6d43d23602ed2eb2d24babffd47434292e58fb301b65cfbf46c12a334544c6fb is a SHA-512 commitment to this user's real-life identity.