User:Omegat/OPW Report
Weekly progress report for FOSS OPW 2014, Round 9 project.
Community Bonding Period
editLanding and meeting with mentors
editAfter being selected as an OPW intern, my mentor gave me small introductory tasks to become more familiar with the API. The tasks were indeed very helpful as they will help me in debugging errors and failures when I create family files for sites using older wiki versions. My mentor has guided me really well throughout by giving me good hints to start with a particular task and letting me learn by exploring on my own too!
Communication with mentors
editI have been communicating with my mentors over the IRC channel #pywikibot and google hangouts. As I will be available during most of the internship period, I will seek their guidance and opinions almost daily. There has been no problem in communication so far. Time zones do not concern me much as I share the same time zone as one of my mentors and there is enough over lapping time with my other mentor.
Lessons learnt
edit- became familiar with phabricator-style portal for tasks and project board
- determine which feature was added in which MW version
- modifying/improving test files to skip/accommodate tests for features which are not present in a particular MW version
Project plan and Deliverables for first half
editTimelime for the project can be found here. As discussed with my mentor, family files will be submitted in batches and not individually. Meanwhile, I will keep debugging alongside and submit the code changes regularly.
Phabricator and Gerrit
editWeekly reports
editWeek 1 (Dec 13 - Dec 21)
edit- Created inter-wiki.xls file.
- Entered entries of all the sites on IWM and added details in the columns.
- Generated family files of the sites listed on IWM.
- Ran unit tests on the first half of the list.
Week 2 (Dec 22- Dec 29)
edit- Worked on Wikia failures in this week.
- Determined the version in which new mode was added.
- Analysed the Wikia Search failure.
- Fixed search errors and paraminfo error (due to new mode)
- Did error analysis of the sites to detect common errors recurring for wikis using same version.
Week 3 (Dec 30 - Jan 5)
edit- Analysis of the errors given by sites on IWM. Created a bug for a common error. (T85667)
- Worked on Token Errors generated when user doesn't have necessary tokens. (Wikia Failure)
- Trouble with github. Took time to resolve rebase problem.
- Revise the inter-wiki spread-sheet as it contains errors.
Week 4 (Jan 6 - Jan 14)
edit- Determined the wiki engine being used by non-MediaWiki wiki sites on the IWM.
- Did analysis of the wikiengine's API to understand working of the API.
- Tried to find end points of the wikiengines and the version being used by different sites.
Week 5 (Jan 15 - Jan 22)
edit- Read up the XML-RPC documentation on JSPWiki.
- Looked for XML-RPC implementation of wikis on the IWM. (But couldn't find any)
- Worked on determining the site-type (MW used and version) if it's family file does not exist.
Week 6 (Jan 23 - Jan 30)
edit- Submitted patch for the detection of site type.
- Wrote script to run the detect_site_type on all the IWM sites.
- Fixed bugs and errors arising due to the script.
- Discussed an alternative way to detect the MW site as a few sites on IWM do not contain the "generator".
Week 7 (Jan 31 - Feb 6)
edit- Set up a wikia github branch with travis builds to track status of Wikia related fixes.
- Continued working on testing the detection code on IWM sites.
Week 8 (Feb 7 - Feb 15)
edit- Sick for most of the week.
- Created a phab task for further discussion of detect_site_type. (T88601)
- Fixed the encoding error arising due to inappropriate error handling while retreiving data. (T88928)
- Wrote the code for alternate implementation of detect_site_type.
Week 9 (Feb 16 - Feb 22)
edit- Discussed ValueError problem of unpacking Exception (ValueError) with my mentor. This patch is being fixed in the following gerrit link: https://gerrit.wikimedia.org/r/#/c/191042/
- We now call detect_site_type, load_site method as it returns a Site object without need of the family files. Autofamily is used to create the an object of class Family and using that, a Site object is created.
- Wrote a unittest for this method. https://gerrit.wikimedia.org/r/#/c/186339/
Week 10 (Feb 23 - Feb 28)
edit- Had exams this week. Hence couldn't work much.
Week 11 (March 1 - March 8)
edit- Worked on the load_site method and unittest.
- Fixed errors raise by the unittest on the InterWikiMap (IWM)
- Wrote the Site class (subclass of BaseSite) for non-MW sites upon discussion with mentors. https://gerrit.wikimedia.org/r/#/c/196141/
Details
editThe First Week
editFor the first half of my internship, I am supposed to extend PyWikiBot (PWB) support to the sites listed on the InterWiki Map (IWM). This support structure of PyWikiBot for sites using MediaWiki engine is based around the creation of family files. I used a script that automatically generates these family files. Once this was done, the I ran the test suite on each these sites. Since the IWM consists of sites other than those using MediaWiki software, family files for many sites could not be created. For some, it seemed that the API was not enabled and others gave various HTTP errors. I created a spreadsheet containing detailed analysis of the the IWM list and it can be found here.
An important skill needed for this part of my assignment is to have a good knowledge of the API and test sites using different MediaWiki versions. I was required to determine the version in which new API properties or changes were introduced. This analysis is necessary to debug the errors generated by running the test suite for different family files. Hence, besides extending support to the IWM list, this part of my assignment also aims at making the unittest more robust as it tolerates functionalities of different MediaWiki versions.
Wikia Failures
editOne of the major failures on Wikia is api_tests.TestParamInfo.test_new_mode. This error arises as ‘the new mode’ to fetch API parameter information was added in the version 1.25wmf4 and hence, all wikis before 1.25wmf4 will not pass this test. It took me quite some time and help to come to this conclusion. Usually, to detect in which version a particular feature was added, I analyze the sites of different versions and test the API. If the response generated by the API gives an error code or an ‘unrecognized value for parameter’, this means that the particular version doesn’t support the feature. My initial hunch was that the new mode to obtain paraminfo was added in 1.25alpha as translatewiki (given below in the links) tested the new mode correctly. I used a different approach to determine that 1.25alpha is indeed the earliest release.
MediaWikiVersion(‘1.25alpha’) < MediaWikiVersion(‘1.25wmf12′)
To my horror, this returned False even though 1.25alpha was definitely released before 1.25wmf12. I decided that 1.25wmf1 should be taken as the version in which new_param_info was added. This too, was wrong!
Then, my mentor suggested that this is more complicated than the way I was analysing it. If the wiki is using the code from git, then they could be using the code before or after the release date of a version. The git-hash of the code used can be found from this: http://aniwiki.net/api.php?action=query&meta=siteinfo says it is using git-hash= “8aaf4468411b77b3bff93302f8da5815744673ff” git-branch=”master”
And hence, my mentor found two wikis based on ‘1.25alpha’ that did not work with the new mode! For example:
http://aniwiki.net/api.php?action=paraminfo&modules=query+infolink does not work even though the version is 1.25alpha. My mentor saved the day again!
Just as a side note: the new_mode can be accessed via the API as:
https://translatewiki.net/w/api.php?action=paraminfo&modules=query+info
While the old way (which has been deprecated since version 1.25wmf4) can be accessed as:
https://translatewiki.net/w/api.php?action=paraminfo&querymodules=info
So, the test_new_mode must be skipped for all sites using mw version (<1.25wmf4) and has been done in this changeset.
Another error which took us some time to resolve is the test error in site_tests.SiteUserTestCase.testSearch. In my initial attempt to resolve this bug, I added a parameter _search_disabled (which if True will skip the test) and a method _is_search_disabled (to update this parameter). This changeset however needs a lot more analysis before it can be merged. My mentor suggested that since Wikia has it’s own search API called “Wiki Search”, we can skip this test by detecting this extension. Another interesting thing I learnt from this exercise is that some settings can be manually over-ridden and using $wgDisableSearchUpdate, Wikia has disabled MW API search. Hence, using mysite.has_extension(“Wikia Search”), the test has been skipped in this changeset.
Errors for families using MediaWiki
editSo in this post, I would like to talk about specific errors and the possible bugs causing those errors as discussed with my mentors. As given in the spreadsheet, the errors in beachpedia and choralwiki are the most recurring. One of the recurring error is page_tests.TestLinkObject.testHashCmp. With respect to translatewiki (which is nothing but the i18n site), this error meant that i18n family file will also give the same error! Turns out that it did! More so to my horror, the link_tests which I thought pertained only to the new family files I had created, gave errors on enwp (english wikipedia) too! But upon analysing the code, I realized that those errors are family independent as the site and code of the family is set manually for these particular tests. But besides that, the errors were not being reported by other developers! (Oh dang!)
To this my mentor suggested that the generating family files can break site.interwiki. So turns out, it is a parsing problem. My mentor created this as a task in Phabriator, T85658. Just a small note on Phabricator, which is the bug and project repository for MediaWiki. It is a great tool for collaboration of projects and maintainance of the tasks. I personally find it great to work on. :) So back to the this queer problem, it seems like the method _get_path_regex returns the regular expression matching the path after the domain of the site. According to traceback, it was unable to create a path for this site which must be manually set in the family file. Upon checking the code, I realized it is does not support families that contain multiple sites/entries. This method returns one common regular expression and that expression must match all the sites. Not only should the regular expression match all the sites, but it should ensure that no other site matches this expression. With this bug resolved, hopefully the link_tests and testHashCmp will not give errors.
Another error I would like to talk about is specific to a wikia called doomwikia, which gave pagegenerator_tests errors based on RecentChangesPageGenerator and NewPagesGenerator. As suggested by my mentor, this error was being caused due to very little data in NewPages and RecentPages. As again, his intuition was correct! I confirmed this to be the cause of error by following these links:
http://ru.doomrus.wikia.com/wiki/Special:RecentChanges
http://doom.wikia.com/wiki/Special:RecentChanges
This clearly shows that recent changes in russian doomwiki doesn’t contain any data while for english doomwiki, it does. English doomwiki doesn’t give these errors. This bug (T85667) still needs to be resolved though.
Encoding-Decoding
editHello again! It has been a long time since I blogged. But instead of catching up with my work over the last month, I’d like to talk about something I worked on more recently. Two sites on the IWM failed to decode the data retreived via HttpRequest using the following code snipet:
request = pywikibot.comms.http.fetch(url)
data = request.content
Turns out that content uses request.encoding to determine the encoding format used by the site. The encoding method however also determines whether the decoding takes place without errors and raises an error if a UnicodeError exception occurs during decoding. So a UnicodeDecodeError exception occured during decoding according to the traceback in the discussion (T88928) on phabricator.
So if we take a look at request.raw, request.raw[10980:11000]
= “‘Verein f\xfcr Compute” (for the text “Verein für Compute”). As given in the table, ‘\xfc’ is a ISO/IEC 8859 (or Latin-1) character encoding for “ü”. So even though, the Http header reponse states that the encoding is done in “utf-8″, this one character is encoded in “Latin-1″ format. My mentor contacted the site administrator and requested him to fix their output so that it only contains “utf-8″ characters. So the all the characters are now encoded in “utf-8″. This means that if you now do request.raw[10980:11000]
, you will get something like “‘Verein f\xc3\xbcr Compute”. Here, “\xc3\xbc” is character encoding for “ü” in utf-8. :)
On our end, we worked around this bug by using:request.raw.decode(request.header_encoding, "replace")
. Now the decode method uses the encoding which we directly pass to it by request.header_encoding and in case of any UnicodeDecodeErrors, the erraneous characters are replaced by the replacement character u’FFFD’ � thanks to the “replace” parameter.
Wrap Up Report
editOPW Round 9 has finally come to an end. I learnt a lot in my internship period and I worked towards making PWB more independent of family files and providing PWB support for other interfaces. I was able to implement the first half of the project, but the second half is long from over. Working on PWB has helped me gain an intricate understanding of how to build support for APIs. It is much more complicated than I had thought. But atleast I have started on the second half of the project (and the basic support for non MW sites has been merged!) and I intend to continue working on the project to finish what I have started. I also fixed many bugs that I came across and it was fun to work on those small parts of the module/function. I would also like to thank my mentors xzise and jayvdb, for guiding me whenever I got stuck. It wouldn't have been possible without them. In the end, it was a beautiful experience. I will keep contributing to PWB. PWB for life! Cheers!