WikidataEntitySuggester/Proposal

Wikidata Entity Suggester

Public URL: Entity Suggester
Bugzilla report: Entity Suggester : Bug #46555, Entity Selector sort : Bug #45351
Mailing list thread: wikitech-l archive link
Announcement: proposal announcement
Extension page: Extension:WikidataEntitySuggester
Documentation: WikidataEntitySuggester
Progress Reports: WikidataEntitySuggester/Progress

Name and contact information

Name: Nilesh Chakraborty
Email: nilesh@nileshc.com
IRC or IM networks/handle(s):
- jabber: nilesh@nileshc.com
- freenode nick: nileshc
Location: Kolkata, India (UTC +0530h)
Typical working hours: 15:00–24:00 IST (9:30–18:30 UTC)

Synopsis

Summary

Wikidata authors have to spend a considerable amount of time on finding the required properties and values for them. This project is meant to make their task easier. The goal of this project is three-fold - (i) suggesting properties relevant to the context (depends upon the item that is being edited), (ii) suggesting values to the recommended properties or a new property that the author starts with, (iii) make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top. A collaborative filtering approach will be used to suggest the properties and do the sorting. In order to suggest the values, individual approaches (collaborative filtering, complex SQL queries) has to be used for each type of property.

Benefit to Wikidata

This project will make the process of adding a new item to wikidata much more efficient and easier for the authors, since they will receive real-time recommendations for properties and values rather than always having to repeatedly come up with all the properties themselves. Also, the ordering of properties under an item will be improved.

Project overview and implementation ideas

I will describe my initial ideas for each of the three objectives: (Please see the last section on this page for info on a prototype I'm building)

Suggesting properties :
1. Write a map-reduce job with Apache Hadoop to parse the latest wikidatawiki pages-meta-current dump and extract user-item (item-property) pairs with required metadata - this will be used to train the recommendation engine. Let's call this Dataset 1.
2. Feed the user-item pairs and required metadata (if any) in Dataset 1 into Myrrix where 'user' implies a wikidata item (eg. New York City) and 'item' implies properties. Using collaborative filtering, properties will be recommended to items.
Suggesting values :
1. This will need individual approaches for each property or type of property. At first I will write a map-reduce job that will parse the latest wikidatawiki pages-meta-current dump as in objective 1, to yield a different kind of dataset, with the unnecessary info stripped off (details will be decided after some more experimentation). Let's call this Dataset 2. This will most probably be fed into an SQL database (I'll investigate the possibility for a noSQL one too, but it's unlikely, since JOINs and complex queries may have to be performed upon this data.)
  Let's consider a few examples now:
  1. Place-oriented properties like place of birth, country of citizenship - suggest values based upon already entered values, ie. it is highly probable that the birth place will be a subset of the country (countries?) of citizenship of a person.
  2. Relationship-related properties - Father/Son : If the item being edited is already listed as a Father/Son of another item, the Father or Son fields can be easily suggested. It's similar for aunts, uncles, spouses etc.
  3. Properties like alma mater, occupation, field of work, employer and any Music-related property (namely performer, composer, producer, record label etc) : I think it'll be a good idea to use collaborative filtering to suggest values to properties like these since they often overlap or show "A is a member of this, so probably A is also a member of that"-style characteristics.
  4. Product and Literature based properties should respond well to a collaborative filtering approach too. I need to experiment on individual types of properties to see which method fits.
2. Investigate using a collaborative filtering method exactly similar to : 1. Suggesting properties
Making sorting order of properties on an item page better : This can be done by a similar collaborative filtering approach, treating wikidata item-property pairs as user-item pairs and recommend "items" to "users".
But, sorting the order of items in search results will require some other heuristics.

Deliverables

Project goals and prime deliverables

Build an entity suggester module for suggesting properties for statements, that can be trained with existing datasets.
Add support for suggesting values (for a selected few types of properties).
Make the sorting mechanism of the entity selector smarter so that more relevant properties appear at the top.

Tentative roadmap/timeline

The following is a breakup of how I wish to complete this project - each section, ranging from a few days to more than one week, has a tentative time period or deadline, and a specific task to be done:

Time Period	Tasks
Upto May 21	Research : Decide upon the optimal implementation methods for each objective. Chalk out more implementation details, map-reduce job details, get familiar with MediaWiki source code and how to integrate new functionality or java .jar files into wikidata. I will have my university exams from May 21st to June 11th. I will be available during that period, but will not be able to start with coding. During this time we can have necessary discussions and plot out more details about my objectives.
June 11 - June 20	Setup Apache Hadoop on the MediaWiki vagrant server. Write mapper and reducer for objective 1 to parse wikidata dump.
June 21 - June 25	Feed user-item pairs from output of reducer into Myrrix and try out recommendations. Ask mentors and community about documentation regarding integration into wikidata. Write a simple javascript recommender client that will be inserted into the item page. It will call the Myrrix REST API via ajax and retrieve recommendations.
June 26 - June 30	Test usability with mentors and the community. Write documentation. Make bug fixes.
July 1 - July 7	Write another map-reduce job to parse the data dump and start adding support for suggesting values, beginning with Music related properties.
July 8 - July 11	Test usability. Fix bugs. Write documentation. Consult with mentors regarding accuracy of suggestions.
July 12 - July 20	Add support for suggesting values for Product and Literature related values.
July 21 - July 23	Polish code, fix bugs if any.
July 24 - July 28	Add support for relationship and Place oriented properties. Plan for other kinds of properties.
July 29 - August 6	Buffer week. Implement support for other properties if planned.
August 7 - August 16	Experiment and decide the best methods for smart sorting of properties and items.
August 17 - August 22	Implement smart sorting of properties.
August 23 - August 28	Implement smart sorting of items.
August 29 - September 6	Finish integrations with wikidata frontend.
September 6 - September 13	Write documentation. Check bug posts and fix them. Test usability.
September 14 - September 22	Finishing touches, last minute polishing, bug fixes, writing unit tests, improving documentation if needed.
Post GSoC	Optimize recommendation scripts if needed. Add support for more properties for value suggestion. Try making the property/item sorter more intelligent.

About you

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree. In short, I love programming and it's pretty much what I do all day, if I'm not on occasion busy doing something else! I have unending enthusiasm for working on anything related to big data, data mining, machine learning and recommendation engines and like researching on those topics because I'm passionate about them.

To find the idea on building an entity suggester for wikidata, on the MediaWiki GSoC ideas page, was serendipity if not anything else. If I could build something that would make the job easier for wikidata authors and let them become more efficient, it would be nothing short of fabulous. Since I have a thorough experience with recommendation engines (both Apache Mahout and Myrrix), I believe that I can use my skills to the fullest and make the entity suggester quite possibly "the most awesomest wiki enhancement ever". :-)

Participation

I will make a weekly or bi-weekly post on my blog at nileshc.com about my progress on the project, status on the milestones etc. and communicate with my mentor and the community via the wikitech-l mailing list. I will set up a page under my User page and make the same post there too. I will maintain documentation, HOW-TOs etc. in the Entity Suggester's wiki page. I'll post my monthly and weekly reports on this page.

Though honestly I'm not much of a blogger and prefer to just focus on working, with only a moderate amount of interaction.

I will preferably use this gerrit repository to track the source code.

Past open source experience

Honestly, I do not have a lot of published open source code. I am currently working on a Facebook friend-suggester that recommends friends based on semantic similarity of each other's interests. Previously I have worked on an online interactive social college magazine from scratch (using Java EE/JSF, Websphere and DB2 server) and designed the database schema for it; I was in a team of 4. Unfortunately, it never reached a point of completion. The database schema and use-case diagrams I designed are available here.

Any other info

Byrial has written a few C programs that have turned out to be really helpful to me. Please check out this link: http://www.wikidata.org/wiki/User:Byrial. Unfortunately, the last couple of wikidata dumps seem to be breaking those C codes. So I wrote my own Hadoop MapReduce scripts in Python.

This GitHub repo was the initial place where I began prototyping. I've moved my code to Gerrit. Please check the entity suggester's extension page which I'll update once the project matures.