User:Mshavlovsky/Authorship Tracking

Authorship Tracking

Public URL: [1]
Bugzilla report: [2]
Announcement: wikitech-l

Name and contact information

Name: Michael Shavlovsky
Email: mshavlov@ucsc.edu
IRC or IM networks/handle(s): mshavlov on freenode
Location: San Francisco Bay Area.
Typical working hours: Always :-).

Synopsis

We propose to implement authorship tracking for the text in the Wikipedia. The goal is to annotate every word of Wikipedia content with the revision where it was inserted, and the author who created it.

We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm:

Reads the page summary
Computes the authorship for the new revision, and stores it
Stores an updated summary of the history which includes also the new revision.

The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time.

The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia.

Detailed Information

Code repository
Detailed description of the algorithm. The paper has been accepted at the WWW 2013 conference.
Demo (this is a demo of the algorithm, not of the system to be built).

Deliverables

Determine implementation location. Determine whether it is preferable to run the algorithm with access to the main Wikipedia database (faster, simpler), or whether to keep the process fully isolated and rely on the MediaWiki API.
Design and implement data access. Currently, the algorithm accesses the database directly. Design a data access layer if necessary.
Design and implement storage for authorship metadata. The metadata needs to be stored in a text-store or a key-value store; determine implementation and code it.
Design and implement public API. The proposed project does not include the visualization of the authorship information; rather, the information will be made available to all via an API, so innovative visualizations can be developed. An API to make the information available needs to be designed (probably easy, via page_id and/or revision_id).
Document the API.
Test, deploy, and iterate.
Improve attribution [optional and ongoing]. As remarked in the paper, there are various ways of defining earliest plausible attribution, some of which make use of information on the statistical frequency of terms in the whole Wikipedia, or in the page. Once everything works, and we have a first version running, we will consider if time permits how to improve the content attribution. This is a process that can happen also after the Summer of Code, via contributions both by us and by others to the open code-base.

About you

I am a Ph.D. student in Computer Science with the passion of making great ideas reality. I like to see how my research contributes to the lives of other people. This project will give me a chance to implement and put in practice our algorithms: from theory to actual impact.

Participation

I am planning to work on a regular basis, have weekly discussions with a mentor/mentors, participate in IRC conversation. I like to set weekly goals and iterate quickly.

Past open source experience

I have some experience in developing mediawiki extensions.

BlameMaps
CrowdRanker I am a contributor to the CrowdRanker project.

Any other info

The development of the algorithm was joint work with Luca de Alfaro at UCSC. Luca is a co-author of WikiTrust, a Wikipedia reputation system. Several users were relying on WikiTrust for providing information on content origin in the Wikipedia. However, WikiTrust is not ideal for this, for several reasons:

WikiTrust is quite heavy computationally.
WikiTrust is written in a way that makes its integration with the Wikimedia production environment difficult.
The content tracking features of WikiTrust were an afterthought: WikiTrust was not optimized for providing accurate authorship information.

The current authorship tracking project started as a way to provide the most widely used output of WikiTrust in a way that was more precise, more efficient, with code that was easier to understand and maintain. All the code is written in Python, and has been tested already for integration with small MediaWiki installations; this SOC project is really about making the leap to an implementation that can be useful on the real Wikipedia.