Wikimedia Developer Summit/2017/tools-for-curating

Copy of https://etherpad.wikimedia.org/p/devsummit17-tools-for-curating


SESSION OVERVIEW

===== edit
  • Title: Tools for curating and organizing editing work: what has been done and where to go next
  • Day & Time: Monday, January 9, 2016 (1:30 – 2:40 p.m.)
  • Room: Chapel Hill
  • Facilitator(s):
  • Note-Taker(s): quiddity, Pau Giner
  • Topics for discussion:
    • CollaborationKit
    • Wikipedia Requests
    • Translate extension
    • Bots
    • Templates/Categories
      • Ambox templates on pages
    • Recommendation API
    • Specialpage Maintenance reports
    • SuggestBot


  • Chronology: [Capture the gist of who said what, in what order. A transcript isn't necessary, but it's useful to capture the important points made by speakers as they happen]
  • On the larger wikis, it's often a problem of finding things that haven't already been done. There are a lot of existing tools to help direct editors towards requests for assistance.
  • A process based on templates often results on work being done later in the future. Sometimes the work is already complete, but nobody removed the tag(s) or knew they were supposed to.
  • This lets tasks that have some overlap get tracked in multiple locations. (E.g. WikiProjects Biography and Medicine)
  • I would like to merge the automated-tasks and the human-suggested tasks, be tracked in the same location, so that it is easier to keep track of what needs to be done, and what has been done.


  • What do you think are technical concerns with this proposal?
  • Q: There are some other entry-points... is the existing problem that they're not well-maintained? What's the problem with the existing human maintained ones?
  • A: There are multiple forking or duplicative or abandoned processes. Lots of wheel-reinvention. We need a centralized service
  • E.g. WP:Requested articles, but they're in various subpages, and can't easily be embedded or extracted or manipulated.
  • Q: One problem will be getting official WMF resources, because there are so many competing requests for resources.
  • A: I'm used to working with limited resources, on long-term scales. I'm trying to focus on technical-implementation concerns here.
  • Q: [?] How can we incorporate the massive backlog into this, without making it overwhelming? [?]
  • A: We could use a question-and-answer system that clears up the junk data, to help editors easily answer "Is this problem already solved?"
  • A: We could show the tags to editors during the save window, asking "did you resolve this issue within your edit?"
  • Q: Are the tags currently considered part of the metadata?
  • A: The templates often add the article to a category. However the categories are a folksonomy and not well-organized or maintained. The categories are usually divided into sub-categories based on Year&Month, e.g. https://en.wikipedia.org/wiki/Category:Articles_needing_cleanup_from_March_2008 with no easy way to do intersections
  • A: Need a database table for article-quality issues.
  • PageAssessments now adds some of this. Could be extended to help with this.
  • Q: How do you see this overlapping with campaigns and other things in editathons? E.g. language team creating a "to-do list" for translation requests, and personal tasks.
  • A: Yes, should work on a system that can handle a generalized and re-usable set of requests, instead of per-type.
  • A: Looking at https://wpx.wmflabs.org/w/index.php/WPX:WikiProject_Dogs - this uses https://www.mediawiki.org/wiki/Extension:CollaborationKit - it gives a random sampling on the top level, and then a full list in the sub-pages, e.g. https://wpx.wmflabs.org/w/index.php/WPX:WikiProject_Dogs/List_1
  • What do you think the relationship could be between the translation campaigns and this collaborationkit?
  • Yes, convergence would be good.
  • (Note: https://www.mediawiki.org/wiki/Content_translation/Product_Definition/Campaigns and https://phabricator.wikimedia.org/T96147 ("[Epic] ContentTranslation - Lists of articles to translate")
  • 3 step process:
    • I can keep a list of articles I personally want to write
    • I want this list to be sharable, e.g. organizing a campaign
    • I want to track progress, and visualize and coordinate)
  • Perhaps CX can work with CollaborationKit, already.
  • 2 main aspects: There's the data-collection and storing, and there's how we want to present the lists to editors. We next need to work on a centralized database for this.
  • QUESTION: Not sure if new database, or new namespace, or other???
  • Need ability to easily Add, and easily Annotate, requests.
  • Q: A similar issue is happening with Education extension with lists.
  • A: Different groups may "collide" but these aspects can be adjusted (e.g., tweaking queries to exclude undesired entries from automated lists such as porn stars in a women biography worklist).
  • Q: What are the next steps? We seem to have a profusion of disjointed and dispersed update-lists and task-lists, how can we centralize?
  • A: We need to figure out. Research is a starting point. From there, identify needs. Identifying types of tasks and types of users that benefit from these activities.
  • Q: The wishlist #22 is "An editor dashboard" - a better editor-homepage, which usually the watchlist at the moment. We should think about closely integrating these ideas. (There are more and more tools where data is stored on labs or offwiki, which can cause problems.) https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Editing#CW2016-R022
  • Thoughts: Database needed, because multiple keys. Access by Lua - Provide API. Only CollaborationKit can edit the content, but anything can query it. Not tied to database tables, distinguish API methods from database methods. What faces the user can be more stable than the database table, which can be altered as needed.
  • Current WikiProject X design: All templates. CollaborationKit: ContentModel (JSON), but displayed as simple text list. Want to move to something better. Allow annotations of the auto-machine-entries. E.g. one source might be wikidata. Should be able to annotate the proposed article with a note saying the topic isn't notable enough for a Wikipedia article.
  • Automated results may be transient, only being stored when people annotate them.
  • Might want to store different kinds of metadata for different request-types: e.g. potential plagiarism might have different requirements, than a potential-new-article, than a definite orphan. It's like phabriactor, where tasks have various transient states, and many metadata aspects, some of which will be unused.
  • Schema-wise, it will be frustrating to have to enable sorting by all the database types.
  • Page-assessments ...[?]
  • It's not clear what needs to be indexed.
  • A database table with an API. But how to surface it (e.g, a specialised namespace, structured namespace that fits the database)? How do people interface with individual tasks on wiki?
  • Q: Are there expected performace issues related to having database tables mapped to the structured information?
  • A: Probably not. Audience were not familiar with the particularities of Wikipedia DBs.
  • Q: Are tasks associated with an article?
  • A: Yes, but article does not need to exist.

General consensus edit

The general consensus:

  • There is a database table.
  • Requests and metadata around requests are stored in the database table.
  • User-generated requests are recorded into the table by way of a namespace in each wiki. "Task:" or something. The pages would be JSON objects, with each key–value pair corresponding to one in a database table
  • Automated tasks (from Wikidata queries, etc.) are not recorded in the table unless they have been annotated.
  • Lists of tasks are automatically generated based on queries, with results coming from the database table and automated data services.

A lot of details need to be filled in, but proper user research should be done first to ascertain the actual needs of editors.