Technical decision making/Decision records/T274181

Structured Data Across Wikimedia Architecture (SDAW)

The Structured Data Across Wikimedia Architecture (SDAW) team was able to exit the process early. The technical decision forum process helps the team to come to the conclusion that the proposal was at a larger scale than originally anticipated. With a clearer understanding of their decision proposal, the SDAW team will work on a more focused framework to bring the Technical Decision Forum. The feedback the SDAW team received are as follow:

Question:
Was the problem clearly stated?

Respond Percentage

Strongly agree	25.0%
agree	54.2%
Neutral	16.7%
disagree	4.2%
Strongly disagree	0.0%

Feedback

The Problem statement talks about tagging articles, paragraphs and even sentences with relevant language independent Wikidata concepts. Leaving aside the fact that there exist concepts that are not language independent (e.g. https://en.wikipedia.org/wiki/Saudade), I 'd like to focus on the fact that tagging sole sentences is a completely different scale of problem than tagging articles (or paragraphs even). Regardless of implementation, the amount of computing resources required for tagging paragraphs or sentences will be orders of magnitude more than tagging articles. Could this problem be broken down more? Perhaps starting with just tagging articles and scaling up from there? From my understanding a big part of the gains will be more tagging paragraphs (e.g. an intr/summary being an answer to a question) so maybe paragraphs can fit in the initial plan as well. But adding sentences to start with sounds a bit too much to me.

I have read the "What" section at least 5 times now. I finally realized that "The first most useful type of metadata is the topic of the content." is the closest thing to an explanation of the goal of this body of work that I can understand. The size of the grant funding the work is irrelevant (as is the fact of an earmarked grant being involved at all). This as the lead of the section obfuscates rather than informs. It is still not clear to me after several re-reads if extracting/identifying/cataloging "sections" is part of the work expected, or if instead this will build on other existing structural decomposition of articles that already somehow exists. There is a paragraph on "Structuring content into discrete sections" but it does not contain statements of proposed action. Instead it merely states that this might be a nice thing for external reusers of content and other workflows without establishing an concrete basis for those statements. I would personally expect the What to be statements in active voice about a problem domain and the high level course of action to be explored. Ideally this would also be written in inverted pyramid/journalistic style so that one does not have to hunt for the important ideas within other tangentially related prose.

As stated, it sounds like the problem is the presence of the grant itself.

I asked members of my team to review the SDAW document and take and one of my teammates has several questions (posted below verbatim):

"Looking through the presentation, I'm a bit confused by the application of this idea. I see that they are trying to apply section-level concepts to the lead section, which (ideally) summarizes every section in the article. shouldn't this just reflect article-level concepts, and if so, couldn't they use existing ways of describing a topic (eg category tree - although this might be a useful way of replacing that; existing descriptors on Wikidata)? and how would this work with abstracting out references, since again leads are more typically supported only in body? also how does link analysis interact with project-determined linking standards like enwp's MOS:LINK - eg something linked in an earlier section may not be linked again later? they argue that it minimizes bias vs machine learning - I don't agree that it would"
Questions I had are the following:

   1. Will there be a difference between tagging cited and un-cited content? Will there be a preference for getting tagged information from cited content?
   2. Will people be able to access the website of the citation?

The Problem Statement seems fairly thorough and well-written. It doesn't delve too deeply into specific technical details, but I assume that is completely standard and acceptable for these types of documents. I'm also not entirely clear on how the two additional goals (increased readership from underserved markets, increased editors from emerging/mobile markets) specifically tie into this project, but again I'm not sure that matters.

The problem statement is clear and well articulated. I understand that we are in a What and Why stage now. At the same time, I look forward to a clear "what does done look like" including success metrics to help illustrate what happens next with this decision.

The problem statement is understood, but the third paragraph, talking specifically about discrete sections and sentences, might be out of scope for this project, and is being worked on, researched, scoped, and examined by other teams. The process and technical decisions involved in creating the knowledge store are independent of this decision record, and that should be clarified."

While to problem is clear - it might've benefit the decision to limit the scope. Current one makes this a large scale effort, resulting in many teams being involved. Overall it looks well defined, my only question is if the "what" part also involves exposing the structured content to consumers (e.g. some sort of structured content API) or the scope for now is just generating the data? Do we expect some sort of product integration with the Android/iOS apps?

The scope is unclear. On one hand it is pretty much open ended: structured tagging for content. On other hand it says the first task is section topics. Should this be evaluated for the larger or smaller scope? Some clarification is in order, maybe first do the overall thing only and then later more specific focus area? It's easy to imagine that structured data may help with many things, and enable things we haven't even thought about. However, the examples listed here are vague and it would be nice to a bit more details about why or how we think it helps the things mentioned. I think it would help to connect these to planned/expected focus areas within this project. For example section topic modeling can help (among other things) Section Translation which facilitates translation and knowledge parity."

Regarding the structure of the first section, I think there’s enough overlap between the phrasing of the first and second questions (“What is the problem or opportunity?” and “What does the future look like if this is achieved?”) that the 3rd and 4th paragraphs could be placed in the second section, leaving the first section to outline the decision statement and its scope. Speaking of scope, it’s not clear if this decision record covers all milestones mentioned in the roadmap or only M1. If it’s the former, then I’d expect community consultation, editing, and moderation to be mentioned explicitly. Regarding the “What does the future look like if this is achieved?” section: We do know that exposing structured data in a machine-readable format increases average pageviews per day. See https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test. Regarding the “What happens if we do nothing?” section: What happens if we fail to meet the goals of the grant?

Looks good, I like the callout about breaking an article into sections as its own separate win. It's helpful to see the distinction between building a system for serving content modularly and the specific application of this system, that is tagging sections with topics. If one byproduct is clients/API layers never again needing to parse html to extract sections, then this system has much more potential than just providing quick facts, and that is a selling point that could be hit a little harder.

The Problem Statement seems pretty broad to us. We appreciate the late addition of the Summary paragraph at the top. The general tagging using Wikidata concepts seems generally clear, especially accompanied by examples from the linked documents.

We still have a fairly vague understanding on how tagging Wikipedia articles with topical metadata is going to help achieving the goals of increasing the number of readers, especially from underserved communities; and increasing the number of contributors and editors, especially from emerging markets and on mobile. We wonder if this could be elaborated a bit more. We wonder if some further examples could be provided to demonstrate the potential future applications benefiting from this work. ""What does the future look like if this is achieved?"" seems relatively short compared to the problem statement. Finally, it might not exactly fit the Decision Statement Overview document, but WMDE is curious to hear more details on how it is planned to use Wikidata in applications like ""question answering"" or ""providing quick facts"". The Platform Engineering team struggled a lot with the “What” section as it didn’t provide a lot of technical details. Beyond the grant requirement, members of the team wanted more detail of how it would be implemented to better comments on the other sections and to fully understand the size of this project and if it should be broken up into sections. There was also some feeling that this document may be retrofitting a decision that has already been made. Open question: Will the topic be the only metadata implemented in the context of this 3 years project? Or is the scope broader?"

While the problem statement is pretty clear, the problem and project described are both very large, and it is hard to be sure.

Question

Does the solution support the Foundation goals?

Respond Percentage

Strongly agree	33.3%
agree	50.0%
Neutral	16.7%
disagree	0.0%
Strongly disagree	0.0%

Feedback

Why is the earmarked grant information in the What section but not in the Why section?

Since we're asked specifically about the MTP and the 2030 strategy, it would be nice to link directly to their objectives that this projects aligns to. I don't know if the "objectives it supports" listed on the document are from either of them.

It can be hinted by the text itself, but it is not entirely clear until you read the additional background links.

The "on-site search [that] is a significant improvement over the current state" does not seem to match the scope defined earlier. Per my understanding of this document this (SDAW) is more a platform improvement project, and while it will showcase what it enables with concrete examples, search feels too important and too complex to be an explicit goal.

There is general excitement around our ability to collect annotations such as these to be highly important to us building successful and inclusive ML and information retrieval / recommender system technologies, especially for cross-lingual work. However, two main questions appeared around community needs: Has the community asked for this (section-level structured data)? That is, is there a community of volunteers who are excited to give constructive feedback, add topics when the tool exists, build tools to make it easier etc.? Otherwise, it feels like we're building a system that will generate a massive amount of backlog / potential for inaccuracies and therefore frustration without any plan to handle that. (e.g., Article Feedback Tool [1]). Related to that is the question (from the supplementary slide-deck) about how users would interact with “the meta”: Has anyone studied usage of the RelatedArticles [2] extension? This project gave editors the ability to override CirrusSearch's ""related articles"" recommendations (which are automatically generated). why annotations on the section-level? Alternative candidates would be page-level annotations where one can identify a clear need from the community (see all the ad-hoc technologies such as WikiProjects + PageAssessments [3] that have been developed to track these things). While this can be more complicated because you're building on existing technologies and not trying to disrupt existing workflows, it's much more likely to get community buy-in and would serve a clear need.

 [1] https://en.wikipedia.org/wiki/Wikipedia:Article_Feedback_Tool
 [2] https://www.mediawiki.org/wiki/Extension:RelatedArticles
 [3] https://www.mediawiki.org/wiki/Extension:PageAssessments

I feel like the "how" in this area gets a bit lost. They are all worded like objectives to me. Maybe just a simple prefix of "Objective: {asdf}" and "How: {asdf}" could clear this up, or add a specific reference to topics here.

We generally understand how solving this problems supports Wikimedia goals. We could see some more clarity brought to the ""Increase impact of knowledge with data"" part. From the text itself it has not been entirely clear to us what Objectives are meant here - whether this is primarily targeting Big Tech organisations (OKAPI?) that would be enabled to provide Wikimedia content to new audiences via their products, or whether the main target are emerging markets. Table 1 in the public SDAW grant text (I don't have access to the google doc linked from the Decision Statement Overview, so I am relying on https://commons.wikimedia.org/wiki/File:2020_Structured_Data_Across_Wikimedia_proposal.pdf to a degree clarified this point for us, but we think being more explicit in the Statement Overview would still be beneficial."

Better clarity on the impact of this project would be helpful. Open question: Why should we prioritize this over other efforts? And how does it map to our OKRs?"

As written this is great. It seems like there's also an opportunity for increasing Knowledge Equity through this work, at least as a second-order effect via gap analysis of the resultant corpus, which might be good to highlight

Question

Are the right stakeholders have been identified and holding the right roles?

Respond Percentage

Strongly agree	16.7%
agree	37.5%
Neutral	25.0%
disagree	16.7%
Strongly disagree	4.2%

Feedback

The amount of data that will probably be generated by this effort is substantial enough to ask that SRE Data Persistence team (aka DBA) also be included in the RACI table. Granted, this is already alluded to by the ""SRE"" line under the ""Consulted"" column where ""Hardware and System needs"" are mentioned but I only represent the Service Operations team of SRE, the rest of the SRE teams need to be represented too, especially the Data Persistence team."

Due to my difficulty in understanding the scope of the project from the What section I am unable to fairly evaluate the Who section. That being said, at a high level there does seem to be consulted and informed inclusion for a broad range of engineering roles within the Foundation and WMDE. My main question is if there should be at least Informed (if not Consulted) status for some community sub-groups, but this is currently difficult to ascertain from the stated brief. I also do not see any explicit inclusion of UX and Design Research roles, but again it is unclear from the What if this includes enough externally facing interactions to warrant that.

Please add Tyler Cipriani/Release Engineering to the Informed list. There will be (potential) impact on the team but they don't need to block on the decision, similar to Product Analytics: https://docs.google.com/document/d/1shoJwPeDy7W9_c9n6QfERNXx5uUFsYhtZfj9m-fy7ow/edit?disco=AAAALgo5ep4. For the next question (""follow up"") you can simply @mention me highlighting the addition to the document."

Data integration for this project will be a fundamental component. I think Consulted for Data Engineering is the correct role, since core platform will be doing the actual work. However, I'd appreciate if our team was involved in the architecture design decisions earlier rather than later. (Maybe that is what 'consulted' means :) )

There seems to be some overlap with past proposals, particularly this annotation service: https://phabricator.wikimedia.org/T149667, https://meta.wikimedia.org/wiki/Grants:IdeaLab/Amazing_Article_Annotations. To what degree was prior art consulted / those use cases accounted for.

I had a couple of ideas for integration with the Wikipedia Library that can be explored once the project is complete and some features are implemented in the Library. Might be nice to keep us informed on progress.

We'd prefer if the Growth team was consulted since the focus of the team is building structured editing experiences with topic-based discovery, and some of the architecture choices described here have significant product impact and would affect that work. (Already brought this up within the Decision Statement Overview document, just re-stating here since the form asked about it.)

This RACI looks comprehensive and widely vetted. It's not unheard of for an unidentified stakeholder to surface later on. Still, this looks like a good start apart from UPDATE: Language team is not in the RACI of a decision that mentions translation and localization.

This decision document seemed to have evolved into a decision document about the Knowledge Store (phoenix) rather than a decision document about SDAW work. The Phoenix/Content Store is an evolving piece of infrastructure that is under the leadership and direction of the Architecture team as part of moving the organization towards a better sustainable architecture -- as such, we need to start working with the potential consumers, but we also need to be very careful not to jump to the ""deliver it too quickly"" bandwagon and have a decision document that basically talks about how to implement phoenix into production using product teams. It feels like the problem definition is specific to SDAW's needs, but the solution is touching on general architectural work that serves multiple product teams. Instead, the document should concentrate on what problems the SDAW team needs to resolve in order to move forward on their own part so we can help them design a series of actually iterative implementations that will help them focus their product and hash out the challenges."

Translation, localisation and underserved communities were mentioned multiple times, yet the Language team is not listed in the stakeholders. Their work with Section Translation seems directly relevant to the first task of section topic modeling, and collaborating with them would make it more likely that there are concrete benefits for translation. Also, clarification of the scope could help to clarify the list of stakeholders. I imagine that different stakeholders would be involved at different times, depending on what the focus is at the time."

Regarding other teams I cannot say much, but for the research team (the one I am representing) there might be an opportunity to be more involved than currently reflected in the document (""May be interested in using the new data in research.""). This could be in the form of consultation or support around the development of the technical solutions to automatically identifying annotations in sections (or sentences, etc.). These problems are hard, especially when aiming to do that across all languages. In fact, the Research Team has been developing Machine learning models for similar problems (in a language-agnostic way) in the past year, such as annotating articles with topics [1] or generating recommendations for links to add to articles [2] as part of the add-a-link structured task [3]. Even if choosing simpler non-ML methods (e.g. by just working with the existing blue links) there will still be problems around biases simply from reproducing existing bias in the data (for example, we know that there are systematic biases in the way articles on women are linked to other articles). We are actively working on these problems and would be happy to share our insights and support your work.

 [1] https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification
 [2] https://meta.wikimedia.org/wiki/Research:Link_recommendation_model_for_add-a-link_structured_task
 [3] https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link"

Consider adding Readers Web (RW) to Informed. RW maintains the Vector and Minerva skins, which are used by the majority of our users. If there are to be any changes made to the structure of the HTML output by the parser, then it’d be good for RW to know about them ahead of time. Consider moving WMDE to Consulted. New ways of consuming content from Wikidata/Wikibase could drive improvement of APIs, of which WMDE are domain experts, which will benefit everyone. The above said, there do seem to be a lot of teams in the RACI. Should this be taken as a signal to decrease the scope of the decision?"

I know we're not supposed to get into the hows of this, but is it safe to assume this section topic tagging will not involve any sort of metadata embedded directly in wikitext? I noticed mentions of that approach in the linked context documents. If not, it seems like you'll want to include Parsoid, and client-side teams on how to work with the new embedded tags for article display and editing.

In the version of the Statement Overview I am reading while filling this WMDE is listed as Informed part in the RACI table. WMDE disagrees with this categorization as we see our involvement required to making the decision, or decisions on aspects related to Wikidata - which seems to be an essential building block for this endeavor. Therefore we believe we should be a Consulted party in this Decision Making Process. WMDE currently owns and maintains the majority of structured data infrastructure at Wikimedia, i.e. Wikidata, as a structured data repository, the related editing processes and workflows, structured data source for Commons, etc and other areas related to Wikidata, including links between Wikipedia articles and Wikidata concepts (sitelinks, those seem like an essential building block for SDAW), descriptions for mobile apps, and so on. Making any changes to this infrastructure would require our involvement. We also anticipate the work of the WMF teams might require some changes to how Wikidata data is served to ""client apps"", what Wikidata APIs provide, how to build software on top of Wikidata/Wikibase, etc. Those topics are something we would be eager to work on to enable your success but we would need to be actively involved to e.g. understand your needs and requirements, discuss limitations, tradeoffs etc. One way communication, which ""Informed"" level implies, is not going to be sufficient to allow Wikidata serve the SDAW initiative well. Therefore we would see WMDE as a ""Consulted"" actor in this decision making process."

Because the document did not provide technical details the team was unable to confidently say which teams should be included as they do no yet have clarity on the “how” of this project.

As far as I understand, SRE Service Ops should probably be ""responsible"" and not just ""consulted"". The project is very likely to require new hardware being purchased and provisioned, and they would be the ones shepherding that process and doing that work. Also, it seems like multiple possible paths forward involve the creation and production-ization of a new storage backend. In such a case, SRE will need to be involved in the design phase, as well as help to resolve many implementation details -- and will need to feel confident in going oncall for said service. This is a substantial amount of work. Re: the final question -- I am of course happy being contacted if my feedback is unclear, or if the decision team / Forum chairs disagree. But I don't feel a need for it if there is agreement."