Wikimedia Technical Documentation Team/Doc metrics/Assessment
Please do not mark this page for translation yet. It is still being drafted, or it contains incomplete translation markup which should be fixed before marking for translation. |
This page documents the outcomes of the testing and assessment of v0 experimental technical documentation metrics, completed in FY 2024-25 Q3.
tl;dr
edit- We need more data, more accurate data, and easier-to-use data sources. Platforms where we publish tech docs lack much of the data infrastructure that exists for sister projects like Wikipedia. However, we can't just try to replicate the ecosystem of tools and dashboards that Wikipedias have, because tech docs have different quality criteria than encylopedia articles. The biggest areas where we lack key data for assessing tech docs are:
- Connections between docs and code (#CodeConnection)
- Consistent format and style (#ConsistentFormat)
- We can't use data to measure two of the most crucial elements of tech docs quality: topic coverage and accuracy. This requires human analysis.
- Overall, the test metrics turned out to be of mixed utility and mixed validity.
- The collection-level standardized scores did not consistently align with the human-assessed quality of the collections. We should only standardize scores if we have a larger volume of input data elements.
- Some of the doc attributes we measured were not great indicators, or were misleading for certain metrics.
- Some of the doc attributes we measured were not bad, but turned out to be superfluous given the utility of other signals for the same metric.
- Some of the doc attributes we measured were useful, and we identified some combinations of data that do yield actionable insights. As a next step, we will be implementing those in a prototype.
User testing focus areas
editOur user testing of the metrics focused on assessing:
- Utility of collection-level standard metrics:
- Do testers who look at the standard metrics at the collection level make the same conclusions?
- Do the metrics lead testers to identify the same areas for further investigation? (This doesn't mean the data is right, it just means we all interpret it in the same way.)
- Validity of metrics output:
- Do the metrics lead testers to make the same conclusions they would make based on manual/human assessment re: which docs need which improvements?
- How and where do the metrics-based conclusions align/diverge from human assessment of the same docs?
Our user testing contributed to the critiques, conclusions, and areas for future work documented on this page.
Critiques of specific metrics
editOverview of how Tricia ranks the metrics based on the test outcomes:
Metric | Utility | Validity |
---|---|---|
#Succinct | Relatively good | Mixed |
#Developers | Relatively good | Relatively good |
#CollectionOrientation | Could be good | Relatively good |
#CodeConnection | Could be great with improved validity | Not great |
#ConsistentStructure | Not great | Unclear |
#ConsistentFormat | Not great | Not great |
#Freshness | Not great | Unclear |
Succinct
edit🏆 Utility: relatively good
🌀 Validity: mixed
The REST_API collection was the only collection with a positive score for Succinct. Three out of the five test collections scored -1. After digging in to the data, it seems the doc attributes we used to measure succinctness were mixed in their usefulness:
- Headings below level 3 influenced the score, especially for long pages. However, it didn't seem to be a more useful indicator than page length. Instead, it just brought the score down too drastically for pages that would already have been flagged by other measures like number of sections or length in bytes.
- The doc attribute "See also length", which measured whether a "See Also" section had more than 6 links, influenced the score, but didn't turn out to be a very useful signal. It mostly gave false negatives for overall succinctness of a page, and it gave the same negative score regardless of whether a See Also section had 7 links or 100 (lack of nuance).
- The most useful doc attribute in this metric was Section count, followed by Page length. Combined, these two signals provided useful information about the amount of content on a page, and whether it was broken up into reasonable chunks. A ratio of number of sections to characters could be an even better attribute to measure. Unfortunately, the utility of these two datapoints was somewhat obscured by the relatively useless or misleading other doc attributes that comprised this metric.
Quotes from feedback:
"Once on specific pages, I generally agreed that the problems pointed out by the metrics are worth addressing, but [compared to other issues] didn't find them very important. I found myself more interested in the quality of the instruction or writing, readability, and informational value."
Developers
edit🏆 Utility: relatively good
🏆 Validity: relatively good
This metric was made up of two types of signals: assessing code-related content on the page, and assessing page relevance/popularity based on traffic and revisions
Validity was impacted by inaccurate data and unavailable data for incoming links. The metric's validity was also impacted by the fact that pageviews and revisions are not entirely reliable indicators of the relevance or utility of a page.
In general, only the collections based on product/technology as their unifying concept scored positive for the Developers metric. The workflow and doc type collections were neutral or negative. Despite that, this metric seemed to work relatively well for "Local development setup" collection. The top-scoring pages were the most relevant pages for this user task. The collection's overall score was dragged down by the inclusion of a sub-collection that has a single maintainer (CLI) and one for low-priority docs (Mediawiki-Vagrant). See more about that below, in the CollectionOrientation section.
- Checked and confirmed this isn't part of the doc type design flaw, since the Developers metric included various attributes that aren't conditioned on doc type.
For the REST API collection, this metric also seemed to work relatively well. Even with the missing and inaccurate link data, Incoming links scores accurately reflected the relevance, specificity, and popularity of pages. This was true both at the collection level and across collections, at the page level (more about that below).
Page | Code sample present on page | Code samples in more than one language | If landing page: contact info for owner/maintainer? | Incoming links | More than one edit in the last 30 days? | Majority of edits by single maintainer | Percentage of page watchers who visited in the last 30 days | Total score for Developers metric |
---|---|---|---|---|---|---|---|---|
Codex | 30 | 20 | 0 | 10 | 20 | 0 | 10 | 90 |
Localisation | 0 | 0 | 50 | 10 | 20 | 0 | 0 | 80 |
Local_development_quickstart | 30 | 20 | 0 | 0 | 20 | 0 | 10 | 80 |
Manual:Messages_API | 30 | 20 | 0 | 10 | 20 | 0 | 0 | 80 |
Manual:Extension.json/Schema | 30 | 20 | 0 | 10 | 0 | 0 | 10 | 70 |
API:REST_API | 30 | 0 | 0 | 10 | 20 | 0 | 10 | 70 |
ResourceLoader/ES6 | 30 | 20 | 0 | 10 | 0 | 0 | 10 | 70 |
Manual:FAQ | 30 | 20 | 0 | 0 | 20 | 0 | 0 | 70 |
ResourceLoader/Package_files | 30 | 20 | 0 | 10 | 0 | 0 | 10 | 70 |
Manual:Page_content_models | 30 | 20 | 0 | 0 | 20 | 0 | 0 | 70 |
Note: a zero score is not necessarily bad. To learn more about how to interpret the scores in the table above, see the User Guide and Reference doc.
Looking only at the pages with the lowest scores for the traffic/edits signals in this metric (pageviews, editor variety, revisions), the data presents these five pages as having the lowest overall scores:
- Manual:Text_table
- Database_field_prefixes
- Manual:Skins
- API:Presenting_Wikidata_knowledge
- Continuous_integration/Tutorials/Debian_packaging
When comparing how effectively the revision signals enabled us to differentiate pages, the editor variety signal was more useful. We should still consider implementing the revisions signal, but try lengthening the time period to 90d instead of 30d.
Looking only at the pages with the lowest scores for the content signals, the data presents these pages as having the lowest overall scores:
- Database_field_prefixes
- Cli/guide
- Manual:Skins
- API:Presenting_Wikidata_knowledge
- Continuous_integration/Tutorials/Debian_packaging
In both of the above lists, human assessment of the pages indicates only two of the pages may actually need attention:
- Cli/guide: is transcluded in its parent page / landing page, so the page itself is never visited.
- Continuous_integration/Tutorials/Debian_packaging is a stub that could probably be integrated into other docs.
Despite its shortcomings, Incoming links scores for those two pages are the only signal that successfully differentiates them from their lowly peers. So, the combination of these content and traffic/revisions signals seems to be valid and useful as long as we include Incoming links scores. For all the other pages, which didn't receive a negative incoming links score, it's true the pages haven't been updated in a long time, but we can't confidently conclude based on this data that they aren't relevant for developers.
CollectionOrientation
edit🌀 Utility: could be good
🏆 Validity: relatively good
CollectionOrientation seems very difficult to assess without looking across pages as a collection. We only looked at whether there was a navigation menu, and/or sections that provided hooks to next logical pages. Despite that, this metric seemed to reflect the reality that pages in product/technology-focused doc collections are more likely to have navigation elements that support moving through the collection as a group of related pages.
The test doc collections we curated based on developer tasks/workflows and doc type scored lowest for the CollectionOrientation metric. This makes sense, because pages on mediawiki.org have mostly been created and maintained based on the product or technology they describe, or the team that owns those systems. For example, the task-oriented "Local development setup" doc collection scored among the lowest of the collections for this metric. When you drill down into the page-level data, it appears that this is primarly due to these pages not currently being curated as a collection on-wiki:
- Only two of the pages in the collection have navigation menus, and one of those is a landing page. The only navigational supports between most of these pages and sub-collections are links in wiki page content.
- Two sub-collections (Mediawiki-Vagrant and Mediawiki CLI docs) scored low for
Collection score from section types
because many of the pages had neither a "See Also" nor a "Next Steps" section. Metrics testers agreed that adding these section types could help improve the navigability between pages.
Even if these pages should not be considered part of one collection, the takeaway here is that they're only using page structure (aka subpaging) to indicate their relatedness, and they are not presented anywhere as part of a larger collection. This is somewhat useful information that could inform documentation improvements, like creating a cross-collection landing page or navigation template for these pages. However, getting information like this isn't currently feasible at scale because we don't (yet) have doc collections for other user journeys or tasks.
Quotes from feedback:
User journey collection concepts are tricky:
"I’d say that local dev environments, as a user journey collection, doesn’t need its own nav because once you choose a local dev environment, you don’t need to be able to easily get to the docs for another one. There’s a main landing page, then you go off to whichever tool you’ve chosen and just use that. There are some other collections in this list that could use their own local nav, so in that respect, this score was correct in highlighting that."
Pages using subpaging in place of a nav menu:
"This isn’t entirely bad, but yes, adding a simple navigation template would make these pages easier to use. However, the nav template should only be for the [sub-collection], not for any other pages in the local dev setup collection. User journey collections like these should probably only very rarely have their own nav templates. Maybe omit this score for this type of collection?"
Nav menu was a valid signal, but its utility is still an open question, and it's currently not feasible to capture automatically.
CodeConnection
edit🌀 Utility: could be great with improved validity
💔 Validity: not great
Need more datapoints. Only two (Code sample automation and Links to code repos from landing pages) doesn't seem like enough for assessing this, especially when that assessment was conditioned on doc type (see: Design flaws below). The output values for the test set are either 0 or 30, but only provide insight within the same doc type buckets (landing page vs. other), because we only recorded data for one of the questions depending on the doc type. This means that we only have one datapoint per page for this metric. Clearly not enough! (Details in the Reference doc).
Non-landing pages that scored positively for this metric are those that have code samples in source control, or generated from code that is in source control. In our test dataset, 94 of these pages had code samples, but only 8 pages had code samples that were in source control or generated from code:
Page | Doc type |
---|---|
API:Allmessages | Reference |
API:Parsing wikitext | Reference |
Extension:Examples | Tutorial |
Extension:BoilerPlate | Tutorial |
API:Picture of the day viewer | Tutorial |
API:Nearby places viewer | Tutorial |
API:Article ideas generator | Tutorial |
API:Holidays viewer | Tutorial |
Doc type | Any code samples on the page? | # pages |
---|---|---|
How-to or user guide | TRUE | 33 |
FALSE | 7 | |
40 | ||
Unclear, or a mix of multiple types | TRUE | 26 |
FALSE | 8 | |
34 | ||
Tutorial | TRUE | 21 |
FALSE | 2 | |
23 | ||
Reference | TRUE | 10 |
FALSE | 9 | |
19 | ||
Overview or concepts | TRUE | 4 |
FALSE | 3 | |
7 | ||
TRUE total
(page has code samples) |
94 | |
FALSE total
(page doesn't have code samples) |
29 | |
Grand Total | 123 |
ConsistentStructure
edit💔 Utility: not great
❓ Validity: unclear
- The REST API collection was the only collection that had an overall positive score for this metric.
- The doc attribute "See also length", which measured whether a "See Also" section had more than 6 links, influenced this metric score, but didn't turn out to be a very useful signal. It mostly gave false negatives for overall consistent structure on a page, and it gave the same negative score regardless of whether a See Also section had 7 links or 100 (lack of nuance).
- Navigation menu coverage was a somewhat useful indicator, but is not currently feasible to identify automatically, since it depends on being able to auto-identify navigation elements on the page.
ConsistentFormat
edit💔 Utility: not great
💔 Validity: not great
- MW Tutorials was the only collection with a positive score for this metric. However, this turns out to be mostly due to unbalanced inputs for this metric: it only measured two doc attributes, and one of them was conditioned on doc type.
- The MW Tutorials collection had slightly more landing pages in it than other collections, so even though other collections also scored generally well for consistent headings, the boost from landing pages artificially inflated the score for this collection.
- This metric did successfully reflect that two pages in the REST API collection had heading issues, so that’s great!
- Need more datapoints. The two we used are far from sufficient for assessing this, and the most valid one was based on human assessment of heading consistency. This would be very difficult to assess automatically.
- Measuring anything about formatting seems very fraught in a world where so much can be changed by user preferences and device settings.
Freshness
edit💔 Utility: not great
❓ Validity: unclear
This metric is based on 6 doc attributes, which include some of the most easily-accessible data (revisions and pageviews). However, that data is hard to interpret as a reliable indicator, and we have minimal ability to control it. For examples, see revisions.
Design flaws
editConditioning input data on doc type
editSince some doc attributes are only applicable to certain doc types, and doc type influences the meaning of certain attribute values. However, many of our docs don't have a clear doc type, or are a mix of doc types. In our data gathering process, we only assessed certain attributes if we could identify a page as being (for example) a landing page. Since so many pages have unclear doc types, this means that the presence or lack of clearly-identifiable landing pages in a collection influences scores. Our (flawed) data input process was conditioned on doc type in the following way:
If the page is identified as doc_type = landing page, gather data about:
- Landing page layout (
lp_layout
) - Link to code repo (
lp_repo
) - Contact info for maintainer (
lp_maintainer
)
If the page is identified as some doc_type other than landing page, gather data about:
- Presence and attributes of code samples on the page (
code_samples, code_samples_multi, code_samples_auto
) - Navigation and section elements (
nav, page_in_nav
,next_steps
) - Content format (
list_or_table
)
Example impact: collections that have more landing pages will inherently be able to score higher for the CollectionOrientation metric, because they can get points for having a content grid, but other page types can't. Maybe this isn't a design flaw, because landing pages themselves help orient the reader to a collection?
Scope of impact: our test dataset of 140 pages had only 17 pages coded as landing pages, leaving 123 pages that were not influenced by this conditional.
For a full list of which fields in the input data depended on doc type, see the Field Reference.
Excluding some pages from test input data
editThe test dataset contains, for some collections, fewer pages than the full collection contents as reflected in the PagePiles. In some cases, this discrepancy comes from reasonable exclusions, like removing translation pages, redirects, or docs not published on mediawiki.org from the test set. However, in the case of the ResourceLoader collection, we didn't have the capacity to include the individual configuration variable documentation pages in the test. Because we also excluded JS documentation published on doc.wikimedia.org, these metrics may not accurately reflect the content, nor the user experience, of the ResourceLoader collection. The other 4 collections in the test set are not as impacted by this flaw.
Over-representing some doc best practices across metrics
editIn general, the metrics computions award score increases for docs that use best practices, but they don't penalize docs for NOT using those. For example: not every type of content can be represented as a table or list, but those types of formatting can improve doc readability if used when appropriate. This nuance is captured by metrics scoring that rewards Tables and lists, but doesn't penalize their absence.
While the same reward-not-penalty logic applies for Navigation: layout grid, that best practice is even more specific: it's only relevant for landing pages, and is really more of a suggestion than a firm best practice. A layout grid is not the only way to provide navigation and orientation to a collection, it's just one way that generally works well. Because of this, and because layout grids are easy to identify, we ended up including it as a signal in three different metrics. Even though we didn't penalize pages for not using a layout grid, the relatively small number of attributes we were assessing overall means that this attribute was over-emphasized -- especially for something that is really just a suggestion!
Scope of impact: ConsistentFormat, CollectionOrientation, ConsistentStructure metrics all award points if a page coded as a landing page has a layout grid. Since our test dataset of 140 pages had only 17 pages coded as landing pages, this has minimal impact on the test.
Considerations for future work
editWe manually generated the test dataset because implementing data pipelines to extract the required data would involve significant engineering effort. We wanted to validate the utility of the doc attributes and metrics we measured before putting resources into implementing more robust and large-scale solutions. The following sections summarize the different types of challenges likely to be involved in automating the generation of these tech docs metrics.
Page content analysis requires complex parsing
editMost of the doc attributes we assessed for this test were content attributes. Because we were generating the test data manually, we could use our human cognitive superpowers to look at pages and quickly and easily record values for for attributes like title length, consistency of headings, and presence of special page sections. Extracting page content metrics automatically, at scale, would require engineering effort to determine the appropriate parsing techniques, data sources, etc, and may require substantial computational resources. (Not as substantial as our production Wikipedia article quality models, but still worth considering).
We could consider using NLP tools to generate a one-off dataset of textual characteristics for a large number of technical documentation pages -- but then we'd still have to figure out how to translate that information to actionable docs data. This project has been a first step in identifying how we might do that.
We could consider building an ML model to assess technical documentation quality. The approach TBurmeister took in defining weights for different attributes (features) would be easy to adapt to a linear regression model. However, the process of generating training data may be too prohibitive, and the quantity of technical documentation pages too small, to justify this approach.
Content assessment is subjective and contextual
editMany of the doc attributes we measured are not very strong or reliable indicators of the doc characteristics we care about -- but they're all we have, other then page-by-page manual review by humans. (See: collection audit).
See the Field reference for details of these challenges for each doc attribute we used in the metrics test.
Mediawiki.org is a multilingual wiki, but to keep the scope of this test manageable, we only assessed English versions of pages. Future implementation work should take into account which metrics can and should be provided for pages in multiple languages -- while also balancing that with the reality that the translation rate for our technical content is generally very low:
- None of the pages in our test dataset were marked "DoNotTranslate". Roughly half of the pages had translation markup added to them.
- Of those 74 pages, only 4 had translations more than 50% complete for all languages shown in the languages menu.
Making scores accurate and useful is hard
editIf we standardize the outputs for each metrics category, then we're able to compare across categories, and it's easier to see meaning in the data when looking across collections and metrics. However, standardizing the scores may remove too much nuance.
If we don't average the raw scores for a collection based on the scores of the pages, then collection size impacts the scores. However, averaging scores across the collection's pages means that how we define the collections impacts the data.
- Example: Both the normalized and raw scores for the Local dev setup collection were lower than other collections. Drilling down into the page-level scores shows that many of the 0 scores for this metric came from pages in a specific sub-collection. If that entire sub-collection was considered a separate collection, or excluded from this collection for any reason, the score would be significantly different.
- Example: the REST API collection was the only collection with a positive score for Succinct. 3 out of the 5 collections scored -1. We could conclude from this that most of our docs could be more succinct. However, the REST API collection is the smallest and most tightly-scoped collection in our test set. That may have influenced the score. It's hard to tell if the pages in this collection are actually more succinct than others, or if pages in small collections with clearly-defined conceptual boundaries are more likely to be succinct. The collection concept in this case reflects a documentation space that is relatively clear and well-bounded, which probably makes it easier to write clear and succinct documentation. This may not be a design flaw...just something to keep in mind.
It may have worked better to normalize the scores at the page level, and then only use averages for the collection-level?
Even though this effort was the largest tech docs assessment project ever undertaken for Wikimedia tech docs (to our knowledge!), we still only assessed a small number of doc attributes. Consequently, each attribute we chose to use could have a big, potentially unbalanced impact on the metric score for a given page and collection. The more data we could get and use, the more robust metrics like this are likely to be.
Think carefully about measuring things we can't change
editTesters highlighted how some of the metrics included doc health indicators that aren't things we can control about the docs. Especially when gathering this data requires significant time and effort, we should carefully consider if that investment is worthwhile when we know we can't take specific actions to change the metric.
Examples:
- Were more than 50% of edits made by the top editor?
- More than one edit in the last 30 days?
- Percentage of page watchers who visited in the last 30 days
Technical documentation exists on multiple platforms
editThis test only assessed pages on mediawiki.org. To more accurately assess the quality of our technical documentation, we should take into account the full range of wikis and platforms (Gerrit, Github, GitLab, static sites, PAWS notebooks...) where people publish Wikimedia technical documentation. Readers and doc maintainers have to navigate this complexity, so excluding it from doc metrics creates an inaccurate picture of usability, findability, and other key quality indicators. Doc collections should capture all the pages a user would need to consult for a given task, topic, etc., regardless of where the page is published. Accurately modeling the landscape of our content can help us identify areas where collection contents are very scattered, so we can improve the developer experience by working to consolidate it.
Inaccurate data
editThe following known issues impact the accuracy of data we used in the test metrics calculations:
- phab:T353660:Links that use Special:MyLanguage don’t appear in Special:WhatLinksHere. Impacts the Developers metric score for incoming links.
Unavailable data
editCoverage
editCoverage is one of the most important elements of technical documentation quality. It whether a doc, or collection, includes all the information it should, or whether there are content gaps. Coverage is especially important in technical documentation due to the constantly-changing nature of most codebases. If the docs don't cover what the code actually does, their utility decreases dramatically.
Attributes of docs stored with code
editWe currently have no easy way to assess (or even identify, at scale) technical documentation stored in our code repositories. A prerequisite for doing this would be a comprehensive and well-defined list of which code repositories we should include in an assessment. Then -- since docs stored with code are usually in markdown -- analyzing that content would require us to implement different parsing than for on-wiki docs. Finally, docs stored with code are published in a different setting with different navigation options and UI elements, so assessing their attributes would likely require different quality criteria than for on-wiki docs.
Referrer data
editOur ability to measure the relevance of individual tech docs is somewhat impacted by the lack of data from external incoming links. Especially since developers report using search engines as one of their primary ways of finding technical information (per Developer Satisfaction Survey/2025 results forthcoming as of 2025 March), it would be useful to be able to identify which tech docs receive the most pageviews from those sources. It would also be useful to identify incoming links from code repositories, since those links can be a strong indicator that content is related to and aligned with actual code. Unavailable data:
- Referer or clickstream data is not available for technical wikis unless the referer is another wiki page, so we can't use that to assess incoming clicks from the many places where people store source code.
- For privacy reasons, pageview tables and the tools that use them limit referer info to coarse buckets (referer_class : Can be none (null, empty or '-'), unknown (domain extraction failed), internal (domain is a wikimedia project), external (search engine) (domain is one of google, yahoo, bing, yandex, baidu, duckduckgo), external (any other)).
For more details, see Links from code repos to wiki pages and Incoming links.
Doc types
edit- Pages don't consistently align with clear doc types.
- When we can identify a clear doc type, we often have no consistent way to associate that information with the page. Categories work well for this, but have not been consistently used across our technical wikis, and can't be used for docs that are stored off-wiki.
- Doc type can change the meaning of certain metrics. For example: a very long landing page probably needs improvement, but a very long reference doc is usually fine.
- Being able to assess attributes by doc type is useful. For example: if we can identify all the tutorials, we can check for specific attributes they should have, like code samples.
(Design flaw?) The way we implemented scoring for some categories may have resulted in unfair calculations for collections that have fewer landing pages.
- Example: Collections that have more landing pages will inherently be able to score higher for the CollectionOrientation metric, because they can get points for having a content grid, but other page types can't. Maybe this isn't a design flaw, because landing pages themselves help orient the reader to a collection?
Doc collections
editWithout human curation to group pages into collections, the main data source we can use to assess the quality of related documentation is page structure. Categories are generally not used, or used very inconsistently, on our technical wikis. Page structure isn't a reliable nor complete reflection of the content that exists for a given topic, technology, doc type, or developer workflow.
Doc collections provide a way for us to curate groupings of pages that reflect how people navigate and experience our documentation.(See more examples and thinking about doc collections from past projects.
- For this project, we manually curated 5 collections as PagePiles so we could generate collection-level metrics. Without that type of curation effort, only page-level metrics would be possible.
- It's generally more difficult to analyze and extract meaning from only page-level metrics.
- Some key metrics indicators can only be assessed if we have collections:
- Consistency of page contents and structure across pages in the collection
- Findability: do navigation menus and/or links capture all pages in the collection, or are there orphan pages?
- Findability: do pages within a collection link to each other? Is this because of content sharding and duplication, lack of nav menu coverage, or some other reason?
- Do all pages of a given doc type meet quality criteria for that type, i.e. tutorials have steps and ideally code samples, landing pages have contact info, links to primary code repo, etc.
The test collections we curated based on a user journey / developer task and doc type scored lowest for the CollectionOrientation metric. This makes sense, because pages on mediawiki.org have mostly been curated into groupings based on the product or technology they describe, or the team that owns those systems, rather than these other organizing principles. The collection-level standardized scores did not consistently align with the human-assessed quality of the collections. We should only standardize scores if we have a larger volume of input data elements.
- For example: the ResourceLoader collection is generally in good condition, but its scores were mediocre. The REST API collection is high quality, and its score reflected that. The "Developing extensions" collection contains a larger number of pages with a range of quality, but it scored higher than the ResourceLoader collection and closer to the REST API collection.
Curating task-based collections of content, and navigation paths to and through them, is the type work that technical writers are uniquely well-placed to undertake. It requires cross-product expertise and familiarity with both user journeys and the documentation landscape. However, it's also time-intensive and complex due to widely varying levels of page specificity/content coverage, and complicated one-to-many relationships between pages, collections and developer tasks. Collections would be difficult to scale for a wiki-wide, automated implementation of doc metrics, but could be worthwhile for a few key developer journeys.
PagePiles are not likely to be a sustainable long-term method for us to curate documentation collections. Reasons:
- It's not really the intended use case for the tool.
- Changing the contents of collections requires merging PagePiles into a new one, with a new ID. So, no persistent identifiers for a collection, which means tracking collection metrics over time is extra-complicated.
- Not simple to access and explore collection contents through other interfaces, and ideally collection membership for pages should be something stored as page metadata, or accessible from the page itself.
Next steps
editMetrics to implement as prototype
editDoc attribute measurement | Type of signal | Metric |
---|---|---|
Section count | Content | Succinct |
Page size in bytes | Content | Succinct |
Percentage of page watchers who visited in the last 30 days | Traffic/edits (popularity) | Developers |
More than 50% of edits made by the top editor? (SPOF test) | Traffic/edits (popularity) | Developers |
More than one edit in the last 30 days? | Traffic/edits (popularity) | Developers |
Links to code repos from wiki pages | Content relevance | Developers |
Presence of code samples on page | Content relevance | Developers |
Incoming links from the same wiki | Content relevance | Developers |
Details and implementation notes for the prototype are at Wikimedia_Technical_Documentation_Team/Doc_metrics/Prototype.
Related follow-up work
edit- Pursue implementation of doc linting to help doc maintainers to make pages more succinct and user-friendly.
- One opportunity in this area is the current warnings that MediaWiki gives editors for pages over a certain byte size. The message says "WARNING: This page is ## KB long; consider shortening it to better serve readers." Could that warning link to guidance that helps editors shorten pages and add structure?
- Define requirements for curating doc collections over time and clarify use cases. Investigate opportunities to create more collection-based navigation aids for key user journeys.
- Engage with the community on the topic of subpaging and best practices for grouping / structuring docs. Highlight the benefits and drawbacks of different approaches, then update the Style Guide and Docs Toolkit with recommendations.
- Consider implementing some standard categorization for navigation templates, and for pages that have a clear doc type, like tutorials. This metadata can enable some automated page content analysis that is otherwise currently infeasible.
- Continue to seek out opportunities to automate code samples, especially in tutorials, since we already have examples of that being feasible and working well. Consider updating Coding conventions pages to recommend best practices for connecting docs and code.