Wikimedia Technical Documentation Team/Doc metrics/Design


This page describes the design process and methodology the Tech Docs team used to Define a draft set of v0 metrics definitions for MediaWiki core technical documentation. This design process used findings from multiple types of research.

The main outcomes of this design phase were:

  • Map available data signals to quality dimensions.
  • Define v0 metrics to test on sample doc collections.

Requirements & metrics criteria

edit

This page describes the process the Tech Docs team is using to select a set of metrics. TBurmeister_(WMF) has completed (or will be completing) research, design, and outreach project phases to define metrics criteria and elicit requirements based on the following three areas:

  1. Project scope and team requirements
  2. Industry best practices and scientific literature, focusing on any commonly-accepted frameworks for measuring technical documentation quality
  3. Technical community goals and needs related to understanding and measuring the quality of technical documentation (both as doc users and doc maintainers).

The resulting criteria reflect, and were inspired by the evaluation criteria WMF is using for its Essential Metrics, and the criteria the WMF Research team defined in their work on Knowledge Gap Metrics.

Project requirements

edit

Metrics should be succinct

edit

We should limit ourselves to five or fewer metrics, so that the metrics are relatively easy to understand, compare, and maintain. This criteria also helps us avoid scope creep, since we must complete any metrics definition and implementation work during FY24-25.

Metrics should be universal and consistent

edit

Regardless of the organizing principle we use to group technical documentation pages or files into collections, the same set of metrics should help us assess doc quality across the collections. In other words: metrics must only measure characteristics of technical documentation that are universally applicable across the different facets we might use to inspect groups of documentation (for example: collections based on topics, doc types, audiences, and workflows).

Metrics should not be overly-specific. For example, we should not define a metric based on how many of our tutorials have runnable code. Even though that's a useful piece of information (and we should fix tutorials that don't have runnable code!), it doesn't apply to all types of documentation, and not all audiences or workflows require tutorials.

Metrics should be relevant and actionable

edit

Metrics should reflect actual doc quality, not just things we happen to be able to measure about docs. When we know we've improved a doc or a collection, the metrics should reflect that.

"A raw metric is close enough to the system so that it can reflect an actual core aspect of it. It doesn’t use complex formulas. A cooked metric tends to ignore the underlying system to reflect a business concept, which needs a complex formula to be expressed." - User:MForns_(WMF) in Metrics from Below presentation

Metrics are a signal, not a goal

edit

Most documentation metrics can have different interpretations based on context. We can't draw reliable conclusions only based on one type of data, or one type of metric.

"To properly interpret metrics, you need to know the content. For example, low time-on-page could mean that the search serves the wrong page… or that the answer to the most common question is in the first sentence. Metrics also can’t tell you whether a problem is due to navigation or content…You can get actionable information for improving documentation without analytics or metrics." - Write the Docs

"Often, there is not a one-to-one correlation between a metric and a specific [doc] change, but rather numerous metrics inform a change or an overarching goal, such as delivering the right content to the right user at the right time."[1]

We want our work to be driven by user outcomes, not by metrics themselves. When we say "drive docs work with data", perhaps we really mean "validate docs work and support docs decisions with data", not "only go where quantitative data tells us to go". The primary user outcomes for these metrics revolve around enabling efficient decision-making around technical documentation efforts:

  • Anyone (not just tech writers) can use metrics to help identify doc collections that may need improvement, and what type of improvements are most needed.
    • This helps validate that we're working on what needs doing, and supports decisions about where to allocate time and effort.
  • Anyone (not just tech writers) can use metrics to assess how doc collections we've worked on are showing improvement.
    • This helps validate that what we're doing is working), and helps us assess the impact of completed work (retrospective).
  • Tech writers, code maintainers, and product owners will be more able to understand:
    • Are our docs improving in ways that matter for our community and users?
    • Are our docs up-to-date / correct / evolving in step with the software they document, or are they behind?
    • Are our docs meeting users' needs?
    • How are our docs impacting developer (and maintainer) experience?

Most importantly, we have to remember that we're trying to analyze a massively complex system, and we'll likely only be able to capture small glimpses of the full picture, even if we can verify some types of progress towards our documentation goals.

Industry and academic literature criteria

edit

Strimling (2019) was one of few academic sources TBurmeister reviewed that explicitly addressed tech docs metrics criteria:

  • "The definition must be from the readers’ point of view: Because it is the readers alone who determine if the document we give them is high quality or not, any definition of documentation quality must come from the readers’ perspective. Writers can come up with any number of quality attributes that they think are important, but, at the end of the day, what they think is not as important as what the readers think."
  • "The definition must be clear and unequivocal: Both readers and writers have to “be on the same page” when it comes to what makes a document high quality. Misunderstandings of what readers actually want from the documentation are a recipe for unhappy readers."
  • "The definition must cover all possible aspects of quality: Quality is a multidimensional concept, and we must be sure that any attempt to define it is as comprehensive as possible. A definition that emphasizes one dimension over another, or leaves one out altogether, cannot be considered to be a usable definition."
  • "The definition must have solid empirical backing: To be considered a valid definition of documentation quality, serious research must be done to give it the proper theoretical underpinnings. Years of experience or anecdotal evidence can act as a starting point, but if we are serious about our professionalism and our documentation, we need more."[2]

Community criteria

edit

[We will continue to gather community criteria during a more robust feedback cycle planned for November-December 2024, so this section is incomplete]

First, TBurmeister reviewed user surveys, past feedback, and documentation of prior work and community conversations around measuring doc health. Then, TBurmeister had exploratory conversations to try to understand technical community members' priorities and mindset in approaching documentation metrics as a tool, and as a quality indicator. She asked questions like:

  • As a developer of MediaWiki software, who has worked on or written many pages of documentation: how would you tell an enthusiastic new doc contributor to direct their efforts to have impact? How do you know which documentation needs work?
  • If you woke up one day feeling like improving docs, how would you decide what to work on? What data would you consider?

[This phase is ongoing, so this section is incomplete]

Metrics selection process

edit

Disclaimer: This process was more scrappy than rigorously scientific, though it was scientific enough to generate a useful analysis and (seemingly!) valid data. Some of the data elements and doc characteristics expanded or combined after the initial metrics categorization activity, and some of the mapping between data elements, doc characteristics, and metrics categories were necessarily fuzzy. This is a first best effort to see what we can learn, and what might be useful!

Compile list of technical documentation characteristics

edit

TBurmeister created a list of documentation-related content and code characteristics, sourced from a variety of sources.

Map doc characteristics to metrics categories

edit

TBurmeister designed a card-sorting activity where members of the Tech Docs team independently grouped documentation characteristics into metrics categories. We started with a seed set of four example metrics categories, but each participant could (and did) add additional categories. Each doc characteristic could be placed into more than one metrics category. For example:

  • If docs are organized alongside related pages, they are more likely to be findable.
  • If docs avoid duplication, they are more likely to be usable and maintainable.

→ See the full list of technical documentation characteristics (Google spreadsheet) we used in our design exercise.

Map data elements to doc characteristics

edit

After compiling the list of doc characteristics, TBurmeister mapped low-level content and code data elements (the raw data we can measure) to these characteristics. The example below illustrates these mappings. Note how the same raw data element can be an indicator for different types of documentation characteristics, depending on its aspect:

Raw data element Data aspect Documentation characteristic(s)
Navigation template Is there a navigation template on the page? Orient the reader within a collection

Organized alongside related pages

Navigation template How long is the navigation template? Avoid walls of text (succinctness)
Incoming links Are there many incoming links from other wiki pages? Avoid duplication

Use consistent organization and structure

Incoming links Are there very few incoming links from other wiki pages? Align with real developer tasks and problems (relevance)

Use consistent organization and structure

Incoming links Are there any incoming links from code repos? Align with real developer tasks and problems (relevance)

Update when code changes

→See the full mapping here(Google spreadsheet).

Score the metrics mappings for each doc characteristic

edit

Our card-sorting exercise helped us align each data element + doc characteristic mapping to a metric, but this is where things got interesting:

  • The card-sorting activity gave us 9 metrics categories (too many!).
  • We didn't necessarily agree on which data elements were the most salient indicators for a given metric.
  • Some data elements are easy to measure, but aren't a very strong signal for doc characteristics we care about.
  • Some data elements might be a great signal, but are too hard to measure.
  • Some data elements are relevant for so many doc characteristics, they seem to lose their salience for metrics.
  • Some data elements have many different aspects, each of which may be relevant for a given metric or doc characteristic in different ways.
    • Example: the use of tables to format information can be a positive indicator for consistent formatting, and making written content more succinct. However, tables can be a negative indicator for usability of pages on mobile.
    • The way TBurmeister dealt with this complexity resulted in a methodological flaw: some really important doc characteristics are underrepresented in the data if they only have one aspect or interpretation.
  • Most metrics relate to each other, and some might be directly correlated.

Analyze and refine initial metrics

edit

To make some sense of this beautiful complexity, TBurmeister did the following:

  • Scored the number of times each data element + doc characteristic was categorized into a metric.
  • Dropped one metrics category ("Assessable") that had fewer than 3 data elements coded into it.  
  • Analyzed data elements coded across the remaining 8 metrics categories:
 

Eliminate outliers

edit

The categories that make up the smallest percentages in the data were:

  • Scannable
  • Complete
  • Inclusive

Because Complete and Inclusive appear in many of the sources researched, TBurmeister didn't want to eliminate them without closer analysis. She was comfortable eliminating Scannable after confirming that the data elements coded there are covered by other categories. "Scannable" may be more of a doc characteristic than a metric category, but as long as we're measuring the underlying aspects of content that make it scannable (for example, as part of elements like "consistent structure and format" and "avoid walls of text"), that should be sufficient.

Analyze co-occurrence

edit

Since one of our criteria for these metrics is that they should be succinct (limited to 5 or fewer), the next steps were to look for overlaps and co-occurrence of the  data elements across metric categories. We don't need to define different metrics for measuring the same set of underlying, raw data. However, before we eliminate categories like Scannable, Complete, and Inclusive, we should understand how the data elements we mapped to them are (or are not) represented in the other metrics categories.

The following table shows the likely correlations or co-occurrences of data elements across metrics categories, based on counts for the number of times a data element coded in one category was also coded in another category:

Relevant Inclusive Scannable Usable Findable Accurate Complete Maintainable
Relevant 0 3 12 24 23 14 15
Inclusive 0 0 22 0 0 0 7
Scannable 3 0 13 7 3 6 9
Usable 12 22 13 20 14 12 38
Findable 24 0 7 20 17 17 25
Accurate 23 0 3 14 17 14 16
Complete 14 0 6 12 17 14 18
Maintainable 15 7 9 38 25 16 18

Tip: You can see a color-coded version of the above chart in the Google spreadsheet.

Analysis: "Maintainable"
edit

Based on the high score (38) of Maintainable x Usable, and Maintainable x Findable (25) TBurmeister concluded that if we measure one of those metrics, we'll safely be improving the other: if we make the docs more usable and findable, they'll also be more maintainable. In the future, we can validate this assertion by comparing changes in Usability and Findability metrics with trends in revision data.

The next highest scores are for Findable x Relevant, and Accurate x Relevant. These are harder to interpret without looking at how we coded individual data elements:

Analysis: "Findable" vs "Relevant"
edit

Similarity score: 3.875

Data elements relating to page revisions and incoming links account for most of the overlap.

  • Incoming links: we can account for this by considering that people will usually ensure that relevant content is findable by adding links to it, including in code files and navigation templates.
  • Page revisions: Frequency and recency of edits were coded as potential indicators of both findability and relevance. This makes sense, because people can't edit what they can't find, and they probably won't edit content that's irrelevant.

Overall, these two metrics categories don't seem similar enough to justify eliminating one of them without further investigation.

Analysis: "Accurate" vs "Relevant"
edit

Similarity score: 1.375

Doc characteristic SUM of Accurate SUM of Relevant
Freshness 27 20
Align with real developer tasks and problems 6 19
Connection to code 7 3

"Number of unique & repeat editors" is the data element coded most strongly for both of these metrics categories. The only data element coded higher for "Relevant" is "Code samples". The popularity of revision-related data elements for these metrics reflects a couple assumptions:

  • People are more likely to edit content that is relevant to them, so that content is more likely to be accurate.
  • If docs are updated, they will be accurate, and if they're accurate, they will be relevant.

Adding in the layer of documentation characteristics to the analysis, it's unsurprising that "Freshness" is a salient doc characteristic for both categories (Accurate (27); Relevant (20)). The data elements most associated with Freshness for these metrics categories are:

Editing-related elements:
  • Time since original publishing
  • Number of unique & repeat editors
  • Last revision date
  • Frequency of edits
  • Number of edits in recent time
  • # of active editors in given time period
Content-related elements:
  • Draft (Template)
  • Historical, Outdated, Archived (Template)
  • Lint errors in generated docs
  • TODO or FIXME in text

The next highest-ranked doc characteristics, and their corresponding data elements, are:

  • Align with real developer tasks and problems (Accurate (6); Relevant (19)):
    • Code sample
    • Outgoing links to wikis (few)
    • Number of unique & repeat editors
    • Most revisions
  • Connection to code (Accurate (7); Relevant (3)):
    • Links to code repos from wiki pages, and vice versa

For tables highlighting this data: Appendix: Tech Docs Metrics or play around with the raw data in a pivot table on the "Relevant v Accurate" tab in the Google spreadsheet Conclusions:

  • Revision metadata and code elements are the most useful for measuring Accuracy and Relevance.  
    • Code samples make docs more Relevant
    • Links to/from code reflect Accuracy more than Relevance
  • Lint errors in auto-generated docs, or templates like Draft, Historical, etc. on-wiki, are the best content elements for measuring Freshness. But since we have a number of easily-accessible revision data sources, we should probably just rely on that data, or consider combining the two types of data.
  • While Freshness is the doc characteristic most strongly correlated with Accuracy, we can't use it as a substitute. Docs may be frequently updated by bots, for translation, or various other reasons that don't actually correspond to making their content accurate, correct, and complete.

Analysis: "Inclusive", "Usable"
edit

The Inclusive metric category shows the strongest overlap with the Usable category, and only small overlap with the Maintainable category. Since Usable and Maintainable overlap strongly, it may be safe to eliminate both Inclusive and Maintainable as categories. Below is a summary of the top data elements coded both for Inclusive and Usable.

Data element Data aspect Documentation characteristic(s) Metrics categories
Headings How long are the headings? Use consistent format and style

Avoid walls of text (succinctness)

Are translatable / translated

Inclusive, Usable, Maintainable

Inclusive, Usable

Inclusive, Usable

Headings How many headings are on the page? Use consistent organization and structure Scannable, Usable, Complete, Maintainable
Headings How deep is the heading structure? Use consistent organization and structure

Avoid walls of text (succinctness)

Scannable, Usable, Complete, Maintainable

Inclusive, Usable, Maintainable

It seems that for the headings data element, the correlation between Usable and Inclusive is aligned with the documentation characteristic of avoiding walls of text and being succinct, which makes the content easier to read, easier to skim, and puts less of a cognitive load on the reader (and translator!). The quantity and depth of headings is more related to doc structure, which – if deployed consistently – makes docs more scannable and easier to maintain, because it's easier to quickly assess the information they contain.


The above analysis led to the observation that, despite overlapping raw data elements, certain documentation characteristics seemed more salient for the Inclusive and Usable metrics categories. The table below shows the top documentation characteristics for Inclusive in bold text, and the table is sorted in descending order by Usable:

Top documentation characteristics for Inclusive (in bold), sorted in descending order by top characteristics for Usable
Doc characteristic SUM of Inclusive SUM of Usable
Are succinct; avoid walls of text 5 41
Use consistent organization and structure 0 20
Are readable on mobile devices, with all information visible 6 18
Use consistent format and style 5 16
Orient the reader within a collection; Are organized alongside related pages 0 11
Are translatable / translated 8 10
Freshness 0 10
Avoid duplication 0 9
Align with real developer tasks and problems 0 9
Follow accessibility guidelines 4 6
Provide working examples 0 2
Provide information in varied formats (images, text, diagrams) 0 2
Use terminology that aligns with user search terms 0 1
Implement content types (i.e. Diataxis) 0 1
Connection to code 0 1
Grand Total 26 157

Based on the above, TBurmeister concluded that, as long as we make sure to include the most salient doc characteristics for Inclusive (those highlighted in blue in the table), along with their raw data elements, in our Usable metric, we can safely say that Usable encompasses Inclusive, without needing a separate metric.

Methodological shortcomings

edit

This process was not meant to be scientific, though TBurmeister tried to use some score-based weighting to determine alignment between raw data elements and doc characteristics, and how those correspond to high-level metrics categories.

The line between what is a documentation characteristic and what is a metric category is sometimes not very clear, nor useful. It seemed necessary to have an additional layer of abstraction between the raw data elements and the high-level metrics, which align more to strategic goals, like "make the docs more accurate". However, certain areas, like freshness vs. accuracy, showed doc characteristics closely aligned with a high-level metric, which brought into question the distinction and the categorization of individual concepts. So, this isn't the most perfect of ontologies, but it does allow us to make a bit more sense of what would otherwise be an overwhelming gap between low-level, raw data and high-level, abstract concepts.

One specific mistake / flaw that TBurmeister realized too late in the process: some data elements have many different aspects, each of which may be relevant for a given metric or doc characteristic in different ways. In the coding process, each data element + aspect pair had its own line. As a result, data elements that can be interpreted multiple ways, or be relevant for different doc characteristics, are overrepresented in the data. This also means that some really important raw data elements, and the doc characteristics they reflect, are underrepresented in the data, just because they're relatively straightforward and have only one aspect. A clarifying example:

  • "Headings" as a raw data element has 6 rows in the coding spreadsheet: one for each data element + aspect + doc characteristic triple.
  • A page containing a "Draft" template is a raw data element that has only 1 row in the coding spreadsheet, despite being a really useful signal of doc freshness. Because it is single-valued and aligns (strongly!) with just one doc characteristic, the system I used to map and weight all the different data points undervalues it. ☹️

Conclusions and discussion

edit
  • Eliminate "Maintainable" metric category. Plan future work to compare trends in Usability and Findability metrics with trends in revision data to confirm that docs are more maintainable as they become more usable and findable.
  • Keep Relevant, Accurate, but exclude Complete.
  • Revision metadata and code elements are the most useful for measuring Accuracy and Relevance.  
    • Code samples make docs more Relevant
    • Links to/from code reflect Accuracy more than Relevance

We can't only measure revisions / freshness.

  • Lint errors in auto-generated docs, or templates like Draft, Historical, etc. on-wiki, are the best content elements for measuring Freshness. But since we have a number of easily-accessible revision data sources, we should probably just rely on that data, or consider combining the two types of data.
  • While Freshness is the doc characteristic most strongly correlated with Accuracy, we can't use it as a substitute for Accuracy. Docs may be frequently updated by bots, for translation, or various other reasons that don't actually correspond to making their content accurate, correct, and complete.
  • TODO: can we pick one or two content data elements to compare with revision data to get an idea of how much divergence there is here, and/or if that divergence is limited to certain collections?
  • Accuracy is hard to measure. The closest proxy we have is Freshness.

Readability scores didn't make the cut! But doc characteristics that relate to them, like succinctness and consistent style, did. "Findability" sometimes seemed to refer to the findability of a page in general, and other times it referred to the findability of information within a page or within a collection. For example, headings were coded as relevant for Findability, which makes sense if we consider how headings are part of consistent page structure, which makes information easier to locate within a page.  But, unlike menus or even page titles, headings have minimal impact on whether the page itself is findable.

CSS and Tables: depends on the details and the context. May be too hard to use, and the added complexity of it not being only negative or only positive adds to the cost.

Proposed v0 metrics and data to investigate

edit

Because this page is already so long, this juicy content is at Doc_metrics/v0.

References

edit
  1. Megan Gilhooly. (2020). Product answers: Engineering content for freshness. Intercom (Jan. 2020). https://www.stc.org/intercom/2020/01/product-answers-engineering-content-for-freshness
  2. Strimling, Yoel. (2019). "Beyond Accuracy: What Documentation Quality Means to Readers". Technical Communication (Washington). 66. pp 7-29.

Data and additional tables

edit

Appendix (Google doc)

Raw data, charts, graphs (Google spreadsheet)