User:TBurmeister (WMF)/Measuring page length

This page gathers resources and describes what I learned when I discovered that it's not so straightforward to quantify the amount of text on a wiki page.

Motivation and context

edit

My use case is to assess the readability and usability of written content in our technical documentation, as part of the Doc metrics project. I'm interested in the number of characters and words on a page, and in the structure of the page (number of sections), because those content attributes are a strong indicator of readability and quality[1].

In trying to define how to score docs for a "page length" metric ( i.e. if Page_length < 20k bytes then 1, else 0) – I ended up realizing that bytes as a measure of page length was inconsistent in ways I hadn't expected. I had a misconception that "page size" would refer to how a human reader would assess the amount of content on the page, not the size in bytes of the page as stored in a database. Realizing the difference between "page size" and "page length" led me to question whether page size in bytes is actually a reliable proxy for measuring the amount of content on a page, and/or the length of the page.

What I investigated

edit

Which data sources provide data for page size or page length?

Do the available data sources align or diverge in their calculations of page size / page length?

How do different types of page content and formatting influence page size and page length measurements? Are these interactions consistent or unpredictable?

Do available data sources reflect how a human would assess the length of a page? Does the data (whether bytes, number of sections, number of characters, or whatever) actually correlate with whether a page is "text-heavy" or "very long"?

Examples of page size in bytes not reflecting actual page length

edit

Since certain types of technical content (like Reference pages) are often formatted in a way that generates longer pages, I didn't want to compare reference content with non-reference content. In general, it's acceptable for reference docs to be long, because we assume that users will employ a targeted search or skimming approach to quickly locate the information they need. At first, I thought these content type differences could account for most discrepancies, but then I found other examples that weren't reference docs:

  • Manual:FAQ is significantly longer than Manual:Hooks. As one measure: the "print version" link for each page generates 38 print pages for Manual:FAQ, and 23 print pages for Manual:Hooks. However, Manual:Hooks is larger in bytes (177k) than Manual:FAQ (89k).
  • API:REST API/Reference is 59 print pages and 65k bytes.

This required me to clarify the requirements of my use case: did it matter that bytes and characters don't have a 1-to-1 correlation for some non-English languages, or that (in general) the byte length of a string may not match what a human would expect? Did it matter that some data sources didn't count lists or tables as part of the "text" in measuring page length?

Ultimately, my goal is to identify which pages in a collection of technical documentation have "walls of text", and/or a structure that is so lengthy or complex that it's likely to be impact developer experience. So, it might actually acceptable for me to use a data source like wikipedia:Wikipedia:Prosesize, which excludes non-paragraph content from its calculation of "what is prose". Lists are a good way to add structure to content, and to break up walls of text. So, a page length measurement that excludes them could be acceptable, if my main goal is to find un-structured, large chunks of prose.

However, that leaves out the many instances of code examples that are an essential part of technical documentation. Should code samples be included in how we assess the length of a page? Their formatting usually causes pages to be longer, even though the amount of content on each line is much less than that of prose. Do long code samples hinder readability just as much as long paragraphs? Is there a line limit at which we should instead just link to example files stored in source control, instead of putting code in wiki pages? More importantly (for this project): how do existing data sources for measuring page size and length handle the presence of code samples? Is it different enough that this should impact which data sources we use for doc metrics?

Since most of our technical documentation is in English, the issue of byte length not reflecting character counts for some languages is not always a concern. However, on mediawiki.org translations are part of our technical documentation, so this issues is relevant in that context.

Regardless of the language the page is written in: byte size doesn't always accurately reflect complexity or length of a page if it uses many templates or other fancy wikitext to generate textual content.

Ideally, I'd like to leave all of this calculation of doc quality based on content attributes up to an ML model, but that isn't yet feasible[2]. At the very least, this deep dive has deepened my understanding of which content features would be relevant, if we were to design/train a content quality model specifically for assessing technical documentation.

Types of page length measurements

edit

Page length as bytes

edit

This is the standard method of measuring page length in most MediaWiki built-in metrics tools and the corresponding database tables. Consequently, you can access page length in bytes in a variety of different tools and dashboards:

Complications of page length as bytes

edit
  • Not all languages have a one-to-one correlation between characters and bytes like English does. This has real implications for languages like Hebrew and Russian. See, for example, ⚓ T275319 Change $wgMaxArticleSize limit from byte-based to character-based.
  • Bytes captures the wikitext that generates various types of page elements that may not correspond to text. For example, pages with many different types of non-text elements, like TODO example can have just as many bytes as a very text-heavy page, like TODO example.
    • Pages with many code samples...i.e. API:Nearby places viewer (22,058 page bytes, printable version = 8 pages, 2 images, 2 buttons, more code than text) compared to Writing an extension for deployment: (26,202 bytes, printable version = 6 pages, 1 slides pdf, 2 form input fields, mostly text, no code). The latter, a shorter but more text-heavy and dense page, has a larger page size in bytes than the tutorial, which is longer (even with sections collapsed) but contains mostly code samples.
    • Pages with transclusions or templates:
    • Manual:FAQ is significantly longer than Manual:Hooks. As one measure: the "print version" link for each page generates 38 print pages for Manual:FAQ and 23 print pages for Manual:Hooks. However, Manual:Hooks is larger in bytes (177k) than Manual:FAQ (89k).

Tentative conclusion: We need a measure of page length that corresponds to the content as the human reader experiences it when viewing the page, which requires using the rendered HTML.

Page length as prose size

edit

XTools - XTools/Page History displays "prose size" in bytes, along with the number of characters and words in the page sections the tool considers to be "prose".

prosesize:

  • Based on Parsoid HTML for each wiki page.
  • "counts the text within <p> tags in the HTML source of the document, which corresponds almost exactly to the definition of "readable prose". This method is not perfect, however, and may include text which isn't prose, or exclude text which is (e.g. in {{cquote}}, or prose written in bullet-point form)."[4]

Complications of using prose size for understanding page length

edit
  • Different tools and data sources use varying definitions for what is "text" on a page, and they may vary in how they parse the wikitext or HTML. As a result, different data source report different numbers for character or word counts, depending on their parsing strategy, and on how they identify what counts as "text" to be measured.
    • Example: Wikimedia tutorials actually has 264 words if we count all the text on the page, but prosesize and Xtools capture only 7 words due to excluding the layout grid and icon text.

Number of sections

edit

XTools: XTools/Page History includes in its "prose" section a count of the number of sections in the page, but it's unclear from the docs whether that is limited to just the "prose" sections of the page, like the character and word count metrics, or if it's just in this section for convenience and not actually limited to those same "prose-only" constraints.

  • Section count includes the lead section, so all pages will have at least one section.

Example comparing different measurements

edit

This section uses real tech docs to explore the differences between how these tools compute their statistics, and which data points are available.

In the tables below, "prose" refers to the definitions used by XTools and prosesize to define what counts as prose (see above).
Comparison of statistics for page: ResourceLoader/Architecture
Tool/data source Bytes Prose bytes Chars Words Sentences Sections
XTools 56,623 24,822 24,817 (prose only) 3,979 (prose only) 49 (not prose only?)
prosesize 24,834 3,991 (prose only)
MediaWiki 56,623
HTML page text pasted into word processor (LibreOffice)

Code samples, TOC, nav, and References removed

but section headers remain

38,973 (excluding spaces) 7,142
Expresso too long too long too long

Ratio of prose bytes (24,822) to page bytes (56,623): .438

Ratio of prose chars (24,817) to chars per LibreOffice (38,973): .636

Ratio of chars per LibreOffice (38,973) to page bytes (56,623): .688

As a human reviewer, I find the main content sections of this page to be relatively readable, though the content type (overview / concepts) requires more attention than other types of docs -- it's not as easy to skim the page and get an understanding, I have to actually read to understand. The UX and readability could be improved by separating content that explains the architecture from content that explains why the architecture was designed a given way. The "annex" sections could be moved to a separate page. The "debugging" section, and information like "Beware of the below reserved names" should be moved to a location where users are more likely to expect it and encounter in the course of completing development tasks (rather than learning about how the system works, which is the primary purpose of this page).

Comparison of statistics for page: ResourceLoader/Migration_guide_(users)
Tool/data source Bytes Prose bytes Chars Words Sentences Sections
XTools 79,408 32,710 32,695 (prose only) 4,990 (prose only) 112 (not prose only?)
prosesize 32,811 5,047 (prose only)
MediaWiki 79,408
HTML page text pasted into word processor (LibreOffice)

Code samples, TOC, nav, and References removed

but section headers remain

34,105 (excluding spaces) 5,972
Expresso too long too long

Ratio of prose bytes (32,710) to page bytes (79,408): .412

Ratio or prose chars (32,710) to chars per LibreOffice (34,105): .96 (interesting! this is because most of the non-prose was code samples, so prosesize and myself removed the same content from our counting)

Ratio of chars per LibreOffice (34,105) to page bytes (79,408): .429

Comparison of statistics for page: API:REST_API
Tool/data source Bytes Prose bytes Chars Words Sentences Sections
XTools 9,861 3,971 3,953 (prose only) 620 (prose only) 17 (not prose only?)
prosesize 4,008 634 (prose only)
MediaWiki 9,861
HTML page text pasted into word processor (LibreOffice)

Code samples, TOC, nav, and References removed

but section headers remain

3,630 (excluding spaces) 673
Expresso 4,347 692 38

Ratio of prose bytes (3,971) to page bytes (9,861): .403

Ratio of prose chars (3,953) to chars per LibreOffice (3,630): 1.089 (TODO: anomaly! investigate)

Ratio of chars per LibreOffice (3,630) to page bytes (9,861): .368 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)

As a human reviewer, I find this page to be well-structured, and easy to both skim and read.

Comparison of statistics for page: Manual:Developing_extensions
Tool/data source Bytes Prose bytes Chars Words Sentences Sections
XTools 32,430 9,296 9,296 (prose only) 1,470 (prose only) 27 (not prose only?)
prosesize 9,374 1,504 (prose only)
MediaWiki 32,430
HTML page text pasted into word processor (LibreOffice)

Code samples, TOC, nav, and References removed

but section headers remain

16,119 (excluding spaces) 2,893
Expresso

Ratio of prose bytes (9,296) to page bytes (32,430): .287

Ratio of prose chars (9,296) to chars per LibreOffice (16,119): .576

Ratio of chars per LibreOffice (16,119) to page bytes (32,430): .497 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)

As a human reviewer, I find this page very information-heavy, but not overwhelming. The variety of formatting and the content structure makes it manageable to at least skim, but it could be improved by reducing content duplication (i.e. with Localisation docs) and by splitting some content into new or other pages.

Comparison of statistics for page: Writing_an_extension_for_deployment
Tool/data source Bytes Prose bytes Chars Words Sentences Sections
XTools 26,202 3,084 3,084 (prose only) 485 (prose only) 10 (not prose only?)
prosesize 3,107 494 (prose only)
MediaWiki 26,202
HTML page text pasted into word processor (LibreOffice)

Code samples, TOC, nav, and References removed

but section headers remain

11,635 (excluding spaces) 2,147
Expresso 14,053 2,202 124

Ratio of prose bytes (3,084) to page bytes (26,202): .117

Ratio of prose chars (3,084) to chars per LibreOffice (11,635): .265

Ratio of chars per LibreOffice (11,635) to page bytes (26,202): .444 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)

As a human reviewer, I find this page to be very overwhelming and text-heavy, despite its use of lists. So, that may mean that excluding lists from what counts as "prose" is actually not a good idea for assessing readability or page UX.

Conclusions and next steps

edit
  • Differences between XTools and prosesize calculation of prose and words are minimal enough to ignore.
  • My experiments indicate that the way XTools and prosesize define "prose" may be too restrictive to accurately capture the length and complexity of many technical documents, because those docs are more likely than encyclopedia articles to use lists, code samples, and other non-paragraph formats, which -- while not technically prose -- do still contribute to the length of a page and the cognitive burden of using it.
    • Need to investigate further if certain types of tech docs content are especially poorly-represented by bytes and/or by prose bytes. It seems that the more structured a page is, the less "prose bytes" represents its real length.
      • TODO: Check this for reference docs and docs with many code samples, like tutorials.
      • TODO: Are there actually examples of a page being very long to human eyes but small in bytes? Maybe something with much page content generated by templates?
  • TODO: Is bytes consistently aligned enough with character count that we can rely on bytes to reflect the complexity of content on a page?
edit

References

edit
  1. https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality
  2. Page length is a feature used to predict article or revision quality in many of our ML models. Ideally we could just use that instead of computing page length ourselves to then calculate tech doc metrics based on it, but page quality ML models are only intended for use on Wikipedias, not on technical wikis (so far). The criteria for quality in the tech docs context also differ from the encylopedic context, so the models may not be applicable even if they were available for content on mediawiki.org.
  3. https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/blob/main/wikipedia_prosesize/src/lib.rs#L70
  4. 4.0 4.1 https://en.wikipedia.org/wiki/Wikipedia:Prosesize
  5. XTools/Page History#Prose