User:TBurmeister (WMF)/Measuring page length
This page gathers resources and describes what I learned when I discovered that it's not so straightforward to quantify the amount of text on a wiki page.
Motivation and context
editMy use case is to assess the readability and usability of written content in our technical documentation, as part of the Doc metrics project. I'm interested in the number of characters and words on a page, and in the structure of the page (number of sections), because those content attributes are a strong indicator of readability and quality[1].
In trying to define how to score docs for a "page length" metric ( i.e. if Page_length < 20k bytes then 1, else 0) – I ended up realizing that bytes as a measure of page length was inconsistent in ways I hadn't expected. I had a misconception that "page size" would refer to how a human reader would assess the amount of content on the page, not the size in bytes of the page as stored in a database. Realizing the difference between "page size" and "page length" led me to question whether page size in bytes is actually a reliable proxy for measuring the amount of content on a page, and/or the length of the page.
What I investigated
editWhich data sources provide data for page size or page length?
Do the available data sources align or diverge in their calculations of page size / page length?
How do different types of page content and formatting influence page size and page length measurements? Are these interactions consistent or unpredictable?
Do available data sources reflect how a human would assess the length of a page? Does the data (whether bytes, number of sections, number of characters, or whatever) actually correlate with whether a page is "text-heavy" or "very long"?
Examples of page size in bytes not reflecting actual page length
editSince certain types of technical content (like Reference pages) are often formatted in a way that generates longer pages, I didn't want to compare reference content with non-reference content. In general, it's acceptable for reference docs to be long, because we assume that users will employ a targeted search or skimming approach to quickly locate the information they need. At first, I thought these content type differences could account for most discrepancies, but then I found other examples that weren't reference docs:
- Manual:FAQ is significantly longer than Manual:Hooks. As one measure: the "print version" link for each page generates 38 print pages for Manual:FAQ, and 23 print pages for Manual:Hooks. However, Manual:Hooks is larger in bytes (177k) than Manual:FAQ (89k).
- API:REST API/Reference is 59 print pages and 65k bytes.
This required me to clarify the requirements of my use case: did it matter that bytes and characters don't have a 1-to-1 correlation for some non-English languages, or that (in general) the byte length of a string may not match what a human would expect? Did it matter that some data sources didn't count lists or tables as part of the "text" in measuring page length?
Ultimately, my goal is to identify which pages in a collection of technical documentation have "walls of text", and/or a structure that is so lengthy or complex that it's likely to be impact developer experience. So, it might actually acceptable for me to use a data source like wikipedia:Wikipedia:Prosesize, which excludes non-paragraph content from its calculation of "what is prose". Lists are a good way to add structure to content, and to break up walls of text. So, a page length measurement that excludes them could be acceptable, if my main goal is to find un-structured, large chunks of prose.
However, that leaves out the many instances of code examples that are an essential part of technical documentation. Should code samples be included in how we assess the length of a page? Their formatting usually causes pages to be longer, even though the amount of content on each line is much less than that of prose. Do long code samples hinder readability just as much as long paragraphs? Is there a line limit at which we should instead just link to example files stored in source control, instead of putting code in wiki pages? More importantly (for this project): how do existing data sources for measuring page size and length handle the presence of code samples? Is it different enough that this should impact which data sources we use for doc metrics?
Since most of our technical documentation is in English, the issue of byte length not reflecting character counts for some languages is not always a concern. However, on mediawiki.org translations are part of our technical documentation, so this issues is relevant in that context.
- The Ukrainian translation of Help:Magic words/hu is 118,694 bytes at 70% translated. The English version of the same page is 107,602 bytes.
- XTools "Largest Pages" provides a useful way to compare the different sizes in page bytes across translations for a given page: https://xtools.wmcloud.org/largestpages/www.mediawiki.org/12?include_pattern=%25Magic_words%25 (keep in mind that this is inherently flawed as a measure due to inconsistent translation completion percentages across language version)
Regardless of the language the page is written in: byte size doesn't always accurately reflect complexity or length of a page if it uses many templates or other fancy wikitext to generate textual content.
Ideally, I'd like to leave all of this calculation of doc quality based on content attributes up to an ML model, but that isn't yet feasible[2]. At the very least, this deep dive has deepened my understanding of which content features would be relevant, if we were to design/train a content quality model specifically for assessing technical documentation.
Types of page length measurements
editPage length as bytes
editThis is the standard method of measuring page length in most MediaWiki built-in metrics tools and the corresponding database tables. Consequently, you can access page length in bytes in a variety of different tools and dashboards:
- MediaWiki Page Information
- Action API module API:Info
- According tothe API:Info notes section, the built-in MediaWiki Page Information tool uses a separate module, but "much of the information it returns overlaps the API:Info module".
- Wikistats: https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Bytes
- XTools - XTools/Page History contains both page size in bytes (the same as all above sources), but also "prose size" in bytes.
- prosesize: Size (in bytes) of the text within readable prose sections on the page (calculated via the Rust string:len method).[3][4]
Complications of page length as bytes
edit- Not all languages have a one-to-one correlation between characters and bytes like English does. This has real implications for languages like Hebrew and Russian. See, for example, ⚓ T275319 Change $wgMaxArticleSize limit from byte-based to character-based.
- Bytes captures the wikitext that generates various types of page elements that may not correspond to text. For example, pages with many different types of non-text elements, like TODO example can have just as many bytes as a very text-heavy page, like TODO example.
- Pages with many code samples...i.e. API:Nearby places viewer (22,058 page bytes, printable version = 8 pages, 2 images, 2 buttons, more code than text) compared to Writing an extension for deployment: (26,202 bytes, printable version = 6 pages, 1 slides pdf, 2 form input fields, mostly text, no code). The latter, a shorter but more text-heavy and dense page, has a larger page size in bytes than the tutorial, which is longer (even with sections collapsed) but contains mostly code samples.
- Pages with transclusions or templates:
- API:REST API/Conditional requests is nearly the same size in bytes as API:Allmessages (2,093 and 2,045 bytes, respectively). However, API:Allmessages is objectively longer with 10 sections of template-generated example code and API documentation.
- Manual:FAQ is significantly longer than Manual:Hooks. As one measure: the "print version" link for each page generates 38 print pages for Manual:FAQ and 23 print pages for Manual:Hooks. However, Manual:Hooks is larger in bytes (177k) than Manual:FAQ (89k).
Tentative conclusion: We need a measure of page length that corresponds to the content as the human reader experiences it when viewing the page, which requires using the rendered HTML.
Page length as prose size
editXTools - XTools/Page History displays "prose size" in bytes, along with the number of characters and words in the page sections the tool considers to be "prose".
- The algorithm used to calculate prose was "inspired by toolforge:prosesize", but it appears to not yield exactly the same results. See explanation in https://blog.legoktm.com/2023/02/25/measuring-the-length-of-wikipedia-articles.html
- Definition of prose excludes templates, and content created by TemplateStyles, Math and Cite[5].
- Based on Parsoid HTML for each wiki page.
- "counts the text within <p> tags in the HTML source of the document, which corresponds almost exactly to the definition of "readable prose". This method is not perfect, however, and may include text which isn't prose, or exclude text which is (e.g. in {{cquote}}, or prose written in bullet-point form)."[4]
Complications of using prose size for understanding page length
edit- Different tools and data sources use varying definitions for what is "text" on a page, and they may vary in how they parse the wikitext or HTML. As a result, different data source report different numbers for character or word counts, depending on their parsing strategy, and on how they identify what counts as "text" to be measured.
- Example: Wikimedia tutorials actually has 264 words if we count all the text on the page, but prosesize and Xtools capture only 7 words due to excluding the layout grid and icon text.
Number of sections
editXTools: XTools/Page History includes in its "prose" section a count of the number of sections in the page, but it's unclear from the docs whether that is limited to just the "prose" sections of the page, like the character and word count metrics, or if it's just in this section for convenience and not actually limited to those same "prose-only" constraints.
- Section count includes the lead section, so all pages will have at least one section.
Example comparing different measurements
editThis section uses real tech docs to explore the differences between how these tools compute their statistics, and which data points are available.
Tool/data source | Bytes | Prose bytes | Chars | Words | Sentences | Sections |
---|---|---|---|---|---|---|
XTools | 56,623 | 24,822 | 24,817 (prose only) | 3,979 (prose only) | 49 (not prose only?) | |
prosesize | 24,834 | 3,991 (prose only) | ||||
MediaWiki | 56,623 | |||||
HTML page text pasted into word processor (LibreOffice)
Code samples, TOC, nav, and References removed but section headers remain |
38,973 (excluding spaces) | 7,142 | ||||
Expresso | too long | too long | too long |
Ratio of prose bytes (24,822) to page bytes (56,623): .438
Ratio of prose chars (24,817) to chars per LibreOffice (38,973): .636
Ratio of chars per LibreOffice (38,973) to page bytes (56,623): .688
As a human reviewer, I find the main content sections of this page to be relatively readable, though the content type (overview / concepts) requires more attention than other types of docs -- it's not as easy to skim the page and get an understanding, I have to actually read to understand. The UX and readability could be improved by separating content that explains the architecture from content that explains why the architecture was designed a given way. The "annex" sections could be moved to a separate page. The "debugging" section, and information like "Beware of the below reserved names" should be moved to a location where users are more likely to expect it and encounter in the course of completing development tasks (rather than learning about how the system works, which is the primary purpose of this page).
Tool/data source | Bytes | Prose bytes | Chars | Words | Sentences | Sections |
---|---|---|---|---|---|---|
XTools | 79,408 | 32,710 | 32,695 (prose only) | 4,990 (prose only) | 112 (not prose only?) | |
prosesize | 32,811 | 5,047 (prose only) | ||||
MediaWiki | 79,408 | |||||
HTML page text pasted into word processor (LibreOffice)
Code samples, TOC, nav, and References removed but section headers remain |
34,105 (excluding spaces) | 5,972 | ||||
Expresso | too long | too long |
Ratio of prose bytes (32,710) to page bytes (79,408): .412
Ratio or prose chars (32,710) to chars per LibreOffice (34,105): .96 (interesting! this is because most of the non-prose was code samples, so prosesize and myself removed the same content from our counting)
Ratio of chars per LibreOffice (34,105) to page bytes (79,408): .429
Tool/data source | Bytes | Prose bytes | Chars | Words | Sentences | Sections |
---|---|---|---|---|---|---|
XTools | 9,861 | 3,971 | 3,953 (prose only) | 620 (prose only) | 17 (not prose only?) | |
prosesize | 4,008 | 634 (prose only) | ||||
MediaWiki | 9,861 | |||||
HTML page text pasted into word processor (LibreOffice)
Code samples, TOC, nav, and References removed but section headers remain |
3,630 (excluding spaces) | 673 | ||||
Expresso | 4,347 | 692 | 38 |
Ratio of prose bytes (3,971) to page bytes (9,861): .403
Ratio of prose chars (3,953) to chars per LibreOffice (3,630): 1.089 (TODO: anomaly! investigate)
Ratio of chars per LibreOffice (3,630) to page bytes (9,861): .368 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)
As a human reviewer, I find this page to be well-structured, and easy to both skim and read.
Tool/data source | Bytes | Prose bytes | Chars | Words | Sentences | Sections |
---|---|---|---|---|---|---|
XTools | 32,430 | 9,296 | 9,296 (prose only) | 1,470 (prose only) | 27 (not prose only?) | |
prosesize | 9,374 | 1,504 (prose only) | ||||
MediaWiki | 32,430 | |||||
HTML page text pasted into word processor (LibreOffice)
Code samples, TOC, nav, and References removed but section headers remain |
16,119 (excluding spaces) | 2,893 | ||||
Expresso |
Ratio of prose bytes (9,296) to page bytes (32,430): .287
Ratio of prose chars (9,296) to chars per LibreOffice (16,119): .576
Ratio of chars per LibreOffice (16,119) to page bytes (32,430): .497 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)
As a human reviewer, I find this page very information-heavy, but not overwhelming. The variety of formatting and the content structure makes it manageable to at least skim, but it could be improved by reducing content duplication (i.e. with Localisation docs) and by splitting some content into new or other pages.
Tool/data source | Bytes | Prose bytes | Chars | Words | Sentences | Sections |
---|---|---|---|---|---|---|
XTools | 26,202 | 3,084 | 3,084 (prose only) | 485 (prose only) | 10 (not prose only?) | |
prosesize | 3,107 | 494 (prose only) | ||||
MediaWiki | 26,202 | |||||
HTML page text pasted into word processor (LibreOffice)
Code samples, TOC, nav, and References removed but section headers remain |
11,635 (excluding spaces) | 2,147 | ||||
Expresso | 14,053 | 2,202 | 124 |
Ratio of prose bytes (3,084) to page bytes (26,202): .117
Ratio of prose chars (3,084) to chars per LibreOffice (11,635): .265
Ratio of chars per LibreOffice (11,635) to page bytes (26,202): .444 (this ratio itself is mostly meaningless, what matters is if it's consistent across docs with varying content types and formatting)
As a human reviewer, I find this page to be very overwhelming and text-heavy, despite its use of lists. So, that may mean that excluding lists from what counts as "prose" is actually not a good idea for assessing readability or page UX.
Conclusions and next steps
edit- Differences between XTools and prosesize calculation of prose and words are minimal enough to ignore.
- My experiments indicate that the way XTools and prosesize define "prose" may be too restrictive to accurately capture the length and complexity of many technical documents, because those docs are more likely than encyclopedia articles to use lists, code samples, and other non-paragraph formats, which -- while not technically prose -- do still contribute to the length of a page and the cognitive burden of using it.
- Need to investigate further if certain types of tech docs content are especially poorly-represented by bytes and/or by prose bytes. It seems that the more structured a page is, the less "prose bytes" represents its real length.
- TODO: Check this for reference docs and docs with many code samples, like tutorials.
- TODO: Are there actually examples of a page being very long to human eyes but small in bytes? Maybe something with much page content generated by templates?
- Need to investigate further if certain types of tech docs content are especially poorly-represented by bytes and/or by prose bytes. It seems that the more structured a page is, the less "prose bytes" represents its real length.
- TODO: Is bytes consistently aligned enough with character count that we can rely on bytes to reflect the complexity of content on a page?
Related resources
editReferences
edit- ↑ https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality
- ↑ Page length is a feature used to predict article or revision quality in many of our ML models. Ideally we could just use that instead of computing page length ourselves to then calculate tech doc metrics based on it, but page quality ML models are only intended for use on Wikipedias, not on technical wikis (so far). The criteria for quality in the tech docs context also differ from the encylopedic context, so the models may not be applicable even if they were available for content on mediawiki.org.
- ↑ https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/blob/main/wikipedia_prosesize/src/lib.rs#L70
- ↑ 4.0 4.1 https://en.wikipedia.org/wiki/Wikipedia:Prosesize
- ↑ XTools/Page History#Prose