Extension:DiscussionTools/How it works

This page documents the internals of DiscussionTools for developers of the extension and tools that build on top of it, like JS gadgets or SQL queries.

To learn about common reasons why DiscussionTools might not work as expected on a specific page, see Why can't I reply to this comment?.

Parser

Most DiscussionTools features rely on the talk page comment parser introduced in this extension (no relation to the MediaWiki wikitext parser).

The comment parser takes as input the HTML rendering of the discussion page (produced by either Parsoid or the old wikitext parser), and gives as output a representation of the comments and threads on the page.

Note that DiscussionTools does not deal with the wikitext at all, only with HTML.

Data structures

DiscussionTools recognizes two kinds of items: headings and comments. Other content on the page is not included in the representation.

Headings and comments form a tree structure. Comments can be top-level (represented as replies to headings), or be replies to other comments. Headings can be top-level, or be sub-headings (represented as replies to other headings). A thread is a heading together with its tree of replies.

Each item has the following properties:

ID and name, which are used to identify the item in different contexts
Range, referencing the HTML DOM nodes where it was detected. The range may begin or end in the middle of an element, and may span multiple elements in different parent nodes.
Indentation level (always 0 for headings, 1 for top-level comments, 2+ for replies)
References to parent item and reply items

Comments additionally have:

Signature ranges, as above, referencing the HTML DOM nodes of signatures
Author name
Date and time

Headings additionally have:

Heading level (1-6)
Whether it is a placeholder heading, used when comment items appear before the first heading on the page

Most of this information (notably excluding the ranges / content) is stored in the database used for permalinks. Some of it is also encoded back into the HTML in the formatter, as described below.

Example

Below is an example discussion, and the comment parser's representation of it (pseudocode):

A

B. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

C.

C. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

D. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

E. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

F. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

G. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

H. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

I. Matma Rex (talk) 00:09, 24 June 2021 (UTC)

[
  HeadingItem( { level: 0, range: (h2: A), replies: [
    CommentItem( { level: 1, range: (p: B), replies: [
      CommentItem( { level: 2, range: (li: C, li: C), replies: [
        CommentItem( { level: 3, range: (li: D), replies: [
          CommentItem( { level: 4, range: (li: E), replies: [] },
          CommentItem( { level: 4, range: (li: F), replies: [] },
        ] },
      ] },
      CommentItem( { level: 2, range: (li: G), replies: [] },
    ] },
    CommentItem( { level: 1, range: (p: H), replies: [
      CommentItem( { level: 2, range: (li: I), replies: [] },
    ] },
  ] } )
]

Parsing algorithm

Detecting comments

First step to obtain the above is to find the comments and headings that exist on the page.

For each text node in the DOM, excluding those inside blockquotes etc.:
- If its text contains a timestamp formatted according to the wiki's language, and
- If the text node containing the timestamp is preceded by a signature, that is a link to a user page, user talk page, or user contributions
- Output a comment with the following properties:
  - Range beginning at the first "leaf" node following the previous comment, heading, or start of document; and ending at the end of the "paragraph" containing the signature
  - Indentation level computed as the minimum of the indentation of the beginning and end of the range
  - Signature range from the first detected link to the end of the timestamp
  - Author name parsed from the signature
  - Date and time parsed from the timestamp

Parsing timestamps

Timestamps are parsed by an algorithm that reverses the steps taken by MediaWiki to output them. Only timestamps that exactly match the MediaWiki's date formats are accepted, to guarantee that they can be parsed unambigously. DST timezones and language variants are supported.

Threading comments

Comments are assigned as replies to other comments depending on the indentation level.

Assigning ID and name

Item IDs and names are computed based only on the author, date and time, and thread structure. They do not depend on the text of the comment or the heading. This allows identical IDs/names to be assigned to the same comment even if it is modified in later revisions of the page, or the same heading even if it is renamed, and to be identical when language variants are in use.

Item IDs are unique within the revision being parsed. If two items were to be otherwise indistinguishable, they are numbered sequentially.

Item names are consistent across all pages and revisions where the item might appear, even when it's moved or changed.

Indistinguishable items

As a result of the assignment logic above, when multiple comments or headings have the same author, date and time, they will be assigned the same ID (but only if they're in different revisions or pages) and the same name (possible even within a single revision). This is rare, but it does happen.

Some discussion features will treat those comments as if they were the same comment, which may be surprising if they look obviously different to a human. See details below.

To really identify a single item, you must use the revision ID plus the item ID.

Reply tool

Adding reply links

The formatter inserts reply links into the DOM in PHP, as well as comment start and end markers.

Care is taken not to insert them in invalid places, like inside a <style> or a <br> tag.

Item properties from the comment tree data structure are included as JSON data attribute on the reply links. Together with the markers, they are later used in JS code to reconstruct the comment tree without running the comment parser.

We use markers instead of directly storing the range to allow some compatibility with other extensions and gadgets that modify the client-side DOM.

Inserting the reply widget

The modifier inserts the reply widget into the DOM in JS, as if the reply widget was a new reply to the comment.

The DOM tree is suitably rearranged to ensure correct indentation level of the reply (wrapper nodes are added, and other nodes may be moved around).

The reply is added below all existing replies to the given comment (and replies to them), with indentation level of the given comment plus 1.

Saving comments

Saving comments uses the same modifier algorithm, implemented in PHP. The contents of each paragraph in the reply are inserted inside a list item node. Then the HTML is converted back to wikitext using Parsoid, which is saved as a new revision of the page.

When replying in wikitext mode, each line of wikitext is added inside a list item node as a transclusion. Parsoid includes the wikitext unchanged in its output.

Why not wikitext

Saving comments does not operate directly on wikitext, but rather uses HTML throughout the process and Parsoid to convert it. This has some benefits and drawbacks.

Benefits:

We do not need to maintain a whole separate comment parser and modifier that would implement a similar algorithm for wikitext.
The reply widget and the actual reply are placed on the page in the same way, so the "preview" will always match the final result.
We can more easily recognize "frames" around the content, such as barnstar/wikilove messages, and add replies outside of them, regardless of the markup they use.
It better handles edge cases where a single line of wikitext contains fragments of multiple comments (occasionally occurring when the page was previously edited using visual editor).
We will not need to make major changes once multi-line list items are introduced in wikitext.
Comments transcluded from other pages can usually be detected and replied to.

Drawbacks:

Any Parsoid bugs affect the reply tool and potentially cause content corruption. This was a significant issue at the beginning of the project, but since then we've developed a tool to detect issues, and the Parsing team has been fixing them. (One remaining issue is that Parsoid incorrectly handles pages that contained fostered content in HTML (T240280). The reply tool will refuse to edit such pages.)
Parsoid's handling of HTML comments and whitespace has been unintuitive and it required a lot of effort to get it to produce reasonable wikitext.
Comments marked as template-generated but not transcluded from other pages usually can't be replied to.

Transcluded comments

When running the comment parser on Parsoid HTML, we can use the information about comment ranges from our comment parser and information about template-generated content from Parsoid HTML to determine whether a comment visible on the page has been transcluded from a different page, and post the reply there.

Parallel implementations

Most of the comment parser, modifier, and data structure code has two implementations: in PHP and JavaScript. It is a historical accident, as the tools were first prototyped in JS to make it easy to test them with live content on Wikipedia, and then reimplemented in PHP to improve performance (particularly to avoid fetching and sending the full page's Parsoid HTML when saving replies). But once we had them, we kept them both: it helps avoid bugs by comparing the two implementations and allows some client-side actions to happen without consulting the server, e.g. inserting the reply widget.

New topic tool

Unlike the reply tool, the new topic tool saves the comment as wikitext, using the existing APIs to add a new section to a page. In visual mode the comment is converted to wikitext first.

Conceptually, in our data structure, adding a new topic thread is the same as adding a new heading and then adding a top-level comment as a reply to that heading. The interface code reuses much of the reply tool by putting that concept into reality. It seemed like a good idea at the time.

Notifications

Subscribing

Users can subscribe to receive notifications about new replies in a topic. We currently only allow subscribing to level 2 headings that have comments directly underneath (not in sub-sections – this may change: T275943, T298617#7695392).

The model could theoretically support subscribing to notifications about replies to any comment or heading. However, it would require much more complexity in the user interface (particularly in managing subscriptions when multiple subscriptions with different states could overlap), so we gave up on it.

Each subscription has the following properties:

Subscription item name (sub_item field in SQL). This is a concatenation of the username and timestamp of the first comment under that heading. This is used when generating notifications. Example data: h-Admin-20231223222800
Subscription link target, that is the page title (sub_namespace and sub_title) and section title (sub_section) where this item appeared when the subscription was created. This is not used when generating notifications, and may not match where the thread actually appears (if it was archived, or renamed). It's only intended to be used as a human-readable label when managing subscriptions (not implemented yet).
State (sub_state), subscribed (1) or unsubscribed (0). Currently unused but intended to be used for unsubscribing from automatic subscriptions.
User who is subscribed (sub_user. Corresponds to the user_id)
Time when this subscription was created (sub_created)
Time when a notification about the item was last sent (sub_notified, which can be null)

This data is stored in a database table called discussiontools_subscription .

Generating notifications

Echo separates the concepts of events and notifications. A single event can results in notifications sent to many users, depending on its user locators (to include users) and user filters (to exclude them).

Whenever an edit to a talk page is saved, Echo compares the previous and new page revision to generate its events, e.g. mentions. DiscussionTools extends this mechanism, and compares the previous and new comment trees to find new comments and generate events for them.

Each event has the following properties:

(built-in in Echo) Page title
(built-in in Echo) Agent (user who caused the event, by leaving the comment)
(built-in in Echo) Section title
(built-in in Echo) Page revision
Subscription item name. A locator is used to include all users subscribed to it in the notifications. Note that we ignore the page title and section title here, and users will still get notifications if the section was renamed or archived to a different page.
New comment's ID and name. The ID is used to show a direct link to the comment. The name is intended to be used in the future to allow linking to the comment if it has been archived to a different page.
New comment's content, a snippet of which is shown in the notifications
List of users who were mentioned in the comment

This data is stored in one of Echo's database tables, however only the title and agent can be queried directly. Everything else is in a serialized blob.

We generate an event for every new talk page comment, regardless of whether anyone is subscribed to the thread it's in. We generate notifications only for subscribed users.

If the edit would result in an Echo event related to talk pages (that is: mention, mention-summary, or edit-user-talk) as well as a DiscussionTools comment event, we avoid sending double notifications by using a filter to exclude the users who were mentioned and, if the edit was to a user talk page, its owner. Instead we enhance the Echo event with the comment's ID and name to show a direct link to the new comment (rather than just a section where it was added) and the comment's content to show a snippet (unless Echo provided one).

Tracking topics

Sections you subscribe to are identified by the author, date and time of the oldest comment (this is their item name). This allows for sections to be moved, renamed, or archived/unarchived, without losing the subscriptions. It also allows subscriptions to be handled consistently for sections that are transcluded on different pages (e.g. some wikis' village pumps are set up like that).

There are some scenarios where the item name will change, and the connection between the subscriptions and the topic is lost:

If the level 2 heading is changed to a level 3 or any other, e.g. because the section is merged in as a sub-section of an existing discussion
If an older comment is added to the section, e.g. if a malformed quote is added in a new comment or if an older section is merged into it (we currently ignore comments in sub-sections, but this may change: T298617#7695392)

If you subscribe to a heading whose item name is indistinguishable from another's, everything behaves as if you had subscribed to both – e.g. you'll get notifications for both of them, and unsubscribing from one will also unsubscribe you from the other. This is necessary to handle transcluded sections mentioned above.

Usability enhancements

The changes provided by this feature are intended to make talk pages look more clearly like places where people are commenting.

When enabled, the HTML markup contains two versions of the markup, and CSS classes are used to toggle which one is visible. This increases the HTML size, but avoids splitting the parser cache, and allows the changes to be disabled without reloading the page (this is used by mobile "Read as wiki page" button).

Metatadata related to discussion activity is shown for each topic: link to and date of latest comment, number of comments and number of people in discussion. It is computed based on the structure from the comment parser, and is only shown in sections that contain at least one discussion comment (we currently ignore comments in sub-sections, but this may change: T298617#7695392). Only the data in the comments is considered, not historical information (e.g. someone who fixed a typo in a comment, but didn't leave any comments, is not counted in "people in discussion").

The changes are only applied to in "Talk" and "User talk" namespaces, to avoid unexpected formatting in namespaces that mix content and discussion (e.g. "Wikipedia:" namespace in many projects).

Permanent links

When discussion topics get archived, or moved to a different page for some other reason, or when discussion comments are moved to a different place on the page, the normal links to comments break. This affects links in our own notifications, as well as internal links using comment IDs used in wikitext discussions (of the form [[Talk:Blah#c-...]]).

To solve this problem, you can use permanent links of the form [[Special:GoToComment/c-...]]. The special page will redirect to the current location of the comment. This only works when $wgDiscussionToolsEnablePermalinksBackend is enabled.

Edge cases

In some rare cases we might not be able to redirect to the "current location" for the comment (or heading).

It might not be visible anywhere, because:

It never existed in the first place, and the link is wrong
It has been removed from the page (removing comments outright is rare, but occasionally done for off-topic, redundant or offensive comments)
It has been edited in a way that breaks our identification (e.g. deleting and re-adding a signature)
It was supposed to be archived/moved by cut-and-paste, but something went wrong with the second half of the operation and it was just deleted
The discussion item database hasn't been populated for that page.

Or it might be visible in more than one place, because:

It is actually two indistinguishable comments on different pages (with the same author, date and time)
It was supposed to be archived/moved by cut-and-paste, but something went wrong with the first half of the operation and it was just copied
It might be transcluded on multiple pages (we ignore transclusions when looking for the redirect target, so this doesn't lead to the scenario below)

Or it might be older than the permalinks feature (it only has data about comments added after it has been deployed – unless we back-fill the data for older comments, this hasn't been decided yet).

In these cases the permanent link will instead redirect to Special:FindComment, which displays as much information as possible to help you figure out what happened:

A link to the most recent revision containing the comment on each page where the comment has ever appeared (unless the page has been deleted or the revision has been hidden)
A note if it's no longer in the latest revision of the page
A note if it has only been transcluded there

Discussion item database

This feature is backed by a database of comment metadata. For every comment ID and name that has ever appeared on wiki pages (ever since the feature was enabled), we store every page title on which it appeared, and the oldest and newest revision in which it appeared. This information is generated entirely from the wikitext and there's no API to edit it (like the pagelinks table).

The data is generated/updated as a part of refreshlinks jobs. Under normal circumstances these updates are small (just recording the comments that have been added or removed on a page since the last edit or template refresh). However, right after the feature is enabled, the relevant database tables are empty; any refreshlinks job will cause the information about all comments on the page to be generated. To populate the data of all pages after enabling the feature, the persistRevisionThreadItems.php maintenance script must be run. Otherwise, it will be populate when talk pages get edited directly or indirectly as part of template changes, which may cause overload if a template used on many talk pages is edited (phab:T334258).

The database also includes some additional information:

Discussion items (discussiontools_items table)
For each item:
- Item name
- Comment timestamp (if a comment)
- Comment author (if a comment)
- Pages in which the item appeared (discussiontools_item_pages table)
  For each page:
  - Page ID
  - Oldest revision in which the item appeared (discussiontools_item_revisions table)
  - Newest revision in which the item appeared (discussiontools_item_revisions table)
    For each of those two revisions:
    - Item ID (discussiontools_item_ids table)
    - Revision ID
    - Parent item ID in the same revision
    - Page ID from which the item is transcluded from (if any)
    - Indentation level (if a comment)
    - Heading level (if a heading)

This is a subset of the comment parser data structure; notably excluding the content of the comment/heading. We will use it to improve the performance of features that previously needed to run the comment parser repeatedly on old revisions (e.g. checking for new comments while the user is replying, generating notifications). It maybe also become useful in the future to measure talk page usage (e.g. how many people comment in topics, or how long it takes until topics are archived).

To save disk space, only data about the oldest and newest revisions of items is kept in the discussiontools_item_revisions table. After the page is edited and the comment appears in a newer revision of the page, the row for the older revision (previously newest) is deleted.

You can find a few examples in the PHPUnit integration tests of the extension.

Simple example
Archived section
Indistinguishable comments
Transcluded section
Changed comment indentation
Changed heading level

Each directory contains a MediaWiki dump that can be imported in to your wiki, and JSON dumps of the database tables produced by importing it.