Seminar Report: Mathematical Information Retrieval- Improving the Software of Wikipedia




As part of a University Project at the University of Göttingen, I engaged in contributing to the development of MediaWiki software. Being initially unfamiliar with the codebase of MediaWiki, my background in PHP development helped me to navigate the project. This involvement presented a valuable opportunity to deepen my understanding of software engineering practices while making tangible contributions to an established open-source platform.



Wikipedia, online since January 15, 2001, is known as an online encyclopedia with over 60 million articles, of which 2,874,092 are in the German language. Since January 25th, 2002, its underlying infrastructure has been running MediaWiki as a PHP backend to provide the services offered by Wikipedia. MediaWiki was initially developed by Magnus Manske for use on Wikipedia and was further improved by Lee Daniel Crocker. Nowadays, the open-source project is managed by the Wikimedia Foundation, and its use extends beyond Wikipedia, serving as a backend for thousands of similar online services. [1]

Written in PHP, it achieves scalability through caching and database duplication. Notably, its interface is available in over 400 languages and has over 1000 configuration settings. Structurally, MediaWiki consists of a core program and over 1800 extensions. [1]

This project aimed to contribute to open-source software by identifying and solving a problem in MediaWiki's codebase or one of its many extensions.



For this project, I chose to follow a basic roadmap. Firstly, I installed MediaWiki on a local machine and familiarized myself with the codebase. Secondly, I found an issue on MediaWiki's Issue Board and proposed a solution. Then, I sought community consent to agree on my solution, fixed the problem, and committed the changes to Gerrit, which is MediaWiki's version control. After this, administrators reviewed and, hopefully, accepted the changes.

To install MediaWiki locally, the community provides comprehensive tutorials on different methods, namely XAMPP or Docker. Generally, it is required to set up a local web server such as Apache Web Server and an SQL Database Server (e.g., MySQL).

The Problem


The issue I aimed to address revolves around text formatting within Wikipedia. On the "edit" subpage, users can modify articles using LaTeX macros to present formatted text or mathematical formulas, which are subsequently rendered by the browser. Although most standard LaTeX commands are accessible in this editor, certain functions, such as \overarc{text}, have been notably absent since 2011. The overarc serves as a notation for repeating decimals in Spain and some Latin American countries, providing a display style for named geometric arcs. While its usage is specialized, its relevance becomes apparent in specific fields, thus cementing its place within the LaTeX standard.

The Issue can be viewed in Phabricator under the reference phab::T32215. [2]

As the translation is part of the Math Extension it was also required to install this onto the MediaWiki core.

To rectify this issue, it's essential to grasp the sequence of translation steps leading to visualization in the client's browser. Technically, text formatting is achieved using MathML, an XML/HTML approach that structures formatting instructions for rendering by browsers. Although it's integral to the HTML5 standard, mainstream browsers initially lagged in its full integration. In the case of Wikipedia's LaTeX command, PHP code generates a MathML block, which is then transmitted to the client as part of the HTML document. Subsequently, the client's browser renders this information appropriately, ensuring accurate visualization. So MathML/HTML block representing an overarc over a sample text would look similar to figure1.

Figure 1: MathML example

This is then rendered by the browser in figure 2.

Figure 2: Overarc Visualization

Currently, there exists a workaround that enables users to utilize a macro function, which generates a span element in HTML, thereby incorporating text decoration resembling an overarc. The template primarily serves a presentational purpose, leveraging the CSS property border-radius and relying on browser support for elliptical borders. However, this method cannot be executed within the standardized LaTeX syntax and relies on browser support for an elliptical border, which may not be given for all common browsers. Additional Information about the overarc can be found in the wiki. [3]

Earlier, MathML wasn’t widely supported across various web browsers. While Firefox supports MathML since 2006 and Safari since 2016, other widely used Browsers (Chrome, Edge and Opera) only implemented MathML in early 2023.

As a result, MediaWiki relied on rendering images to visualize LaTeX-style math representations. This limitation stemmed from inconsistent support for MathML rendering among different browsers, as documented by resources such as caniuse.org, which provides information on browser support for various web technologies.

Figure 3: MathML Browser Support[4]



In constructing the MathML Block, it was imperative to identify a UTF-8 character corresponding to the overarc symbol. This necessitated a thorough examination of UTF-8 character tables.

In constructing the MathML Block, it was crucial to find a UTF-8 character that could serve as the overarc symbol. This required a detailed analysis of UTF-8 character tables. After exploring various options, including several others, the U+23DC character was identified.

This character not only represents an arc but also scales appropriately across multiple characters in common browsers, fulfilling the requirements for this MathML rendering. It is noteworthy that not all UTF-8 characters can be natively stretched over multiple characters. It's hard to find a scientific solution for this; however, the range from U+23DC to U+23E1 is meant to be used for horizontal brackets and should therefore be able to be horizontally stretched in Unicode Standard 15.0. [5] However, the rendering is also browser and font dependent, making this more of a trial-and-error process than a strictly scientific problem as the implementation of the standard is not in all cases consistent.

Also, a rough understanding of MathMLs structure and key elements was required. MathML structures mathematical expressions hierarchically within a <math> root element, utilizing various container elements like <mrow>, <mfrac>, and <msqrt> to organize and group mathematical components. These components include <mi> for identifiers, <mn> for numbers, <mo> for operators, and <mtext> for non-mathematical text.

Additionally, MathML accommodates complex structures such as underscripts, overscripts, and matrices through elements like <munder>, <mover>, <munderover>, and <mtable>. Attributes like accent in <mover> and rowspan in <mtd> offer further customization. This format enables precise representation of mathematical expressions, crucial for rendering and processing in web applications and documents. [6]

Navigating the codebase of MediaWiki, with which I was unfamiliar, presented a significant challenge in pinpointing the pertinent sections requiring modification or extension to implement the intended feature. This challenge was surmounted by meticulously scouring the code for functions exhibiting similar behavior to what was envisioned for \overarc{}. One such example is the \overbrace{} function, which shares much of the desired behavior but employs the brace UTF-8 character over the letters instead of the arc UTF-8 character used for the overarc representation. It's worth noting that due to the expansive size of the project and the segregation between the core program and its various extensions, this proved to be anything but a trivial task. A fundamental question arose: where should one direct their search? Where does the translation occur, and which extensions are indispensable for the task at hand?



Once the issue was rectified, the code changes were committed to Gerrit, which functions as MediaWiki's version control system. This ensures that the modifications are documented in the related issue board [2] integrated into the project's codebase. Gerrit facilitates the review process, allowing other developers to inspect the changes made, provide feedback, and ensure the quality and integrity of the code. After this change was merged to the main branch the changes were rolled out to the wikipedia.de domain.



As implementing the overarc function only required calling the same function used for \overbrace{} and others, but with different parameters, the question arose whether it would be possible to provide these input parameters directly on the edit page and then call the function directly. Technically, this would certainly be possible. However, the goal of this function is to represent syntactically correct LaTeX formations inside the edit page, and such a function does not exist in LaTeX. Nevertheless, other missing functions, such as \underarc{}, could also be added with very little effort.

It was noted that \overparen{} is also mapped to the U+23DC symbol in the unicode-math package, while the \overarc{} comes from the arcs package. Yet, these functions share functionality. Since Unicode-math effectively serves as the standard LaTeX math support, it should also be possible to utilize this function in Wikipedia. Therefore, \overparen{} should also be implemented with the same functionality as \overarc{} or as an alias to it.


  1. 1.0 1.1 Manual:What is MediaWiki? on MediaWiki. Retrieved on 2024-03-31.
  2. 2.0 2.1 T32215 Not possible to use \overarc on Phabricator. Retrieved on 2024-03-31.
  3. Template:Overarc on Wikipedia. Retrieved on 2024-03-31.
  4. MathML on Can I use... Support tables for HTML5, CSS3, etc. Retrieved on 2024-03-31.
  5. Unicode Standard 15.0, Page 886, Section 22.7 Technical Symbols Retrieved on 2024-04-18.
  6. Math Markup Language (Section 3.4) on W3C. Retrieved on 2024-03-31.