Context

The Dashboard currently gets statistics on references added by fetching the features data supplied by Wikimedia's article quality machine learning models, and comparing the values of one revision with values for the previous revision. This reference counting method is constrained by the availability of article quality machine learning models, which are not accessible for the majority of Wikipedia language editions, including the Spanish language version.

The main goal of the internship is to develop a performant alternative implementation of counting references added that does not depend on articlequality features data, and works for every language version of Wikipedia (and ideally for every wiki).

Why do we build a new API?

One promising route for getting references for every wiki would be to co-opt data from another existing API that works across languages. Originally, the misalignment API was suggested: https://misalignment.wmcloud.org/api/v1/quality-revid-features?lang=es&revid=144495297.

However, when researching a bit more, it turned out that the existing misalignment API was originally designed as a research prototype so it's intended to be eventually removed. Also and more importantly, it's hosted in a shared space where someone could one day delete it unknowingly. Therefore, it's not a great place for it to exist to sustain the dashboard long-term. In addition, the logic we need around references is relatively simple and the existing misalignment API is doing a bunch of other things that slow it down or could cause errors.

Because of that, we came up with the idea of building a new reference-counter API hosted in Toolforge. Toolforge would protect against that accidental deletion piece and give wikiedu folks more easy access. Building a new API is also a good opportunity to simplify further so it's easier to maintain, faster, and less likely to fail inexplicably.

References-counter API

Possible approaches

There are two main ways for an API to count references in a given wiki revision:

  1. working with wikitext or,
  2. working with the parsed HTML of a page.

Traditionally, it's been a lot easier to access a page's wikitext than HTML (background). To sum up, the Wikimedia Foundation has provided public dumps of the content of all wikis, and these dumps come in wikitext format. There are severe drawbacks to working with the XML dumps containing articles in wikitext. This is because MediaWiki translates wikitext into HTML which is then displayed to the readers. Thus, some elements contained in the HTML version of the article are not readily available in the wikitext version; for example, due to the use of templates. This means that, if only the wikitext is parsed, esearchers might ignore important content which is displayed to readers. This is especially important for the reference counting process, since, for example, some templates are used to add references.

Wikitext

Wikitext can be accessed by the MediaWiki APIs (example: https://en.wikipedia.org/w/api.php?action=parse&oldid=1188344986&prop=wikitext&format=json).

Over time, the research community has developed many tools to help folks who want to use the dumps, which are basically tools that work with wikitext. For example, the misalignment API mentioned above works indeed with wikitext (see this code). While working with wikitext is convenient for some reasons, identifying references in wikitext is not trivial, therefore finding a perfect algorithm for it will not be either.

For example, the misalignment API counts references (see this code) using the ref tag. It includes the self-closing tag <ref/> and the container version (<ref> and </ref>). However, that’s definitely not perfect as, for example, it doesn’t take shortened footnote templates into account, which is another way of adding references.

Identifying shortened footnote templates is particular hard because they are wiki-specific, varying across languages and projects. While it is possible to build wiki-specific lists of templates, and we could use some tools for it (like this or this), the endeavor is challenging, both in terms of initial setup and ongoing maintenance.

Parsed HTML

The parsed HTML is available from the MediaWiki APIs too (example: https://en.wikipedia.org/w/api.php?action=parse&oldid=1188344986&prop=text&format=json)