Main goal

I want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.

Current status of things

What revision data is

Currently the system has a revisions data table, where we save all the revision data. Revision data involves:

  1. Revision data strictly speaking, imported through the RevisionImporter class. Ultimately, it fetches wiki revision data from an endpoint that provides SQL query results from a replica wiki database on wmflabs (see Replica class).
  2. Revision scores data, imported through the RevisionScoreImporter class. It uses the Lift Wing and reference-counter APIs on behind to get values as wp10 or references count.
  3. Revision metadata, imported through the PlagiabotImporter class. It gets the suspected plagiarism data from an external Toolforge tool, and updates the ithenticate_id field in the data table.

How revision data is imported

There are two main processes that modify the revisions table in different ways: ScheduleCourseUpdatesWorker and ConstantUpdateWorker.

ScheduleCourseUpdatesWorker

Course updates will pull in revisions and articles for courses that are ongoing or still within the update window. They are sorted into queues depending on how long they run, with short courses having their own queue and very long ones their own as well. It executes at every 5 minutes.

Ultimately, it uses the UpdateCourseStats class, which takes the following actions related to revisions:

  1. Import revision data strictly speaking.
  2. Import revision scores data.
  3. Delete revisions that are in the limbo. See update_article_status.
  4. Update the summary field in revisions records. See update_wikidata_stats.

Additionally, this process also updates Article records (see import_revisions_slice) and ArticlesCourse records (see update_from_course class method).

Note: the UpdateCourseStats class fetches data from two main places: Revisions and Uploads. While Revisions are imported from the last existing revision, Uploads are retrieved for the entire course each time (from start to end).

ConstantUpdateWorker