I want to store stats for each course by time period (for example, week-by-week stats), and only use revision data during the update for a given time period, so that we can remove the Revisions table altogether. This will dramatically reduce the storage requirements of the system and remove one of the major database performance bottlenecks.
Currently the system has a revisions
data table, where we save all the revision data. Revision data involves:
RevisionImporter
class. Ultimately, it fetches wiki revision data from an endpoint that provides SQL query results from a replica wiki database on wmflabs (see Replica
class).RevisionScoreImporter
class. It uses the Lift Wing and reference-counter APIs on behind to get values as wp10 or references count.PlagiabotImporter
class. It gets the suspected plagiarism data from an external Toolforge tool, and updates the ithenticate_id
field in the data table.There are two main processes that modify the revisions table in different ways: ScheduleCourseUpdatesWorker
and ConstantUpdateWorker
.
Course updates will pull in revisions and articles for courses that are ongoing or still within the update window. They are sorted into queues depending on how long they run, with short courses having their own queue and very long ones their own as well. It executes at every 5 minutes.
Ultimately, it uses the UpdateCourseStats
class, which takes the following actions related to revisions:
update_article_status
.summary
field in revisions records. See update_wikidata_stats
.Additionally, this process also updates Article
records (see import_revisions_slice
) and ArticlesCourse
records (see update_from_course
class method).
Note: the UpdateCourseStats
class fetches data from two main places: Revisions and Uploads. While Revisions are imported from the last existing revision, Uploads are retrieved for the entire course each time (from start to end).