Jump to content
Wikimedia Meta-Wiki

User:Erik Zachte/Wikistats 1

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Erik Zachte (talk | contribs) at 08:11, 24 January 2020. It may differ significantly from the current version .

Jan 2020: I am the author of Wikistats 1: reports and data about the Wikimedia wikis, their authors and usage. I started this in 2003 (less than a year after I discovered Wikipedia and became an instant convert). I continued expanding and maintaining these scripts in my 10+ years as Wikimedia Data Analyst (Sep 2008 - Jan 2019).

Wikistats 1 is now gradually being replaced by Wikistats 2, a complete overhaul done by the WMF Analytics Team. One of the purposes of this article is to grasp where we stand in the transition to Wikistats 2.

I also intend to collect info about Wikistats 1 here for reference and easy access (even I have difficulty to find some data files I created). Whenever an opinion is included, it will be my personal view, and of course may contain my personal biases. Therefore this can not be considered official documentation.

Data

An often heard complaint about Wikistats 1 was that the raw data weren't available for external processing. Actually many were/are available, but were hard to find, and often undocumented. I'll provide some pointers here.

Page views per project, per wiki, per month

Jan 2020: For page views per wiki, our archives, and counts from these archives are still updated and refreshed daily (hurray!).

  • Hourly page views (from two sources) have all been packed into one yearly tar file
These data have been patched on at least two occasions, in order to correct for massive under-counting (up to 40% of messages were not counted for +/- 8 months).
  • Csv files on multiple aggregation levels have all been packed into one zip file. This huge file is still refreshed daily (Jan 2020: size 122 MB).

In the zip file there is a separate folder for each project (wikipedia, wiktionary etc). Each folder contains counts per language since 2008 (some counts started later, e.g. to WMF mobile site). Counts have been broken down by day, week, month, day of week, overall total per language. A white list of genuine language codes is used to remove cruft.

Note: file projectviews_per_month_all_projects_html.csv is an exception, in that it contains html code snippets, which are reused in the last step of the batch process for the highest level overview report.

Defunct: File projectviews_per_hour_all.csv only contains counts till 2017 (and none even for Wikipedia).

Defunct: Files projectviews_per_month_popular_wikis_normalized_[yyyy][mm].csv were generated for all projects into one file, which was stored in the folder for Wikipedia data. It was only meant to be used in the original Monthly Report Card, and is no longer updated.

Page views per article, per hour

(see also section Reports below for Wikistats 2 alternative)

There are two sets of public data files about page views per article. One published per hour, one aggregated into monthly chunks, but still with hourly granularity (and extrapolations for missing hours). These files go back to

I consider this the most important data stream from Wikistats 1. It is not about Wikimedia projects per se. It is about what the world at large did seek to learn, in our age. I see it as complementary to the Twitter Archive at the Library of Congress. If only we had such a treasure trove for data archaeologists from e.g. WW II, it would be used by many scholars. Its importance will grow in coming decades, as the data age and ripen. BTW it was a community project (shout-out to Mathias Schindler) that I took over, as it was better to keep it going on Wikimedia servers.

There is some redundancy for those data files, as hourly and monthly files are both publicly available (albeit on same server, which therefore forms a single point of failure). Who wants to download and archive 720 hourly dumps when there is an aggregate version with no granularity lost, and less than one percent of the cumulative size of those 720?!

So the data gathering and publication is OK, and quite robust. But as for long term preservation, I'm not so sure about that part, with a single copy on dumps.wikimedia.org. That's why I started to backup to hdfs with its much better redundancy and fail-over. That hdfs backup part of my script is broken now. Dan replied on Phabricator that he wants to take care of this. Thanks much, Dan :-)

These data files go back to 2008, when page view counts became available. Please be aware that the page view definition changed in 2015, when a.o. requests by bots were no longer in the data, and mobile traffic got fully counted.

Data collected from database dumps

These data have been collected up till Jan 2019. Data archive is here.

Reports

Page views per project, per wiki, per month

Wikistats 1 partial screenshot of page views report
Wikistats 1 partial screenshot of page views report


Wikistats 1 reported on page views for nearly all Wikimedia wikis in many variations. These reports are still generated daily, but are no longer published.[1]
Older versions of the reports (March 2019) are online: example, list of all reports.

Levels of reporting in these reports:

  • Separate reports for mobile and non-mobile traffic, and also one for the overall total.
  • Separate reports for raw counts and normalized counts
A normalized version of the monthly counts was introduced to make it easier to compare months, as follows: Time and again people got confused when we had a 'drop in page views' in February. I used to say "Remember February usually has almost 10% days less then January" (usual reply "I got it, of course"). These normalized data were also used in the Monthly Report Card. These were less suited for external communication, but imho much better for internal discussion. Hence both versions of the reports coexisted peacefully.
  • Counts per project, per wiki
    • Monthly counts (with rounded numbers for brevity)
    • Monthly secondary metrics
      • month over month (MoM) growth for that wiki (the color of the cell also conveyed rate of decline/growth)
      • percentage of overall views that went to this wiki
      • ranking for this wiki in this month

Input for these reports is still publicly available and refreshed daily, see section Data above.

Page views per article, per hour

(see also section Data above)

WMF hosts a query tool which produces charts about page views for one or several articles. This is based on Wikistats 2 data. It is a widely applauded product (and I would say rightly so), made by the Analytics Teams and others. Please be aware that the page view definition changed in 2015, when a.o. requests by bots were no longer in the data, and mobile traffic got fully counted. This tool only reports on counts from 2015.

example

Reports based on data collected from database dumps

to be added

Mailing list stats


No longer updated on WMF site (, reports can't be published, again merely a rsync issue).
End 2019 WMF staff asked me about an update, so I switched these back to my own server, where they came from originally.
But eventually these reports may become obsolete, when mailing lists are migrated/reorganized (under consideration).

Surveys

Supporter scores on Wikistats 1 dump based reports (2016)
Two surveys were held in 2015/2016 to provide a summary of what Wikistats 1 entails, and to ask feedback on what our editor community cared about

Visualizations

I use the term visualisation (aka viz.) mostly for data renderings that go beyond simple bar or line charts.

Main viz's were these:


See introduction, animation.
Status: last refresh of data was in July 2018. Redoing this with data from Wikistats 2 should be doable.

Wikipedia Views Visualized


Monthly pageviews per country/region, or per wikipedia language. See visualization, documentation.
Status: last monthly update: September 2018. Data collection scripts are rather complex, and require quite a few meta data, also use a complex data structure.

Wikipedias active editors per million speakers


See visualizations
Status: data are from August 2018. Refreshing this with input from Wikistats 2 should be doable, and meta data from Wikistats 1 could be reused.

Wikipedia edits on a random day


See introduction, animation (use keys 1-6 to change metric, H for help).
Status: data were collected (with a custom built script) only once, on July 29, 2011.

Tools

  • Wikistats 1 has been written in perl mostly.
  • Older bar charts were html only, generated with perl. Newer line charts were rendered with R.
  • Visualizations were done with javascript, and htlm5, and some external libraries.

Notes/References

  1. This still is a matter of debate between me and the Analytics Team [1] (Jan 23, 2020)

See also

This older overview of Wikistats 1

AltStyle によって変換されたページ (->オリジナル) /