Jump to content
Wikimedia Meta-Wiki

User:Erik Zachte/Wikistats 1

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Erik Zachte (talk | contribs) at 17:31, 23 January 2020 (+ image). It may differ significantly from the current version .

Jan 2020: I was the author of Wikistats 1 scripts, which I started in 2003, and continued to expand in my 10+ years as Wikimedia Data Analyst (Sep 2008 - Jan 2019).

I intend to collect some info about Wikistats 1 here, for reference and easy access (even I have difficulty to find some data files these days). Whenever an opinion is included, it will be my personal view, and of course may contain my personal biases. Therefore this can not be considered official documentation.

An often heard complaint about Wikistats 1 was that the raw data weren't available for external processing. Actually many were/are available, but were hard to find, and often undocumented. I'll provide some pointers here.

Page views per wiki per month, on all Wikimedia projects

Reports

Wikistats 1 partial screenshot of page views report
Wikistats 1 partial screenshot of page views report


Wikistats 1 reported on page views for nearly all Wikimedia wikis in many variations. These reports are still generated daily, but are no longer published.[1] Older versions of the reports (March 2019) are online: example, list of all reports.

Levels of reporting in these reports:

  • Separate reports for mobile and non-mobile traffic, and also one for the overall total.
  • Separate reports for raw counts and normalized counts
A normalized version of the monthly counts was introduced to make it easier to compare months, as follows: Time and again people got confused when we had a 'drop in page views' in February. I used to say "Remember February usually has almost 10% days less then January" (usual reply "I got it, of course"). These normalized data were also used in the Monthly Report Card. These were less suited for external communication, but imho much better for internal discussion. Hence both versions of the reports coexisted peacefully.
  • Counts per project, per wiki
    • Monthly counts (with rounded numbers for brevity)
    • Monthly secondary metrics
      • month over month (MoM) growth for that wiki (the color of the cell also conveyed rate of decline/growth)
      • percentage of overall views that went to this wiki
      • ranking for this wiki in this month

Data

Jan 2020: For page views per wiki, our archives and counts from these archives are still updated and refreshed daily (hurray!).

  • Hourly page views (from two sources) have all been packed into one yearly tar file
These data have been patched on at least two occasions, in order to correct for massive under-reporting (up to 40% of messages were not counted for +/- 8 months).
  • Csv files on many aggregations levels have all been packed into one zip file. This huge file is still refreshed daily (Jan 2020: size 122 MB).

In the zip file there is a separate folder for each project (wikipedia, wiktionary etc). Each folder contains counts since 2008 per language (some counts started later, e.g. mobile site). Counts have been broken down by day, week, month, day of week, overall total per language. A white list of genuine language codes is used to remove cruft.

Note: file projectviews_per_month_all_projects_html.csv is an exception, in that it contains html code snippets, which are reused in the last step of the batch process for the highest level overview report.

File projectviews_per_hour_all.csv only contains counts till 2017 (and none even for Wikipedia).

Files projectviews_per_month_popular_wikis_normalized_[yyyy][mm].csv were generated for all project into one file, and stored with Wikipedia data. Only meant to be used in the original Monthly Report Card.

    tbc

    Page views per article, per hour

    We have two versions of pageviews per article, and again two versions of the latter

    I consider this the most important data stream from Wikistats 1. It is not about Wikimedia projects per se. It is about what the world at large did seek to learn, in our age. I see it as complementary to the Twitter Archive at the Library of Congress. If only we had such a treasure trove for data archaeologists from e.g. WW II, it would be used by many scholars. Its importance will grow in coming decades, as the data age and ripen. BTW it was a community project (shout-out to Mathias Schindler) that I took over, as it was better to keep it going on Wikimedia servers.

    There is some redundancy for those data files, as hourly and monthly files are both publicly available (albeit on same server, which is then a single point of failure). Who would want to download and archive 720 hourly dumps when there is an aggregate version, with less than one percent in overall size, with no granularity lost?!

    So the data gathering and publication is OK. But as for long term preservation, I'm not so sure about that part, with a single copy on dumps.wikimedia.org. That's why I started to backup to hdfs with its much better redundancy and fail-over. That hdfs backup part of my script is broken now. Dan replied on Phabricator that he wants to take care of this. Thanks much, Dan :-)

    Stats collected from database dumps

    Reports

    tbc

    Data

    These data have been collected up till Jan 2019. Data archive is here.

    Mailing list stats

    No longer updated on WMF site. Yet WMF staff asked me about an update, so I switched it back to my own server, where it ran originally. But eventually it may become obsolete, when mailing lists are migrated/reorganized.

    Surveys

    Supporter scores on Wikistats 1 dump based reports (2016)
    Two surveys were held in 2015/2016 to provide a summary of what Wikistats 1 entails, and to ask feedback on what our editor community cared about

    Visualizations

    I use the term visualisation (aka viz.) mostly for data renderings that go beyond simple bar or line charts.

    Main viz's were these:


    See introduction, animation.
    Status: last refresh of data was in July 2018. Redoing this with data from Wikistats 2 should be doable.

    Wikipedia Views Visualized


    Monthly pageviews per country/region, or per wikipedia language. See visualization, documentation.
    Status: last monthly update: September 2018. Data collection scripts are rather complex, and require quite a few meta data, also use a complex data structure.

    Wikipedias active editors per million speakers


    See visualizations
    Status: data are from August 2018. Refreshing this with input from Wikistats 2 should be doable, and meta data from Wikistats 1 could be reused.

    Wikipedia edits on a random day


    See introduction, animation (use keys 1-6 to change metric, H for help).
    Status: data were collected (with a custom built script) only once, on July 29, 2011.

    Tools

    • Wikistats 1 has been written in perl mostly.
    • Older bar charts were html only, generated with perl. Newer line charts were rendered with R.
    • Visualizations were done with javascript, and htlm5, and some external libraries.

    Notes/References

    1. This still is a matter of debate between me and the Analytics Team [1] (Jan 23, 2020)

    AltStyle によって変換されたページ (->オリジナル) /