Historical Rubygem Download Data

Historical Rubygem download data is used to display download charts on project pages as well as for the upcoming calculation of trending projects.

To keep the usage of database resources at bay we persist only weekly stats, the cutoff day being Sunday.

The download data is being synced from the Rubygems.org API throughout each day, so the numbers do not have an equal cutoff time - therefore they will always be slightly skewed. For example one project might get it's stats updated early in the day, another towards the end of the day, meaning the number of downloads for the latter has effectively accumulated downloads from another day. As "incorrect" as this may be, for the purposes of illustrating historical trends in gem downloads we consider this "good enough" ;)

For several gems that have been around long before the Rubyforge to Gemcutter transition there is a massive spike in download numbers in November 2012. It does not seem to be an issue with our dataset, the numbers just increased significantly at that point day over day. If you know what happened then please get in touch with us! :) We do have a suspicion that historical Rubyforge download stats might have been retro-fitted at that point, but could not find anybody to confirm this yet :)

Where does the data come from?

The historical data has been assembled from two sources:

  • Ruby Toolbox database backups (for the period from late 2010 up to 2013)
  • The Bestgems.org API (from 2013 onwards). Thanks a lot to the project's creator xmisao for assembling and providing this data!

If you have access to even older historical gem download data please get in touch!

While the Bestgems.org dataset has mostly daily stats available for their entire history, the Ruby Toolbox's historical data was quite patchy. The site only used to sync download data every few days for categorized projects, and only once in two weeks for uncategorized projects.

To get a continuous weekly number we interpolated the missing values from the surrounding present ones, assuming linear day-to-day growth on multi-day gaps. Finally to reduce storage requirements the dataset was reduced to only keep one value per week (on sundays).

If you'd like to run additional analytics on top of the dataset we recommend to use our production database exports.

See also: