The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.
Over the past several months we have had a look at a number of top Github repository collections, such as:
This post will be a bit different, in that we are looking at the top open dataset repositories that Github has to offer. The post was inspired by the Github Open Data Showcase, which is good, but which is not very large. Ideally, I would like to make a list of the top open datasets on Github, period; however, this gets tricky, since searching for "open data," or any variant of this search term, is going to lead to complications on a site set up with the explicit goal of sharing open source projects and their data.
I decided to take the offerings in this showcase which were not explicitly noted as being out of date and add in 3 additional strictly-dataset repos with the highest numbers of stars I could find from simple search, rank them all accordingly, and present them here. We have found at KDnuggets that datasets are one of the most sought-after pieces of the data science puzzle for many readers, and hopefully this fresh batch (at least, fresh from our perspective) is of use to some of our readers.
We are currently conducting our latest Annual KDnuggets Analytics Software Poll, and so the particular percentages from last year may change, but we know that open source tools have been used by 73% of data scientists in the past 12 months. While this number reflects software, and not data, it is easy to surmise that open data is a heavily-relied upon commodity in data science and related data-oriented disciplines for research, practice, and production alike, for myriad reasons.
So here they are, the open dataset repos with the highest number of stars as of the time of writing.
Brought to us by Xiaming (Sammy) Chen, this seems to be the undisputed leader of the open dataset collections available on Github. This curated list is organized by such topics as biology, sports, museums, and natural language, and appears to include several hundred datasets. Most are free, but there is a disclaimer at the top of the list that some are not.