A GitHub for Data?

Author

Derek Willis

Published

July 31, 2010

Clay Johnson, late of Sunlight Labs and now writing at the splendidly-named InfoVegan, says that what the “Open Data” movement needs is a better way to store data on the Web. Something like a GitHub for data:

Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 --format=mongodb and have everything I need to interface with the coming census data?

Technically, there’s not much reason why this couldn’t happen. Sure, some government datasets are very large, and some are in arcane and oddball formats, but these are technical problems that can be overcome. But the biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.

In a sense, a GitHub for data could help solve this problem, too, because you can write documentation and many GitHub projects have excellent documentation. But there also are projects with very limited documentation – heck, some of them are mine. This is the biggest gap to better apps, that so few people really understand the data and its pitfalls. I’d like to see what Clay wants to see, too, but right now I’m more interested in:

gitdata install census-2010

If the person executing that command is, say, Paul Overberg.

That’s not to say that I’m in favor of a situation where only those with expertise have access to data. What I’m saying is that the very act of what Clay describes as a hassle:

A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.

Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.