the case for open source data visualization

When I was in graduate school, the most closely studied part of the scientific publications we read was not the results, but the methods sections. (It was also, incidentally, often the hardest section to write for one’s own publications.) Methods sections are wonderful because they allow you to verify that someone else’s work is correct — by reproducing it yourself. But more importantly, methods sections allow you to build upon the work of others. They are the open source code of science.

Unfortunately, for all but a small fraction of data visualizations on the web, there are no methods sections being published. This is a shame, because it slows the free flow of ideas and prevents the creative extension of other people’s work.

Three conditions must be met for a data visualization to be considered open and reproducible:

  • Open Tools — The software tool used for the visualization must be freely available. Thankfully, many of the most powerful visualization software tools, languages, and frameworks are now open source, such as Processing, Prefuse, Actionscript, and R.
  • Open Code (or Methods) — The actual code, script, and/or series of steps taken to generate the visualization must be published. (For example, Lee Byron released hiscode for a walkability heatmap of San Francisco.)
  • Open Data — The data which is visualized should also be available in the same washed and scrubbed format that was used for the visualization. Ideally any code used to clean up the data might also be shared.

I grade some of the web’s existing data visualization sites using these criteria.

  • The New York Times routinely creates stunning graphics (like a visualization of 22 years of box office receipts ), but we are left to guess how they were created. Grade:D
  • VisualComplexity, a graphics gallery of mostly complex networks (like genome neworks), has pretty images but neither data nor visualization code. Grade:D
  • IBM’s ManyEyes has gorgeous visualizations (some of which are made with thePrefuse toolkit), and while the data is made available, the source code for the visualization is not. Grade:C
  • Processing’s exhibition page highlights several extraordinary visualizations created with its open-source framework. But unfortunately, no source code is available from the visual artists. Grade:C
  • The R Graphics Gallery does make source code for graphics available, but in more than half of the cases, no data is available. Grade: B

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: