the rise of the data web

The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky”into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role. As others have noted, we’ve reached a tipping point in history:Continue reading “the rise of the data web”

the seven secrets of successful data scientists

At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data. What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.” 1. Choose TheContinue reading “the seven secrets of successful data scientists”

mining the tar sands of big data

The consequence of sensor networks, cloud computing, and machine learning is that the data landscape is broadening: data is abundant, cheap, and more valuable than ever. It’s a rich, renewable resource that will shape how we live in the decades ahead, long after the last barrel has been squeezed from the tar sands of Athabasca.Continue reading “mining the tar sands of big data”

four lessons for building a petabyte platform

In this post I’ll share some of the thinking behind our choices for the Big Data stack that powers our petabyte platform, consisting of three layers (i) a processing and storage substrate based around Hadoop and HBase, (ii) an analytics engine that mixes R, Python, and Pig and (iii) a visualization console and data API built principally in Javascript. ReadContinue reading “four lessons for building a petabyte platform”

how xml threatens big data

Confessions from a Massive, Nightmarish Data Project Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like aContinue reading “how xml threatens big data”

the data singularity, part ii: human-sizing big data

“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.” Herbert Simon In my previous post, I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans haveContinue reading “the data singularity, part ii: human-sizing big data”

the data singularity is here

Originally published March 8, 2010 at Dataspora. In this blog post I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences. In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes aroundContinue reading “the data singularity is here”

Knuth’s reservoir sampling in Python and Perl

Algorithms that perform calculations on evolving data streams, but in fixed memory, have increasing relevance in the age of Big Data. The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N.These two features – (i) a constantContinue reading “Knuth’s reservoir sampling in Python and Perl”

the three sexy skills of data geeks

(I originally penned this on May 27, 2009 and published on the Dataspora blog.) Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly: “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to processContinue reading “the three sexy skills of data geeks”

color: the cinderella of dataviz

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990    Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor softwareContinue reading “color: the cinderella of dataviz”