Site icon m. e. driscoll

the data singularity, part ii: human-sizing big data

image

“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.”

Herbert Simon

In my previous post, I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans have been excised from information read-write loops — the velocity and volume of data in the world is increasing, and at an exponential rate.

But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?

The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can’t scale up. But these tools share a common goal: scaling down data, and making it human-sized. That’s the “reduce” part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.

What’s happening today isn’t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.

VI. How Satellite Data Paralyzed the CIA

Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to one history, this massive, rich data didn’t accelerate the pace of US intelligence: it slowed it down.

Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn’t adjust their decision-making to this new scale, and they were drowned by it.

Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.

VII. People Still Pull the Big Levers

That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being “data-driven”, I’d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking & humming away, slinging bits and earning money.

But no such company exists. What “data-driven” really means is that the executives & employees use data as inputs for making decisions. Companies may be data-fueled, but they’re people-driven.

VIII. Human-sizing Big Data: Filter & Crunch 

All of the analytics in the world won’t matter if it remains inaccessible to the people driving an organization — the human decision-makers.

We have processes all around us acting as data amplifiers, recording events at a pace & scale that we can’t comprehend. But this has created a disequilibrium: our capacity to create data is vastly outstripping our ability to consume it. Analytics is the act of taking Big Data streams and human-sizing them for our small data brains.

We can reduce data by either filtering it, which sifts through but does not alter data, or by crunching it, reducing many data points to a few.

Google and Facebook are Filters . Many consumer web technologies might be viewed as powerful filters. Google is a relevance filter for 20 billion web pages. Facebook is a social filter for baby photos. FourSquare is a geo-social filter for hipster bars. Amazon is a filter for retail products, combining search with a powerful recommendation engine.

Wikipedia is a Natural Language Cruncher . Crunching data is harder than filtering it. Perhaps the toughest nut to crack involves processing natural language: if you read a thousand web pages about the Gutenberg Bible, how would you describe it in a few paragraphs? Wikipedia is a human-powered natural language cruncher, powered by its army of mechanical turks, whose collective actions even reveal news trends.

Crunch the Past to Predict the Future . Crunching of quantitative data is at the heart of many prediction tasks: the National Weather Service aggregates weather station measurements into forecasts, Fair Isaac calculates a score of credit-worthiness by examining your credit history, and a sports contest might be construed as an algorithm — operating on a sequence of individually played points — to predict the best team or athlete.

Number crunching has its more banal forms, as well, in the kind of sums and averages found in your phone or utility bill. These are necessary, but predictive algorithms — the kind involved in weather forecasting — will continue to grow in importance. For at a certain scale of data, exact reporting become an insurmountable task: we can only hope to have probabilistic answers.

IX. Business Intelligence is Dead: New Tools for a New Era

That our traditional tools don’t operate at scale was highlighted by Tim O’Reilly recently, when he declared “Business intelligence as we knew it is dead.”

A new class of tools is emerging along the Big Data stack, in three areas: (1) storage & computation, (2) analytics, and (3) dashboards & visualization.

These tools will disrupt and attack many of the traditional Business Intelligence firms, ranging from tool-makers like SAS and SPSS, to relational database vendors like Oracle, to custom hardware providers.

X.  Collaborating with Big Data: Analytics is a Social Process

In the same talk that Tim O’Reilly proclaimed the death of BI “as we knew it”, he also highlighted a new initiative by Greenplum called Chorus (Greenplum is a Dataspora client, but I confess I’ve only seen a limited preview).

The animating spirit of Chorus is that analytics is not only about data, models, and visualizations — it’s also about the people who work on these various pieces. One of the reasons I love Box.net is the layer of social information that’s overlayed onto my files: appended notes, access statistics from collaborators, automatic notifications when a change is made.

Chorus is a vision to do this with Big Data; it allows, for instance, an analyst to link a data visualization to an underyling data source, include the R code that created the visualization, and append a note about a recent change to it.

As the Big Data stack matures, tools that help manage the workflow from data to analytics to visualizations, and ultimately to decisions, will be critical. Someday, creating and sharing a data analysis through a web dashboard should be as easy as writing a blog post. Until that day, there’s plenty of work to keep us data scientists well-employed.

I originally published on May 27, 2010 on the Dataspora blog.

Exit mobile version