
“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.”
Herbert Simon
In my previous post, I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans have been excised from information read-write loops — the velocity and volume of data in the world is increasing, and at an exponential rate.
But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?
The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can’t scale up. But these tools share a common goal: scaling down data, and making it human-sized. That’s the “reduce” part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.
What’s happening today isn’t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.
VI. How Satellite Data Paralyzed the CIA
Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to one history, this massive, rich data didn’t accelerate the pace of US intelligence: it slowed it down.
Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn’t adjust their decision-making to this new scale, and they were drowned by it.
Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.
VII. People Still Pull the Big Levers
That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being “data-driven”, I’d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking & humming away, slinging bits and earning money.
But no such company exists. What “data-driven” really means is that the executives & employees use data as inputs for making decisions. Companies may be data-fueled, but they’re people-driven.
VIII. Human-sizing Big Data: Filter & Crunch
All of the analytics in the world won’t matter if it remains inaccessible to the people driving an organization — the human decision-makers.
We have processes all around us acting as data amplifiers, recording events at a pace & scale that we can’t comprehend. But this has created a disequilibrium: our capacity to create data is vastly outstripping our ability to consume it. Analytics is the act of taking Big Data streams and human-sizing them for our small data brains.
We can reduce data by either filtering it, which sifts through but does not alter data, or by crunching it, reducing many data points to a few.
Google and Facebook are Filters . Many consumer web technologies might be viewed as powerful filters. Google is a relevance filter for 20 billion web pages. Facebook is a social filter for baby photos. FourSquare is a geo-social filter for hipster bars. Amazon is a filter for retail products, combining search with a powerful recommendation engine.
Wikipedia is a Natural Language Cruncher . Crunching data is harder than filtering it. Perhaps the toughest nut to crack involves processing natural language: if you read a thousand web pages about the Gutenberg Bible, how would you describe it in a few paragraphs? Wikipedia is a human-powered natural language cruncher, powered by its army of mechanical turks, whose collective actions even reveal news trends.
Crunch the Past to Predict the Future . Crunching of quantitative data is at the heart of many prediction tasks: the National Weather Service aggregates weather station measurements into forecasts, Fair Isaac calculates a score of credit-worthiness by examining your credit history, and a sports contest might be construed as an algorithm — operating on a sequence of individually played points — to predict the best team or athlete.
Number crunching has its more banal forms, as well, in the kind of sums and averages found in your phone or utility bill. These are necessary, but predictive algorithms — the kind involved in weather forecasting — will continue to grow in importance. For at a certain scale of data, exact reporting become an insurmountable task: we can only hope to have probabilistic answers.
IX. Business Intelligence is Dead: New Tools for a New Era
That our traditional tools don’t operate at scale was highlighted by Tim O’Reilly recently, when he declared “Business intelligence as we knew it is dead.”
A new class of tools is emerging along the Big Data stack, in three areas: (1) storage & computation, (2) analytics, and (3) dashboards & visualization.
These tools will disrupt and attack many of the traditional Business Intelligence firms, ranging from tool-makers like SAS and SPSS, to relational database vendors like Oracle, to custom hardware providers.
- 1. Storage & Computation: Mixed Platforms, not Monolithic Databases . At the lowest level of storage & computation, Big Data is driving the success of cloud computing platforms like Amazon’s Elastic Compute Cloud — a massive, virtualized commodity-hardware grid — as an alternative to the Big Iron sold by hardware makers.Big Data has also catalyzed widespread adoption of the distributed, fault-tolerant Hadoop platform — an open-source implementation of Google’s BigTable that was developed by Yahoo, and is now commercially supported by Cloudera.
A bit further up the stack, relational databases are suffering: newer commercial entrants in this space — such as Greenplum, Aster Data, Vertica, and Netezza — offer parallelized relational systems that operate at greater scale and lower cost than Oracle and Teradata.Many open-source, non-relational data stores — with a colorful constellation of names such as HBase, MongoDB, CouchDB, Cassandra, and Voldemort — have gained traction for high-traffic, content-driven web sites.
SQL & NoSQL are Complementary, Not Antagonistic. While some may view storage technologies as antagonistic, either-or choices, the truth is that most Big Data-driven companies use a mixture of tools in complementary ways. Hadoop is often used for batch-processing and transformation of log data that is fed to more structured data stores, such as a distributed RDBMS, in backend systems. Non-relational data stores are in turn ideal for front-facing, high-performance web applications, where queries return a bolus of data related to a single key — often a product, user, or page identifier. All of these pieces working together form an information platform: an ecosystem of APIs working together.
- 2. Analytics: There Are No Turnkey Solutions. Imagine if any piece of data you ever wanted was within a query’s reach: what would you do with it? We’re fast approaching this scenario, and making data meaningful is the bottleneck. But unlike storing data — where use cases & technologies are common and becoming commoditized — the ways that firms filter and crunch their data varies widely.
This reflects the range of analytics needs that firms have: for example, a financial firm may need low-latency, continuous analysis of data streams, while an online retailer or pharmaceutical firm can tolerate 24-hour delays for analysis.
Scaling Up Analytics is Hard. R, my favorite analytics tool, is fantastic for modeling either aggregated data sets or samples of data that can fit in memory, but methods for deploying R in a large-scale data environment are still nascent.One promising approach isSaptarshi Guha’s RHIPE , which combines R with Hadoop ( slides ) from his March presentation at the Bay Area R Users Group . Another MapReduce-based framework for large-scale data analysis include the Apache Mahout project.
Learn, Then Apply: But Stay Close to the Data. In general, there are two pieces in any analytics pipeline: (i) learning, or the training of a model with historical data, and (ii) prediction, or the application of a model to new data. On the the learning side, it’s been said that more data beats better algorithms , and this is certainly true for many classification problems. In general, training a model is a computationally intensive task, and the development of methods that can train on massive data sets is an area of active research.
On the application/prediction side of modeling, the challenges often revolve around deployment, or How do we get the model to the data? (Since the reverse, pushing data to the model, is more expensive). To address the desire of porting models across different environments PMML (Predictive Modeling Markup Language) has been developed, which is supported by a range of database vendors.
The meme of “in-database analytics” is resonating because given data’s increasing heft, efficient analytics will follow the pattern of having the training & execution of models stay close to where the data lives.
As it will be several years before either open-source or commercial analytics tools are mature here, the most successful Big Data modelers will be those data scientists who can build and glue together their own methods, tailored for individual environments and needs.
- 3. Dashboards & Visualization: Why “I See” is a Synonym for “I Understand” . The most visible way in which Big Data is disrupting old tools is by changing the way we look at data. The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs. To take optimal advantage of the human visual system, dashboards and data visualization must be well-designed, and until recently, tools that achieved even a minimal standard of competence were rare.
Visual Literacy is on the Rise. But a new set of visualization tools and packages, as well as growing popular interest in data visualization — catalyzed by the books of Edward Tufte, blogs like Nathan Yau’s FlowingData and talks at T.E.D. conferences – are changing this.As I’ve written about before, there are two distinct kinds of data visualization pathways: (i) exploratory, a highly interactive path whereby a data scientist may permute through dozens or even hundreds of views of a data set to understand its shape or fit to a hypothesized model, and (ii) narrative, a more constrained path whereby only one or several views of the data are presented.
Exploring Data Requires Fast, Frequent Feedback . For the exploratory path, desktop tools are ideal. The open-source language R has several outstanding visualization packages, including ggplot2 and lattice (based on William Cleveland’s trellis). Two solid commercial products for exploratory visualization are SpotFire and Tableau (the latter of which has been praised by the hard-to-please Stephen Few).
Sharing Visualizations: Web Dashboards Are Ideal. Ultimately, however, visualizations need to be shared beyond a single user, to an audience. Web-driven dashboards are an ideal form for sharing narrative visualizations, by allowing navigation along defined axes of the data.The challenge is moving visualizations from the desktop to the web. Tableau has this capacity, but with R the process is less straightforward. One promising route is via Jeff Horner’s RApache tool , which embeds R inside an Apache server (which I’ve used for my MLB Pitch F/X tool, and which Jeroen Ooms’ uses to power his ggplot2 web app ).
The major limitation of R-driven web graphics is that achieving some interactivity within the graphic itself is difficult, as R’s graphics model is focused on static graphics. There are, however, several routes for achieving highly interactive, web-based data visualizations, whether by using Javascript, HTML5′s Canvas, or Flash. Two in particular are: (i) Ben Fry’s Processing , an expressive language for vector animation, which recently addedJavascript as one of its implementations, and (ii) the Protovis framework out of Stanford: a Javascript graphing toolkits whose conceptual integrity and expressive flexibility was inspired (like ggplot2) by Wilkinson’s grammar of graphics.
X. Collaborating with Big Data: Analytics is a Social Process
In the same talk that Tim O’Reilly proclaimed the death of BI “as we knew it”, he also highlighted a new initiative by Greenplum called Chorus (Greenplum is a Dataspora client, but I confess I’ve only seen a limited preview).
The animating spirit of Chorus is that analytics is not only about data, models, and visualizations — it’s also about the people who work on these various pieces. One of the reasons I love Box.net is the layer of social information that’s overlayed onto my files: appended notes, access statistics from collaborators, automatic notifications when a change is made.
Chorus is a vision to do this with Big Data; it allows, for instance, an analyst to link a data visualization to an underyling data source, include the R code that created the visualization, and append a note about a recent change to it.
As the Big Data stack matures, tools that help manage the workflow from data to analytics to visualizations, and ultimately to decisions, will be critical. Someday, creating and sharing a data analysis through a web dashboard should be as easy as writing a blog post. Until that day, there’s plenty of work to keep us data scientists well-employed.
I originally published on May 27, 2010 on the Dataspora blog.