beyond hadoop: fast queries from big data


There’s an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior:

While Hadoop shines as a processing platform, it is awkward as a query tool.

Hive was developed by the folks at Facebook in 2008, as a means of providing an easy-to-use, SQL-like query language that would compile to MapReduce code. A year later, Hive was responsible for 95% of the Hadoop jobs run on Facebook’s servers. This is consistent with another observation made by Cloudera’s Jeff Hammerbacher: when Hive is installed on a client’s Hadoop cluster, its overall usage increases tenfold.

That data-heavy businesses can achieve visibility into the terabytes of logs that they generate is, at a primary level, a major step forward. Before the Hadoop era, this was difficult to impossible without a major engineering investment. Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries.

But there remains a painful challenge that Hive and Hadoop does not solve for: speed.

A Powerful But Lumbering Elephant

Hadoop does not respond anywhere close to “human time”, a term that describes response thresholds acceptable to a human user, typically on the order of seconds. Larry Ellison and his marketing mavens invoke a similar theme when pitching their wares as “analytics at the speed of thought.”

Nonetheless, this sluggishness is not the fault of Hive or Hadoop per se. If a business user asks a question about a year’s worth of data with Hive, a set of MapReduce jobs will dutifully scan and process, in parallel, terabytes of data to obtain the answer. It’s neither the commodity hardware that most Hadoop clusters use nor some of its IO indulgences while executing processes, that are to blame. These are the low-order performance bits.

And while Hadoop jobs do have a fairly constant overhead – with a lower bound in the range of 15 seconds – this is often considered trivial within the context of the minutes or hours that most full jobs are expected to take.

The higher-order bits affecting query performance are: (i) the size of the data being scanned, (ii) the nature of storage, e.g. whether it is kept on disk or in memory, and (iii) the degree of parallelization.

An Emerging Design Pattern: Distill, then Store

As a result, a common design pattern is emerging among data-heavy firms: Hadoop is used as a pre-processing tool to generate summarized data cubes, which is then loaded into an in-memory, parallelized database – be it Oracle Exalytics, Netezza, Greenplum or even Microsoft SQL Server. Occasionally, a traditional database query layer can be bypassed altogether, and summary data cubes can be loaded directly into a desktop analytics tool such as Qlikview, Spotfire, or Tableau.

At my start-up Metamarkets, we have embraced this design pattern and the role that Hadoop plays in preparing data for fast queries. Our particular bag of tricks is best described by the three principles of Druid:

  • Distill: We roll data data up the coarsest grain at which a user might have reasonable interest. Put simply, it is rare that one is concerned with individual events at one-second time frames. Rolling up to groups of events, with a select set of dimensions and at minutely or hourly granularity, can distill raw data’s footprint down to 1/100th of its original size.
  • Distribute: While this summarized data is spread across multiple nodes in our cluster, the queries against this data are also distributed and parallelized. In our quest to break into the “human time” threshold, we have increased this parallelization to as many as 1000 cores, allowing each query to hit a large percentage of nodes on our cluster. In our experience, CPUs are rarely the bottleneck for systems serving human clients, even for a cluster serving hundreds of users concurrently.
  • Keep in Memory: We share Curt Monash’s sentiment that traditional databases will eventually end up in RAM , as memory costs continue to fall. In-memory analytics are popular because they are fast, often 100x to 1000x faster than disk. This dramatic performance kick is what makes Qlikview such a popular desktop tool.

The end result of these three techniques, each of which independently delivers between a 10 and 1000-fold improvement, is a platform that can run in seconds what previously took minutes or even hours in Hive.

This approach, for which we know we are not alone in pursuing, achieves performance that exceeds or matches any of the big box retailers at a considerably lower price point.

The commoditization wave that began with massive data processing, initiated by Hadoop, is migrating upwards towards query architectures. Thus the competitive differentiators are shifting away from large-scale data management and towards what might be called Big Analytics, where the next battle for profits will be fought.

(reblogged from a version I wrote at the Metamarkets blog).

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: