beyond hadoop: fast queries from big data


There’s an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior:

While Hadoop shines as a processing platform, it is awkward as a query tool.

Hive was developed by the folks at Facebook in 2008, as a means of providing an easy-to-use, SQL-like query language that would compile to MapReduce code. A year later, Hive was responsible for 95% of the Hadoop jobs run on Facebook’s servers. This is consistent with another observation made by Cloudera’s Jeff Hammerbacher: when Hive is installed on a client’s Hadoop cluster, its overall usage increases tenfold.

That data-heavy businesses can achieve visibility into the terabytes of logs that they generate is, at a primary level, a major step forward. Before the Hadoop era, this was difficult to impossible without a major engineering investment. Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries.

But there remains a painful challenge that Hive and Hadoop does not solve for: speed.

A Powerful But Lumbering Elephant

Hadoop does not respond anywhere close to “human time”, a term that describes response thresholds acceptable to a human user, typically on the order of seconds. Larry Ellison and his marketing mavens invoke a similar theme when pitching their wares as “analytics at the speed of thought.”

Nonetheless, this sluggishness is not the fault of Hive or Hadoop per se. If a business user asks a question about a year’s worth of data with Hive, a set of MapReduce jobs will dutifully scan and process, in parallel, terabytes of data to obtain the answer. It’s neither the commodity hardware that most Hadoop clusters use nor some of its IO indulgences while executing processes, that are to blame. These are the low-order performance bits.

And while Hadoop jobs do have a fairly constant overhead – with a lower bound in the range of 15 seconds – this is often considered trivial within the context of the minutes or hours that most full jobs are expected to take.

The higher-order bits affecting query performance are: (i) the size of the data being scanned, (ii) the nature of storage, e.g. whether it is kept on disk or in memory, and (iii) the degree of parallelization.

An Emerging Design Pattern: Distill, then Store

As a result, a common design pattern is emerging among data-heavy firms: Hadoop is used as a pre-processing tool to generate summarized data cubes, which is then loaded into an in-memory, parallelized database – be it Oracle Exalytics, Netezza, Greenplum or even Microsoft SQL Server. Occasionally, a traditional database query layer can be bypassed altogether, and summary data cubes can be loaded directly into a desktop analytics tool such as Qlikview, Spotfire, or Tableau.

At my start-up Metamarkets, we have embraced this design pattern and the role that Hadoop plays in preparing data for fast queries. Our particular bag of tricks is best described by the three principles of Druid:

  • Distill: We roll data data up the coarsest grain at which a user might have reasonable interest. Put simply, it is rare that one is concerned with individual events at one-second time frames. Rolling up to groups of events, with a select set of dimensions and at minutely or hourly granularity, can distill raw data’s footprint down to 1/100th of its original size.
  • Distribute: While this summarized data is spread across multiple nodes in our cluster, the queries against this data are also distributed and parallelized. In our quest to break into the “human time” threshold, we have increased this parallelization to as many as 1000 cores, allowing each query to hit a large percentage of nodes on our cluster. In our experience, CPUs are rarely the bottleneck for systems serving human clients, even for a cluster serving hundreds of users concurrently.
  • Keep in Memory: We share Curt Monash’s sentiment that traditional databases will eventually end up in RAM , as memory costs continue to fall. In-memory analytics are popular because they are fast, often 100x to 1000x faster than disk. This dramatic performance kick is what makes Qlikview such a popular desktop tool.

The end result of these three techniques, each of which independently delivers between a 10 and 1000-fold improvement, is a platform that can run in seconds what previously took minutes or even hours in Hive.

This approach, for which we know we are not alone in pursuing, achieves performance that exceeds or matches any of the big box retailers at a considerably lower price point.

The commoditization wave that began with massive data processing, initiated by Hadoop, is migrating upwards towards query architectures. Thus the competitive differentiators are shifting away from large-scale data management and towards what might be called Big Analytics, where the next battle for profits will be fought.

(reblogged from a version I wrote at the Metamarkets blog).

how Oracle, the Goliath of data, could stumble

 This week’s Oracle World was bracketed by two events. First: the unveiling of Oracle Exalytics, a beefy in-memory appliance dedicated to large-scale analytics, during Larry Ellison’s opening keynote. Second: the undressing of Oracle’s cloud computing initiatives by Marc Benioff, SalesForce’s CEO, and the unceremonious cancellation of his keynote on Wednesday morning.

Both events highlight that when it comes to Big Data, analytics and cloud computing, Oracle is on the wrong side of history.

Startups don’t use Oracle

To glimpse the future of the data stack, Oracle need look no further than its own backyard, to what Silicon Valley start-ups are embracing: the distributed processing ecosystem of Hadoop, NoSQL data stores like MongoDB, and cloud platforms like Amazon’s web services.  As Marc Andreessen said last week, “Not a single one of our startups uses Oracle.”

The truth is, Oracle can’t support the kind of technology stacks embraced by startups — open-source software, elastic architectures, commodity hardware grids — because it cannibalizes revenue from their existing lines of business.

“I don’t care if our commodity X86 business goes to zero,” Ellison said in Oracle’s last earnings call, “We don’t make money selling that.”

This commoditization wave has sent others, including HP, fleeing from hardware, but it has driven Oracle into the breach as a big box retailer: they’re attempting to capture higher margins on sales of SPARC architectures.

But history is not on Oracle’s side.  Here are four realities that Oracle must face to maintain its unassailable position as the world’s leading data firm:

#1: The future of data is distributed

“Lots of little servers everywhere, lots of little databases everywhere. Your information got hopelessly fragmented in the process.” – from Matthew Symonds book Softwar (p. 38).

This is how Larry Ellison described the technology landscape of the 1990s, and his personal jihad against complexity has deepened Oracle’s distrust of distributed computing.

But the tide of data isn’t turning back, and the scale is too large to contain in any box; Big Data, on the scale of hundreds of terabytes to petabytes, must be distributed across “lots of little servers.” The most viable tool available today for processing and persisting Big Data is Hadoop.

Whether at the data layer — or a level above, at analytics — firms must adapt to this distributed reality and build tools that enable parallelized, many-to-many migration of data between nodes on Hadoop and those on their own platforms.

#2: The future of computing is elastic

Metal server boxes don’t bend or expand; they are inelastic, both physically and economically.  In contrast, the needs of businesses are highly elastic; as companies grow, they shouldn’t have to unpack and install boxes to meet their compute needs, any more than they should install generators for more electricity.

Computing is a utility, compute cycles are fungible, and firms want to pay for what they need, when it’s needed, like electricity.

The ability to scale storage and compute capacity up or down, within minutes, is liberating for individuals and cost-effective for organizations, but it is impossible with a “cloud in a box.”  It is only enabled by a true cloud computing infrastructure, with virtualization and dynamic provisioning from a common pool of resources.

#3: The future of applications isn’t the desktop

Despite Oracle having developed the first pure network computer in 1996 (or perhaps because of this), far too many of Oracle’s supporting business applications are delivered via the desktop, rather than via web browsers.

By comparison, Cloudera has created a rich web-based application for managing and monitoring all aspects of Hadoop clusters; Amazon Web Services has a fully-featured web console for interacting with its offerings; and Salesforce’s products are almost exclusively web-driven.

The expressivity afforded by web browsers has risen dramatically in the last two years, particularly with the emergence of Javascript as the lingua franca of web application development, and improvements in Javascript engines.

The same trend from desktop to browser also extends into mobile devices.  An increasingly large fraction of computing occurs on smart phones and tablets, and forward-thinking firms, like Dropbox, have built applications that cater to this reality.

#4: The future of analytics is visual

The decades of disappointment with business intelligence tools isn’t due only to their lack of brains (such that they’ve now fled to the fresh moniker of “business analytics”), but also the absence of beauty. Data is beautiful, as any reader of Edward Tufte can attest.

When visualized thoughtfully and artfully, data has an almost hymnal power to persuade decision makers.  And when exploring data of high complexity and dimensionality, the kind that lives in Oracle’s databases, tools that accelerate the “mean time to pretty chart” are essential.

In addition, analytics tool users are right to expect a smooth user experience on a par with other tools, whether photo editing or word processing, when they are creating and exploring data visualizations.

Yet amidst all of Oracle’s presentations and marketing materials about big data and analytics, one finds not a single dashboard or visualization to stir the senses.

While Spotfire and Tableau are notable exceptions to this critique, on the whole, the tools that dot the Oracle landscape lack either brains or beauty.

Enterprises will be slow to wake up to these realities, and Oracle will continue to profit handsomely from their slumber.

Fin: Oracle is ripe for attack by data services

The opportunities abound to chip away at the massive market share that Oracle now holds, providing data services to start-ups who won’t buy Oracle’s capital intensive boxes, and helping medium-sized businesses migrate to flexible, cost-effective, cloud-based alternatives.

(An earlier version of post was published as a guest column at GigaOm).

the secret guild of silicon valley

The governors of the guild of St. Luke, Jan de Bray

A couple of weeks ago, I was drinking beer in San Francisco with friends when someone quipped:

“You have too many hipsters, you won’t scale like that. Hire some fat guys who know C++.” 

It’s funny, but it got me thinking.  Who are the “fat guys who know C++”, or as someone else put it, “the guys with neckbeards, who keep Google’s servers running”? And why is it that if you encounter one, it’s like pulling on a thread, and they all seem to know each other?

The reason is because the top engineers in Silicon Valley, whether they realize it or not, are part of a secret Guild.  They are a confraternity of craftsmen who share a set of traits:

  • Their craft is creating software
  • Their tools of choice are C, C++, and Java – not Javascript or PHP
  • They wear ironic t-shirts, and that is the outer limit of their fashion sense
  • They’re not hipsters who live in the Mission or even in the city; they live near a CalTrain stop, somewhere on the Peninsula
  • They meet for Game Night on Thursdays to play Settlers of Catan
  • They are passive, logical, and Spock-like

They aren’t interested in tweeting, blogging, or giving talks at conferences.  They care about building and shipping code.  They’re more likely to be found in IRC chat rooms, filing JIRAs for Apache projects, or spinning out Github repos in their spare time.

They are part of a nomadic band of software tradesmen, who have mentored one another over the last four decades in Silicon Valley, and they have quietly, steadily built the infrastructure behind the world’s most successful companies.  When they leave – as they have places like Netscape, Sun, and Yahoo – the firms they leave behind wither and die.

If you want to build a technology company, you’ll need to hire them, but you’ll never find a member of the Guild through a recruiter.  They are being cold-called, cold-emailed, and cold-LinkedIn-messaged on a daily basis by recruiters, but their response will be similarly cold.

A true member of the Guild is only ever an IM away from a new job at Facebook, Google, or the long archipelago of start-ups their fellow members are busy building.  Outwardly successful companies that fail to draw engineers from the Guild will struggle with the performance and stability of their technology – as LinkedIn did in its early days and as Twitter did until recently.

It’s rare for an entrepreneur or executive to earn membership in the Guild, for that requires a path of apprenticeship that few have the talent or stamina for.  But it’s possible to earn the respect of the Guild, and to convince its members that your company is a hall where they can gather daily to mentor and develop their craft.

It begins with having an engineering-led culture, where technology decisions are made on their technical merits, never on personal grounds.  It also means allowing craftsmen to solve problems by creating new tools, rather than with just a labored application of the old.  These are values that Google and Facebook, two veritable Guild halls of the Valley, tout to any engineer who asks.

Finally, the implicit compact that the Guild makes with a company is that their efforts will not be in vain.  The most powerfully attractive force for the Guild is the promise of building a product that will get into the happy hands of hundreds, thousands, or millions.  This is the coveted currency that even companies that have struggled to build an engineering reputation, like foursquare, can offer. 

The Guild of Silicon Valley is largely invisible, but their affiliations have determined the rise and fall of technology giants.  The start-ups who recognize the unsung talents of its members today will be tomorrow’s success stories.

[ Addendum:  George E.P. Box said “All models are wrong.  Some models are useful.”  While my tongue-in-cheek model of the anti-hipster Guild of Engineers has angered those who interpret it literally, my rhetorical goal is to make a point:  that the hard work of engineering isn’t glamorous, and is often invisible to the media or the reigning pop culture of start-ups you’ll find in San Francisco.  If you want to build a successful technology company, you would do well to target the experienced  folks who have been honing their craft in the trenches of Silicon Valley for the last few decades, and those whom they’ve mentored. ]

node.js and the javascript age

Three months ago, we decided to tear down the framework we were using for our dashboard, Python’s Django, and rebuild it entirely in server-side Javascript, using node.js. (If there is ever a time in a start-ups life to remodel parts of your infrastructure, it’s early on, when your range of motion is highest.)

This decision was driven by a realization: the LAMP stack is dead. In the two decades since its birth, there have been fundamental shifts in the web’s make-up of content, protocols, servers, and clients. Together, these mark three ages of the web.

Read Full Post at Metamarkets

the rise of the data web


The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky”into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps,, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

Sacred “Words & Enthusiasm” vs Meaningless Utterances

Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.

Documents are made of “words and enthusiasm”: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions. Their core stuffing is text. Their structure is unpredictable and irregular — even fractal.

Data are not created but collected (something given, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure. Data are re-orderable and divisible: you can relay city temperatures in any order, but you can’t rearrange a Shakespearian sonnet without muddling its meaning. Some documents are so meaningful as to be considered sacred.

Data are, in this regard, meaningless on their own; they do not signify, they simply are. These data are the utterances of the spimes that surround us.

Documents as Trees, Data as Streams

The argument for shifting away from markup languages as data formats is not just practical, it’s philosophical: it’s about pivoting our conception away from the dominant metaphor of documents — trees — towards one far more suitable for data — streams.

Trees are rooted and finite: you can’t chop up a tree and easily put it back together again (while XML has made concessions to document fragments, it is not a natural fit).

Streams can be split, sampled, and filtered. The divisibility of data streams lends itself to parallelism in a way that document trees do not. The stream paradigm conceives of data as extending infinitely forward in time. The Twitter data stream has no end: it ought to have no end tag.

Conceiving of data as streams moves us out of the realm of static objects and into the realm of signal processing. This is the domain of the living: where the web is not an archive but an organism, reacting in real-time.

XML Considered Harmful for Data

XML is a poor language for data because it solves the wrong problems — those of documents — while leaving many of data’s unique issues unaddressed.   But many promising alternatives exist — microformats like JSONThrift, and even SQLite’s file format – as I will detail in a my next post.

(Originally published August 20, 2009 on the Dataspora blog.)

the seven secrets of successful data scientists

At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.

What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”

1. Choose The Right-Sized Tool

Or, as I like to say, you don’t need a chainsaw to cut butter.

If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).

In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up. 

Alternatively, if you’ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of Perl one-liner foo may get you there.

But don’t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability. Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.

When you’re data gets very large, so big it can’t fit reasonably on your laptop (in 2010, that’s north of a terabyte), then you’re in Hadoop, parallelized database , or overpriced Big Iron territory.

So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.

2. Compress Everything

We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.

As I was writing this, I was downloading an uncompressed CSV file via a web API. Uncompressed, it was 257MB, ZIP-compressed: 9MB.

Compression gives you a 6-8x bump out of the gate. When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.

That said, because compression can render data difficult to introspect, I don’t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.

3. Split Up Your Data

“Monolithic” is a bad word in software development.

It’s also, in my experience, a bad word when it comes to data.

The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you’ll need to use it to shard your database or distribute it across your file system.

Even more, it’s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).

Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as:  split, apply, combine .

This is, in my mind, a more lucid formulation of “map, reduce” to include key selection (“split”) as a distinct step before any map/apply.

4. Sample Your Data

Let’s say hypothetically you’ve got 200 GBs of data from your portmanteau of a start-up, FaceLink. Someone wants to know if more people visit on Mondays or Fridays, what do you do?

Before you wonder “if only I had 64 GB of RAM on my MacBook Pro”, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.

It’s easy to visually inspect, or pull into R and plot.

Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.

That said, sampling can bite you if you’re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).

5. Smart Borrows, But Genius Uses Open Source

Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.

A Google Code Search may find turn up a regular expression for that obscure data format.

The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.

6. Keep Your Head in the Cloud

This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.

I told him it was a terrible idea.

Why? Because the data he wants to work on isn’t local, it’s on an Amazon EC2 cluster. It’d take hours to download those TBs over a cable connection.

If you want to compute locally, pull down a sample. But if your data is in the cloud, that’s where your tools and code should be.

7. Don’t Be Clever

I once heard Brewster Kahle discuss managing the Internet Archive’s many-petabyte data platform: “everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: ‘You’re fired.’”

Hyperbole aside, his point is well-taken: cleverness doesn’t scale.

When dealing with big data, embrace standards and use commonly available tools. Most of all, keep it simple, because simplicity scales.

I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach. Today, they are several versions behind the latest release, and devoting time & energy to back-porting changes.

Cleverness rarely pays off. Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.