what to feed the mythical machine learning beast?

One of the holy grails of machine learning is the creation of a system that can “read the web” and learn from it, as Isaac Newton read Euclid’s Elements and taught himself geometry.

Imagine a mythical beast that could speed-read one-hundred million pages per second, consuming every Wikipedia entry, every scientific article on arxiv.org, every out-of-copyright scanned book, and beyond just indexing that information, could actually reason with it.

Building an intelligent machine isn’t hard impossible.  It’s building a learning machine, one that mirrors the magic by which a teenager learns to drive a car, play chess, or do calculus in a period of a few dozen hours – that’s the magic that we haven’t yet figured out.

But I wonder if some of our challenges in creating this mythical learning machine lie with what we’re trying to feed the beast.  After all, the web of documents was written for human consumption.  Natural language is a lossy compression algorithm; it maps the massive varieties of our experiences into semantic text.  A high-frequency sensory stream of sights, sounds, and experiences gets hashed into “cold sidewalks are slippery.”

To that end, if we want machines to reason about our world, let’s stop giving them our digested cud of content.  Let’s provide them direct experience, via the sensor streams that our instrumented planet is emitting via weather stations, transit networks, electrical grids, smart phones, fitbits, and GPS devices. With that data, machines might begin to intuit relationships between weather and sidewalk slips – in forms that are beyond our own human minds to comprehend.

It’s data, not documents, that the mythical machine learning beast will eat.

beyond hadoop: fast queries from big data

image

There’s an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior:

While Hadoop shines as a processing platform, it is awkward as a query tool.

Hive was developed by the folks at Facebook in 2008, as a means of providing an easy-to-use, SQL-like query language that would compile to MapReduce code. A year later, Hive was responsible for 95% of the Hadoop jobs run on Facebook’s servers. This is consistent with another observation made by Cloudera’s Jeff Hammerbacher: when Hive is installed on a client’s Hadoop cluster, its overall usage increases tenfold.

That data-heavy businesses can achieve visibility into the terabytes of logs that they generate is, at a primary level, a major step forward. Before the Hadoop era, this was difficult to impossible without a major engineering investment. Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries.

But there remains a painful challenge that Hive and Hadoop does not solve for: speed.

A Powerful But Lumbering Elephant

Hadoop does not respond anywhere close to “human time”, a term that describes response thresholds acceptable to a human user, typically on the order of seconds. Larry Ellison and his marketing mavens invoke a similar theme when pitching their wares as “analytics at the speed of thought.”

Nonetheless, this sluggishness is not the fault of Hive or Hadoop per se. If a business user asks a question about a year’s worth of data with Hive, a set of MapReduce jobs will dutifully scan and process, in parallel, terabytes of data to obtain the answer. It’s neither the commodity hardware that most Hadoop clusters use nor some of its IO indulgences while executing processes, that are to blame. These are the low-order performance bits.

And while Hadoop jobs do have a fairly constant overhead – with a lower bound in the range of 15 seconds – this is often considered trivial within the context of the minutes or hours that most full jobs are expected to take.

The higher-order bits affecting query performance are: (i) the size of the data being scanned, (ii) the nature of storage, e.g. whether it is kept on disk or in memory, and (iii) the degree of parallelization.

An Emerging Design Pattern: Distill, then Store

As a result, a common design pattern is emerging among data-heavy firms: Hadoop is used as a pre-processing tool to generate summarized data cubes, which is then loaded into an in-memory, parallelized database – be it Oracle Exalytics, Netezza, Greenplum or even Microsoft SQL Server. Occasionally, a traditional database query layer can be bypassed altogether, and summary data cubes can be loaded directly into a desktop analytics tool such as Qlikview, Spotfire, or Tableau.

At my start-up Metamarkets, we have embraced this design pattern and the role that Hadoop plays in preparing data for fast queries. Our particular bag of tricks is best described by the three principles of Druid:

  • Distill: We roll data data up the coarsest grain at which a user might have reasonable interest. Put simply, it is rare that one is concerned with individual events at one-second time frames. Rolling up to groups of events, with a select set of dimensions and at minutely or hourly granularity, can distill raw data’s footprint down to 1/100th of its original size.
  • Distribute: While this summarized data is spread across multiple nodes in our cluster, the queries against this data are also distributed and parallelized. In our quest to break into the “human time” threshold, we have increased this parallelization to as many as 1000 cores, allowing each query to hit a large percentage of nodes on our cluster. In our experience, CPUs are rarely the bottleneck for systems serving human clients, even for a cluster serving hundreds of users concurrently.
  • Keep in Memory: We share Curt Monash’s sentiment that traditional databases will eventually end up in RAM , as memory costs continue to fall. In-memory analytics are popular because they are fast, often 100x to 1000x faster than disk. This dramatic performance kick is what makes Qlikview such a popular desktop tool.

The end result of these three techniques, each of which independently delivers between a 10 and 1000-fold improvement, is a platform that can run in seconds what previously took minutes or even hours in Hive.

This approach, for which we know we are not alone in pursuing, achieves performance that exceeds or matches any of the big box retailers at a considerably lower price point.

The commoditization wave that began with massive data processing, initiated by Hadoop, is migrating upwards towards query architectures. Thus the competitive differentiators are shifting away from large-scale data management and towards what might be called Big Analytics, where the next battle for profits will be fought.

(reblogged from a version I wrote at the Metamarkets blog).

how Oracle, the Goliath of data, could stumble

 This week’s Oracle World was bracketed by two events. First: the unveiling of Oracle Exalytics, a beefy in-memory appliance dedicated to large-scale analytics, during Larry Ellison’s opening keynote. Second: the undressing of Oracle’s cloud computing initiatives by Marc Benioff, SalesForce’s CEO, and the unceremonious cancellation of his keynote on Wednesday morning.

Both events highlight that when it comes to Big Data, analytics and cloud computing, Oracle is on the wrong side of history.

Startups don’t use Oracle

To glimpse the future of the data stack, Oracle need look no further than its own backyard, to what Silicon Valley start-ups are embracing: the distributed processing ecosystem of Hadoop, NoSQL data stores like MongoDB, and cloud platforms like Amazon’s web services.  As Marc Andreessen said last week, “Not a single one of our startups uses Oracle.”

The truth is, Oracle can’t support the kind of technology stacks embraced by startups — open-source software, elastic architectures, commodity hardware grids — because it cannibalizes revenue from their existing lines of business.

“I don’t care if our commodity X86 business goes to zero,” Ellison said in Oracle’s last earnings call, “We don’t make money selling that.”

This commoditization wave has sent others, including HP, fleeing from hardware, but it has driven Oracle into the breach as a big box retailer: they’re attempting to capture higher margins on sales of SPARC architectures.

But history is not on Oracle’s side.  Here are four realities that Oracle must face to maintain its unassailable position as the world’s leading data firm:

#1: The future of data is distributed

“Lots of little servers everywhere, lots of little databases everywhere. Your information got hopelessly fragmented in the process.” – from Matthew Symonds book Softwar (p. 38).

This is how Larry Ellison described the technology landscape of the 1990s, and his personal jihad against complexity has deepened Oracle’s distrust of distributed computing.

But the tide of data isn’t turning back, and the scale is too large to contain in any box; Big Data, on the scale of hundreds of terabytes to petabytes, must be distributed across “lots of little servers.” The most viable tool available today for processing and persisting Big Data is Hadoop.

Whether at the data layer — or a level above, at analytics — firms must adapt to this distributed reality and build tools that enable parallelized, many-to-many migration of data between nodes on Hadoop and those on their own platforms.

#2: The future of computing is elastic

Metal server boxes don’t bend or expand; they are inelastic, both physically and economically.  In contrast, the needs of businesses are highly elastic; as companies grow, they shouldn’t have to unpack and install boxes to meet their compute needs, any more than they should install generators for more electricity.

Computing is a utility, compute cycles are fungible, and firms want to pay for what they need, when it’s needed, like electricity.

The ability to scale storage and compute capacity up or down, within minutes, is liberating for individuals and cost-effective for organizations, but it is impossible with a “cloud in a box.”  It is only enabled by a true cloud computing infrastructure, with virtualization and dynamic provisioning from a common pool of resources.

#3: The future of applications isn’t the desktop

Despite Oracle having developed the first pure network computer in 1996 (or perhaps because of this), far too many of Oracle’s supporting business applications are delivered via the desktop, rather than via web browsers.

By comparison, Cloudera has created a rich web-based application for managing and monitoring all aspects of Hadoop clusters; Amazon Web Services has a fully-featured web console for interacting with its offerings; and Salesforce’s products are almost exclusively web-driven.

The expressivity afforded by web browsers has risen dramatically in the last two years, particularly with the emergence of Javascript as the lingua franca of web application development, and improvements in Javascript engines.

The same trend from desktop to browser also extends into mobile devices.  An increasingly large fraction of computing occurs on smart phones and tablets, and forward-thinking firms, like Dropbox, have built applications that cater to this reality.

#4: The future of analytics is visual

The decades of disappointment with business intelligence tools isn’t due only to their lack of brains (such that they’ve now fled to the fresh moniker of “business analytics”), but also the absence of beauty. Data is beautiful, as any reader of Edward Tufte can attest.

When visualized thoughtfully and artfully, data has an almost hymnal power to persuade decision makers.  And when exploring data of high complexity and dimensionality, the kind that lives in Oracle’s databases, tools that accelerate the “mean time to pretty chart” are essential.

In addition, analytics tool users are right to expect a smooth user experience on a par with other tools, whether photo editing or word processing, when they are creating and exploring data visualizations.

Yet amidst all of Oracle’s presentations and marketing materials about big data and analytics, one finds not a single dashboard or visualization to stir the senses.

While Spotfire and Tableau are notable exceptions to this critique, on the whole, the tools that dot the Oracle landscape lack either brains or beauty.

Enterprises will be slow to wake up to these realities, and Oracle will continue to profit handsomely from their slumber.

Fin: Oracle is ripe for attack by data services

The opportunities abound to chip away at the massive market share that Oracle now holds, providing data services to start-ups who won’t buy Oracle’s capital intensive boxes, and helping medium-sized businesses migrate to flexible, cost-effective, cloud-based alternatives.

(An earlier version of post was published as a guest column at GigaOm).

the secret guild of silicon valley

The governors of the guild of St. Luke, Jan de Bray

A couple of weeks ago, I was drinking beer in San Francisco with friends when someone quipped:

“You have too many hipsters, you won’t scale like that. Hire some fat guys who know C++.” 

It’s funny, but it got me thinking.  Who are the “fat guys who know C++”, or as someone else put it, “the guys with neckbeards, who keep Google’s servers running”? And why is it that if you encounter one, it’s like pulling on a thread, and they all seem to know each other?

The reason is because the top engineers in Silicon Valley, whether they realize it or not, are part of a secret Guild.  They are a confraternity of craftsmen who share a set of traits:

  • Their craft is creating software
  • Their tools of choice are C, C++, and Java – not Javascript or PHP
  • They wear ironic t-shirts, and that is the outer limit of their fashion sense
  • They’re not hipsters who live in the Mission or even in the city; they live near a CalTrain stop, somewhere on the Peninsula
  • They meet for Game Night on Thursdays to play Settlers of Catan
  • They are passive, logical, and Spock-like

They aren’t interested in tweeting, blogging, or giving talks at conferences.  They care about building and shipping code.  They’re more likely to be found in IRC chat rooms, filing JIRAs for Apache projects, or spinning out Github repos in their spare time.

They are part of a nomadic band of software tradesmen, who have mentored one another over the last four decades in Silicon Valley, and they have quietly, steadily built the infrastructure behind the world’s most successful companies.  When they leave – as they have places like Netscape, Sun, and Yahoo – the firms they leave behind wither and die.

If you want to build a technology company, you’ll need to hire them, but you’ll never find a member of the Guild through a recruiter.  They are being cold-called, cold-emailed, and cold-LinkedIn-messaged on a daily basis by recruiters, but their response will be similarly cold.

A true member of the Guild is only ever an IM away from a new job at Facebook, Google, or the long archipelago of start-ups their fellow members are busy building.  Outwardly successful companies that fail to draw engineers from the Guild will struggle with the performance and stability of their technology – as LinkedIn did in its early days and as Twitter did until recently.

It’s rare for an entrepreneur or executive to earn membership in the Guild, for that requires a path of apprenticeship that few have the talent or stamina for.  But it’s possible to earn the respect of the Guild, and to convince its members that your company is a hall where they can gather daily to mentor and develop their craft.

It begins with having an engineering-led culture, where technology decisions are made on their technical merits, never on personal grounds.  It also means allowing craftsmen to solve problems by creating new tools, rather than with just a labored application of the old.  These are values that Google and Facebook, two veritable Guild halls of the Valley, tout to any engineer who asks.

Finally, the implicit compact that the Guild makes with a company is that their efforts will not be in vain.  The most powerfully attractive force for the Guild is the promise of building a product that will get into the happy hands of hundreds, thousands, or millions.  This is the coveted currency that even companies that have struggled to build an engineering reputation, like foursquare, can offer. 

The Guild of Silicon Valley is largely invisible, but their affiliations have determined the rise and fall of technology giants.  The start-ups who recognize the unsung talents of its members today will be tomorrow’s success stories.

[ Addendum:  George E.P. Box said “All models are wrong.  Some models are useful.”  While my tongue-in-cheek model of the anti-hipster Guild of Engineers has angered those who interpret it literally, my rhetorical goal is to make a point:  that the hard work of engineering isn’t glamorous, and is often invisible to the media or the reigning pop culture of start-ups you’ll find in San Francisco.  If you want to build a successful technology company, you would do well to target the experienced  folks who have been honing their craft in the trenches of Silicon Valley for the last few decades, and those whom they’ve mentored. ]

node.js and the javascript age

Three months ago, we decided to tear down the framework we were using for our dashboard, Python’s Django, and rebuild it entirely in server-side Javascript, using node.js. (If there is ever a time in a start-ups life to remodel parts of your infrastructure, it’s early on, when your range of motion is highest.)

This decision was driven by a realization: the LAMP stack is dead. In the two decades since its birth, there have been fundamental shifts in the web’s make-up of content, protocols, servers, and clients. Together, these mark three ages of the web.

Read Full Post at Metamarkets