design principles for data pipelines

image

(Image: ‘Tower of Babel’ by Pieter The Elder Bruegel, 1563)

Underinvestment in and misunderstanding of ETL is a silent killer in organizations.  It’s why reports are often delayed, why answers across systems rarely agree, and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy.  Even the most common attribute of data, time, has thousands of accepted dialects: “Sat Mar 1 10:12:53 PST,” “2014-03-01 18:12:53 +00:00” and “1393697578” are all equivalent.  And there’s a growing chorus of other sources with even less consistency:  geo-coordinates, user agent strings, country codes, and currencies. Each new data type is a layer of bricks in our collective, digital tower of Babel.

It’s no wonder that a CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data – and that some of Silicon Valley’s smartest minds (Joe Hellerstein and Jeff Heer’s Trifacta, Wes McKinney’s Datapad) are tackling this challenge.

As someone who has spent much of my career wrestling ETL’s demons, here are five secrets for keeping cool inside data’s infernos:

1. Stay Close To The Source

Journalists know that to get the truth, go to primary sources. The same is true for ETL. The closer you are to the origin of the data, the fewer dependencies you will have on filtered or intermediate versions, and the less chance that something will break.

Beyond fidelity, closeness to data sources conveys speed: tapping into a raw source means data feeds can be processed and acted upon in minutes, not hours or days.

The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands.

2. Avoid Processed Data

Like food, data is best when it’s fresh, you know the source, and it’s minimally processed.  This last piece key: in order to crunch huge quantities of data, one common approach is sampling.  Twitter provides a spritzer stream that is < 1% sample of all tweets; in my world of programmatic advertising, many marketplaces provide a 1% feed of bid requests to buyers.

Sampling can be great for rapid prototyping (and Knuth’s reservoir sampling algorithm is both beautiful and useful), but in my real-world experience, I’ve rarely found a sampling approach that didn’t backfire on me at some point. Critical population metrics – like maxes, mins, and frequency counts – become impossible to recover once a data stream has been put through the shredder.

In era when bandwidth is cheap and computing resources are vast, you may choose to summarize or sketch your data – but don’t sample it.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.

Just as with the railroads two centuries before, systems that embrace and enforce standards will succeed, and those who invent their own proprietary binary formats will suffer.

4. Let Business Questions Drive Data Collection

Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform (today, it’s usually a Hadoop cluster). The implicit assumption is that “once we have all the data, we can answer any question we’d like.” It’s an approach that’s doomed to fail.  There’s always more data available than can be collected, so choosing what to crunch can only be made in the context of business questions.

Laying down ETL pipe is wrist-straining work, so avoid building pipelines and drilling data wells where no business inquiry will ever visit.

5. Less Data Extraction, More API Action

Sometimes working with the nuts and bolts of data is a necessity, but for a growing class of problems it’s possible to get out of the data handling business entirely.  Take contact lists, email, and digital documents: for years, IT departments suffered through migrations of these assets from silo to silo. Today, cloud applications like Salesforce, Gmail, and Box make this somebody else’s problem.

A maturing ecosystem of SaaS applications expose APIs for acting on – without extracting – cloud-managed data.  These interfaces will allow developers and organizations to focus less on handling data and more on the activities of their core businesses.

(An earlier version of this essay appeared on February 27, 2014 in AdExchanger).

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: