I was recently listening to Mike Maples interview Andy Rachleff about the search for product-market fit, which is fitting, since Andy coined the term. Andy recounted a pearl of wisdom that Scott Cook had once given him: when doing customer research, savor the surprises. A few years into founding Intuit, Scott uncovered that while QuickBooks was created for personal finance, half their users were businesses. Why? Because most small businesses lacked formal accounting expertise, they preferred simple software.
Savoring surprises is a simple but powerful framing, because it forces reflection. It’s a great question for a job interview (“When you first got to Google, what surprised you?”) or at a cocktail hour (“What surprised you most about Tokyo?”). Who would want to hire or hang out with someone who answers “Nothing”?
Surprises are the bits of data we don’t expect. It is cognitively taxing to retain these bits, rather than burying them to confirm what we think we already know. Savoring surprise is at the heart of the beginner’s mindset. And it is the essence of learning and discovery.
In September 1928, a scientist returned from a two-week vacation and found a mold had contaminated his bacterial culture, and unexpectedly, killed the bacteria around it. Alexander Fleming savored this surprise, rather than ignore it, and it ultimately led to his discovery of penicillin. As he later put it “One sometimes finds what one is not looking for.”
[A]ll the great business disrupters of the past decade [–] Amazon, Google, Microsoft, Apple, Tesla, Uber, Airbnb, Netflix—they are all running Systems of Observation against the data flows they are privileged to access or host, and then feeding them into Systems of Intelligence to extract insights from them.
Thinking is not something that goes on entirely, or even mostly, inside people’s heads. Little intellectual work is accomplished with our eyes and ears closed. Most cognition is done as a kind of interaction with cognitive tools, pencils and paper, calculators, and, increasingly, computer-based intellectual supports and information systems. Neither is cognition mostly accomplished alone with a computer. It occurs as a process in systems containing many people and many cognitive tools. Since the beginning of science, diagrams, mathematical notations, and writing have been essential tools of the scientist. Now we have powerful interactive analytic tools, such as MATLAB, Maple, Mathematica, and S-PLUS, together with databases. The entire fields of genomics and proteomics are built on computer storage and analytic tools.
Colin Ware. Information Visualization: Perception for Design.
ETL is hard because data is messy. Even the most common attribute of data, time, has thousands of accepted dialects: “Sat Mar 1 10:12:53 PST,” “2014-03-01 18:12:53 +00:00” and “1393697578” are all equivalent. And there’s a growing chorus of other sources with even less consistency: geo-coordinates, user agent strings, country codes, and currencies. Each new data type is a layer of bricks in our collective, digital tower of Babel.
It’s no wonder that a CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data – and that some of Silicon Valley’s smartest minds (Joe Hellerstein and Jeff Heer’s Trifacta, Wes McKinney’s Datapad) are tackling this challenge.
As someone who has spent much of my career wrestling ETL’s demons, here are five secrets for keeping cool inside data’s infernos:
1. Stay Close To The Source
Journalists know that to get the truth, go to primary sources. The same is true for ETL. The closer you are to the origin of the data, the fewer dependencies you will have on filtered or intermediate versions, and the less chance that something will break.
Beyond fidelity, closeness to data sources conveys speed: tapping into a raw source means data feeds can be processed and acted upon in minutes, not hours or days.
The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands.
2. Avoid Processed Data
Like food, data is best when it’s fresh, you know the source, and it’s minimally processed. This last piece key: in order to crunch huge quantities of data, one common approach is sampling. Twitter provides a spritzer stream that is < 1% sample of all tweets; in my world of programmatic advertising, many marketplaces provide a 1% feed of bid requests to buyers.
Sampling can be great for rapid prototyping (and Knuth’s reservoir sampling algorithm is both beautiful and useful), but in my real-world experience, I’ve rarely found a sampling approach that didn’t backfire on me at some point. Critical population metrics – like maxes, mins, and frequency counts – become impossible to recover once a data stream has been put through the shredder.
In era when bandwidth is cheap and computing resources are vast, you may choose to summarize or sketch your data – but don’t sample it.
3. Embrace (And Enforce) Standards
In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.
Just as with the railroads two centuries before, systems that embrace and enforce standards will succeed, and those who invent their own proprietary binary formats will suffer.
4. Let Business Questions Drive Data Collection
Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform (today, it’s usually a Hadoop cluster). The implicit assumption is that “once we have all the data, we can answer any question we’d like.” It’s an approach that’s doomed to fail. There’s always more data available than can be collected, so choosing what to crunch can only be made in the context of business questions.
Laying down ETL pipe is wrist-straining work, so avoid building pipelines and drilling data wells where no business inquiry will ever visit.
5. Less Data Extraction, More API Action
Sometimes working with the nuts and bolts of data is a necessity, but for a growing class of problems it’s possible to get out of the data handling business entirely. Take contact lists, email, and digital documents: for years, IT departments suffered through migrations of these assets from silo to silo. Today, cloud applications like Salesforce, Gmail, and Box make this somebody else’s problem.
A maturing ecosystem of SaaS applications expose APIs for acting on – without extracting – cloud-managed data. These interfaces will allow developers and organizations to focus less on handling data and more on the activities of their core businesses.
(An earlier version of this essay appeared on February 27, 2014 in AdExchanger).
In the last year, the data scientist has been called “the sexiest job of the 21st century.” But if data is the new oil, and data scientists are its petrochemical high priests, who are the oil riggers? Who are the roughnecks doing the dirty work to get data pipelines flowing, unpacking bytes, transforming formats, loading databases?
They are the data engineers, and their brawny skills are more critical than ever. As the era of Big Data pivots from research to development, from theoretical blueprints to concrete infrastructure, the notional demand for data science is being dwarfed by the true need for data engineering.
A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value. Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy at the level of basic nurture and shelter. What is needed for this data isn’t philosophy, religion, or science – what’s needed is basic, scalable infrastructure.
It’s the data engineers who can build this infrastructure, and they represent the true talent shortage of Silicon Valley and beyond. Their unsexy but critical skills include crafting Hadoop pipelines, programming of job schedulers, and parsing broad classes of data – timestamps, currencies, lat & long coordinates – which are the screws, bolts, and ball bearings in the industrial age of data.
Let us now praise these unsung heroes, the data engineers, who are building the invisible but essential digital underground.
Consumer startups like Facebook, Twitter, Pinterest, and even DropBox are built by founders who wanted to “make something cool” for their own benefit. Their teams intuitively understand what works because they are their own target audience: young, tech-savvy people looking for better ways to connect, share, and organize their digital stuff.
When it comes to buyer psychology, corporations are not people
By contrast, the challenge for enterprise startups, is that corporations are not really people (their legal personhood aside) — and certainly not our people.
When you’re hungry for lunch, you go and buy a sandwich for a few dollars. When an enterprise is hungry for lunch, it solicits bids from multiple catering companies, negotiates for weeks to months, and signs a contract for a few million dollars.
This gap between the psychology of enterprises and the startups that sell to them is a challenge that consumer startups do not face. Worse, early team members in startups have limited enterprise experience; they are a poor fit to the process-orientation and risk-aversion (or to put it more kindly, risk-balancing) that is rewarded at the higher levels of corporate environments.
Less Goldilocks, more Dunder-Mifflin
Lacking this enterprise DNA, younger startups often build their sales processes in the image of how startups buy rather than how enterprises buy. When startups seek to purchase a software solution, they favor simple, scalable pricing: click a box, swipe a credit card, and start running. Hence the canonical three-column SaaS pricing page (call it Goldilocks pricing) that you see at many SaaS companies—where the middle column invariably feels “just right.”
But large enterprise buyers are less adventure-embracing Goldilocks, and more The Office’s Dunder Mifflin. They require more than three sizes of self-serve, they don’t do click-through contracts, and they rarely pay with credit cards. The reasons are both economic and cultural. Economically, as buying decisions grow larger, the cost of sales — product customization, negotiated contracts, and invoicing — become marginally small. Culturally, Fortune 500 companies expect to have a relationship.
As Box CEO Aaron Levie recently told me, “Look, when Coca Cola writes you a big check, they want to meet you in person.”
Silicon Valley IT is not enterprise IT
Startups also often underestimate the importance of professional services and training for enterprises. They believe every company has a cadre of engineers smart enough to set up and tailor an application accordingly, and business users who can quickly figure it out — whether it be Google Analytics, Hubspot, or Expensify — and get up and running.
But this is not the case in most enterprises. The success of firms like RedHat, MySQL AB, and more recently, Cloudera, testify to the enormous value lies in integration and support, even when that software – whether Linux, MySQL, or Hadoop – is free and open-source.
Seasoned sales executives: The “growth hackers” of enterprise startups
As the venture investing pendulum swings back towards enterprise technology companies, founders and venture capitalists will need to augment their teams with sales executives who can nimbly step around the often woolly, sometimes mammoth challenges of contract negotiations, channel partnerships, and client services engagements. These experienced leaders will be the “growth hackers” of the enterprise realm.
“On a scale of 1-10 of impatience, the best entrepreneurs are an 11.” – Tom Stemberg, Founder of Staples
Curiosity and impatience make for great founder traits, but they often pull in different directions.
Curiosity compels you to sit and study a problem, to voraciously consume every article and reference you can find to wrap your head around a big idea or an imagined future (self-driving cars, space elevators, or self-destructing sexts).
Impatience gets you up out of your chair to do something about it: hire, fundraise, sell, and evangelize.
Curiosity is for academics, impatience for executives, but start-up founders need to be both dreamers and doers, straddling the world of ideas and realities.
Robert Oppenheimer, the American Prometheus behind the first atomic bomb, was a dreamer – but he was also impatient. His colleague Murray Gellman said he lacked the ability to sit still:
“Germans call it ‘Sitzfleisch’, ‘sitting flesh’ when you sit on a chair. As far as I know, he never wrote a long paper or did a long calculation, anything of that kind. He didn’t have the patience for that… [But] he inspired other people to do things, and his influence was fantastic."
Impatience is the very opposite of Sitzfleisch, and without it, the Manhattan Project would have yielded nothing more than chalk dust.
Curiosity is what drew Steve Jobs to sit in on calligraphy classes at Reed; inspired Larry Ellison to study chip design at U. Chicago; compelled Bill Gates to cram for economics courses at Harvard; lured Larry and Sergey to pursue computer science Ph.D.s at Stanford.
Impatience is what drove them all to drop out and start Apple, Oracle, Microsoft, and Google.
Silicon Valley’s cult of the drop-out pays homage to impatience – who has time for school when you’re building a billion-dollar business? – but gives short shrift to curiosity which is the heart of innovation.
Nothing fires a healthy impatience more than the desire to see a big idea, born of deep curiosity, brought to life. As Steve Jobs said, "remembering that you are going to die” is a great motivator.
This is a phrase that has stuck with me since Tim O’Reilly uttered some form of it two years ago. Tim was talking about online cartography, saying it’s not the maps that matter: it’s getting to our destination. Maps are a half-step short of that goal. And in a world of navigational algorithms and self-driving cars, maps become less useful as tools.
Likewise, data visualization is a halfway house: a stopping place on the path from data to decision.
The explosion in interest in data visualization over the last couple of years — witness the popularity of blogs like FlowingData, DataVisualization.ch; companies like Tableau, Visual.ly, and Chart.io; and the maturation of Feltron-like infographics as mass media — is a powerful and important trend. We are long overdue to make the leap into a post-spreadsheet era, and human brains are far better equipped to process pictures than mind-numbing columns of figures.
But data visualizations still require human analysts to react and kick off another action, if they are to be useful.
Worse, too much data visualization can prompt decision fatigue. An interactive visualization of my weight, BMI, and body fat index is nice — but I’ve never logged into my scale’s online dashboard. A ”hey, stop eating so much” text alert, or a vibrating wrist-band to get me moving, is better. The best user interfaces don’t make us think, they help us act.
As the planet become more fully instrumented and online, from cars to cash registers to coffee pots, we find ourselves swimming in a rising tide of digital data. We can and should seek refuge in the harbor of data visualization, where analysts surface and explore insights with with choropleths, tree maps, Sankey diagrams, and other species of story-telling shapes.
But the real revolution is at work in the digital underground, with decision chains of algorithms exchanging data, silently singing each to each, surfacing only occasionally with actions: an equity trade, a digital ad, a left turn, a dimmed street light. These digital undergrounds, teaming with artificial life, are found on Wall Street, in online media, within warehouses, and across electrical grids.
These algorithms don’t require data visualizations, they consume those mind-numbing columns of figures in milliseconds. This is the realm of mathematics and statistics, machine learning and signal processing, and the hackers of these algorithms are the econometricians, neuroscientists, and applied physicists called data scientists. If visualization is the light side of data science, machine learning is its dark side: black box models whose mechanisms aren’t easily visualized or interpretable, except that they work. Renaissance Technologies hasn’t conquered Wall St. with pretty pictures, they’ve done it with better trades.
Likewise, the winners in the Big Data era will focus less on bar charts and more on actions: helping businesses set prices, cities move citizens, and people be healthier.