data visualization is a halfway house


(Image credit: A.Koblin for RadioHead)

This is a phrase that has stuck with me since Tim O’Reilly uttered some form of it two years ago.  Tim was talking about online cartography, saying it’s not the maps that matter: it’s getting to our destination.  Maps are a half-step short of that goal.  And in a world of navigational algorithms and self-driving cars, maps become less useful as tools.

Likewise, data visualization is a halfway house: a stopping place on the path from data to decision.

The explosion in interest in data visualization over the last couple of years — witness the popularity of blogs like; companies like Tableau,, and; and the maturation of Feltron-like infographics as mass media — is a powerful and important trend.  We are long overdue to make the leap into a post-spreadsheet era, and human brains are far better equipped to process pictures than mind-numbing columns of figures.

But data visualizations still require human analysts to react and kick off another action, if they are to be useful.

Worse, too much data visualization can prompt decision fatigue.  An interactive visualization of my weight, BMI, and body fat index is nice — but I’ve never logged into my scale’s online dashboard.  A ”hey, stop eating so much” text alert, or a vibrating wrist-band to get me moving, is better.  The best user interfaces don’t make us think, they help us act.

As the planet become more fully instrumented and online, from cars to cash registers to coffee pots, we find ourselves swimming in a rising tide of digital data.  We can and should seek refuge in the harbor of data visualization, where analysts surface and explore insights with with choropleths, tree maps, Sankey diagrams, and other species of story-telling shapes.

But the real revolution is at work in the digital underground, with decision chains of algorithms exchanging data, silently singing each to each, surfacing only occasionally with actions: an equity trade, a digital ad, a left turn, a dimmed street light.  These digital undergrounds, teaming with artificial life, are found on Wall Street, in online mediawithin warehouses, and across electrical grids.

These algorithms don’t require data visualizations, they consume those mind-numbing columns of figures in milliseconds.  This is the realm of mathematics and statistics, machine learning and signal processing, and the hackers of these algorithms are the econometricians, neuroscientists, and applied physicists called data scientists.  If visualization is the light side of data science, machine learning is its dark side:  black box models whose mechanisms aren’t easily visualized or interpretable, except that they work.  Renaissance Technologies hasn’t conquered Wall St. with pretty pictures, they’ve done it with better trades.

Likewise, the winners in the Big Data era will focus less on bar charts and more on  actions: helping businesses set prices, cities move citizens, and people be healthier.

eight golden rules of interface design

As we dedicate an increasing fraction of our time interacting with software – from airport check-in terminals and parking meters, to desktop and mobile applications –  digital interface design is becoming as important as physical architecture in improving our experience of the world.

Here are Professor Ben Schneiderman’s Eight Golden rules for optimally designing that experience (drawn from his classic text, Designing the User Interface):

1 Strive for consistency.
Consistent sequences of actions should be required in similar situations; identical terminology should be used in prompts, menus, and help screens; and consistent commands should be employed throughout.

2 Enable frequent users to use shortcuts.
As the frequency of use increases, so do the user’s desires to reduce the number of interactions and to increase the pace of interaction. Abbreviations, function keys, hidden commands, and macro facilities are very helpful to an expert user.

3 Offer informative feedback.
For every operator action, there should be some system feedback. For frequent and minor actions, the response can be modest, while for infrequent and major actions, the response should be more substantial.

4 Design dialog to yield closure.
Sequences of actions should be organized into groups with a beginning, middle, and end. The informative feedback at the completion of a group of actions gives the operators the satisfaction of accomplishment, a sense of relief, the signal to drop contingency plans and options from their minds, and an indication that the way is clear to prepare for the next group of actions.

5 Offer simple error handling.
As much as possible, design the system so the user cannot make a serious error. If an error is made, the system should be able to detect the error and offer simple, comprehensible mechanisms for handling the error.

6 Permit easy reversal of actions.
This feature relieves anxiety, since the user knows that errors can be undone; it thus encourages exploration of unfamiliar options. The units of reversibility may be a single action, a data entry, or a complete group of actions.

7 Support internal locus of control.
Experienced operators strongly desire the sense that they are in charge of the system and that the system responds to their actions. Design the system to make users the initiators of actions rather than the responders.

8 Reduce short-term memory load.
The limitation of human information processing in short-term memory requires that displays be kept simple, multiple page displays be consolidated, window-motion frequency be reduced, and sufficient training time be allotted for codes, mnemonics, and sequences of actions. 

the rise of the technical VC

Silicon Valley’s first big bang of innovation occurred in 1957, when eight engineers left Shockley Transistor to form FairChild Semiconductor.  Back then, the idea of engineers being entrusted as founders of a business was heretical.  Forty-one firms were asked to invest, but “none of them were interested”, according to Arthur Rock.

The idea that engineers without MBAs can be successful founders has changed, but what about engineers acting as investors?  In my experience, the majority of investment professionals on Sand Hill road are still non-technical.

But that is changing, in two ways.  

First, several young prominent venture capitalists who have technical degrees are rising to the top of their profession.  Folks such as Kevin Efrusy (MSEE and BSEE from Stanford) and Jeremy Levine (CS degree from Duke) are ranked #9 and #10, respectively, on this year’s Midas List of top investors.  And at #1 this year is Jim Breyer, who earned a CS degree from Stanford, and having just turned 50 is still youthful by VC standards.

Secondly, as technical founders have made their fortunes, many of them have joined the investing class.  Marc Andreessen and Reid Hoffman, two successful technical founders turned investors, were the second and third top investors in 2012.

And the Midas List doesn’t cover the funding arena where the influence of technical founders is greatest: angel investing.  Many of the world’s most successful non-professional investors – Jeff Bezos, Max Levchin, Andy Bechtolsheim, Paul Graham, Bill Joy, and Marc Benioff – have, with their spare change and spare time, outperformed entire funds.

Silicon Valley’s venture capital community is undergoing the same “revenge of the nerds” phenomenon that its businesses underwent in the 1960s and 70s.  Technical founders are launching companies, earning returns, and then spotting new start-ups to invest in – increasingly without needing surrogates carrying MBAs.

Or perhaps more accurately, whereas the technical class was previously seen as serving the business class, now it is the business class that serves the technical class.  Mark Zuckerberg’s having a controlling share of Facebook is testament to this new reality.

The rise of the technical VC is part of a larger macro-trend that Marc Andreessen cogently captured in five words: software is eating the world.  

One vertical after another – from media, travel, and (soon I hope) health care and education – is being transformed by information technology.  Those who conceive, develop, and understand software are the new masters of the universe.  And everyone else – lawyers, bankers, janitors – are their servants.

CEOs and VCs are learning to code not because their curiosity inspires it, but because their careers depend on it.

dna dating

A recent start-up,, is attempting to build a better dating engine using Big Data and algorithms.  But what mix of data could best be used to algorithmically identify an optimal mate?  Photos, favorite albums, and religious beliefs are a start.

But how about DNA?

A couple of years ago at SciFoo, Toby Segaran, Meredith Carpenter, and I brainstormed about creating a start-up that would do just this.  We dubbed it GeneHarmony.

Here’s how it would work: to become a member, you submit a saliva sample to our genomics facility, which sequences all of your genetic quirks (since most of us share DNA which is 99.6% similar, we need only sequence the differences).

Once sequenced, your genome would be scanned against all other members, with a focus on genes that are known to be predictive of mate compatibility, and return a rank-ordered list of potential dates.

The principal of “opposites attract” is mirrored at the DNA level. Studies show that individuals who are genetically dissimilar are significantly more likely to marry (the inverse of “why you shouldn’t marry your cousin.”)

So much of mating is an elaborate system to uncover genetic signals. Many factors which are considered attractive – facial symmetry, body shape, intelligence, body odor – are ways in which humans tell suitors “I have good genes.”  DNA dating could cut through these perceptual inefficiencies and get right to the genetic point.

Even better, members’ experiences could be tracked and fed back into the genetic database to create better dating models.  One could even tune the parameters depending on the kind of relationship sought: are you a 22 year-old thrill-seeker looking for fun, or an aging bachelor seeking marriage and stability?

Of course, the privacy issues raised by such a service are massive. What if the site was used to settle a paternity lawsuit?  Or used to target advertisements?  Facebook’s privacy issues appear trivial by comparison.

And yet, for most of us, selecting a partner is the most consequential decision of our lives.  Why shouldn’t we leverage all of the science and technology we have to improve that choice?

the data science debate: domain expertise or machine learning?


(L to R:  Mike Driscoll, Drew Conway, DJ Patil, Amy Heineike, Pete Skomoroch, Pete Warden, Toby Segaran. Credit: O’Reilly – Link to Video)

This past Tuesday evening at Strata I moderated an Oxford-Style debate between six of the top data scientists in Silicon Valley and beyond. The motion debated was: 

“In data science, domain expertise is more important than machine learning skill.”

The topic emerged from conversations over dinner the previous night, with Kaggle’s Jeremy Howard, LinkedIn’s Monica Rogati, and some pre-debate musings of Google’s Hal Varian.

To constrain the question, we added an additional clarification: which of these would you favor more in hiring your company’s first data scientist?

Arguing in favor of the motion (e.g. favoring domain expertise) were: 

  • Drew Conway, Ph.D. Candidate at NYU, Data Scientist at IA Ventures  
  • DJ Patil, Data Scientist in Residence at Greylock Partners  
  • Amy Heineike, Director of Mathematics at Quid

Weighing in against the motion (e.g. favoring machine learning skills) were:

When the Strata audience was initially polled, the vote was 53 to 40 in favor of domain expertise.  Then the debate began with comments from the audience.

The Audience:  s/MachineLearning/DomainExpertise is Easy 

We heard from Daniel Tunkelang, who argued in favor of domain expertise, stating that it was easier to learn statistics and machine learning than to acquire a lifetime of expertise and intuition (perhaps it comes easy to Dr. Tunkelang, but I’m not sure how many who have attempted to consume the Elements of Statistical Learning on their own would agree).

Claudia Perlich, a three-time winner of the KDD Nuggets competition, stood up and shared how she had won contests in domains as varied as “breast cancer, movie prediction, and sales performance – and I can tell you I knew next to nothing about those things when I started.“

The panelists were then asked to weigh in with their thoughts.

The Panelists:  Our Opponents Have Made Our Points for Us  

Drew Conway, whose popular Data Science Venn Diagram includes “substantive expertise” as one of its components (and truth be told, “math & statistics knowledge”) advocated that asking good questions is the most critical element in a data science project.  And the ability to ask good questions requires domain understanding.

Toby Segaran relayed a story about work I had done using social network analysis for modeling telco customer churn.  He went on to say that, “Mike, a domain expert in almost nothing, actually outperformed the domain experts.”  (ed. note: Thanks for the backhanded compliment, Toby 🙂 ).

DJ Patil read from the original LinkedIn Data Science job posting, arguing that machine learning skills were not even mentioned.  Rather they were seeking those who had curiosity and the ability to rapidly acquire domain expertise in the area of social network analysis.  He cited their hire of a theoretical physicist from Stanford, Jonathan Goldman – who did the initial groundbreaking work on the People You May Know algorithm – as evidence that machine learning skills were not important.

Pete Skomoroch fired back that “since machine learning and physics are both just mathematics” that Jonathan was actually just a machine learning expert by another name.  Those skills, said Skomoroch, helped him tackle and ultimately succeed in a domain in which he had little prior expertise.

Pete Warden, arguing for machine learning skills, cited his own experience at JetPac, his new travel site, where identifying high quality user photos was a high priority.  They hosted a competition on Kaggle, the machine learning contest platform, and in three weeks had built a quality ranking algorithm for just $5,000.

Amy Heineike then retorted that Pete Warden had actually made the case against himself.  In outsourcing their machine learning, she claimed, they underscored the importance of the one thing they could not outsource: their own domain expertise.

Toby Segaran agreed that company founders have excellent domain expertise: that is why they started their companies.  But when hiring a first data scientist, they need to hire for what they don’t have:  machine learning skills.  (Zing!)

Pete Skomoroch ended the debate with a rhetorical question, asking the audience to consider the most successful companies in recent years: was human intuition or was it analytics driving them?

The Verdict:  Let Us All Now Hail Our Machine Learning Overlords

In the end, the audience was polled again, and the results were tabulated in parallel by the panel (using what I like to call ManReduce), the verdict was: 52 for domain expertise, 55 for machine learning.

Like any good debate topic, there is merit on both sides of the domain expertise versus machine learning proposition.  As Hal Varian said when we asked him before the panel: “it depends on the structure of the problem.”  And in fairness to the debate panelists, they did not choose their positions: we assigned teams fifteen minutes before we went on stage.

One of the conclusions reached was that, when a problem is well-structured (or to Drew Conway’s point, when a good question is posed), it is much easier for machine learning to succeed.  Kaggle’s strength as a contest platform is that domain experts have already framed the problem:  they choose the features of the data to use (feature engineering or “feature creation”, as Monica Rogati calls it) as well as the criteria for success. This is the first, hardest step in any data science project.  After this, machine learners can step in and develop the best algorithms for classifying and predicting new data (or, less usefully, explaining old data).

Thus who you decide to hire as your first data scientist – a domain expert or a machine learner – might be as simple as this: could you currently prepare your data for a Kaggle competition?  If so, then hire a machine learner.  If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.

(Thanks to O’Reilly Media, and Strata organizers Edd Dumbill and Alistair Croll – who suggested the Oxford Debate format –  for hosting a terrific conference).

start-ups belong in cities


Last Saturday, I woke up and walked down to my favorite coffee shop in San Francisco, SightGlass coffee in SoMa.

I met up with a couple of entrepreneurs pitching an amazing idea, and while ordering some mind-buzzingly-good drip coffee, ran into a mentor of mine.

I write this because, while these interactions could have happened in the suburbs of Silicon Valley – whether the Coupa Cafe in Palo Alto or Red Rock in Mountain View – they are quintessentially enabled by four qualities of a city like San Francisco:

  •  neighborhoods that mix commerce and living, that “serve more than one primary function”
  •  blocks that are walkable, short and broken up with alleyways and side streets
  •  buildings which are a diversity of the old and new, luxury and low-rent
  •  people are prevalent and sufficiently concentrated

These four qualities enable the unique vibrancy of urban neighborhoods, and were laid out by Jane Jacobs in her magnum opus “The Death and Life of Great American Cities.”

I know that, for historical reasons  technology start-ups began in Silicon Valley.  But there is something tragic about watching 22-year-old software engineers waiting on city corners for buses to take them to work in the suburbs.

Especially when San Francisco is undergoing a renaissance of technology firms, driven by a few forces:

  1. anchor firms like Twitter, Splunk, DropBox, Zynga, and Square
  2. flourishing of start-up neighborhoods like SoMa, and now DogPatch
  3. early stage VC firms with a strong SF presence, like True Ventures and OATV

To that end, I’m a huge supporter of the initiative which is helping strengthen the community of hackers, entrepreneurs, and firms who recognize the unique advantages that a city provides.

To the aspiring young engineers thinking of coming West, I say: don’t settle for a bagels and Wi-Fi ride to an office park or even a campus.  Come to a Great American City, we have amazing start-ups for you to join.

To the entrepreneurs and their investors: curse the overpriced rents in San Francisco but recognize that efficient markets sometimes express a valid point.  Where there is high value there is high cost, and as Richard Florida has observed, the world’s brightest and most talented people flock to cities.  So invest in them, their start-ups and the cities – San Francisco, New York, Chicago, London, Beijing – where they want to live.

the coal mining of the information age

“If I were starting a NoSQL-in-the-enterprise startup, I would focus on ETL. ETL is a mess, and is a precursor for any fancy uses of data.”@jaykreps

“@jaykreps ETL is the coal mining of the information age: dirty, important work that fuels the economy.”@peteskomoroch

One of the largest obstacles facing companies who seek to derive value from data isn’t data’s size.  It’s data’s dirtiness.

It’s been said before: 80% of the effort that goes into a data science project is extracting, transforming, and loading (ETL’ing) data into a system where it can be analyzed.

This challenge is not simply a consequence of poorly structured data: free form text records are now mostly rare.  Yet there remains bewildering variety within well-structured, regular data.

Take the basic dimension of time, an attribute that nearly every data set contains.  A date can be expressed as POSIX or ISO8601 strings or a Unix epoch integer, among myriad other forms: 

  •   Sat Dec 10 10:37:13 PST
  •   2011-12-11T18:37:13.0+0000
  •   1323599850

And dates are just the beginning.  There are country codes, currency symbols, geospatial coordinates, and language indicators.  Beyond the data itself, there how it is delimited and encoded (including XML, the clamshell plastic packaging of data formats).

Data platform businesses create value by reducing the friction of data flow among participants.  They do this with standards.  The financial services industry, the most mature of data verticals, has defined symbologies for equities and other tradeable instruments.  Consumer goods have UPC barcodes.  Governments have national postal codes.

The manufacturing world has long appreciated the value of interchangeable parts, in lowering the costs of creating everything from electronics to airplanes.  Historically, standards arise in one of two ways: through the de jure recommendation of a consortium, or through the de facto adoption of a market leader’s schema.

As the industrial revolution of data continues to unfold, we need data platforms and standards’ bodies to facilitate “interchangeable data”.  These will accelerate the growth of a new breed of data-driven applications and services.  Clean coal mining may be a fantasy, but clean data mining may yet be possible.

why everyone should be a medical data donor


What happens to your medical records when you die?  Gil Elbaz thinks you ought to donate them to science, a thought he shared with a technology audience this past week.

It’s a fascinating idea.  But why wait until you’re dead?  In the age of the quantified self, why shouldn’t you be able to give your DNA sequence, your diet, and your disease diagnoses to science while you’re alive?  Unlike your organs, you can donate your data away and yet still keep it.

We have companies collecting vast swaths of data about our buying, browsing, and clicking habits to sell us more stuff.  But when it comes to understanding what behaviors keep us healthy, it’s a rocky landscape of HIPAA-regulated, technologically-challenged health insurers and providers.  We collect so much data about what makes us click, yet so little about makes us tick.

There are pockets of hope.  Sites such as PatientsLikeMe – which as this writing has 122,640 patients and over a thousand conditions – and are green sprouts in a bottom-up, democratizing data movement for health.

Nearly eight out of ten people on the planet earth now own a mobile phone.  These phones send so-called “heartbeat” data to cell towers every few seconds.  Imagine if, instead, we had the true heartbeat data of the humans carrying those phones?  A simple cardiac signal can betray a host of health issues, from stress and aging to a warning of impending stroke or heart attack.

I know that I’m not alone in being willing to give my data to medical science.  If the Fitbit or Jawbone UP had a checkbox that read “donate my data”, and the receiving institution was a trusted one, it could be the beginning of a valuable data bank.  If the Red Cross can convince us to stick needles in our arms to give blood, certainly we can endure bracelets on our wrists to give data.

lies, damned lies, and social media statistics


Social media statistics – shares, retweets, and likes – reflect content’s value the way a funhouse mirror reflects one’s looks: grotesquely.  As the web lines its halls with social mirrors, these distortions are influencing the content we create and consume.

One need look no further than the headlines at Hacker News for a gallery of the grotesque:  “N Reasons…“, ”Why X is Wrong“, “Free Y”, and “How Z.. Cancer”.  Many of these stories are explicitly crafted to achieve fifteen seconds of fame.

I plead guilty of this seduction  – with @jkottke telling me off as proof – because it’s tempting to believe that metrics are an honest measure of value.  They’re not.

Social Media Statistics are Biased

Hacker News readers are not a representative audience. Because of the frenzied frequency with which they flood the voting booths of cyberspace, their influence is outsized – and perversely enough, in inverse proportion to their attention spans.

We need a balance against these biases.  A retweet from @timoreilly means more than one from @lolz69.  Klout has attempted, with some ignominy, to measure online influence. If we weighted retweet counts by influence, we might have a better measure of an article’s impact.

Time matters too. All content is a zero until someone reacts, so we need to gauge the speed of +1s or shares, not just the total.

And positive feedback loops are everywhere.  We end up reading and sharing the same few dozen articles every day, not because these are always the most valuable, but because once they’ve bubbled up into the meme pool, they get recirculated and amplified.

Be a First Follower

The strongest signal of quality should be the content itself, not its number of shares or comments.  If you keep an open mind, you’ll encounter that joy of discovery once so integral to the web.  Lovely gems still lurk out there.  

Being the first follower takes a smidgeon of bravery.  So ignore what other people think and share something no one else has.  You’ll be a democratizing force.

Connect with People, Don’t Collect Them

Few of us share our ideas, photographs, and experiences online solely to collect followers.  We do so to convince, to delight, to connect with people.  

If you’re a creator, never confuse numbers with the value of your creative output.  Resist the urge to chase some earlier success.  If you create something of lasting value, which has staying power after the initial spasms of interest have passed, you will engage with your audience in a way that few metrics reveal.

Blogging to boost your follower count is like launching a start-up to build your bank balance:  it rarely works.  Instead, focus passionately on creating value, and the rest will come.