the data science debate: domain expertise or machine learning?


(L to R:  Mike Driscoll, Drew Conway, DJ Patil, Amy Heineike, Pete Skomoroch, Pete Warden, Toby Segaran. Credit: O’Reilly – Link to Video)

This past Tuesday evening at Strata I moderated an Oxford-Style debate between six of the top data scientists in Silicon Valley and beyond. The motion debated was: 

“In data science, domain expertise is more important than machine learning skill.”

The topic emerged from conversations over dinner the previous night, with Kaggle’s Jeremy Howard, LinkedIn’s Monica Rogati, and some pre-debate musings of Google’s Hal Varian.

To constrain the question, we added an additional clarification: which of these would you favor more in hiring your company’s first data scientist?

Arguing in favor of the motion (e.g. favoring domain expertise) were: 

  • Drew Conway, Ph.D. Candidate at NYU, Data Scientist at IA Ventures  
  • DJ Patil, Data Scientist in Residence at Greylock Partners  
  • Amy Heineike, Director of Mathematics at Quid

Weighing in against the motion (e.g. favoring machine learning skills) were:

When the Strata audience was initially polled, the vote was 53 to 40 in favor of domain expertise.  Then the debate began with comments from the audience.

The Audience:  s/MachineLearning/DomainExpertise is Easy 

We heard from Daniel Tunkelang, who argued in favor of domain expertise, stating that it was easier to learn statistics and machine learning than to acquire a lifetime of expertise and intuition (perhaps it comes easy to Dr. Tunkelang, but I’m not sure how many who have attempted to consume the Elements of Statistical Learning on their own would agree).

Claudia Perlich, a three-time winner of the KDD Nuggets competition, stood up and shared how she had won contests in domains as varied as “breast cancer, movie prediction, and sales performance – and I can tell you I knew next to nothing about those things when I started.“

The panelists were then asked to weigh in with their thoughts.

The Panelists:  Our Opponents Have Made Our Points for Us  

Drew Conway, whose popular Data Science Venn Diagram includes “substantive expertise” as one of its components (and truth be told, “math & statistics knowledge”) advocated that asking good questions is the most critical element in a data science project.  And the ability to ask good questions requires domain understanding.

Toby Segaran relayed a story about work I had done using social network analysis for modeling telco customer churn.  He went on to say that, “Mike, a domain expert in almost nothing, actually outperformed the domain experts.”  (ed. note: Thanks for the backhanded compliment, Toby 🙂 ).

DJ Patil read from the original LinkedIn Data Science job posting, arguing that machine learning skills were not even mentioned.  Rather they were seeking those who had curiosity and the ability to rapidly acquire domain expertise in the area of social network analysis.  He cited their hire of a theoretical physicist from Stanford, Jonathan Goldman – who did the initial groundbreaking work on the People You May Know algorithm – as evidence that machine learning skills were not important.

Pete Skomoroch fired back that “since machine learning and physics are both just mathematics” that Jonathan was actually just a machine learning expert by another name.  Those skills, said Skomoroch, helped him tackle and ultimately succeed in a domain in which he had little prior expertise.

Pete Warden, arguing for machine learning skills, cited his own experience at JetPac, his new travel site, where identifying high quality user photos was a high priority.  They hosted a competition on Kaggle, the machine learning contest platform, and in three weeks had built a quality ranking algorithm for just $5,000.

Amy Heineike then retorted that Pete Warden had actually made the case against himself.  In outsourcing their machine learning, she claimed, they underscored the importance of the one thing they could not outsource: their own domain expertise.

Toby Segaran agreed that company founders have excellent domain expertise: that is why they started their companies.  But when hiring a first data scientist, they need to hire for what they don’t have:  machine learning skills.  (Zing!)

Pete Skomoroch ended the debate with a rhetorical question, asking the audience to consider the most successful companies in recent years: was human intuition or was it analytics driving them?

The Verdict:  Let Us All Now Hail Our Machine Learning Overlords

In the end, the audience was polled again, and the results were tabulated in parallel by the panel (using what I like to call ManReduce), the verdict was: 52 for domain expertise, 55 for machine learning.

Like any good debate topic, there is merit on both sides of the domain expertise versus machine learning proposition.  As Hal Varian said when we asked him before the panel: “it depends on the structure of the problem.”  And in fairness to the debate panelists, they did not choose their positions: we assigned teams fifteen minutes before we went on stage.

One of the conclusions reached was that, when a problem is well-structured (or to Drew Conway’s point, when a good question is posed), it is much easier for machine learning to succeed.  Kaggle’s strength as a contest platform is that domain experts have already framed the problem:  they choose the features of the data to use (feature engineering or “feature creation”, as Monica Rogati calls it) as well as the criteria for success. This is the first, hardest step in any data science project.  After this, machine learners can step in and develop the best algorithms for classifying and predicting new data (or, less usefully, explaining old data).

Thus who you decide to hire as your first data scientist – a domain expert or a machine learner – might be as simple as this: could you currently prepare your data for a Kaggle competition?  If so, then hire a machine learner.  If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.

(Thanks to O’Reilly Media, and Strata organizers Edd Dumbill and Alistair Croll – who suggested the Oxford Debate format –  for hosting a terrific conference).

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: