the three sexy skills of data geeks


(I originally penned this on May 27, 2009 and published on the Dataspora blog.)

Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.

Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I’ve put the salient character trait needed to acquire it).

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004). I expect to be on its learning curve my entire life. This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman’s Elements of Statistical Learning.

Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles (you can find us with a Twitter search for #rstats), this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.

A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript). This is problem solving with programming, and quite different from statistics. An aspiration towards elegance — in the form of a perfect XSLT filter, for example — is rarely rewarded, and often punished. A decade ago, I thought that the world’s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill. I was wrong. (Perhaps there’s an analogy with the paper industry: the growing volume of data means we’ll likely need more regular expressions before we need less).

Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).

And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgressnow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.

Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics). But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.

Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.

A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like Processing and Flare make possible). Luckily, successful collaboration often occurs between data analysts and designers, the occasional fracas notwithstanding.

The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.

Put All Three Skills Together: Sexy. Thus with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to nerds seeking revenge.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: