At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.
What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”
1. Choose The Right-Sized Tool
Or, as I like to say, you don’t need a chainsaw to cut butter.
If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).
In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up.
Alternatively, if you’ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of Perl one-liner foo may get you there.
But don’t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability. Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.
So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.
2. Compress Everything
We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.
As I was writing this, I was downloading an uncompressed CSV file via a web API. Uncompressed, it was 257MB, ZIP-compressed: 9MB.
Compression gives you a 6-8x bump out of the gate. When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.
That said, because compression can render data difficult to introspect, I don’t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.
3. Split Up Your Data
“Monolithic” is a bad word in software development.
It’s also, in my experience, a bad word when it comes to data.
The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you’ll need to use it to shard your database or distribute it across your file system.
Even more, it’s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).
Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as: split, apply, combine .
This is, in my mind, a more lucid formulation of “map, reduce” to include key selection (“split”) as a distinct step before any map/apply.
4. Sample Your Data
Let’s say hypothetically you’ve got 200 GBs of data from your portmanteau of a start-up, FaceLink. Someone wants to know if more people visit on Mondays or Fridays, what do you do?
Before you wonder “if only I had 64 GB of RAM on my MacBook Pro”, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.
It’s easy to visually inspect, or pull into R and plot.
Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.
That said, sampling can bite you if you’re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).
5. Smart Borrows, But Genius Uses Open Source
Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.
A Google Code Search may find turn up a regular expression for that obscure data format.
The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.
6. Keep Your Head in the Cloud
This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.
I told him it was a terrible idea.
Why? Because the data he wants to work on isn’t local, it’s on an Amazon EC2 cluster. It’d take hours to download those TBs over a cable connection.
If you want to compute locally, pull down a sample. But if your data is in the cloud, that’s where your tools and code should be.
7. Don’t Be Clever
I once heard Brewster Kahle discuss managing the Internet Archive’s many-petabyte data platform: “everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: ‘You’re fired.’”
Hyperbole aside, his point is well-taken: cleverness doesn’t scale.
When dealing with big data, embrace standards and use commonly available tools. Most of all, keep it simple, because simplicity scales.
I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach. Today, they are several versions behind the latest release, and devoting time & energy to back-porting changes.
Cleverness rarely pays off. Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.