“If I were starting a NoSQL-in-the-enterprise startup, I would focus on ETL. ETL is a mess, and is a precursor for any fancy uses of data.” – @jaykreps
“@jaykreps ETL is the coal mining of the information age: dirty, important work that fuels the economy.” – @peteskomoroch
One of the largest obstacles facing companies who seek to derive value from data isn’t data’s size. It’s data’s dirtiness.
It’s been said before: 80% of the effort that goes into a data science project is extracting, transforming, and loading (ETL’ing) data into a system where it can be analyzed.
This challenge is not simply a consequence of poorly structured data: free form text records are now mostly rare. Yet there remains bewildering variety within well-structured, regular data.
Take the basic dimension of time, an attribute that nearly every data set contains. A date can be expressed as POSIX or ISO8601 strings or a Unix epoch integer, among myriad other forms:
- Sat Dec 10 10:37:13 PST
And dates are just the beginning. There are country codes, currency symbols, geospatial coordinates, and language indicators. Beyond the data itself, there how it is delimited and encoded (including XML, the clamshell plastic packaging of data formats).
Data platform businesses create value by reducing the friction of data flow among participants. They do this with standards. The financial services industry, the most mature of data verticals, has defined symbologies for equities and other tradeable instruments. Consumer goods have UPC barcodes. Governments have national postal codes.
The manufacturing world has long appreciated the value of interchangeable parts, in lowering the costs of creating everything from electronics to airplanes. Historically, standards arise in one of two ways: through the de jure recommendation of a consortium, or through the de facto adoption of a market leader’s schema.
As the industrial revolution of data continues to unfold, we need data platforms and standards’ bodies to facilitate “interchangeable data”. These will accelerate the growth of a new breed of data-driven applications and services. Clean coal mining may be a fantasy, but clean data mining may yet be possible.