Confessions from a Massive, Nightmarish Data Project
Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.
So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (“taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.
Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.
We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.
Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.
In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.
Three Reasons Why XML Fails for Big Data
I. XML Spawns Data Bureaucracy
In its natural habitat, data lives in relational databases or as data structures in programs. The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit. When more time is spent on inter-converting data — serializing, parsing,translating — than in using it, you’ve created a data bureaucracy.
II. Yes, Size Matters for Data
Size matters for data in a way it does not for documents. Documents are intended for human consumption and have human-sized upper bounds (a lifetime’s worth of reading fits on a thumb drive). Data designed for machine consumption is bounded only by bandwidth and storage.
XML’s expansiveness — for even when compressed, the genie must be let out the bottle at some point — imposes memory, storage, and CPU costs.
III. Complexity Carries a Cost
I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need. XML’s complexity imposes costs without commensurate benefits, specifically:
- In-line, element-by-element tagging is redundant. Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).
- Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it’s rare that one can be sure of a tag’s data type without consulting its DTD (in a separate document).
- End-tags support nested structures (such as an aside (within (an aside)). But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.
XML’s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group. While data suitors labor to quickly unravel that XML format into something they can use.
Three Rules for XML Rebels
I. Stop Inventing New Formats (as Tim Bray said in 2006)
Before you call for “an XML format for X”, let me tell you a story about LaTeX and MathML. (And while these are document formats, there’s a lesson here for data).
The LaTeX typesetting system is the lingua franca for composing scientific documents. As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.
MathML, on the other hand, is a markup language for mathematics recommended by the W3C. If you’re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with thishandy 354-page spec and code it yourself.
Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress? So he wrote a plug-in that did it. The applause was deafening.
Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different. Far better to evolve and adapt the standards we already have (as JSON and SQLite’s file format do), than to fabricate new ones from whole cloth. As John Udell says, “good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.”
II. Obey The Fifteen Minute Rule
Interviewed several years ago, James Clark stated “If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”
Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to ‘Hello World’ in fifteen minutes. (This does not preclude complex languages or formats, per se: it does require that additional complexity not be sui generis, but built on some existing foundation,for example.)
Despite a noble vision for the semantic web , the barriers for adopting the W3C’s proposals for linked data are too high. The beauty of original HTML standard was that it was dead simple. The flaw of RDF is that it is too hard.
III. Embrace Lazy Data Modeling
Lazy data modeling is similar to lazy evaluation. The right schema for data depends on future use cases, in as-yet-undeveloped applications. Instead of trying to guess the future, we can store the data “as-is” — and deal with its transformation when (and if) a necessary use case arises. As Michael Franklin and colleagues note: ”the most scarce resource available for semantic integration is human attention.”
This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like Data.gov. The US Census Bureau shouldn’t expend resources to publish in XML if they have a good-enough format available right now.
For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why data geeks seem to get no sleep).
XML’s unfulfilled promise for data testifies that formats can create friction. The easier it is for data to be shared and consumed, the more quickly we’ll realize our visions for smarter businesses and better governments.
(25-Aug-2009 Update: Read a response from open gov advocates at Sunlight Labs).
60 Responses to “How XML Threatens Big Data”