how xml threatens big data

image

Confessions from a Massive, Nightmarish Data Project

Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.

So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (“taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.

Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.

We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.

Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.

In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.

Three Reasons Why XML Fails for Big Data

I. XML Spawns Data Bureaucracy

In its natural habitat, data lives in relational databases or as data structures in programs. The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit. When more time is spent on inter-converting data — serializing, parsing,translating — than in using it, you’ve created a data bureaucracy.

Indeed, it was what Doug Crockford called “impedance mismatch inefficiencies” that sparked him to create JSON – standardizing Javascript’s object notation as a portable data container.

II. Yes, Size Matters for Data

Size matters for data in a way it does not for documents. Documents are intended for human consumption and have human-sized upper bounds (a lifetime’s worth of reading fits on a thumb drive). Data designed for machine consumption is bounded only by bandwidth and storage.

XML’s expansiveness — for even when compressed, the genie must be let out the bottle at some point — imposes memory, storage, and CPU costs.

III. Complexity Carries a Cost

I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need. XML’s complexity imposes costs without commensurate benefits, specifically:

  • In-line, element-by-element tagging is redundant. Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).
  • Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it’s rare that one can be sure of a tag’s data type without consulting its DTD (in a separate document).
  • End-tags support nested structures (such as an aside (within (an aside)). But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.

XML’s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group. While data suitors labor to quickly unravel that XML format into something they can use.

Three Rules for XML Rebels

I. Stop Inventing New Formats (as Tim Bray said in 2006)

Before you call for “an XML format for X”, let me tell you a story about LaTeX and MathML. (And while these are document formats, there’s a lesson here for data).

The LaTeX typesetting system is the lingua franca for composing scientific documents. As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.

MathML, on the other hand, is a markup language for mathematics recommended by the W3C. If you’re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with thishandy 354-page spec and code it yourself.

Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress? So he wrote a plug-in that did it. The applause was deafening.

Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different. Far better to evolve and adapt the standards we already have (as JSON and SQLite’s file format do), than to fabricate new ones from whole cloth. As John Udell says, “good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.”

II. Obey The Fifteen Minute Rule

Interviewed several years ago, James Clark stated “If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”

Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to ‘Hello World’ in fifteen minutes. (This does not preclude complex languages or formats, per se: it does require that additional complexity not be sui generis, but built on some existing foundation,for example.)

Despite a noble vision for the semantic web , the barriers for adopting the W3C’s proposals for linked data are too high. The beauty of original HTML standard was that it was dead simple. The flaw of RDF is that it is too hard.

III. Embrace Lazy Data Modeling

To keep data bureaucracy to a minimum, several Big Data thinkers have advocated a morecatholic approach to data: building data stores that accommodate a broad range of data types and formats.

Lazy data modeling is similar to lazy evaluation. The right schema for data depends on future use cases, in as-yet-undeveloped applications. Instead of trying to guess the future, we can store the data “as-is” — and deal with its transformation when (and if) a necessary use case arises. As Michael Franklin and colleagues note: ”the most scarce resource available for semantic integration is human attention.”

This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like Data.gov. The US Census Bureau shouldn’t expend resources to publish in XML if they have a good-enough format available right now.

For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why data geeks seem to get no sleep).

XML’s unfulfilled promise for data testifies that formats can create friction. The easier it is for data to be shared and consumed, the more quickly we’ll realize our visions for smarter businesses and better governments.

(25-Aug-2009 Update: Read a response from open gov advocates at Sunlight Labs).

COMMENTS

60 Responses to “How XML Threatens Big Data”

  1. The Rise of the Data Web : Dataspora Blog on August 23rd, 2009

    […] How XML Threatens Big Data […]

  2. Is XML bad for big data? on August 23rd, 2009

    […] Driscoll continues his attack against XML for Big Data. He points out three reasons why XML and Big Data are strange […]

  3. Ian Davis on August 23rd, 2009

    I’m interested to hear your thoughts on linked data and rdf in this regard. To me the problem xml creates is a new data model for every schema whereas all rdf has a single data model.

  4. Chris Davis on August 23rd, 2009

    Is RDF really that hard? On one hand, yes I do find the RDF/XML format to be not exactly user friendly. However, what I’ve found to be tremendously helpful are the different formats of RDF. For example, I’ve become a big fan of the N-Triples format since it allows me to just dump out statements in the form of “subject predicate object”. No nesting or “army of tags”. In other words, I just generate files containing only three columns. This sounds similar to the “TabML” format you created. For me, this format definitely passed the fifteen minute test, and has proven to be much easier to read than RDF/XML.

  5. David Knell on August 23rd, 2009

    It’s not just MathML – the W3C are also responsible for VoiceXML, used for defining interactions with IVRs. The problem would appear to be that they’re designing standards for problems which are outside of their core competency, and building them on inappropriate foundations – such as XML.

    –Dave

  6. Andrew Wooster on August 23rd, 2009

    VoiceXML was designed by people who were actually building next-gen IVR systems, such as Tellme. It was very much within the core competency of the people building it, but may not necessarily be the easiest or the most flexible platform due to its complexity.

  7. Carolus Holman on August 23rd, 2009

    I thought I was going insane. I have been plunging into the API’s of Google Analytics, YouTube and Twitter this past week. Funny thing, I too could see the data I needed in the XML document. Since I haven’t worked with XML for a long time, I was thinking about how easy it was supposed to be, early on with Visual Studio 2003, one could derive a schema from the document with a few simple clicks, now with multiple namespaces, most software tools choke, and the answer was—gag–break the XML into separate files strip the namespaces and then process the documents.

    My solution to this madness was XML to LINQ to SQL. SQL Server being the final resting place for my data. I will have to commend MS for creating LINQ, though I can’t say whether or not they actually created it or purchased another company.

    I appreciate your article, it’s good to see (well good in a bad way) that even seasoned veterans have frustrations with this stuff.

  8. Gary on August 23rd, 2009

    I work for the Census Bureau; the US Census Bureau is a bureau, not a department, dammit!

    “Far preferable is stating the data model separately” is an interesting statement. Web pages are “properly” split into data (*ml), behavior (js) and formatting (css). Why shouldn’t data sets?

  9. Carolus Holman on August 23rd, 2009

    One other comment, I suppose talking about MS on the blog may garner some flames. I just use the best tool I can understand for the job.

    I am interested in the Open Source world I just find that entry into it is usually walled up behind esoteric terminology. If anyone has some primers on how to get started with Open Source source analytics, data modeling etc. without the need for writing your tools please let me know.

    Thanks.

  10. Nestor on August 23rd, 2009

    It was your fault to choose XML. XML does not fail. Stupid programmers/decision makers fail. XML stands for “eXtended Markup Language”, it is for marking, that is, for structuring data with no structure or with complex, not regular structure. Texts, nested data structures… But not data with regular structure like tables.

  11. How XML Threatens Big Data : Dataspora Blog « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit on August 23rd, 2009

    […] How XML Threatens Big Data : Dataspora Blogdataspora.com […]

  12. Robin on August 23rd, 2009

    …”the repositories, all with millions of records, “…
    …”The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM”…

    Sounds to me like you forgot to research your technology of choice and got burned as a result. Any reason why you couldn’t use a stream-based processing API, like SAX?

    Now, because of your own bad design decisions, you’re attacking XML. Nice. You even get as far as admitting that XML was the wrong choice of tech *in this case*, but don’t admit it’s better in other circumstances, like data interchange.

    I just don’t understand why certain programmers are so “anti-xml”. It’s like a carpenter being “anti-hammer”…

  13. Michael E. Driscoll on August 23rd, 2009

    @ChrisDavis – Point taken on the greater simplicity of RDF’s non-XML variant, perhaps it will gain adoption. But, for example, this Gene Ontology RDF project I came across looked like a lot of pain to implement, and appears inactive.

    @Gary – Census Dept –> Census Bureau – corrected!

    @Nestor – Indeed, I was foolish here to choose XML 

    :)

     , but genomics data is rather complex — and not particularly amenable, at first glance, to TabML.

    @Robin – I am not anti-hammer, I believe XML has its place: for documents, not data, nor even data interchange. (I actually did use a stream-based processing API — James Clark’s expat — but it was still comparably slow).

  14. William Morgan on August 23rd, 2009

    Long before Mike Adams, many people, including myself, have been enabling LaTeX math markup to the web, either by converting it to images or by translating it into MathML. I wrote a library in 2005; the original itex2MML on which it was based dates from at least 2001.

    Equation markup is not really a data format, so I’m not sure it belongs in the argument above. That said, since MathML still isn’t supported by major browsers more than 12 years since its inception, I consider it a failed standard, and I regret having bought into it. (Read the MathML spec some time for a great example of data bureaucracy in action.)

  15. Mike Bergman on August 23rd, 2009

    I agree that RDF and linked data are difficult and complex to author. That, however, does not make them poor candidates for the canonical representation for underlying data models and schema.

    I also agree that catholic approaches to data formats are appropriate. You may want to see my ‘Structs’: Naïve Data Formats and the ABox posting, where I argue that it is fine (and to be expected!) to use whatever data struct you like depending on your purpose.

    Data format purists are oh, so boring. XML has its place, as does JSON, BibTeK, CSV and RDF/N3. Go for it!

  16. Bob on August 23rd, 2009

    I disagree. If you’re going to present data to the public, it’s much better to have it all in a format that is immediately available to everyone. Otherwise there will be lots of duplicated work.

    Also, compare this to RFC, the standards of the internet. They are still written in 7bit ASCII. The first one was written in 1969 and is still readable as-is in every reader program on the planet. Can you say the same of *any* document format other than ASCII?

    XML is aiming to be the equivalent for data. In 40 years I will be able to use my standard python (or cobra or whatever language I’ll use) XML library and just process it.

    What am I supposed to do with a .XLS document? who will be able to read *that* in 40 years?

    What if I need data that’s in excel format, and some word documents, and a CSV-file, and a colon-separated data file (ala /etc/passwd)? Data hell.

    That being said, I prefer yaml, but it has a 1:1 mapping to XML, so it should be safe too.

  17. wial on August 23rd, 2009

    ETL tools like open source Pentaho Kettle are the way to go in this problem space, aren’t they?

  18. Richard Durr on August 23rd, 2009

    Maybe portable objects are the solution. Refer:
    http://gagne.homedns.org/~tgagne/contrib/EarlyHistoryST.html#4

  19. Rich Morin on August 23rd, 2009

    I have had great success with using a subset of YAML (basically, JSON) as a way to encode intermediate data files in a human-readable form. IMHO, It’s much easier to read than XML. It’s also a much better match than XML for the fundamental data structures (eg, scalars, lists, hashes) of the languages I use.

  20. Edward on August 23rd, 2009

    Interesting to see what people with lots of data do. The weather and satellite community passes around large datasets as NetCDF files (eghttp://en.wikipedia.org/wiki/NetCDF). This is a self describing format, engineered to be compact. Only need an 8 bit integer … no problem. I am guessing that in this world Big Data has never given XML much attention …

  21. Bob Foster on August 23rd, 2009

    A few years ago I was thinking we’d have to wait until the entire generation of programmers who grew up in the great wave of XML hype died out before we would regain our common sense. Your comments make me more hopeful.

    (No, Robin, being anti-XML isn’t like a carpenter being anti-hammer. It’s like a carpenter who doesn’t want to carry a 50-pound nail gun air compressor up a ladder to fix a loose shingle.)

  22. Dan Brickley (danbri) ’s status on Monday, 24-Aug-09 11:07:10 UTC – Identi.ca on August 24th, 2009

    […] xml critique in http://dataspora.com/blog/xml-and-big-data/ – too casually dismisses tech (rdf) that allows syntax-level […]

  23. Dan Brickley on August 24th, 2009

    For me, one of the great appeals of RDF, is that it allows different parties to take quite radically different choices regarding concrete syntax, while allowing their data to be re-integrated later. You might use SQL, others might use XML; that’s just fine. RDF has db-to-rdf mapping tools that take tabular data and either map into into RDF triples, or convert RDF SPARQL queries into SQL on the fly. For XML, we have the GRDDL spec which says how you can document non-RDF XML using XSLT transformations. And so on. RDF (and related technologies) just encourage you to do a bit of documenting – eg saying which classes you have are mutually disjoint, which properties take single values or are uniquely identifying, etc. But it doesn’t force a syntax on you – whether xml-based or otherwise. I’m not arguing that it’s painless, just that one of the goals of RDF is to allow just the kind of diversity you argue for, but trying at the same time to minimise the data-fragmentation cost of everyone doing it their own way…

  24. James on August 24th, 2009

    I’d also add in that XML was not designed for (generic) data, it was for documents. Hence the breakage in unicode (text is a strict subset of Unicode), mixed content (argh… we need a schema to parse!?!), namespace design,….

  25. Jeff H. on August 24th, 2009

    Wow!

    Your biotech experience paralleled mine circa 2000. I was working for a large discount brick-n-mortar bookstore which had jumped online just a few years earlier. A giant big-box discount retailer was coming on-line and needed book fulfillment, and they settled on our company. They’d deliver us XML book orders, and we’d send them back XML order status updates. Fortunately, I was pretty adept at scouring CPAN and ran across XML::Twig, an event based XML parser, so we could both consume the huge orders expressed in gigabyte files and produce order status files of the same size.

    Only the big box retailer’s database couldn’t load them.

  26. Guy Murphy on August 24th, 2009

    Why didn’t you use a SAX parser? You wouldn’t of had to load the full tree into memory and your slow memory hog… well, wouldn’t have been.

    If you had been using event based parsing I suspect your project would have behaved quite differently.

  27. Eric N. on August 24th, 2009

    I strongly agree with you assessment of XML– in fact, with the support it got from major vendors, it severely impeded most data integration efforts for bioinformatics and genomics, and possibly even innovation within pharmaceutical companies. However, regarding RDF you are incorrect to assume it is too hard to use: most examples in bioinformatics show RDF (i.e., N3) is no more complicated that JSON. I propose an open challenge to illustrate this– game?

  28. XML and Data on August 24th, 2009

    […] said, it was refreshing to read that someone else is apprehensive about XML as a “big data” […]

  29. XML y grandes volúmenes de datos « Javier Aroche on August 24th, 2009

    […] y grandes volúmenes de datos How xml threatens big data. Básicamente XML no es para grandes volúmenes de datos (por el tamaño y la complejidad para […]

  30. Peter B on August 24th, 2009

    I believe XML has a place for data interchange: those where there are many producers and few consumers, and the benefits of easily validating user-submitted datafeeds outweighs the pain.

    The example I’m thinking of? Submitting data to property listing websites. Most sites use delimited formats and having worked with many I would swap them for XML any day. XML can be validated (oh so I made a mistake there!), which leads to the specification being concrete (hopefully – I hate guessing ambiguous clauses); copes with line breaks in the data; and has supported tools available.

    I guess this is outside your ‘big dataset’ situation, but I feel XML is the best format for these situations.

  31. Roy Hayward on August 24th, 2009

    Awesome! I have always said, “XML makes your big data bigger. No not more data, just more space.”

    Of course this gets me dirty looks from all of the XML fans I work with.

    And the other two question I keep having as our XML fans tout it as a “human readable” format. (1) aren’t I building this to integrate two computer systems? (2) since I can read EDI with my un-aided eye/brain, does that make me …. borg?

    I think XML has its place. But I think that it is like a data format celebrity. From the attention it gets one would think XML had done an interview on Opera or something.

  32. SDC on August 24th, 2009

    The classic quip from Slashdot sums it up well, I think: ‘XML is like violence. If a little doesn’t work, try using some more’.

    All joking aside I agree 100%. XML is overly verbose, strains at the limits of human readability (I can usually get the jist of JSON, XML is so full of ‘non-data’ it’s a pain to read), and generally is just good for exchanging bits of info or for documents.

  33. Konrad Förstner (konrad) ’s status on Tuesday, 25-Aug-09 11:55:09 UTC – Identi.caon August 25th, 2009
  34. Rick J. Wagner on August 25th, 2009

    Nice article. You’re right, XML is not appropriate for big data tasks. Some of us need reminders of this once in a while….

    Rick

  35. Jeffrey Stewart on August 25th, 2009

    I have two reactions to this article. First is just because you have a hammer, that does not mean that every problems is a nail. XML is not the end all to be all. It is the right solution for the right problems. And large datasets that do not require interaction with other systems is not the right problem.

    My second comment is you cannot judge things in isolation. You can only judge them in comparison to the alternatives. When you take a look at what was available before XML came on the scene, you find more expensive tools, more complexity in representation, less interoperability and thus less ubiquity.

    Good comments. Good article. But tell what your alternatives are and comparative judgments.

  36. Robert Miner on August 25th, 2009

    In my view, you have misunderstood the purpose of MathML, and hence your criticism that it was unneeded and shouldn’t have been created is erroneous. MathML is intended to encode mathematics in a granular, explicit way suitable for machine processing. It isn’t intended for author’s to use directly. Thus it is unsurprising that LaTeX (which was designed to optimize hand authoring in ASCII editors) is popular with researchers in contexts like WordPress. However, it is a mistake to conclude MathML is unsuccessful or unused.

    Most major STEM publisher now use XML-based workflows using MathML for their journals. LaTeX, as a Turing complete programming language is ill-suited to the kind of standardization and validation required in a high-volume publishing operation, and as a consequence, most of the 1 million LaTeX preprints on the arXiv that make it into print are converted to XML+MathML.

    Similarly, MathML is widely supported by math-aware software. By virtue of being standard (which LaTeX isn’t) and explicit and low-level (which LaTeX also isn’t), MathML is well-suited as an import/export format. The design team for the Math Input Panel which is being introduced as a new accessory in Windows 7 crew the same conclusion. The Math Input Panel does handwriting recognition of mathematics, for use in other applications. It’s only data format? MathML.

    Another area where MathML excels in in accessibility. In MathML, the logical equation structure is explicitly encoded, which greatly facilitates voice rendering. MathML has been incorporated by reference into the DAISY digital talking book specification, which will soon be a mandatory format for textbook publishers to produce in the US.

    My point here is not that MathML is better than LaTeX — LaTeX is great for the uses it was designed for. However, offering MathML as an example of an XML language that wasn’t needed and isn’t used is incorrect. It was designed to do a better job than the existing alternatives (including LaTeX) in terms of accessibility, validation, and explicit encoding of logical expression structure among other things. In those areas, it is successful and widely used, as the three example above indicate.

    I share the frustration expressed by William Morgan above that MathML has not succeeded in browsers. But I don’t think that invalidates the whole standard. Of course, a major factor was that MathML is XML, and XML hasn’t succeed in browsers either. So to that extent, I certainly agree with you that XML is not the panacea it was originally claimed to be.

  37. GarryJR on August 25th, 2009

    well said. Excellent talking points which will help with some headaches we are having RIGHT NOW.

  38. Larry R on August 26th, 2009

    I believe that your problem is not with XML , it is where you are storing it. I would agree that if you use a relational database with hierarchical data then you spend an incredible amount of processing power trying to do all the conversions. However, if you use a database that is designed specifically for the storage of XML data then you will find that it performs very nicely.

    I think that if you explore some databases like Mark Logic or XHive you would find that not only is it easy to store, you would also get lots of great search features. So in short its the storage not the data.

  39. Cactus Acide » » L’observatoire du neuromancien 08/26/2009 on August 26th, 2009

    […] How XML Threatens Big Data : Dataspora Blog […]

  40. A great post on XML and big data « Small Business+Phoenix+Software on August 26th, 2009

    […] big data I’ve had the pleasure of working in healthcare, XML,  and big data, but this article also reminds me a LOT of my past data warehouse projects…especially the data bureaucracy […]

  41. Michael Kay on August 27th, 2009

    I think you’ve revealed why your project failed when you claim “In its natural habitat, data lives in relational databases”.

    No it doesn’t. Relational databases are a highly artificial habitat. But anyone whose mindset is so conditioned by relational thinking is going to fail if they try to force-fit a technology that needs a different approach.

  42. Paddy3118 on August 27th, 2009

    The good:
    XML is not a binary format.
    The bad:
    In a lot of cases, it is as readable as binary 

    ;-)

     

    In “the good old days”, a good data format was easily AWKable, and scripts were much easier to write to help it on its way. (Unfortunately, in those same days, a lot of the data was in proprietary binary formats).

    – Paddy.

  43. Structured Methods › links for 2009-08-27 on August 27th, 2009

    […] How XML Threatens Big Data : Dataspora Blog Michael E. Driscoll writes: Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov. […]

  44. Michael Nielsen » Biweekly links for 08/28/2009 on August 28th, 2009

    […] How XML Threatens Big Data : Dataspora Blog […]

  45. Stephen on August 29th, 2009

    I’ve been using delimited files forever. But one customer was doing business with Germany, and the double S character used the ASCII tilde character, which happened to be the delimiter. That kind of experience moved me to use the more complicated CSV format with optional double quotes, which can also be quoted. The result is a format that can transmit any 2-D data. My C libraries even support newlines in fields. And a nice side effect is that most spread sheets can import them. And, my library supports an optional “first line description”. That is, the first line of data is optionally column names. And, my library does not require the entire data set to be read into RAM. This “first line” bit essentially gives the advantages of a separate description with the advantages of keeping a single file. It allows import and export from SQL database tables. It allows human readable presentation. And, despite being entirely dynamic (no size limits), it’s reasonably fast. At least it’s not the bottleneck. Well, it was developed on a 386…

    XML is much more complex, and can represent more dimensions than 2. But since 2-D data and even 1-D data is the norm, one might expect XML to be confined to unusual cases. Or, one might expect spreadsheets to import XML. We’ve already seen word processors, etc., support XML…

  46. Stephen on August 29th, 2009

    To some extent, XML is a repeat of the Word Processor joke. Here’s the original:

    The great thing about Word Processors is that with them, you can easily cut and paste text, and move things around. The bad thing about Word Processors is that you can easily cut and paste text, and move things around all day long.

    XML is so flexible that one is tempted to instrument the crap out of the data.

    But i have seen database tables set up where a table column has the identifier for the data. I’ve even seen such tables with an identifier, a type, and then several columns of various types. So each row has only one column of real data. All the description for the data is redundant. And, i’ve even seen 2-D data set up this way. Now, most database tables aren’t set up this way. It’s slow, and the data size is huge. But when you have new column types periodically, and you can’t predict what they might be, what do you do? I call this form ultra-normalised. And, i’ve had to denormalize such data. It turned out to be 7 dimensional.

  47. Ryan Schneider on August 30th, 2009

    Sorry but I’m going to have to disagree with this article, mostly because the reasons the author gives for why XML failed for his big data.

    1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure;

    2) XSLT is a poor choice for any XML files over a couple hundred kilobytes. SAX is the preferred method for fast and low memory XML processing. I use thousands of SAX filters chained together to process gigabytes of XML data with no issues.

    3) The XML functions of Oracle do suck, for lack of a better term. If we stick XML in a database it is done only as storage, no queries on it. We have an entire patented system to store/index XML data and do searches over it; sorry can’t go into much more detail than that.

  48. Ramblings along the narrow way » Blog Archive » links for 2009-09-01 on September 1st, 2009

    […] How XML Threatens Big Data : Dataspora Blog If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale. (tags: data database scalability programming software xml) […]

  49. Coast to Coast Bio Podcast » Blog Archive » Episode 26: Google, PLoS and NCBI get into bed together on September 3rd, 2009

    […] How XML threatens Big Data […]

  50. Kut Cagle on September 3rd, 2009
  51. Stephen Green on September 4th, 2009

    But maybe the quote –

    “James Clark stated ‘If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”

    – does say it all. People in some countries don’t like the fact that the english language is the modern equivalent of a ‘lingua franca’ but they still tend to have to use it in business, etc. The best language to speak isn’t the one most conveneint to you, the speaker, but the one best understaood by all your hearers. If the audience is wide then the most widely adopted language is the best to be speaking. If you are only speaking to yourself then your own language may work fine: Hence storing data which only your won software will ‘see’ directly will be fine in ‘TabML’, JSON, whatever works well for you. When that data has to be shared with many diverse systems you need a ‘lingua franca’ and XML seems to be just that. Horses for courses.

  52. Anti-Hammer on September 23rd, 2009

    So, sorry if I skipped about 100 comments just to add my own, but it seems to me this is the same old quasi-religious debate that’s been going on since XML first came out in the late 90s.

    I’ll summarize

    1) I herd XML was tha shizznit
    2) I implemented it badly (really REALLY badly)
    3) Pick an excuse that you heard from somebody smart
    (XML is “too verbose” is the most common and least true…..)

    4) Conclude firmly that XML suxxors and tell EVERYBODY

    Seriously, people. Don’t blame your own shortcomings on the technology. It’s not XML’s fault that you don’t like it. It’s yours.

  53. harborpirate on September 23rd, 2009

    @Ryan Schneider:
    “1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure”

    Nobody writes their own parser NOW. In 2000 they did. XML Parsers of the time were at best, primitive; at worst, slow and broken.

    I recall parsing XML in Java around that time was an absolutely miserable task. The language core did not include XML parsing, and third party libraries were just a morass asking to suck up entire days evaluating them only to find that they were fundamentally broken and/or slow as paint drying. Eventually I broke down and wrote my own, which, though incomplete, was sufficient to the task and orders of magnitude faster than the third party monstrosities.

    I use XML when appropriate, but I’ve learned through experience just how dangerous it can be when evangelized on a project by the uninformed.

    XML is just like any other technology or format: Good at some things, terrible at others. The key is in recognizing the latter and avoiding it like a plague. There is no question, XML used in the wrong circumstances is a project killer.

  54. FriendlyPrimate on October 2nd, 2009

    Wow….I’m with you Michael. I was a fan of XML for years. But I slowly started becoming disillusioned with it as I tried to force it to do things like map to data in relational databases and Java models. Then I found out about JSON, and I fell in love with it. Less verbose, easier/faster to parse, easier to map to common data structures, easier to read, etc… Now XML looks down-right ugly every time I see it.
    XML has a huge head start, but I really do believe that JSON is going to start catching up.

  55. anon_anon on October 3rd, 2009

    loading XML tree in memory is not necessary a memory hog, have you heard of vtd-xml?

  56. Bill Conniff on October 12th, 2009

    New technologies are being developed to addess the size and performance issues of xml. Xponent software’s XMLMax loads any size xml into a treeview using at most 20MB of memory and can do XSL transformations within the same memory limit. The CAX xml parser is a pull parser that can look backward through all parsed xml, thereby enabling any xml transformation with a fast pull parser and without memory constraints. Many vendors are offering better support for large xml. Native xml databases solve some or all of the problems you mention in some scenarios.

  57. Joseph Turian on November 2nd, 2009

    I agree with many of your points. However, XML is the lesser of evils when doing a MySQL dump.

  58. Jewel Ward on March 2nd, 2010

    In case you want to follow up on this idea, there is a symposium on XML for the long-haul; the CFP went out this past week.

    http://balisage.net/longhaul/

    Call for Participation: International Symposium on XML for the Long Haul
    Issues in the Long-term preservation of XML

    Monday 2 August 2010
    Hotel Europa, Montréal, Canada

    Chair: Michael Sperberg-McQueen, Black Mesa Technologies

  59. Jason Price on July 23rd, 2010

    I agree with many points you have stated. I deal with many different data vendors on a daily basis and am responsible for the strategy and development to incorporate outside content with internal content for our intelligence teams.

    XML has its place in small data transactions but when you have to load big data, I cringe when a vendor tells me it is in XML format.

  60. Terry Camerlengo on October 13th, 2010

    I agree with every single word in this article. I work in genomics and early on in my bioinformatics career worked on a project where we converted flat-tabular data into an XML format for storage in an XML database. Fun project, but the round-trips to and from XML was a nightmare….XML was a ridiculously silly format for large genomic datasets. It added complexity, and it was slow.That was 5 years ago and I have never considered XML as a storage medium since and neither has the industry; There are no bioinformatic programs and standard formats for persisting genomics data that make use of XML. At least that I know of. Great article.

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply

%d bloggers like this: