“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.” — Envisioning Information, Edward Tufte, Graphics Press, 1990

Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.
Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.
While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question:
Why use color in data graphics?
If our data are simple, a single color is sufficient, even preferable. For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008. With just two dimensions of data to describe — the x and y location in the strike zone — black and white is sufficient. In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).
Fig 1. Location of Pitches (Villarreal, HOU, 2008)

But what if we’d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where? Or their speed? Visualizations live in two dimensions, but the world they describe is rarely so confined.
The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas. (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).
Getting back to our pitching example, if we want to layer another dimension of data — pitch type — into our plot, we have several methods at our disposal:
- plotting symbols – vary the glyphs that we use (circles, triangles, etc.),
- small multiples – vary extra dimensions in space, creating a series of smaller plots
- color – we can color our data, encoding extra dimensions inside a color space
Which techniques you employ depend on the nature of the data and the media of your canvas. I will describe all three by way of example.
Multivariate Method I: Vary Your Plotting Symbols
Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)

In this plot, I’ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.
I consider this visualization an abject failure. In fact, the prize for my most despised graphs in graduate school goes to bacterial growth curves rendered this way . The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call ‘pre-attentively processed‘ cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories. (Admittedly this can be improved with Chernoff faces or other iconic symbols, where the categorical mapping is self-evident).
Multivariate Method II: Small Multiples on a Canvas
Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics. It has been employed everywhere from Galileo sunspot illustrations to William Cleveland’s trellis plots. And as Scott Mccloud’s unexpected tour de force on comicsmakes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.
In this plot below, the four types of pitches that Oscar throws are splintered horizontally. By reducing our plot sizes, we’ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).
Fig 3: Location and Pitch Type (Villarreal, HOU, 2008)

Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen. Both columns and rows can be used to lattice over additional dimensions, the result being a matrix of scatter plots (in R, see the ‘splom‘ function).
Multivariate Method III: Color Your Data
So why bother with color?
First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut. So color is a compensatory strength.
For multi-dimensional data, color can convey additional dimensions inside a unit of space — and can do so instantly. Color differences can be detected within 200 ms, before you’re even conscious of paying attention (the ‘pre-attentive’ concept I mentioned earlier).
But the most important reason to use color in multivariate graphics is that color is itself multidimensional. Our perceptual color space — however you slice it — is three-dimensioned.
In the example below, I’ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I’ve chosen is a divergent palette that moves along one dimension (think of it as the ‘redness-blueness’ dimension) in the CIELUVcolor space, while maintaining a constant level of luminosity.
Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)


Holding luminosity constant is important, because luminosity (similar to brightness) determines a color’s visual impact. Bright colors pop, and dark colors recede. A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.
I chose only seven gradations of color, so I’m downsampling (in a lossy way) our speed data – but further segmentation of our color ramp is not likely to be perceptible.
I’ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots. This is done to improve the perception of each pitch’s speed via its color: small patches of color are less perceptible. But a consequence of this choice — compounded by our choice to work with a series of smaller plots — is that more points overlap. We’ve further degraded some of our positional information. However, in our last step, we attempt to recover some of this.
Now I’ve finally brought color to bear on this visualization, but I’ve only encoded a single dimension — speed. Which leads to another question:
If color is three-dimensional, can I encode three dimensions with it?
In theory, yes. Colin Ware researched this exact question. In practice, it’s difficult. It turns out that asking observers to assess the amount of ‘redness’, ‘blueness’, and ‘greenness’ of points is possible, but not intuitive (I suspect it’s somewhat like parsing symbols).
Another complicating factor is that a nontrivial fraction of the population has some form of color blindness. This effectively reduces their color perception to two dimensions.
And finally, the truth is that our sensation of color is not equal along all dimensions; it’s thought the closely related ‘red’ and ‘green’ receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).
Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I feel color is best used to encode no more than two dimensions of data.
So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator). This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.
Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)


Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.
One final point about using luminosity. Observing colors in a data visualization involves overloading, in the programming sense. We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).
Since we can overload color any way we want, whenever possible, we should choose mappings that are natural. Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth. Likewise, when sampling from the color space, we might as well choose colors found in nature. These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.
Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.
FutureMan Asks: What about Animation?
This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data. I’ve purposely neglected one very powerful tool: motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization. But packing information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge. Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like Processing and Prefuse are a promising start towards their development.
Methods
The final product of these five-dimensional pitch plots — for all available data for the 2008 season — can be explored via the PitchFX Django-driven web tool at Dataspora labs.
All of the visualizations here were developed using R and the Lattice graphics package. (Of note, Hadley Wickham is developing ggplot2, a bold re-write of the R graphics system based on a grammar of graphics).
References for Further Reading
Comments
[…] How XML Threatens Big Data […]
[…] Driscoll continues his attack against XML for Big Data. He points out three reasons why XML and Big Data are strange […]
I’m interested to hear your thoughts on linked data and rdf in this regard. To me the problem xml creates is a new data model for every schema whereas all rdf has a single data model.
Is RDF really that hard? On one hand, yes I do find the RDF/XML format to be not exactly user friendly. However, what I’ve found to be tremendously helpful are the different formats of RDF. For example, I’ve become a big fan of the N-Triples format since it allows me to just dump out statements in the form of “subject predicate object”. No nesting or “army of tags”. In other words, I just generate files containing only three columns. This sounds similar to the “TabML” format you created. For me, this format definitely passed the fifteen minute test, and has proven to be much easier to read than RDF/XML.
It’s not just MathML – the W3C are also responsible for VoiceXML, used for defining interactions with IVRs. The problem would appear to be that they’re designing standards for problems which are outside of their core competency, and building them on inappropriate foundations – such as XML.
–Dave
VoiceXML was designed by people who were actually building next-gen IVR systems, such as Tellme. It was very much within the core competency of the people building it, but may not necessarily be the easiest or the most flexible platform due to its complexity.
I thought I was going insane. I have been plunging into the API’s of Google Analytics, YouTube and Twitter this past week. Funny thing, I too could see the data I needed in the XML document. Since I haven’t worked with XML for a long time, I was thinking about how easy it was supposed to be, early on with Visual Studio 2003, one could derive a schema from the document with a few simple clicks, now with multiple namespaces, most software tools choke, and the answer was—gag–break the XML into separate files strip the namespaces and then process the documents.
My solution to this madness was XML to LINQ to SQL. SQL Server being the final resting place for my data. I will have to commend MS for creating LINQ, though I can’t say whether or not they actually created it or purchased another company.
I appreciate your article, it’s good to see (well good in a bad way) that even seasoned veterans have frustrations with this stuff.
I work for the Census Bureau; the US Census Bureau is a bureau, not a department, dammit!
“Far preferable is stating the data model separately” is an interesting statement. Web pages are “properly” split into data (*ml), behavior (js) and formatting (css). Why shouldn’t data sets?
One other comment, I suppose talking about MS on the blog may garner some flames. I just use the best tool I can understand for the job.
I am interested in the Open Source world I just find that entry into it is usually walled up behind esoteric terminology. If anyone has some primers on how to get started with Open Source source analytics, data modeling etc. without the need for writing your tools please let me know.
Thanks.
It was your fault to choose XML. XML does not fail. Stupid programmers/decision makers fail. XML stands for “eXtended Markup Language”, it is for marking, that is, for structuring data with no structure or with complex, not regular structure. Texts, nested data structures… But not data with regular structure like tables.
[…] How XML Threatens Big Data : Dataspora Blogdataspora.com […]
…”the repositories, all with millions of records, “…
…”The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM”…
Sounds to me like you forgot to research your technology of choice and got burned as a result. Any reason why you couldn’t use a stream-based processing API, like SAX?
Now, because of your own bad design decisions, you’re attacking XML. Nice. You even get as far as admitting that XML was the wrong choice of tech *in this case*, but don’t admit it’s better in other circumstances, like data interchange.
I just don’t understand why certain programmers are so “anti-xml”. It’s like a carpenter being “anti-hammer”…
@ChrisDavis – Point taken on the greater simplicity of RDF’s non-XML variant, perhaps it will gain adoption. But, for example, this Gene Ontology RDF project I came across looked like a lot of pain to implement, and appears inactive.
@Gary – Census Dept –> Census Bureau – corrected!
@Nestor – Indeed, I was foolish here to choose XML
, but genomics data is rather complex — and not particularly amenable, at first glance, to TabML.
@Robin – I am not anti-hammer, I believe XML has its place: for documents, not data, nor even data interchange. (I actually did use a stream-based processing API — James Clark’s expat — but it was still comparably slow).
Long before Mike Adams, many people, including myself, have been enabling LaTeX math markup to the web, either by converting it to images or by translating it into MathML. I wrote a library in 2005; the original itex2MML on which it was based dates from at least 2001.
Equation markup is not really a data format, so I’m not sure it belongs in the argument above. That said, since MathML still isn’t supported by major browsers more than 12 years since its inception, I consider it a failed standard, and I regret having bought into it. (Read the MathML spec some time for a great example of data bureaucracy in action.)
I agree that RDF and linked data are difficult and complex to author. That, however, does not make them poor candidates for the canonical representation for underlying data models and schema.
I also agree that catholic approaches to data formats are appropriate. You may want to see my ‘Structs’: Naïve Data Formats and the ABox posting, where I argue that it is fine (and to be expected!) to use whatever data struct you like depending on your purpose.
Data format purists are oh, so boring. XML has its place, as does JSON, BibTeK, CSV and RDF/N3. Go for it!
I disagree. If you’re going to present data to the public, it’s much better to have it all in a format that is immediately available to everyone. Otherwise there will be lots of duplicated work.
Also, compare this to RFC, the standards of the internet. They are still written in 7bit ASCII. The first one was written in 1969 and is still readable as-is in every reader program on the planet. Can you say the same of *any* document format other than ASCII?
XML is aiming to be the equivalent for data. In 40 years I will be able to use my standard python (or cobra or whatever language I’ll use) XML library and just process it.
What am I supposed to do with a .XLS document? who will be able to read *that* in 40 years?
What if I need data that’s in excel format, and some word documents, and a CSV-file, and a colon-separated data file (ala /etc/passwd)? Data hell.
That being said, I prefer yaml, but it has a 1:1 mapping to XML, so it should be safe too.
ETL tools like open source Pentaho Kettle are the way to go in this problem space, aren’t they?
Maybe portable objects are the solution. Refer:
http://gagne.homedns.org/~tgagne/contrib/EarlyHistoryST.html#4
I have had great success with using a subset of YAML (basically, JSON) as a way to encode intermediate data files in a human-readable form. IMHO, It’s much easier to read than XML. It’s also a much better match than XML for the fundamental data structures (eg, scalars, lists, hashes) of the languages I use.
Interesting to see what people with lots of data do. The weather and satellite community passes around large datasets as NetCDF files (eghttp://en.wikipedia.org/wiki/NetCDF). This is a self describing format, engineered to be compact. Only need an 8 bit integer … no problem. I am guessing that in this world Big Data has never given XML much attention …
A few years ago I was thinking we’d have to wait until the entire generation of programmers who grew up in the great wave of XML hype died out before we would regain our common sense. Your comments make me more hopeful.
(No, Robin, being anti-XML isn’t like a carpenter being anti-hammer. It’s like a carpenter who doesn’t want to carry a 50-pound nail gun air compressor up a ladder to fix a loose shingle.)
[…] xml critique in http://dataspora.com/blog/xml-and-big-data/ – too casually dismisses tech (rdf) that allows syntax-level […]
For me, one of the great appeals of RDF, is that it allows different parties to take quite radically different choices regarding concrete syntax, while allowing their data to be re-integrated later. You might use SQL, others might use XML; that’s just fine. RDF has db-to-rdf mapping tools that take tabular data and either map into into RDF triples, or convert RDF SPARQL queries into SQL on the fly. For XML, we have the GRDDL spec which says how you can document non-RDF XML using XSLT transformations. And so on. RDF (and related technologies) just encourage you to do a bit of documenting – eg saying which classes you have are mutually disjoint, which properties take single values or are uniquely identifying, etc. But it doesn’t force a syntax on you – whether xml-based or otherwise. I’m not arguing that it’s painless, just that one of the goals of RDF is to allow just the kind of diversity you argue for, but trying at the same time to minimise the data-fragmentation cost of everyone doing it their own way…
I’d also add in that XML was not designed for (generic) data, it was for documents. Hence the breakage in unicode (text is a strict subset of Unicode), mixed content (argh… we need a schema to parse!?!), namespace design,….
Wow!
Your biotech experience paralleled mine circa 2000. I was working for a large discount brick-n-mortar bookstore which had jumped online just a few years earlier. A giant big-box discount retailer was coming on-line and needed book fulfillment, and they settled on our company. They’d deliver us XML book orders, and we’d send them back XML order status updates. Fortunately, I was pretty adept at scouring CPAN and ran across XML::Twig, an event based XML parser, so we could both consume the huge orders expressed in gigabyte files and produce order status files of the same size.
Only the big box retailer’s database couldn’t load them.
Why didn’t you use a SAX parser? You wouldn’t of had to load the full tree into memory and your slow memory hog… well, wouldn’t have been.
If you had been using event based parsing I suspect your project would have behaved quite differently.
I strongly agree with you assessment of XML– in fact, with the support it got from major vendors, it severely impeded most data integration efforts for bioinformatics and genomics, and possibly even innovation within pharmaceutical companies. However, regarding RDF you are incorrect to assume it is too hard to use: most examples in bioinformatics show RDF (i.e., N3) is no more complicated that JSON. I propose an open challenge to illustrate this– game?
[…] said, it was refreshing to read that someone else is apprehensive about XML as a “big data” […]
[…] y grandes volúmenes de datos How xml threatens big data. Básicamente XML no es para grandes volúmenes de datos (por el tamaño y la complejidad para […]
I believe XML has a place for data interchange: those where there are many producers and few consumers, and the benefits of easily validating user-submitted datafeeds outweighs the pain.
The example I’m thinking of? Submitting data to property listing websites. Most sites use delimited formats and having worked with many I would swap them for XML any day. XML can be validated (oh so I made a mistake there!), which leads to the specification being concrete (hopefully – I hate guessing ambiguous clauses); copes with line breaks in the data; and has supported tools available.
I guess this is outside your ‘big dataset’ situation, but I feel XML is the best format for these situations.
Awesome! I have always said, “XML makes your big data bigger. No not more data, just more space.”
Of course this gets me dirty looks from all of the XML fans I work with.
And the other two question I keep having as our XML fans tout it as a “human readable” format. (1) aren’t I building this to integrate two computer systems? (2) since I can read EDI with my un-aided eye/brain, does that make me …. borg?
I think XML has its place. But I think that it is like a data format celebrity. From the attention it gets one would think XML had done an interview on Opera or something.
The classic quip from Slashdot sums it up well, I think: ‘XML is like violence. If a little doesn’t work, try using some more’.
All joking aside I agree 100%. XML is overly verbose, strains at the limits of human readability (I can usually get the jist of JSON, XML is so full of ‘non-data’ it’s a pain to read), and generally is just good for exchanging bits of info or for documents.
[…] http://dataspora.com/blog/xml-and-big-data/ […]
Nice article. You’re right, XML is not appropriate for big data tasks. Some of us need reminders of this once in a while….
Rick
I have two reactions to this article. First is just because you have a hammer, that does not mean that every problems is a nail. XML is not the end all to be all. It is the right solution for the right problems. And large datasets that do not require interaction with other systems is not the right problem.
My second comment is you cannot judge things in isolation. You can only judge them in comparison to the alternatives. When you take a look at what was available before XML came on the scene, you find more expensive tools, more complexity in representation, less interoperability and thus less ubiquity.
Good comments. Good article. But tell what your alternatives are and comparative judgments.
In my view, you have misunderstood the purpose of MathML, and hence your criticism that it was unneeded and shouldn’t have been created is erroneous. MathML is intended to encode mathematics in a granular, explicit way suitable for machine processing. It isn’t intended for author’s to use directly. Thus it is unsurprising that LaTeX (which was designed to optimize hand authoring in ASCII editors) is popular with researchers in contexts like WordPress. However, it is a mistake to conclude MathML is unsuccessful or unused.
Most major STEM publisher now use XML-based workflows using MathML for their journals. LaTeX, as a Turing complete programming language is ill-suited to the kind of standardization and validation required in a high-volume publishing operation, and as a consequence, most of the 1 million LaTeX preprints on the arXiv that make it into print are converted to XML+MathML.
Similarly, MathML is widely supported by math-aware software. By virtue of being standard (which LaTeX isn’t) and explicit and low-level (which LaTeX also isn’t), MathML is well-suited as an import/export format. The design team for the Math Input Panel which is being introduced as a new accessory in Windows 7 crew the same conclusion. The Math Input Panel does handwriting recognition of mathematics, for use in other applications. It’s only data format? MathML.
Another area where MathML excels in in accessibility. In MathML, the logical equation structure is explicitly encoded, which greatly facilitates voice rendering. MathML has been incorporated by reference into the DAISY digital talking book specification, which will soon be a mandatory format for textbook publishers to produce in the US.
My point here is not that MathML is better than LaTeX — LaTeX is great for the uses it was designed for. However, offering MathML as an example of an XML language that wasn’t needed and isn’t used is incorrect. It was designed to do a better job than the existing alternatives (including LaTeX) in terms of accessibility, validation, and explicit encoding of logical expression structure among other things. In those areas, it is successful and widely used, as the three example above indicate.
I share the frustration expressed by William Morgan above that MathML has not succeeded in browsers. But I don’t think that invalidates the whole standard. Of course, a major factor was that MathML is XML, and XML hasn’t succeed in browsers either. So to that extent, I certainly agree with you that XML is not the panacea it was originally claimed to be.
well said. Excellent talking points which will help with some headaches we are having RIGHT NOW.
I believe that your problem is not with XML , it is where you are storing it. I would agree that if you use a relational database with hierarchical data then you spend an incredible amount of processing power trying to do all the conversions. However, if you use a database that is designed specifically for the storage of XML data then you will find that it performs very nicely.
I think that if you explore some databases like Mark Logic or XHive you would find that not only is it easy to store, you would also get lots of great search features. So in short its the storage not the data.
[…] How XML Threatens Big Data : Dataspora Blog […]
[…] big data I’ve had the pleasure of working in healthcare, XML, and big data, but this article also reminds me a LOT of my past data warehouse projects…especially the data bureaucracy […]
I think you’ve revealed why your project failed when you claim “In its natural habitat, data lives in relational databases”.
No it doesn’t. Relational databases are a highly artificial habitat. But anyone whose mindset is so conditioned by relational thinking is going to fail if they try to force-fit a technology that needs a different approach.
The good:
XML is not a binary format.
The bad:
In a lot of cases, it is as readable as binary
In “the good old days”, a good data format was easily AWKable, and scripts were much easier to write to help it on its way. (Unfortunately, in those same days, a lot of the data was in proprietary binary formats).
– Paddy.
[…] How XML Threatens Big Data : Dataspora Blog Michael E. Driscoll writes: Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov. […]
[…] How XML Threatens Big Data : Dataspora Blog […]
I’ve been using delimited files forever. But one customer was doing business with Germany, and the double S character used the ASCII tilde character, which happened to be the delimiter. That kind of experience moved me to use the more complicated CSV format with optional double quotes, which can also be quoted. The result is a format that can transmit any 2-D data. My C libraries even support newlines in fields. And a nice side effect is that most spread sheets can import them. And, my library supports an optional “first line description”. That is, the first line of data is optionally column names. And, my library does not require the entire data set to be read into RAM. This “first line” bit essentially gives the advantages of a separate description with the advantages of keeping a single file. It allows import and export from SQL database tables. It allows human readable presentation. And, despite being entirely dynamic (no size limits), it’s reasonably fast. At least it’s not the bottleneck. Well, it was developed on a 386…
XML is much more complex, and can represent more dimensions than 2. But since 2-D data and even 1-D data is the norm, one might expect XML to be confined to unusual cases. Or, one might expect spreadsheets to import XML. We’ve already seen word processors, etc., support XML…
To some extent, XML is a repeat of the Word Processor joke. Here’s the original:
The great thing about Word Processors is that with them, you can easily cut and paste text, and move things around. The bad thing about Word Processors is that you can easily cut and paste text, and move things around all day long.
XML is so flexible that one is tempted to instrument the crap out of the data.
But i have seen database tables set up where a table column has the identifier for the data. I’ve even seen such tables with an identifier, a type, and then several columns of various types. So each row has only one column of real data. All the description for the data is redundant. And, i’ve even seen 2-D data set up this way. Now, most database tables aren’t set up this way. It’s slow, and the data size is huge. But when you have new column types periodically, and you can’t predict what they might be, what do you do? I call this form ultra-normalised. And, i’ve had to denormalize such data. It turned out to be 7 dimensional.
Sorry but I’m going to have to disagree with this article, mostly because the reasons the author gives for why XML failed for his big data.
1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure;
2) XSLT is a poor choice for any XML files over a couple hundred kilobytes. SAX is the preferred method for fast and low memory XML processing. I use thousands of SAX filters chained together to process gigabytes of XML data with no issues.
3) The XML functions of Oracle do suck, for lack of a better term. If we stick XML in a database it is done only as storage, no queries on it. We have an entire patented system to store/index XML data and do searches over it; sorry can’t go into much more detail than that.
[…] How XML Threatens Big Data : Dataspora Blog If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale. (tags: data database scalability programming software xml) […]
[…] How XML threatens Big Data […]
A rebuttal: http://www.xmltoday.org/content/xml-threatens-big-data-bull
But maybe the quote –
“James Clark stated ‘If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”
– does say it all. People in some countries don’t like the fact that the english language is the modern equivalent of a ‘lingua franca’ but they still tend to have to use it in business, etc. The best language to speak isn’t the one most conveneint to you, the speaker, but the one best understaood by all your hearers. If the audience is wide then the most widely adopted language is the best to be speaking. If you are only speaking to yourself then your own language may work fine: Hence storing data which only your won software will ‘see’ directly will be fine in ‘TabML’, JSON, whatever works well for you. When that data has to be shared with many diverse systems you need a ‘lingua franca’ and XML seems to be just that. Horses for courses.
So, sorry if I skipped about 100 comments just to add my own, but it seems to me this is the same old quasi-religious debate that’s been going on since XML first came out in the late 90s.
I’ll summarize
1) I herd XML was tha shizznit
2) I implemented it badly (really REALLY badly)
3) Pick an excuse that you heard from somebody smart
(XML is “too verbose” is the most common and least true…..)
4) Conclude firmly that XML suxxors and tell EVERYBODY
Seriously, people. Don’t blame your own shortcomings on the technology. It’s not XML’s fault that you don’t like it. It’s yours.
@Ryan Schneider:
“1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure”
Nobody writes their own parser NOW. In 2000 they did. XML Parsers of the time were at best, primitive; at worst, slow and broken.
I recall parsing XML in Java around that time was an absolutely miserable task. The language core did not include XML parsing, and third party libraries were just a morass asking to suck up entire days evaluating them only to find that they were fundamentally broken and/or slow as paint drying. Eventually I broke down and wrote my own, which, though incomplete, was sufficient to the task and orders of magnitude faster than the third party monstrosities.
I use XML when appropriate, but I’ve learned through experience just how dangerous it can be when evangelized on a project by the uninformed.
XML is just like any other technology or format: Good at some things, terrible at others. The key is in recognizing the latter and avoiding it like a plague. There is no question, XML used in the wrong circumstances is a project killer.
Wow….I’m with you Michael. I was a fan of XML for years. But I slowly started becoming disillusioned with it as I tried to force it to do things like map to data in relational databases and Java models. Then I found out about JSON, and I fell in love with it. Less verbose, easier/faster to parse, easier to map to common data structures, easier to read, etc… Now XML looks down-right ugly every time I see it.
XML has a huge head start, but I really do believe that JSON is going to start catching up.
loading XML tree in memory is not necessary a memory hog, have you heard of vtd-xml?
New technologies are being developed to addess the size and performance issues of xml. Xponent software’s XMLMax loads any size xml into a treeview using at most 20MB of memory and can do XSL transformations within the same memory limit. The CAX xml parser is a pull parser that can look backward through all parsed xml, thereby enabling any xml transformation with a fast pull parser and without memory constraints. Many vendors are offering better support for large xml. Native xml databases solve some or all of the problems you mention in some scenarios.
I agree with many of your points. However, XML is the lesser of evils when doing a MySQL dump.
In case you want to follow up on this idea, there is a symposium on XML for the long-haul; the CFP went out this past week.
http://balisage.net/longhaul/
Call for Participation: International Symposium on XML for the Long Haul
Issues in the Long-term preservation of XML
Monday 2 August 2010
Hotel Europa, Montréal, Canada
Chair: Michael Sperberg-McQueen, Black Mesa Technologies
I agree with many points you have stated. I deal with many different data vendors on a daily basis and am responsible for the strategy and development to incorporate outside content with internal content for our intelligence teams.
XML has its place in small data transactions but when you have to load big data, I cringe when a vendor tells me it is in XML format.
I agree with every single word in this article. I work in genomics and early on in my bioinformatics career worked on a project where we converted flat-tabular data into an XML format for storage in an XML database. Fun project, but the round-trips to and from XML was a nightmare….XML was a ridiculously silly format for large genomic datasets. It added complexity, and it was slow.That was 5 years ago and I have never considered XML as a storage medium since and neither has the industry; There are no bioinformatic programs and standard formats for persisting genomics data that make use of XML. At least that I know of. Great article.