mining the tar sands of big data

The consequence of sensor networks, cloud computing, and machine learning is that the data landscape is broadening: data is abundant, cheap, and more valuable than ever. It’s a rich, renewable resource that will shape how we live in the decades ahead, long after the last barrel has been squeezed from the tar sands of Athabasca.

Read Full Post at GigaOm.

four lessons for building a petabyte platform

In this post I’ll share some of the thinking behind our choices for the Big Data stack that powers our petabyte platform, consisting of three layers (i) a processing and storage substrate based around Hadoop and HBase, (ii) an analytics engine that mixes R, Python, and Pig and (iii) a visualization console and data API built principally in Javascript.

Read Full Post at Metamarkets.

how xml threatens big data

image

Confessions from a Massive, Nightmarish Data Project

Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.

So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (“taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.

Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.

We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.

Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.

In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.

Three Reasons Why XML Fails for Big Data

I. XML Spawns Data Bureaucracy

In its natural habitat, data lives in relational databases or as data structures in programs. The common import and export formats of these environments do not resemble XML, so much effort is dedicated to making XML fit. When more time is spent on inter-converting data — serializing, parsing,translating — than in using it, you’ve created a data bureaucracy.

Indeed, it was what Doug Crockford called “impedance mismatch inefficiencies” that sparked him to create JSON – standardizing Javascript’s object notation as a portable data container.

II. Yes, Size Matters for Data

Size matters for data in a way it does not for documents. Documents are intended for human consumption and have human-sized upper bounds (a lifetime’s worth of reading fits on a thumb drive). Data designed for machine consumption is bounded only by bandwidth and storage.

XML’s expansiveness — for even when compressed, the genie must be let out the bottle at some point — imposes memory, storage, and CPU costs.

III. Complexity Carries a Cost

I never fail to sigh when I open a data file and discover an army of tags, several ranks deep, surrounding the data I need. XML’s complexity imposes costs without commensurate benefits, specifically:

  • In-line, element-by-element tagging is redundant. Far preferable is stating the data model separately, and using a lightweight delimiter (such as a comma or a tab).
  • Text tags are purported to be self-documenting, but textual meaning is a slippery thing: it’s rare that one can be sure of a tag’s data type without consulting its DTD (in a separate document).
  • End-tags support nested structures (such as an aside (within (an aside)). But to facilitate data exchange, flattened out structures are preferable, and arbitrary levels of nesting are best using sparingly.

XML’s complexity inflicts misery on both sides of the data divide: on the publishing side, developers struggle to comply with the latest edicts of a fussy standards group. While data suitors labor to quickly unravel that XML format into something they can use.

Three Rules for XML Rebels

I. Stop Inventing New Formats (as Tim Bray said in 2006)

Before you call for “an XML format for X”, let me tell you a story about LaTeX and MathML. (And while these are document formats, there’s a lesson here for data).

The LaTeX typesetting system is the lingua franca for composing scientific documents. As the one-million plus LaTeX-formatted articles on arXiv.org attest, it is spoken by scientists worldwide.

MathML, on the other hand, is a markup language for mathematics recommended by the W3C. If you’re a scientist looking to use MathML, you have two choices: (i) find a program to convert LaTeX, which you already know, to MathML 3.0 or (ii) familiarize yourself with thishandy 354-page spec and code it yourself.

Two years ago, Mike Adams thought of a third way: why not just let people use LaTeX directly in WordPress? So he wrote a plug-in that did it. The applause was deafening.

Spoken languages are strengthened by usage, not by imperial fiat, and data formats are no different. Far better to evolve and adapt the standards we already have (as JSON and SQLite’s file format do), than to fabricate new ones from whole cloth. As John Udell says, “good-enough solutions [that are] here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.”

II. Obey The Fifteen Minute Rule

Interviewed several years ago, James Clark stated “If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”

Accordingly, if you absolutely must develop a new API, language, or format, it should satisfy a simple rule: a person of reasonable ability should be able to get from zero to ‘Hello World’ in fifteen minutes. (This does not preclude complex languages or formats, per se: it does require that additional complexity not be sui generis, but built on some existing foundation,for example.)

Despite a noble vision for the semantic web , the barriers for adopting the W3C’s proposals for linked data are too high. The beauty of original HTML standard was that it was dead simple. The flaw of RDF is that it is too hard.

III. Embrace Lazy Data Modeling

To keep data bureaucracy to a minimum, several Big Data thinkers have advocated a morecatholic approach to data: building data stores that accommodate a broad range of data types and formats.

Lazy data modeling is similar to lazy evaluation. The right schema for data depends on future use cases, in as-yet-undeveloped applications. Instead of trying to guess the future, we can store the data “as-is” — and deal with its transformation when (and if) a necessary use case arises. As Michael Franklin and colleagues note: ”the most scarce resource available for semantic integration is human attention.”

This liberal view also reduces barriers for data sharing, barriers which threaten initiatives like Data.gov. The US Census Bureau shouldn’t expend resources to publish in XML if they have a good-enough format available right now.

For the data geeks in the trenches, who are building the next generation of data services, the laws of economics hold fast: there are unlimited opportunities in the face of one limited resource, time. (Which also explains why data geeks seem to get no sleep).

XML’s unfulfilled promise for data testifies that formats can create friction. The easier it is for data to be shared and consumed, the more quickly we’ll realize our visions for smarter businesses and better governments.

(25-Aug-2009 Update: Read a response from open gov advocates at Sunlight Labs).

COMMENTS

60 Responses to “How XML Threatens Big Data”

  1. The Rise of the Data Web : Dataspora Blog on August 23rd, 2009

    […] How XML Threatens Big Data […]

  2. Is XML bad for big data? on August 23rd, 2009

    […] Driscoll continues his attack against XML for Big Data. He points out three reasons why XML and Big Data are strange […]

  3. Ian Davis on August 23rd, 2009

    I’m interested to hear your thoughts on linked data and rdf in this regard. To me the problem xml creates is a new data model for every schema whereas all rdf has a single data model.

  4. Chris Davis on August 23rd, 2009

    Is RDF really that hard? On one hand, yes I do find the RDF/XML format to be not exactly user friendly. However, what I’ve found to be tremendously helpful are the different formats of RDF. For example, I’ve become a big fan of the N-Triples format since it allows me to just dump out statements in the form of “subject predicate object”. No nesting or “army of tags”. In other words, I just generate files containing only three columns. This sounds similar to the “TabML” format you created. For me, this format definitely passed the fifteen minute test, and has proven to be much easier to read than RDF/XML.

  5. David Knell on August 23rd, 2009

    It’s not just MathML – the W3C are also responsible for VoiceXML, used for defining interactions with IVRs. The problem would appear to be that they’re designing standards for problems which are outside of their core competency, and building them on inappropriate foundations – such as XML.

    –Dave

  6. Andrew Wooster on August 23rd, 2009

    VoiceXML was designed by people who were actually building next-gen IVR systems, such as Tellme. It was very much within the core competency of the people building it, but may not necessarily be the easiest or the most flexible platform due to its complexity.

  7. Carolus Holman on August 23rd, 2009

    I thought I was going insane. I have been plunging into the API’s of Google Analytics, YouTube and Twitter this past week. Funny thing, I too could see the data I needed in the XML document. Since I haven’t worked with XML for a long time, I was thinking about how easy it was supposed to be, early on with Visual Studio 2003, one could derive a schema from the document with a few simple clicks, now with multiple namespaces, most software tools choke, and the answer was—gag–break the XML into separate files strip the namespaces and then process the documents.

    My solution to this madness was XML to LINQ to SQL. SQL Server being the final resting place for my data. I will have to commend MS for creating LINQ, though I can’t say whether or not they actually created it or purchased another company.

    I appreciate your article, it’s good to see (well good in a bad way) that even seasoned veterans have frustrations with this stuff.

  8. Gary on August 23rd, 2009

    I work for the Census Bureau; the US Census Bureau is a bureau, not a department, dammit!

    “Far preferable is stating the data model separately” is an interesting statement. Web pages are “properly” split into data (*ml), behavior (js) and formatting (css). Why shouldn’t data sets?

  9. Carolus Holman on August 23rd, 2009

    One other comment, I suppose talking about MS on the blog may garner some flames. I just use the best tool I can understand for the job.

    I am interested in the Open Source world I just find that entry into it is usually walled up behind esoteric terminology. If anyone has some primers on how to get started with Open Source source analytics, data modeling etc. without the need for writing your tools please let me know.

    Thanks.

  10. Nestor on August 23rd, 2009

    It was your fault to choose XML. XML does not fail. Stupid programmers/decision makers fail. XML stands for “eXtended Markup Language”, it is for marking, that is, for structuring data with no structure or with complex, not regular structure. Texts, nested data structures… But not data with regular structure like tables.

  11. How XML Threatens Big Data : Dataspora Blog « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit on August 23rd, 2009

    […] How XML Threatens Big Data : Dataspora Blogdataspora.com […]

  12. Robin on August 23rd, 2009

    …”the repositories, all with millions of records, “…
    …”The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM”…

    Sounds to me like you forgot to research your technology of choice and got burned as a result. Any reason why you couldn’t use a stream-based processing API, like SAX?

    Now, because of your own bad design decisions, you’re attacking XML. Nice. You even get as far as admitting that XML was the wrong choice of tech *in this case*, but don’t admit it’s better in other circumstances, like data interchange.

    I just don’t understand why certain programmers are so “anti-xml”. It’s like a carpenter being “anti-hammer”…

  13. Michael E. Driscoll on August 23rd, 2009

    @ChrisDavis – Point taken on the greater simplicity of RDF’s non-XML variant, perhaps it will gain adoption. But, for example, this Gene Ontology RDF project I came across looked like a lot of pain to implement, and appears inactive.

    @Gary – Census Dept –> Census Bureau – corrected!

    @Nestor – Indeed, I was foolish here to choose XML 

    :)

     , but genomics data is rather complex — and not particularly amenable, at first glance, to TabML.

    @Robin – I am not anti-hammer, I believe XML has its place: for documents, not data, nor even data interchange. (I actually did use a stream-based processing API — James Clark’s expat — but it was still comparably slow).

  14. William Morgan on August 23rd, 2009

    Long before Mike Adams, many people, including myself, have been enabling LaTeX math markup to the web, either by converting it to images or by translating it into MathML. I wrote a library in 2005; the original itex2MML on which it was based dates from at least 2001.

    Equation markup is not really a data format, so I’m not sure it belongs in the argument above. That said, since MathML still isn’t supported by major browsers more than 12 years since its inception, I consider it a failed standard, and I regret having bought into it. (Read the MathML spec some time for a great example of data bureaucracy in action.)

  15. Mike Bergman on August 23rd, 2009

    I agree that RDF and linked data are difficult and complex to author. That, however, does not make them poor candidates for the canonical representation for underlying data models and schema.

    I also agree that catholic approaches to data formats are appropriate. You may want to see my ‘Structs’: Naïve Data Formats and the ABox posting, where I argue that it is fine (and to be expected!) to use whatever data struct you like depending on your purpose.

    Data format purists are oh, so boring. XML has its place, as does JSON, BibTeK, CSV and RDF/N3. Go for it!

  16. Bob on August 23rd, 2009

    I disagree. If you’re going to present data to the public, it’s much better to have it all in a format that is immediately available to everyone. Otherwise there will be lots of duplicated work.

    Also, compare this to RFC, the standards of the internet. They are still written in 7bit ASCII. The first one was written in 1969 and is still readable as-is in every reader program on the planet. Can you say the same of *any* document format other than ASCII?

    XML is aiming to be the equivalent for data. In 40 years I will be able to use my standard python (or cobra or whatever language I’ll use) XML library and just process it.

    What am I supposed to do with a .XLS document? who will be able to read *that* in 40 years?

    What if I need data that’s in excel format, and some word documents, and a CSV-file, and a colon-separated data file (ala /etc/passwd)? Data hell.

    That being said, I prefer yaml, but it has a 1:1 mapping to XML, so it should be safe too.

  17. wial on August 23rd, 2009

    ETL tools like open source Pentaho Kettle are the way to go in this problem space, aren’t they?

  18. Richard Durr on August 23rd, 2009

    Maybe portable objects are the solution. Refer:
    http://gagne.homedns.org/~tgagne/contrib/EarlyHistoryST.html#4

  19. Rich Morin on August 23rd, 2009

    I have had great success with using a subset of YAML (basically, JSON) as a way to encode intermediate data files in a human-readable form. IMHO, It’s much easier to read than XML. It’s also a much better match than XML for the fundamental data structures (eg, scalars, lists, hashes) of the languages I use.

  20. Edward on August 23rd, 2009

    Interesting to see what people with lots of data do. The weather and satellite community passes around large datasets as NetCDF files (eghttp://en.wikipedia.org/wiki/NetCDF). This is a self describing format, engineered to be compact. Only need an 8 bit integer … no problem. I am guessing that in this world Big Data has never given XML much attention …

  21. Bob Foster on August 23rd, 2009

    A few years ago I was thinking we’d have to wait until the entire generation of programmers who grew up in the great wave of XML hype died out before we would regain our common sense. Your comments make me more hopeful.

    (No, Robin, being anti-XML isn’t like a carpenter being anti-hammer. It’s like a carpenter who doesn’t want to carry a 50-pound nail gun air compressor up a ladder to fix a loose shingle.)

  22. Dan Brickley (danbri) ’s status on Monday, 24-Aug-09 11:07:10 UTC – Identi.ca on August 24th, 2009

    […] xml critique in http://dataspora.com/blog/xml-and-big-data/ – too casually dismisses tech (rdf) that allows syntax-level […]

  23. Dan Brickley on August 24th, 2009

    For me, one of the great appeals of RDF, is that it allows different parties to take quite radically different choices regarding concrete syntax, while allowing their data to be re-integrated later. You might use SQL, others might use XML; that’s just fine. RDF has db-to-rdf mapping tools that take tabular data and either map into into RDF triples, or convert RDF SPARQL queries into SQL on the fly. For XML, we have the GRDDL spec which says how you can document non-RDF XML using XSLT transformations. And so on. RDF (and related technologies) just encourage you to do a bit of documenting – eg saying which classes you have are mutually disjoint, which properties take single values or are uniquely identifying, etc. But it doesn’t force a syntax on you – whether xml-based or otherwise. I’m not arguing that it’s painless, just that one of the goals of RDF is to allow just the kind of diversity you argue for, but trying at the same time to minimise the data-fragmentation cost of everyone doing it their own way…

  24. James on August 24th, 2009

    I’d also add in that XML was not designed for (generic) data, it was for documents. Hence the breakage in unicode (text is a strict subset of Unicode), mixed content (argh… we need a schema to parse!?!), namespace design,….

  25. Jeff H. on August 24th, 2009

    Wow!

    Your biotech experience paralleled mine circa 2000. I was working for a large discount brick-n-mortar bookstore which had jumped online just a few years earlier. A giant big-box discount retailer was coming on-line and needed book fulfillment, and they settled on our company. They’d deliver us XML book orders, and we’d send them back XML order status updates. Fortunately, I was pretty adept at scouring CPAN and ran across XML::Twig, an event based XML parser, so we could both consume the huge orders expressed in gigabyte files and produce order status files of the same size.

    Only the big box retailer’s database couldn’t load them.

  26. Guy Murphy on August 24th, 2009

    Why didn’t you use a SAX parser? You wouldn’t of had to load the full tree into memory and your slow memory hog… well, wouldn’t have been.

    If you had been using event based parsing I suspect your project would have behaved quite differently.

  27. Eric N. on August 24th, 2009

    I strongly agree with you assessment of XML– in fact, with the support it got from major vendors, it severely impeded most data integration efforts for bioinformatics and genomics, and possibly even innovation within pharmaceutical companies. However, regarding RDF you are incorrect to assume it is too hard to use: most examples in bioinformatics show RDF (i.e., N3) is no more complicated that JSON. I propose an open challenge to illustrate this– game?

  28. XML and Data on August 24th, 2009

    […] said, it was refreshing to read that someone else is apprehensive about XML as a “big data” […]

  29. XML y grandes volúmenes de datos « Javier Aroche on August 24th, 2009

    […] y grandes volúmenes de datos How xml threatens big data. Básicamente XML no es para grandes volúmenes de datos (por el tamaño y la complejidad para […]

  30. Peter B on August 24th, 2009

    I believe XML has a place for data interchange: those where there are many producers and few consumers, and the benefits of easily validating user-submitted datafeeds outweighs the pain.

    The example I’m thinking of? Submitting data to property listing websites. Most sites use delimited formats and having worked with many I would swap them for XML any day. XML can be validated (oh so I made a mistake there!), which leads to the specification being concrete (hopefully – I hate guessing ambiguous clauses); copes with line breaks in the data; and has supported tools available.

    I guess this is outside your ‘big dataset’ situation, but I feel XML is the best format for these situations.

  31. Roy Hayward on August 24th, 2009

    Awesome! I have always said, “XML makes your big data bigger. No not more data, just more space.”

    Of course this gets me dirty looks from all of the XML fans I work with.

    And the other two question I keep having as our XML fans tout it as a “human readable” format. (1) aren’t I building this to integrate two computer systems? (2) since I can read EDI with my un-aided eye/brain, does that make me …. borg?

    I think XML has its place. But I think that it is like a data format celebrity. From the attention it gets one would think XML had done an interview on Opera or something.

  32. SDC on August 24th, 2009

    The classic quip from Slashdot sums it up well, I think: ‘XML is like violence. If a little doesn’t work, try using some more’.

    All joking aside I agree 100%. XML is overly verbose, strains at the limits of human readability (I can usually get the jist of JSON, XML is so full of ‘non-data’ it’s a pain to read), and generally is just good for exchanging bits of info or for documents.

  33. Konrad Förstner (konrad) ’s status on Tuesday, 25-Aug-09 11:55:09 UTC – Identi.caon August 25th, 2009
  34. Rick J. Wagner on August 25th, 2009

    Nice article. You’re right, XML is not appropriate for big data tasks. Some of us need reminders of this once in a while….

    Rick

  35. Jeffrey Stewart on August 25th, 2009

    I have two reactions to this article. First is just because you have a hammer, that does not mean that every problems is a nail. XML is not the end all to be all. It is the right solution for the right problems. And large datasets that do not require interaction with other systems is not the right problem.

    My second comment is you cannot judge things in isolation. You can only judge them in comparison to the alternatives. When you take a look at what was available before XML came on the scene, you find more expensive tools, more complexity in representation, less interoperability and thus less ubiquity.

    Good comments. Good article. But tell what your alternatives are and comparative judgments.

  36. Robert Miner on August 25th, 2009

    In my view, you have misunderstood the purpose of MathML, and hence your criticism that it was unneeded and shouldn’t have been created is erroneous. MathML is intended to encode mathematics in a granular, explicit way suitable for machine processing. It isn’t intended for author’s to use directly. Thus it is unsurprising that LaTeX (which was designed to optimize hand authoring in ASCII editors) is popular with researchers in contexts like WordPress. However, it is a mistake to conclude MathML is unsuccessful or unused.

    Most major STEM publisher now use XML-based workflows using MathML for their journals. LaTeX, as a Turing complete programming language is ill-suited to the kind of standardization and validation required in a high-volume publishing operation, and as a consequence, most of the 1 million LaTeX preprints on the arXiv that make it into print are converted to XML+MathML.

    Similarly, MathML is widely supported by math-aware software. By virtue of being standard (which LaTeX isn’t) and explicit and low-level (which LaTeX also isn’t), MathML is well-suited as an import/export format. The design team for the Math Input Panel which is being introduced as a new accessory in Windows 7 crew the same conclusion. The Math Input Panel does handwriting recognition of mathematics, for use in other applications. It’s only data format? MathML.

    Another area where MathML excels in in accessibility. In MathML, the logical equation structure is explicitly encoded, which greatly facilitates voice rendering. MathML has been incorporated by reference into the DAISY digital talking book specification, which will soon be a mandatory format for textbook publishers to produce in the US.

    My point here is not that MathML is better than LaTeX — LaTeX is great for the uses it was designed for. However, offering MathML as an example of an XML language that wasn’t needed and isn’t used is incorrect. It was designed to do a better job than the existing alternatives (including LaTeX) in terms of accessibility, validation, and explicit encoding of logical expression structure among other things. In those areas, it is successful and widely used, as the three example above indicate.

    I share the frustration expressed by William Morgan above that MathML has not succeeded in browsers. But I don’t think that invalidates the whole standard. Of course, a major factor was that MathML is XML, and XML hasn’t succeed in browsers either. So to that extent, I certainly agree with you that XML is not the panacea it was originally claimed to be.

  37. GarryJR on August 25th, 2009

    well said. Excellent talking points which will help with some headaches we are having RIGHT NOW.

  38. Larry R on August 26th, 2009

    I believe that your problem is not with XML , it is where you are storing it. I would agree that if you use a relational database with hierarchical data then you spend an incredible amount of processing power trying to do all the conversions. However, if you use a database that is designed specifically for the storage of XML data then you will find that it performs very nicely.

    I think that if you explore some databases like Mark Logic or XHive you would find that not only is it easy to store, you would also get lots of great search features. So in short its the storage not the data.

  39. Cactus Acide » » L’observatoire du neuromancien 08/26/2009 on August 26th, 2009

    […] How XML Threatens Big Data : Dataspora Blog […]

  40. A great post on XML and big data « Small Business+Phoenix+Software on August 26th, 2009

    […] big data I’ve had the pleasure of working in healthcare, XML,  and big data, but this article also reminds me a LOT of my past data warehouse projects…especially the data bureaucracy […]

  41. Michael Kay on August 27th, 2009

    I think you’ve revealed why your project failed when you claim “In its natural habitat, data lives in relational databases”.

    No it doesn’t. Relational databases are a highly artificial habitat. But anyone whose mindset is so conditioned by relational thinking is going to fail if they try to force-fit a technology that needs a different approach.

  42. Paddy3118 on August 27th, 2009

    The good:
    XML is not a binary format.
    The bad:
    In a lot of cases, it is as readable as binary 

    ;-)

     

    In “the good old days”, a good data format was easily AWKable, and scripts were much easier to write to help it on its way. (Unfortunately, in those same days, a lot of the data was in proprietary binary formats).

    – Paddy.

  43. Structured Methods › links for 2009-08-27 on August 27th, 2009

    […] How XML Threatens Big Data : Dataspora Blog Michael E. Driscoll writes: Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov. […]

  44. Michael Nielsen » Biweekly links for 08/28/2009 on August 28th, 2009

    […] How XML Threatens Big Data : Dataspora Blog […]

  45. Stephen on August 29th, 2009

    I’ve been using delimited files forever. But one customer was doing business with Germany, and the double S character used the ASCII tilde character, which happened to be the delimiter. That kind of experience moved me to use the more complicated CSV format with optional double quotes, which can also be quoted. The result is a format that can transmit any 2-D data. My C libraries even support newlines in fields. And a nice side effect is that most spread sheets can import them. And, my library supports an optional “first line description”. That is, the first line of data is optionally column names. And, my library does not require the entire data set to be read into RAM. This “first line” bit essentially gives the advantages of a separate description with the advantages of keeping a single file. It allows import and export from SQL database tables. It allows human readable presentation. And, despite being entirely dynamic (no size limits), it’s reasonably fast. At least it’s not the bottleneck. Well, it was developed on a 386…

    XML is much more complex, and can represent more dimensions than 2. But since 2-D data and even 1-D data is the norm, one might expect XML to be confined to unusual cases. Or, one might expect spreadsheets to import XML. We’ve already seen word processors, etc., support XML…

  46. Stephen on August 29th, 2009

    To some extent, XML is a repeat of the Word Processor joke. Here’s the original:

    The great thing about Word Processors is that with them, you can easily cut and paste text, and move things around. The bad thing about Word Processors is that you can easily cut and paste text, and move things around all day long.

    XML is so flexible that one is tempted to instrument the crap out of the data.

    But i have seen database tables set up where a table column has the identifier for the data. I’ve even seen such tables with an identifier, a type, and then several columns of various types. So each row has only one column of real data. All the description for the data is redundant. And, i’ve even seen 2-D data set up this way. Now, most database tables aren’t set up this way. It’s slow, and the data size is huge. But when you have new column types periodically, and you can’t predict what they might be, what do you do? I call this form ultra-normalised. And, i’ve had to denormalize such data. It turned out to be 7 dimensional.

  47. Ryan Schneider on August 30th, 2009

    Sorry but I’m going to have to disagree with this article, mostly because the reasons the author gives for why XML failed for his big data.

    1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure;

    2) XSLT is a poor choice for any XML files over a couple hundred kilobytes. SAX is the preferred method for fast and low memory XML processing. I use thousands of SAX filters chained together to process gigabytes of XML data with no issues.

    3) The XML functions of Oracle do suck, for lack of a better term. If we stick XML in a database it is done only as storage, no queries on it. We have an entire patented system to store/index XML data and do searches over it; sorry can’t go into much more detail than that.

  48. Ramblings along the narrow way » Blog Archive » links for 2009-09-01 on September 1st, 2009

    […] How XML Threatens Big Data : Dataspora Blog If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale. (tags: data database scalability programming software xml) […]

  49. Coast to Coast Bio Podcast » Blog Archive » Episode 26: Google, PLoS and NCBI get into bed together on September 3rd, 2009

    […] How XML threatens Big Data […]

  50. Kut Cagle on September 3rd, 2009
  51. Stephen Green on September 4th, 2009

    But maybe the quote –

    “James Clark stated ‘If a technology is too complicated, no matter how wonderful it is and how easy it makes a user’s life, it won’t be adopted on a wide scale.”

    – does say it all. People in some countries don’t like the fact that the english language is the modern equivalent of a ‘lingua franca’ but they still tend to have to use it in business, etc. The best language to speak isn’t the one most conveneint to you, the speaker, but the one best understaood by all your hearers. If the audience is wide then the most widely adopted language is the best to be speaking. If you are only speaking to yourself then your own language may work fine: Hence storing data which only your won software will ‘see’ directly will be fine in ‘TabML’, JSON, whatever works well for you. When that data has to be shared with many diverse systems you need a ‘lingua franca’ and XML seems to be just that. Horses for courses.

  52. Anti-Hammer on September 23rd, 2009

    So, sorry if I skipped about 100 comments just to add my own, but it seems to me this is the same old quasi-religious debate that’s been going on since XML first came out in the late 90s.

    I’ll summarize

    1) I herd XML was tha shizznit
    2) I implemented it badly (really REALLY badly)
    3) Pick an excuse that you heard from somebody smart
    (XML is “too verbose” is the most common and least true…..)

    4) Conclude firmly that XML suxxors and tell EVERYBODY

    Seriously, people. Don’t blame your own shortcomings on the technology. It’s not XML’s fault that you don’t like it. It’s yours.

  53. harborpirate on September 23rd, 2009

    @Ryan Schneider:
    “1) He said one of the things he spent is days doing was “writing parsers”. This shows ignorance of XML and the tools, nobody writes their own parser; it would be the equivalent of writing your own queue or map structure”

    Nobody writes their own parser NOW. In 2000 they did. XML Parsers of the time were at best, primitive; at worst, slow and broken.

    I recall parsing XML in Java around that time was an absolutely miserable task. The language core did not include XML parsing, and third party libraries were just a morass asking to suck up entire days evaluating them only to find that they were fundamentally broken and/or slow as paint drying. Eventually I broke down and wrote my own, which, though incomplete, was sufficient to the task and orders of magnitude faster than the third party monstrosities.

    I use XML when appropriate, but I’ve learned through experience just how dangerous it can be when evangelized on a project by the uninformed.

    XML is just like any other technology or format: Good at some things, terrible at others. The key is in recognizing the latter and avoiding it like a plague. There is no question, XML used in the wrong circumstances is a project killer.

  54. FriendlyPrimate on October 2nd, 2009

    Wow….I’m with you Michael. I was a fan of XML for years. But I slowly started becoming disillusioned with it as I tried to force it to do things like map to data in relational databases and Java models. Then I found out about JSON, and I fell in love with it. Less verbose, easier/faster to parse, easier to map to common data structures, easier to read, etc… Now XML looks down-right ugly every time I see it.
    XML has a huge head start, but I really do believe that JSON is going to start catching up.

  55. anon_anon on October 3rd, 2009

    loading XML tree in memory is not necessary a memory hog, have you heard of vtd-xml?

  56. Bill Conniff on October 12th, 2009

    New technologies are being developed to addess the size and performance issues of xml. Xponent software’s XMLMax loads any size xml into a treeview using at most 20MB of memory and can do XSL transformations within the same memory limit. The CAX xml parser is a pull parser that can look backward through all parsed xml, thereby enabling any xml transformation with a fast pull parser and without memory constraints. Many vendors are offering better support for large xml. Native xml databases solve some or all of the problems you mention in some scenarios.

  57. Joseph Turian on November 2nd, 2009

    I agree with many of your points. However, XML is the lesser of evils when doing a MySQL dump.

  58. Jewel Ward on March 2nd, 2010

    In case you want to follow up on this idea, there is a symposium on XML for the long-haul; the CFP went out this past week.

    http://balisage.net/longhaul/

    Call for Participation: International Symposium on XML for the Long Haul
    Issues in the Long-term preservation of XML

    Monday 2 August 2010
    Hotel Europa, Montréal, Canada

    Chair: Michael Sperberg-McQueen, Black Mesa Technologies

  59. Jason Price on July 23rd, 2010

    I agree with many points you have stated. I deal with many different data vendors on a daily basis and am responsible for the strategy and development to incorporate outside content with internal content for our intelligence teams.

    XML has its place in small data transactions but when you have to load big data, I cringe when a vendor tells me it is in XML format.

  60. Terry Camerlengo on October 13th, 2010

    I agree with every single word in this article. I work in genomics and early on in my bioinformatics career worked on a project where we converted flat-tabular data into an XML format for storage in an XML database. Fun project, but the round-trips to and from XML was a nightmare….XML was a ridiculously silly format for large genomic datasets. It added complexity, and it was slow.That was 5 years ago and I have never considered XML as a storage medium since and neither has the industry; There are no bioinformatic programs and standard formats for persisting genomics data that make use of XML. At least that I know of. Great article.

the data singularity, part ii: human-sizing big data

image

“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.”

Herbert Simon

In my previous post, I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans have been excised from information read-write loops — the velocity and volume of data in the world is increasing, and at an exponential rate.

But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?

The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can’t scale up. But these tools share a common goal: scaling down data, and making it human-sized. That’s the “reduce” part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.

What’s happening today isn’t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.

VI. How Satellite Data Paralyzed the CIA

Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to one history, this massive, rich data didn’t accelerate the pace of US intelligence: it slowed it down.

Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn’t adjust their decision-making to this new scale, and they were drowned by it.

Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.

VII. People Still Pull the Big Levers

That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being “data-driven”, I’d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking & humming away, slinging bits and earning money.

But no such company exists. What “data-driven” really means is that the executives & employees use data as inputs for making decisions. Companies may be data-fueled, but they’re people-driven.

VIII. Human-sizing Big Data: Filter & Crunch 

All of the analytics in the world won’t matter if it remains inaccessible to the people driving an organization — the human decision-makers.

We have processes all around us acting as data amplifiers, recording events at a pace & scale that we can’t comprehend. But this has created a disequilibrium: our capacity to create data is vastly outstripping our ability to consume it. Analytics is the act of taking Big Data streams and human-sizing them for our small data brains.

We can reduce data by either filtering it, which sifts through but does not alter data, or by crunching it, reducing many data points to a few.

Google and Facebook are Filters . Many consumer web technologies might be viewed as powerful filters. Google is a relevance filter for 20 billion web pages. Facebook is a social filter for baby photos. FourSquare is a geo-social filter for hipster bars. Amazon is a filter for retail products, combining search with a powerful recommendation engine.

Wikipedia is a Natural Language Cruncher . Crunching data is harder than filtering it. Perhaps the toughest nut to crack involves processing natural language: if you read a thousand web pages about the Gutenberg Bible, how would you describe it in a few paragraphs? Wikipedia is a human-powered natural language cruncher, powered by its army of mechanical turks, whose collective actions even reveal news trends.

Crunch the Past to Predict the Future . Crunching of quantitative data is at the heart of many prediction tasks: the National Weather Service aggregates weather station measurements into forecasts, Fair Isaac calculates a score of credit-worthiness by examining your credit history, and a sports contest might be construed as an algorithm — operating on a sequence of individually played points — to predict the best team or athlete.

Number crunching has its more banal forms, as well, in the kind of sums and averages found in your phone or utility bill. These are necessary, but predictive algorithms — the kind involved in weather forecasting — will continue to grow in importance. For at a certain scale of data, exact reporting become an insurmountable task: we can only hope to have probabilistic answers.

IX. Business Intelligence is Dead: New Tools for a New Era

That our traditional tools don’t operate at scale was highlighted by Tim O’Reilly recently, when he declared “Business intelligence as we knew it is dead.”

A new class of tools is emerging along the Big Data stack, in three areas: (1) storage & computation, (2) analytics, and (3) dashboards & visualization.

These tools will disrupt and attack many of the traditional Business Intelligence firms, ranging from tool-makers like SAS and SPSS, to relational database vendors like Oracle, to custom hardware providers.

  • 1. Storage & Computation: Mixed Platforms, not Monolithic Databases . At the lowest level of storage & computation, Big Data is driving the success of cloud computing platforms like Amazon’s Elastic Compute Cloud — a massive, virtualized commodity-hardware grid — as an alternative to the Big Iron sold by hardware makers.Big Data has also catalyzed widespread adoption of the distributed, fault-tolerant Hadoop platform — an open-source implementation of Google’s BigTable that was developed by Yahoo, and is now commercially supported by Cloudera.

    A bit further up the stack, relational databases are suffering: newer commercial entrants in this space — such as Greenplum, Aster Data, Vertica, and Netezza — offer parallelized relational systems that operate at greater scale and lower cost than Oracle and Teradata.Many open-source, non-relational data stores — with a colorful constellation of names such as HBase, MongoDB, CouchDB, Cassandra, and Voldemort — have gained traction for high-traffic, content-driven web sites.

    SQL & NoSQL are Complementary, Not Antagonistic. While some may view storage technologies as antagonistic, either-or choices, the truth is that most Big Data-driven companies use a mixture of tools in complementary ways. Hadoop is often used for batch-processing and transformation of log data that is fed to more structured data stores, such as a distributed RDBMS, in backend systems. Non-relational data stores are in turn ideal for front-facing, high-performance web applications, where queries return a bolus of data related to a single key — often a product, user, or page identifier. All of these pieces working together form an information platform: an ecosystem of APIs working together.

  • 2. Analytics: There Are No Turnkey Solutions. Imagine if any piece of data you ever wanted was within a query’s reach: what would you do with it? We’re fast approaching this scenario, and making data meaningful is the bottleneck. But unlike storing data — where use cases & technologies are common and becoming commoditized — the ways that firms filter and crunch their data varies widely.

    This reflects the range of analytics needs that firms have: for example, a financial firm may need low-latency, continuous analysis of data streams, while an online retailer or pharmaceutical firm can tolerate 24-hour delays for analysis.

    Scaling Up Analytics is Hard. R, my favorite analytics tool, is fantastic for modeling either aggregated data sets or samples of data that can fit in memory, but methods for deploying R in a large-scale data environment are still nascent.One promising approach isSaptarshi Guha’s RHIPE , which combines R with Hadoop ( slides ) from his March presentation at the Bay Area R Users Group . Another MapReduce-based framework for large-scale data analysis include the Apache Mahout project.

    Learn, Then Apply: But Stay Close to the Data. In general, there are two pieces in any analytics pipeline: (i) learning, or the training of a model with historical data, and (ii) prediction, or the application of a model to new data. On the the learning side, it’s been said that more data beats better algorithms , and this is certainly true for many classification problems. In general, training a model is a computationally intensive task, and the development of methods that can train on massive data sets is an area of active research.

    On the application/prediction side of modeling, the challenges often revolve around deployment, or How do we get the model to the data? (Since the reverse, pushing data to the model, is more expensive). To address the desire of porting models across different environments PMML (Predictive Modeling Markup Language) has been developed, which is supported by a range of database vendors.

    The meme of “in-database analytics” is resonating because given data’s increasing heft, efficient analytics will follow the pattern of having the training & execution of models stay close to where the data lives.

    As it will be several years before either open-source or commercial analytics tools are mature here, the most successful Big Data modelers will be those data scientists who can build and glue together their own methods, tailored for individual environments and needs.

  • 3. Dashboards & Visualization: Why “I See” is a Synonym for “I Understand” . The most visible way in which Big Data is disrupting old tools is by changing the way we look at data. The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs. To take optimal advantage of the human visual system, dashboards and data visualization must be well-designed, and until recently, tools that achieved even a minimal standard of competence were rare.

    Visual Literacy is on the Rise. But a new set of visualization tools and packages, as well as growing popular interest in data visualization — catalyzed by the books of Edward Tufte, blogs like Nathan Yau’s FlowingData and talks at T.E.D. conferences – are changing this.As I’ve written about before, there are two distinct kinds of data visualization pathways: (i) exploratory, a highly interactive path whereby a data scientist may permute through dozens or even hundreds of views of a data set to understand its shape or fit to a hypothesized model, and (ii) narrative, a more constrained path whereby only one or several views of the data are presented.

    Exploring Data Requires Fast, Frequent Feedback . For the exploratory path, desktop tools are ideal. The open-source language R has several outstanding visualization packages, including ggplot2 and lattice (based on William Cleveland’s trellis). Two solid commercial products for exploratory visualization are SpotFire and Tableau (the latter of which has been praised by the hard-to-please Stephen Few).

    Sharing Visualizations: Web Dashboards Are Ideal. Ultimately, however, visualizations need to be shared beyond a single user, to an audience. Web-driven dashboards are an ideal form for sharing narrative visualizations, by allowing navigation along defined axes of the data.The challenge is moving visualizations from the desktop to the web. Tableau has this capacity, but with R the process is less straightforward. One promising route is via Jeff Horner’s RApache tool , which embeds R inside an Apache server (which I’ve used for my MLB Pitch F/X tool, and which Jeroen Ooms’ uses to power his ggplot2 web app ).

    The major limitation of R-driven web graphics is that achieving some interactivity within the graphic itself is difficult, as R’s graphics model is focused on static graphics. There are, however, several routes for achieving highly interactive, web-based data visualizations, whether by using Javascript, HTML5′s Canvas, or Flash. Two in particular are: (i) Ben Fry’s Processing , an expressive language for vector animation, which recently addedJavascript as one of its implementations, and (ii) the Protovis framework out of Stanford: a Javascript graphing toolkits whose conceptual integrity and expressive flexibility was inspired (like ggplot2) by Wilkinson’s grammar of graphics.

X.  Collaborating with Big Data: Analytics is a Social Process

In the same talk that Tim O’Reilly proclaimed the death of BI “as we knew it”, he also highlighted a new initiative by Greenplum called Chorus (Greenplum is a Dataspora client, but I confess I’ve only seen a limited preview).

The animating spirit of Chorus is that analytics is not only about data, models, and visualizations — it’s also about the people who work on these various pieces. One of the reasons I love Box.net is the layer of social information that’s overlayed onto my files: appended notes, access statistics from collaborators, automatic notifications when a change is made.

Chorus is a vision to do this with Big Data; it allows, for instance, an analyst to link a data visualization to an underyling data source, include the R code that created the visualization, and append a note about a recent change to it.

As the Big Data stack matures, tools that help manage the workflow from data to analytics to visualizations, and ultimately to decisions, will be critical. Someday, creating and sharing a data analysis through a web dashboard should be as easy as writing a blog post. Until that day, there’s plenty of work to keep us data scientists well-employed.

I originally published on May 27, 2010 on the Dataspora blog.

the data singularity is here

image

Originally published March 8, 2010 at Dataspora.

In this blog post I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.

In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren’t even at the terminal node of action. International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.

Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage– all of which are dropping exponentially.

The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on). The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.

So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.

But before I discuss these consequences, I’d like to expand on the premise. The world wasn’t always drowning in this data deluge, so how did we get here?

I. Data at the Speed of Speech

For most of human history, information traveled no faster than the sound of the human voice. The origin of human language was the original singularity: it marked the birth of a non-biological information channel, distinct from our DNA.

But despite this achievement , the production of information — whether farmers’ almanacs or merchants’ ledgers — was still constrained the by costs of ink and parchment and the write-speed of the human hand.

All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.

Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.

People --> People

II. Data at the Speed of Light

With the telegraph, for the first time, data flowed at the speed of light.

In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate. Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.

Centuries earlier, the printing press dramatically reduced the production costs of information. Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.

III. Programmable Looms and Reading Machines

Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.

Punch cards were developed in the early 18th century to control industrial looms , in France.

Now, machines were the final terminus of data transmission. This act of communicating with our machines, programming them, was at the heart of Charles Babbage’s Analytical Engine, which came more than a century later.

People --> Machines

IV. Phonographs and Recording Machines

Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.

The modern rotating disk drive feels less inspired by punch cards, but by Thomas Edison’s cylinder machines, better known as phonographs.

The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice. It’s a vision that still eludes us.

By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.

Machines --> Machines

These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage.

V. Listening to the Pulse of the Planet

The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.

I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence — and which consume 3% of the US energy budget .

Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.

And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle. So why keep the data?

Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.

Collectively, these logs reveal the pulse of the planet — flight delays, package shipments, job losses, and human sentiments.

And as I’ll discuss in my next post, those who can extract a meaningful signal from this thunderous cacophony — the analysts, statisticians, and data scientists — are uniquely positioned to change the world.

Knuth’s reservoir sampling in Python and Perl

Algorithms that perform calculations on evolving data streams, but in fixed memory, have increasing relevance in the age of Big Data. The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N.

These two features – (i) a constant memory footprint and (ii) a capacity to operate on files of indeterminate size – make it ideal for working with very large data sets common to event processing.

While it has likely been multiply discovered and implemented, like many algorithms, it was codified by Knuth’s The Art of Computer Programming.

The trick of this algorithm is to first fill up the sample buffer, and afterwards, to probabilistically replace it with additional lines of input.

Python version

#!/usr/bin/python
import sys
import random
 
if len(sys.argv) == 3:
    input = open(sys.argv[2],'r')
elif len(sys.argv) == 2:
    input = sys.stdin;
else:
    sys.exit("Usage:  python samplen.py <lines> <?file>")
 
N = int(sys.argv[1]);
sample = [];
 
for i,line in enumerate(input):
    if i < N:
        sample.append(line)
    elif i >= N and random.random() < N/float(i+1):
        replace = random.randint(0,len(sample)-1)
        sample[replace] = line
 
for line in sample:
    sys.stdout.write(line)

Perl version

#!/usr/bin/perl -sw
 
$IN = 'STDIN' if (@ARGV == 1);
open($IN, '<'.$ARGV[1]) if (@ARGV == 2);
die "Usage:  perl samplen.pl <lines> <?file>n" if (!defined($IN));
 
$N = $ARGV[0];
@sample = ();
 
while (<$IN>) {
    if ($. <= $N) {
 $sample[$.-1] = $_;
    } elsif (($. > $N) && (rand() < $N/$.)) {
 $replace = int(rand(@sample));
 $sample[$replace] = $_;
    }
}
 
print foreach (@sample);
close($IN);

For example, imagine we are to sample 5 lines randomly from a 6-line file. Call i the line number of the input, and N the size of sample desired. For the first 5 lines (where i < = N), our sample fills entirely. (For the non-Perl hackers: the current line number i is held by the variable $., just as the special variable $_ holds the current line value).

It’s at successive lines of input that the probabilistic sampling starts: the 6th line has a 5/6th (N/i) chance of being sampled, and if chosen, it will replace one of the previously 5 chosen lines with a 1/5 chance: leaving them a (5/6 * 1/5) = 5/6 chance of being sampled. Thus all 6 lines have an equal chance of being sampled.

In general, as more lines are seen, the chance that any additional line is chosen for the sample falls; but the chance that any previously chosen line could be replaced grows. These two balance such that the probability for any given line of input to be sampled is identical.

A more sophisticated variation of this algorithm is one that can take into consideration a weighted sampling.

the three sexy skills of data geeks

image

(I originally penned this on May 27, 2009 and published on the Dataspora blog.)

Hal Varian, Google’s Chief Economist, was interviewed a few months ago, and said the following in the McKinsey Quarterly:

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

In prepping for tonite’s talk at the Google IO Ignite event, this quote inspired me to muse about how sex appeal and statistics might go together: so I chose to mash up a few scatter plots with Andy Warhol’s Marilyn Monroe.

Statisticians’ sex appeal has little to do with their lascivious leanings (ahem, BedPost), and more with the scarcity of their skills.  I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas:  statistics, data munging, and data visualization.  (In parentheses next to each, I’ve put the salient character trait needed to acquire it).

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004). I expect to be on its learning curve my entire life. This being the case, people who possess a solid grasp of modern statistics are rare.   And yet problems that require its application continue to multiply.  The text that I was exposed to in graduate school and find to be an unparalleled survey is Hastie, Tibshirani, and Friedman’s Elements of Statistical Learning.

Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles (you can find us with a Twitter search for #rstats), this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.

A good data munger excels at turning coffee into regular expressions and parsers, implemented in a high-level scripting language of choice (often Perl, Python, even Javascript). This is problem solving with programming, and quite different from statistics. An aspiration towards elegance — in the form of a perfect XSLT filter, for example — is rarely rewarded, and often punished. A decade ago, I thought that the world’s data would soon be well-structured, and my talent for syntactical incantations of regular expressions would be a moot skill. I was wrong. (Perhaps there’s an analogy with the paper industry: the growing volume of data means we’ll likely need more regular expressions before we need less).

Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).

And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgressnow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.

Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has.  Most of us have had exposure to basic chart-making widgets of Excel (and to date myself, tools like Harvard Graphics). But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.

Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one or a small team.

A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools.  (R is excellent for static visualizations, but cannot compete with the kinds of rich interactive visualizations that tools like Processing and Flare make possible). Luckily, successful collaboration often occurs between data analysts and designers, the occasional fracas notwithstanding.

The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.

Put All Three Skills Together: Sexy. Thus with the Age of Data upon us, those who can model, munge, and visually communicate data — call us statisticians or data geeks — are a hot commodity.  I grew up before the age of geek chic, when the computer wizzes were social pariahs, and feature-length movies were dedicated to nerds seeking revenge.  But in the last decade, Steve Jobs became an icon, the Internet became cool, and an entire generation of tech kids grew up well adjusted.  They even built the social web to prove it.   I believe the same could happen to statistics and data geeks too.

color: the cinderella of dataviz

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

image

Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question:

Why use color in data graphics?

If our data are simple, a single color is sufficient, even preferable. For example, below is a scatter plot of 287 pitches thrown by the major league pitcher Oscar Villarreal in 2008. With just two dimensions of data to describe — the x and y location in the strike zone — black and white is sufficient. In fact, this scatter plot is a perfectly lossless representation of the data set (assuming no data points perfectly overlap).

Fig 1. Location of Pitches (Villarreal, HOU, 2008)

Simple black and white scatter plot

But what if we’d like to know more: for instance, what kinds of pitches (curveballs, fastballs) landed where? Or their speed?  Visualizations live in two dimensions, but the world they describe is rarely so confined.

The defining challenge of data visualization is projecting high dimensional data onto a low dimensional canvas. (As a rule, one should never do the reverse: visualize more dimensions than what already exist in the data).

Getting back to our pitching example, if we want to layer another dimension of data — pitch type — into our plot, we have several methods at our disposal:

  1. plotting symbols – vary the glyphs that we use (circles, triangles, etc.),
  2. small multiples – vary extra dimensions in space, creating a series of smaller plots
  3. color – we can color our data, encoding extra dimensions inside a color space

Which techniques you employ depend on the nature of the data and the media of your canvas. I will describe all three by way of example.

Multivariate Method I:  Vary Your Plotting Symbols

Fig 2. Location and Pitch Type (Villarreal, HOU, 2008)

Scatterplot with varied plotting symbols.

In this plot, I’ve layered the categorical dimension of pitch type into our plot by using four different plotting symbols.

I consider this visualization an abject failure.  In fact, the prize for my most despised graphs in graduate school goes to bacterial growth curves rendered this way . The reason these graphs make our heads hurt is because (i) distinguishing glyphs demands extra attention (versus what academics call ‘pre-attentively processed‘ cues like color), (ii) even after we visually decode the symbols, we have yet another step: mapping symbols to their semantic categories. (Admittedly this can be improved with Chernoff faces or other iconic symbols, where the categorical mapping is self-evident).

Multivariate Method II:  Small Multiples on a Canvas

Folding additional dimensions into a partitioned canvas has a distinguished pedigree in information graphics. It has been employed everywhere from Galileo sunspot illustrations to William Cleveland’s trellis plots. And as Scott Mccloud’s unexpected tour de force on comicsmakes clear, panels of pictures possess a narrative power that a single, undivided canvas lacks.

In this plot below, the four types of pitches that Oscar throws are splintered horizontally.   By reducing our plot sizes, we’ve given up some resolution in positional information. But in return, patterns that were invisible in our first plot, and obscured in our second (by varied symbols) are now made clear (Oscar throws his fastballs low, but his sliders high).

Fig 3:  Location and Pitch Type (Villarreal, HOU, 2008)

black and white strip plot

Multiplying plots in space works especially well on printed media, which can hold more than ten times as many dots per square inch as a screen. Both columns and rows can be used to lattice over additional dimensions, the result being a matrix of scatter plots (in R, see the ‘splom‘ function).

Multivariate Method III: Color Your Data

So why bother with color?

First, as compared to most print media, computer displays have fewer units of space, but a broader color gamut. So color is a compensatory strength.

For multi-dimensional data, color can convey additional dimensions inside a unit of space — and can do so instantly. Color differences can be detected within 200 ms, before you’re even conscious of paying attention (the ‘pre-attentive’ concept I mentioned earlier).

But the most important reason to use color in multivariate graphics is that color is itself multidimensional. Our perceptual color space — however you slice it — is three-dimensioned.

In the example below, I’ve used color as a means of encoding a fourth dimension of our pitching data: the speed of pitches thrown. The palette I’ve chosen is a divergent palette that moves along one dimension (think of it as the ‘redness-blueness’ dimension) in the CIELUVcolor space, while maintaining a constant level of luminosity.

Fig 4. Location, Pitch Type, and Velocity (Villarreal, HOU, 2008)

isoluminant, diverging color ramp

color strip plot

Holding luminosity constant is important, because luminosity (similar to brightness) determines a color’s visual impact. Bright colors pop, and dark colors recede. A color ramp that varies luminosity along with hue will highlight data points as an artifact of color choice.

I chose only seven gradations of color, so I’m downsampling (in a lossy way) our speed data – but further segmentation of our color ramp is not likely to be perceptible.

I’ve also chosen to use filled circles as my plotting symbol, as opposed to the open circles in all my previous plots. This is done to improve the perception of each pitch’s speed via its color: small patches of color are less perceptible. But a consequence of this choice — compounded by our choice to work with a series of smaller plots — is that more points overlap. We’ve further degraded some of our positional information. However, in our last step, we attempt to recover some of this.

Now I’ve finally brought color to bear on this visualization, but I’ve only encoded a single dimension — speed. Which leads to another question:

If color is three-dimensional, can I encode three dimensions with it?

In theory, yes. Colin Ware researched this exact question. In practice, it’s difficult. It turns out that asking observers to assess the amount of ‘redness’, ‘blueness’, and ‘greenness’ of points is possible, but not intuitive (I suspect it’s somewhat like parsing symbols).

Another complicating factor is that a nontrivial fraction of the population has some form of color blindness. This effectively reduces their color perception to two dimensions.

And finally, the truth is that our sensation of color is not equal along all dimensions; it’s thought the closely related ‘red’ and ‘green’ receptors emerged via duplication of the single long wavelength receptor (useful for detecting ripe from unripe fruits, according to one just-so story).

Because the high level of dichromacy in the population, and because of the challenge of encoding three dimensions in color, I  feel color is best used to encode no more than two dimensions of data.

So, for my last example of our pitching plot data, I will introduce luminosity as a means of encoding the local density of points (using a kernel density estimator). This allows us to recover some of the data lost by increasing the sizes of our plotting symbols.

Fig 5. Location, Pitch Type, Velocity, and Density (Villarreal, HOU, 2008)

two-dimensional color palette

multivariate color strip plot

Here we have effectively employed a two-dimensional color palette, with blueness-redness varying along one axis for speed, and luminosity varying in the other to denote local density.

One final point about using luminosity. Observing colors in a data visualization involves overloading, in the programming sense. We rely on cognitive functions that were developed for one purpose (perceiving lions) and use them for another (perceiving lines).

Since we can overload color any way we want, whenever possible, we should choose mappings that are natural. Mapping pitch density to luminosity feels right because the darker shadows in our pitch plots imply depth. Likewise, when sampling from the color space, we might as well choose colors found in nature. These are the palettes our eyes were gazing at for the millions of years before #FF0000 showed up.

Color, used thoughtfully and responsibly, can be an incredibly valuable tool in visualizing high dimensional data.

FutureMan Asks: What about Animation?

This discussion has focused on using static graphics in general, and color in particular, as a means of visualizing multivariate data. I’ve purposely neglected one very powerful tool:  motion. The ability to animate graphics multiplies by several orders of magnitude the amount of information that can be packed into a visualization.  But packing  information into a time-varying data structure has to be done by someone (you or me) and from my view, this remains a significant challenge.  Canonical forms of animated visualizations (equivalent to the histograms, box plots, and scatterplots of the static world) are still a ways off, but frameworks like Processing and Prefuse are a promising start towards their development.

Methods

The final product of these five-dimensional pitch plots — for all available data for the 2008 season — can be explored via the PitchFX Django-driven web tool at Dataspora labs.

All of the visualizations here were developed using R and the Lattice graphics package.  (Of note, Hadley Wickham is developing ggplot2, a bold re-write of the R graphics system based on a grammar of graphics).

References for Further Reading

Comments

9 Responses to “Color: The Cinderella of dataviz”

  1. Joshua Reich on March 13th, 2009

    Great post Michael.

    In the world of computer animation (mostly of yore, but still sometimes today), there is a common phrase of ‘coders colors.’ When left to their own devices, software people tend to choose colors that programmatically explore the RGB tuple space.

    While you don’t explicitly mention in, the RGB space, while perfectly logical for designing computer monitors or building CCD’s, does not map to the structure of the retina nor how humans perceive color, and thus is not ideal for data representation.. Few humans are skilled enough to pick harmonious colors directly as RGB tuples, yet most software systems default to this method.

    This makes some historical sense in that no additional computations are required to translate RGB pixels into monitor signals – it was up to the developer to add their own colorspace transforms. But today, the cost of applying a simple linear matrix math to these tuples is inconsequential, yet many packages still only provide a default RGB space.

    R is great here in that the base package provides hsv() and hcl() in addition to rgb(). And many of the programmatic techniques that would otherwise result in ‘coder colors’ in RGB turn out fine in these other color spaces.

  2. Michael E. Driscoll on March 13th, 2009

    Josh – I did not want to bring our dear readers down the rabbit hole of color spaces, but I couldn’t agree with you more w.r.t. RGB. Our actual perceptual color space is not a perfect cube — but I suspect the same engineers who brought us function keys F1 through F12 were also behind choosing these ’system colors’. We are only now slowly shrugging off those frozen accidents — and our machines are no longer visually shrieking at us.

  3. Edward Tufte on March 15th, 2009

    Dear Mike Driscoll,

    This is an interesting exploration. Some suggestions to try:

    Report some real findings about the baseball pitching to demonstrate that the displays have produced something interesting.

    Use a much larger data matrix (100 or 500 pitches).

    Make dots smaller.

    Make tick marks much smaller. On this idea, see Visual Explanations on Smallest Effective Differences.

    Try color patches (ala Ware) instead of dots.

    See Bill Cleveland’s two excellent books on data displays and do some Cleveland versions of these data.

    Take Ware and Cleveland’s work more seriously.

    Don’t give up color’s third dimension because some viewers (4% of men, 1% of women) are color deficient. That’s way too much to give up; instead design all out and then afterwards see if it is possibly to gently accommodate color deficiencies by color value or saturation (in HSV space).

    Try hue, saturation, and value for three dimensions,

    Use gray for all those black grid lines; eliminate as many lines as possible.

    Eliminate and lighten up gray boxes.

    Try colored letters instead of dots to ID changeup, fastball, sinker (S? N?), slider (S? L?) on a larger common plot.

    Make graph labels more informative.

    Don’t write in first person history of what you did; main subject and main verb of each sentence should be about the graphics and baseball (see how sparklines are presented in Beautiful Evidence; 14 pages and not a single “I”).

    Best, ET

  4. Abhishek Tiwari on March 24th, 2009

    Dear Michael,
    Excellent post as well as blog, I just posted a small article on this entry. I hope my readers will find their way to this blog.
    Thanks,
    Abhishek

  5. Maureen Stone on March 24th, 2009

    Michael,

    An interesting exploration. One weakness I see in the visualization, however, is the mapping from speed to color. A monotonically ordered set of values is most naturally mapped to a change in saturation or lightness, not to an interpolation between two colors. If you used saturation for speed, you would then vary lightness to indicate your other quantitative variable (density of pitches).

    However, I suggest you reconsider using color for speed and instead, use color to indicate the type of pitch. You can then use distinct colors (red, green, blue, etc.) to label the types, as labeling is the most effective use of color. The distinct colors would also let you use rings instead of disks, which makes it easier to estimate the density of the data. It could also be combined with a mapping by letter or symbol, to aid those with less than perfect color vision.

    Use the small multiples for quantized speed ranges (Question: are raw speed values the most interesting, or would it be better to have average, plus, minus?). Then you can combine lightness and saturation to indicate pitch density (as in the Brewer ramps).

    Or, maybe it would make more sense for your audience to quantize pitch density and map speed to the ramp. Either way, my intuition is that you would see more interesting patterns in your data than trying to use color for both quantitative dimensions.

    Keep up the good work.

    Maureen

  6. The Importance of color in data visualization on Datavisualization.ch on March 24th, 2009

    […] Michael E. Driscoll over at Data Evolution comes the original article “Color: The Cinderella of dataviz” about the lack of focus on color in visualizations. It’s an elusive read for anybody […]

  7. Mike Williamson on April 12th, 2009

    Hi Michael,

    I originally saw this presentation as you gave it at the “Use R” group last Wed. I then installed “colorspace” on R and played around with it for a little while these past couple days. I have 2 questions, if you don’t mind, since I am curious how you may have handled the same problems I am having:

    1) In order to automatically generate any decent color palette automatically, regardless of gradations, I need to use the “mixcolor” function.
    As this function says in its manual, it mixes colors “additively”. I am not great with color recognition, but it is either this additive mixing, or the fact that it is in fact mixing in the RGB scheme, regardless of what colorspace I put in for “where”, that is messing up the mixing. If I try something similar to what you did for your baseball stuff, and I generate it using mixcolor, I will get a MUCH brighter luminosity than what you show there, so that the “grey” between blue & red is nearly white.
    Is there a way in “R” using mixcolor or whatever to preserve the luminosity when blending colors?

    2) It is clear that the colorspace package is not really “comfortable” with generating colors in anything other than the RGB scheme. I say this because if I try to mixcolor in with colors in anything other than the RGB scheme, it is possible that the mixed color will generate “NA”s for the values. (Specifically, I only tested LAB, LUV, and their polar versions.)
    While I like what everyone is saying that these other color schemes are better for the human eye, and it makes total sense, I have more “fear” of generating a color key with NAs in it (which will simply not plot) than I do of having a color scheme that is less than ideal. Am I doing something wrong, or do others have this problem with mix color? I suppose the completely reasonable solution is to just create a better mixcolor function, has anyone done this so I don’t recreate the wheel?

    Thanks!
    Mike

  8. Michael E. Driscoll on April 15th, 2009

    Hi Mike –

    I’d have to see your code to understand what’s happening, but a few thoughts:

    (1) The LUV and LAB colorspaces separate chromaticity (the u and v coordinates) from luminosity: so luminosity is held constant when you create a mixture of two different chromas. Specifically, here is the code for creating a 2D palette:

    ## builds a 2d palette mixing 2 hues (col1, col2)
    ## and across two luminosities (l1,l2)
    ## returns C, a matrix of the hex RGB values
    library(colorspace)
    plot2d <- function(col1,col2,l1,l2,m,n,...) {
    C <- matrix(data=NA,ncol=m,nrow=n)
    alpha <- seq(0,1,length.out=m)
    lum <- seq(l1,l2,length.out = n)
    for (i in 1:n) {
    c1 <- LAB(lum[i], coords(col1)[2], coords(col1)[3])
    c2 <- LAB(lum[i], coords(col2)[2], coords(col2)[3])
    for (j in 1:m) {
    c <- mixcolor(alpha[j],c1,c2)
    hexc <- hex(c,fixup=TRUE)
    C[i,j] <- hexc
    }
    }
    return(C)
    }

    (2) Once you make or mix colors in the LAB or LUV space, you need to cast them back into RGB using the ‘hex’ function, but you must include the ‘fixup=TRUE’ parameter in your call to avoid getting NAs in your result. From the documentation for ‘hex’:

    fixup: Should the color be corrected to a valid RGB value before
    correction? The default is to convert out-of-gamut colors to
    the string ‘”NA”‘.

    E.g. write,

    library(colorspace)
    ## 50% mixture of blue and red
    red <- LAB(50,64,64)
    blue <- LAB(50,-48,-48)
    gray < - mixcolor(0.50,red,blue)
    rgbgray <- hex(gray, fixup=TRUE)

    I also thought I’d point folks to the “Building Web Dashboards with R” talk that you reference:

    http://files.meetup.com/1225993/Dataspora_Building_Web_Dashboards_with_R.pdf

  9. O’Reilly Radar on May 4th, 2009

    Big Data: SSD’s, R, and Linked Data Streams…

    If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten additional signals from our network of al…

how google and facebook are using R

(March 26th Update: Video now available) 

Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.

The panel comprised of four recognized R users from industry:

  • Bo Cowgill, Google
  • Itamar Rosenn, Facebook
  • David Smith, Revolution Computing
  • Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.

Panel Introduction

I began by describing R as a programming language with strengths in three areas: (i) data manipulation, (ii) statistics, and (iii) data visualization.

What sets it apart from other data analysis tools? It was developed by statisticians, it’s free software, and it is extensible via user-developed packages — there are nearly 2000 of them as of today at the Comprehensive R Archive Network or CRAN.

Many of these packages can be used for predictive analytics. Jim highlighted Max Kuhn’s caret package , which provides a wrapper for accessing dozens of classification and regression models, from neural networks to naive Bayes.

Bo Cowgill, Google

R is the most popular statistical package at Google, according to Bo Cowgill, and indeed Google is a donor to the R Foundation. He remarked that “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” Nonetheless, he’s optimistic to see that as the R developer community has expanded, R’s documentation has improved, and its performance has gained.

One theme that Bo first brought up, but which was echoed by others, was that while Google uses R for data exploration and model prototyping, it is not typically used in production: in Bo’s group, R is typically run in a desktop environment.

The typical workflow that Bo thus described for using R was: (i) pulling data with some external tool, (ii) loading it into R, (iii) performing analysis and modeling within R, (iv) implementing a resulting model in Python or C++ for a production environment.

Itamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpart package) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

David Smith, Revolution Computing

David’s firm, Revolution Computing, not only uses R, but R is their core business. David said that “we are to R what Red Hat is to Linux”. His firm addresses some of the pain points of using R, such as (i) supporting older versions of the software and (ii) providing parallel computing in R through their ParallelR suite.

David showcased how one of their life sciences clients used R to classify genomic data through use of the randomForest package, and how the analysis of classification trees could be easily parallelized using their ‘foreach’ package.

He also mentioned that several firms they have worked with do use R in production environments, whereby a particular script is exposed on a server, and a client calls it with some data to return a result (several ways exist to set up R in a client-server manner, such as RServe rapache , and Biocep).

David evangelizes and educates about R at the Revolutions blog .

Jim Porzak, The Generations Network

Jim (also co-chairs the R Users Group), gave a brief overview of his PAW talk on using R for marketing analytics. In particular, Jim has used the flexclust package to cluster customer survey data for Sun Microsystems, and apply the resulting profiles to identify high-value sales leads.

During the Q & A session, the panelists were asked several questions.

How do you work around R’s memory limitations? (R workspaces are stored in RAM, and thus their size is limited)

Three responses were given (including one from the audience):

(i) use R’s database connectivity (e.g. RMySQL), and pull in only slices of your data, (ii) downsample your data (do you really a billion data points to test your model?), or (iii) run your scripts on a RAM-obsessed colleague’s machine or fire up an virtual server on Amazon’s compute cloud — for up to 15 Gigs.

What’s the general ramp-up process for groups wanting to use R?

Itamar and Bo both indicated that within their groups, almost everyone arrived having learned R in their university studies. Jim Porzak led an R tutorial within his last firm using an internal slide deck.

How easy is it for developers who are not statisticians to learn R?

The consensus seemed to be that R is a difficult language to achieve competency in, vis-a-vis Python, Perl, or other high-level scripting languages.   Jim emphasized, however, that he is a not a statistician – nor were any of our panelists. (As a non-statistician R user myself, I will say this — a consequence of learning R is an improved grasp of statistics. Knowing statistics is a necessary pre-requisite for understanding R’s features, from its data types to its modeling syntax).

How well does R interface with other tools and languages?

There are several packages on CRAN for importing and exporting data to and from Matlab (RMatlab), Splus, SAS, Excel and other tools.  In addition, there are interfaces for running R within Python ( RPy ) and Java ( RJava ).

The panelists mentioned that they typically run R within a GUIs, either RCommander orRattle . (Aside: I run R exclusively in emacs using ESS — incidentally, one of its authors was panelist David Smith).

A video of the event is now available courtesy of Ron Fredericks and LectureMaker.

COMMENTS

  1. timothy vogel on April 30th, 2009

    Jim emphasized, however, that he is a not a statistician – nor were any of our panelists. (As a non-statistician R user myself, I will say this — a consequence of learning R is an improved grasp of statistics. Knowing statistics is a necessary pre-requisite for understanding R’s features, from its data types to its modeling syntax).

    R is like any other statistical package; i can help you gain inference if you know what you’re doing. Otherwise, it will produce output much like any poorly written program that doesn’t actually accomplish the task at hand.

    As a 30 year statistician from a top-10 graduate program I am increasingly distressed by the dominance computer scientists are gaining in the “analytics field” merely for their increased access to the platforms involved and higher-than-average keyboard skills.

    flexclus in R is like fastclus in SAS and minitab’s old cluster. SPSS, Matlab, Pstat, and Mathematica all have analogs. The truly disturbing aspect of this dynamic is that good statisticians are quite likely to give comp sci types their due, but the comp sci types are trying to corner the analytics market via their sheer advantage vis-a-vis platfrom expertise/access!

  1. Michael Wexler on February 22nd, 2009

    Great post! The all-in-memory problem will continue to hold back R’s utility, but there are some great efforts afoot to fix this, everything from parallelization to new ways to store the data in a memory-mapped-to-disk approach (ala S, SPSS, SAS, etc.) For example, see http://www.r-project.org/conferences/useR-2007/program/posters/adler.pdf as a promising approach.

    I don’t believe sampling is always the right answer; unless you understand the underlying distribution, your sampling could cause you to miss some subtle effects. I look forward to these new approaches which can handle all the data at once…

    For those interested in more hints and tips around using R that I’ve collected in my journey from novice to dangerous novice, please see http://www.nettakeaway.com/tp/?s=R where I review some GUIs, collected links, tips about coming from an SPSS environment, etc.

the case for open source data visualization

When I was in graduate school, the most closely studied part of the scientific publications we read was not the results, but the methods sections. (It was also, incidentally, often the hardest section to write for one’s own publications.) Methods sections are wonderful because they allow you to verify that someone else’s work is correct — by reproducing it yourself. But more importantly, methods sections allow you to build upon the work of others. They are the open source code of science.

Unfortunately, for all but a small fraction of data visualizations on the web, there are no methods sections being published. This is a shame, because it slows the free flow of ideas and prevents the creative extension of other people’s work.

Three conditions must be met for a data visualization to be considered open and reproducible:

  • Open Tools — The software tool used for the visualization must be freely available. Thankfully, many of the most powerful visualization software tools, languages, and frameworks are now open source, such as Processing, Prefuse, Actionscript, and R.
  • Open Code (or Methods) — The actual code, script, and/or series of steps taken to generate the visualization must be published. (For example, Lee Byron released hiscode for a walkability heatmap of San Francisco.)
  • Open Data — The data which is visualized should also be available in the same washed and scrubbed format that was used for the visualization. Ideally any code used to clean up the data might also be shared.

I grade some of the web’s existing data visualization sites using these criteria.

  • The New York Times routinely creates stunning graphics (like a visualization of 22 years of box office receipts ), but we are left to guess how they were created. Grade:D
  • VisualComplexity, a graphics gallery of mostly complex networks (like genome neworks), has pretty images but neither data nor visualization code. Grade:D
  • IBM’s ManyEyes has gorgeous visualizations (some of which are made with thePrefuse toolkit), and while the data is made available, the source code for the visualization is not. Grade:C
  • Processing’s exhibition page highlights several extraordinary visualizations created with its open-source framework. But unfortunately, no source code is available from the visual artists. Grade:C
  • The R Graphics Gallery does make source code for graphics available, but in more than half of the cases, no data is available. Grade: B