the rise of the data web


The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky”into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps,, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

Sacred “Words & Enthusiasm” vs Meaningless Utterances

Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.

Documents are made of “words and enthusiasm”: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions. Their core stuffing is text. Their structure is unpredictable and irregular — even fractal.

Data are not created but collected (something given, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure. Data are re-orderable and divisible: you can relay city temperatures in any order, but you can’t rearrange a Shakespearian sonnet without muddling its meaning. Some documents are so meaningful as to be considered sacred.

Data are, in this regard, meaningless on their own; they do not signify, they simply are. These data are the utterances of the spimes that surround us.

Documents as Trees, Data as Streams

The argument for shifting away from markup languages as data formats is not just practical, it’s philosophical: it’s about pivoting our conception away from the dominant metaphor of documents — trees — towards one far more suitable for data — streams.

Trees are rooted and finite: you can’t chop up a tree and easily put it back together again (while XML has made concessions to document fragments, it is not a natural fit).

Streams can be split, sampled, and filtered. The divisibility of data streams lends itself to parallelism in a way that document trees do not. The stream paradigm conceives of data as extending infinitely forward in time. The Twitter data stream has no end: it ought to have no end tag.

Conceiving of data as streams moves us out of the realm of static objects and into the realm of signal processing. This is the domain of the living: where the web is not an archive but an organism, reacting in real-time.

XML Considered Harmful for Data

XML is a poor language for data because it solves the wrong problems — those of documents — while leaving many of data’s unique issues unaddressed.   But many promising alternatives exist — microformats like JSONThrift, and even SQLite’s file format – as I will detail in a my next post.

(Originally published August 20, 2009 on the Dataspora blog.)

Published by Michael Driscoll

Founder @RillData. Previously @Metamarkets. Investor @DCVC. Lapsed computational biologist.

Leave a Reply