One of the holy grails of machine learning is the creation of a system that can “read the web” and learn from it, as Isaac Newton read Euclid’s Elements and taught himself geometry.
Imagine a mythical beast that could speed-read one-hundred million pages per second, consuming every Wikipedia entry, every scientific article on arxiv.org, every out-of-copyright scanned book, and beyond just indexing that information, could actually reason with it.
Building an intelligent machine isn’t hard impossible. It’s building a learning machine, one that mirrors the magic by which a teenager learns to drive a car, play chess, or do calculus in a period of a few dozen hours – that’s the magic that we haven’t yet figured out.
But I wonder if some of our challenges in creating this mythical learning machine lie with what we’re trying to feed the beast. After all, the web of documents was written for human consumption. Natural language is a lossy compression algorithm; it maps the massive varieties of our experiences into semantic text. A high-frequency sensory stream of sights, sounds, and experiences gets hashed into “cold sidewalks are slippery.”
To that end, if we want machines to reason about our world, let’s stop giving them our digested cud of content. Let’s provide them direct experience, via the sensor streams that our instrumented planet is emitting via weather stations, transit networks, electrical grids, smart phones, fitbits, and GPS devices. With that data, machines might begin to intuit relationships between weather and sidewalk slips – in forms that are beyond our own human minds to comprehend.
It’s data, not documents, that the mythical machine learning beast will eat.