The Age of Data

24 Nov 2010

The last decade has been about data. Computers and the internet has made available unfathomable amounts of data on just about anything (albeit a lot of it of questionable quality). More than 10 years ago, Google changed how we search through a very small part of it by understanding the importance of statistical analysis. They called it PageRank.

Companies like Google, Facebook and LinkedIn build their business around collecting data about their users and selling it to their customers. But it’s not the data itself, it’s the relations, the correlations hidden within the terabytes. It is visible every time they “magically” suggest other people we might know or when Amazon displays a different book, equally interesting. iTunes Genius or Netflix works similarly.

Like the beauty of fractals, which the late Benoit Mandelbrot helped describe, and which we cannot comprehend without the billions of calculations possible on computers, we need help making sense of the seemingly disconnected bits of information that is everywhere.

Hal Varian, chief economist at Google put it like this:

I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.

Rockstar statistician Hans Rosling has made the important point that it is not data we lack. It is understanding it. His remarkable appearances on TED made it clear that our beliefs of the world can be very far from the truth (YouTube list). He founded Gapminder to visualize complex or large datasets in order to help us understand – and brilliantly so.

Two years ago Wired published a story called The End of Theory argues how data is changing the scientific method. We no longer depend on the idea for a theory. Rather, computers will be programmed to look for patterns in the sea of data and help us find correlations. Statistics is increasingly becoming the foundation of knowledge. From data visualizations in newspapers to research in medicine and physics, data science is coming.

Google recently released Google Refine, based on software by Freebase, that makes it easy to work with messy data, clean it up and convert to a veriety of formats. Good data is the basis for useful statistics and Refine makes it a lot easier than juggling regular expressions.

There are now several sources online (IBM’s Many Eyes, Gapminder and Freebase to name a few), each with huge datasets on diverse subjects, ready for analysis. Around the world, governments are opening their databases to the public, allowing anyone to find new relations or make it accessible to a greater audience.

Understanding the data we have will help us understand the world and statistics, data parsing and communication is how to do it. To end with a quote from Hans Rosling:

The seemingly impossible is possible. We can have a good world.

Coffee with Kanzi TED Talks - Don't Just Educate the Head