Is data the modern oracle, the oil that will power the next industrial revolution—or just another round of business hype?
Of course it's true that there is more of the stuff, more information in forms that computers can collect and process, than ever in human history. Even trying to quantify it is a fool's errand, when yesterday’s "biggest dataset in the world" becomes today's portable hard drive. But there is more to it than size. After years of talking to people who use big data in fields from dating apps to finding the Higgs Boson, I managed to reverse-engineer my analysis into a handy acronym—DATA.
D is for dimensions, or diverse, or different datasets. By combining very different types of information, we can get new insights. Brain scans alone are informative, but combine them with health records, postcodes and weather reports, and you can test a hypothesis that vitamin D intake affects the progression of multiple sclerosis, for example.
Try Newsweek for only $1.25 per week
A is for automatic. We do so many things through our digital devices, phones or computers or wearables that collecting data is now the default. Every time you touch into a transport system, or pay with a bank card, or connect to a wifi network, you're adding to somebody's database. Much of the processing of that data is also automatic, invisible, opaque.
T is for time. Data streams into the databases almost in real time, making it easy to spot emerging patterns, and then to project that timeline into the future. Not just obvious things like traffic flows, but adding “sentiment analysis” of our social media activity to sales records and weather forecasts to predict the first big barbecue weekend of the year.
A is for AI, artificial intelligence. That's what spots the patterns in the tsunami of numbers. Yes, computers can calculate faster and more accurately than any human, but by using machine learning they do far more. Through trial and error, software modeled on aspects of how humans learn can sort images like brain scans (male/female or healthy/diseased) or more complex documents like job applications.
And this is where the dilemmas start to emerge.
The idea is that, unlike a biased human recruiter, a hiring algorithm will go on objective data. It won't take into account categories of human prejudice like race or gender. And if any disgruntled applicant disputes your hiring decision, you can claim that you followed procedure to the letter.
Even if it turns out that the algorithm got it wrong when the new employee runs off with all the company's cash, at least you won't have to carry the can. You followed procedure, didn't you? Is it your fault if this candidate was the 1 percent, the exception that proves the rule is probabilistic, not absolute?
But what if you are the other 1 percent, the applicant whose scores are lousy, for reasons over which you have no control, but who would make the best employee if somebody would just give you the chance?
Say you live in the wrong part of town, too far from the workplace. Or you've had a lot of time off sick lately. Or your friends tagged you in a Facebook photo with a jokey reference to smoking weed.