Machine learning used to take place behind the scenes: Amazon mined your clicks and purchases for recommendations, Google mined your searches for ad placement, and Facebook mined your social network to choose which posts to show you. But now machine learning is on the front pages of newspapers, and the subject of heated debate. Learning algorithms drive cars, translate speech, and win at Jeopardy! What can and can’t they do? Are they the beginning of the end of privacy, work, even the human race? This growing awareness is welcome, because machine learning is a major force shaping our future, and we need to come to grips with it. Unfortunately, several misconceptions have grown up around it, and dispelling them is the first step. Let’s take a quick tour of the main ones:
Machine learning is just summarizing data. In reality, the main purpose of machine learning is to predict the future. Knowing the movies you watched in the past is only a means to figuring out which ones you’d like to watch next. Your credit record is a guide to whether you’ll pay your bills on time. Like robot scientists, learning algorithms formulate hypotheses, refine them, and only believe them when their predictions come true. Learning algorithms are not yet as smart as scientists, but they’re millions of times faster.
Learning algorithms just discover correlations between pairs of events. This is the impression you get from most mentions of machine learning in the media. In one famous example, an increase in Google searches for “flu” is an early sign that it’s spreading. That’s all well and good, but most learning algorithms discover much richer forms of knowledge, such as the rule If a mole has irregular shape and color and is growing, then it may be skin cancer.
Machine learning can only discover correlations, not causal relationships. In fact, one of the most popular types of machine learning consists of trying out different actions and observing their consequences — the essence of causal discovery. For example, an e-commerce site can try many different ways of presenting a product and choose the one that leads to the most purchases. You’ve probably participated in thousands of these experiments without knowing it. And causal relationships can be discovered even in some situations where experiments are out of the question, and all the computer can do is look at past data.
Machine learning can’t predict previously unseen events, a.k.a. “black swans.” If something has never happened before, its predicted probability must be zero — what else could it be? On the contrary, machine learning is the art of predicting rare events with high accuracy. If A is one of the causes of B and B is one of the causes of C, A can lead to C, even if we’ve never seen it happen before. Every day, spam filters correctly flag freshly concocted spam emails. Black swans like the housing crash of 2008 were in fact widely predicted — just not by the flawed risk models most banks were using at the time.
The more data you have, the more likely you are to hallucinate patterns. Supposedly, the more phone records the NSA looks at, the more likely it is to flag an innocent as a potential terrorist because he accidentally matched a terrorist detection rule. Mining more attributes of the same entities can indeed increase the risk of hallucination, but machine learning experts are very good at keeping it to a minimum.