In the burgeoning field of computer science known as machine learning, engineers often refer to the artificial intelligences they create as “black box” systems: Once a machine learning engine has been trained from a collection of example data to perform anything from facial recognition to malware detection, it can take in queries—Whose face is that? Is this app safe?—and spit out answers without anyone, not even its creators, fully understanding the mechanics of the decision-making inside that box.
But researchers are increasingly proving that even when the inner workings of those machine learning engines are inscrutable, they aren’t exactly secret. In fact, they’ve found that the guts of those black boxes can be reverse-engineered and even fully reproduced—stolen, as one group of researchers puts it—with the very same methods used to create them.
In a paper they released earlier this month titled “Stealing Machine Learning Models via Prediction APIs,” a team of computer scientists at Cornell Tech, the Swiss institute EPFL in Lausanne, and the University of North Carolina detail how they were able to reverse engineer machine learning-trained AIs based only on sending them queries and analyzing the responses. By training their own AI with the target AI’s output, they found they could produce software that was able to predict with near-100% accuracy the responses of the AI they’d cloned, sometimes after a few thousand or even just hundreds of queries.
“You’re taking this black box and through this very narrow interface, you can reconstruct its internals, reverse engineering the box,” says Ari Juels, a Cornell Tech professor who worked on the project. “In some cases, you can actually do a perfect reconstruction.”
The trick, they point out, could be used against services offered by companies like Amazon, Google, Microsoft, and BigML that allow users to upload data into machine learning engines and publish or share the resulting model online, in some cases with a pay-by-the-query business model. The researchers’ method, which they call an extraction attack, could duplicate AI engines meant to be proprietary, or in some cases even recreate the sensitive private data an AI has been trained with. “Once you’ve recovered the model for yourself, you don’t have to pay for it, and you can also get serious privacy breaches,” says Florian Tramer, the EPFL researcher who worked on the AI-stealing project before taking a position at Stanford.
In other cases, the technique might allow hackers to reverse engineer and then defeat machine-learning-based security systems meant to filter spam and malware, Tramer adds. “After a few hours’ work…you’d end up with an extracted model you could then evade if it were used on a production system.”
The researchers’ technique works by essentially using machine learning itself to reverse engineer machine learning software. To take a simple example, a machine-learning-trained spam filter might put out a simple spam or not-spam judgment of a given email, along with a “confidence value” that reveals how likely it is to be correct in its decision. That answer can be interpreted as a point on either side of a boundary that represents the AI’s decision threshold, and the confidence value shows its distance from that boundary. Repeatedly trying test emails against that filter reveals the precise line that defines that boundary. The technique can be scaled up to far more complex, multidimensional models that give precise answers rather than mere yes-or-no responses. (The trick even works when the target machine learning engine doesn’t provide those confidence values, the researchers say, but requires tens or hundreds of times more queries.)
The researchers tested their attack against two services: Amazon’s machine learning platform and the online machine learning service BigML. They tried reverse engineering AI models built on those platforms from a series of common data sets.