Machine learning and artificial intelligence have arrived in the data center, changing the face of the hyperscale server farm as racks begin to fill with ASICs, GPUs, FPGAs and supercomputers.
These technologies provide more computing horsepower to train machine learning systems, a process that involved enormous amounts of data-crunching. The end goal is to create smarter applications, and improve the services you already use every day.
“Artificial intelligence is now powering things like your Facebook Newsfeed,” said Jay Parikh, Global Head of Engineering and Infrastructure for Facebook. “It is helping us serve better ads. It is also helping make the site safer for people that use Facebook on a daily basis.”
“Machine Learning is transforming how developers build intelligent applications that benefit customers and consumers, and we’re excited to see the possibilities come to life,” said Norm Jouppi, Distinguished Hardware Engineer at Google.
Much of the computing power to create these services will be delivered from the cloud. As a result, cloud builders are adopting hardware acceleration techniques that have been common in high performance computing (HPC) and are making their way into the hyperscale computing ecosystem.
The race to leverage machine learning is led by the industry’s marquee names, including Google, Facebook and IBM. As usual, the battlefield runs through the data center, with implications for the major cloud platforms and chipmakers like Intel and NVIDIA.
Neural networks are computers that emulate the learning process of the human brain to solve new challenges, a process that requires lots of computing horsepower. That’s why the leading players in the field have moved beyond traditional CPU-driven servers and are now building systems that accelerate the work. In some cases, they’re creating their own chips.
Last week Google revealed the Tensor Processing Unit (TPU), a custom ASIC tailored for TensorFlow, an open source software library for machine learning that was developed by Google. An ASIC (Application Specific Integrated Circuits) is a chip that can be customized to perform a specific task. Recent examples of ASICs include the custom chips used in bitcoin mining. Google has used its TPUs to squeeze more operations per second into the silicon.
“We’ve been running TPUs inside our data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning,” writes Norm Jouppi, Distinguished Hardware Engineer, on the Google blog. “This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore’s Law).”
A board with a TPU fits into a hard disk drive slot in a data center rack. Google used its TPU infrastructure to power AlphaGo, the software program that defeated world Go champion Lee Sedol in a match. Go is a complex board game in which human players maintained an edge on computers, which long ago had overtaken the abilities of humans in games like chess or “Jeopardy.” The complexities of Go presented a challenge to artificial intelligence technology, but the extra power supplied by TPUs helped Google’s program solve more difficult computational challenges and defeat Sedol.
“Our goal is to lead the industry on machine learning and make that innovation available to our customers,” writes Jouppi. “Building TPUs into our infrastructure stack will allow us to bring the power of Google to developers across software like TensorFlow and Cloud Machine Learning with advanced acceleration capabilities.”
Facebook’s AI Lab is using GPUs to bring more horsepower to bear on data-crunching for its artificial intelligence (AI) and machine learning platform.
“We’ve been investing a lot in our artificial intelligence technology,” said Parikh.