Apache Hadoop, the open source software framework at the heart of big data, is a batch computing engine. It is not well-suited to the online, interactive data processing required for truly real-time data insights. Or is it? Doug Cutting, creator of Hadoop and founder of the Apache Hadoop Project (and chief architect at Cloudera) says he believes Hadoop has a future beyond batch.
“I think batch has its place,” Cutting says. “If you’re moving bulk amounts of data and you need to really analyze everything, that’s not about interactive. But the combination of batch and online computation is what I think people will really appreciate.”
“I really see Hadoop becoming the kernel of the mainstream data processing system that businesses will be using,” he adds.
Speaking at the O’Reilly Strata Conference + Hadoop World in New York City, Cutting explains his thoughts on the core themes of the Hadoop stack and where it’s heading.
“Hadoop is known as a batch computing engine and indeed that’s where we started, with MapReduce,” Cutting says. “MapReduce is a wonderful tool. It’s a simple programming metaphor that has found many applications. There are books on how to implement a variety of algorithms on MapReduce.”
MapReduce is a programming model, designed by Google for batch processing massive datasets in parallel using distributed computing. MapReduce takes an input and breaks it down into many smaller sub-problems, which are distributed to nodes to process in parallel. It then reassembles the answers to those sub-problems to form the output.
“It’s also very efficient,” Cutting says. “It permits you to move your computation to your data, so you’re not copying data around as you’re processing it. It also forms a shared platform. Building a distributed system is a complicated process, not something you can do overnight. So we don’t want to have to re-implement it again and again. MapReduce has proved itself a solid foundation. We’ve seen the development of many tools on top of it such as Pig and Hive.”
“But, of course, this platform is not just for batch computing,” he adds. “It’s a much more general platform, I believe.”
To illustrate this, Cutting lays out what he considers the two core themes of Hadoop as it exists today, together with a few other things that he considers matters of “style.”
First and foremost, he says, the Hadoop platform is defined by its scalability. It works just fine on small datasets stored in-memory, but is capable of scaling massively to handle huge datasets.
“A big component of scalability that we don’t hear a lot talked about is affordability,” he says. “We run on commodity hardware because it allows you to scale further. If you can buy 10 times the amount of storage per dollar, then you can store 10 times the amount of data per dollar. So affordability is key, and that’s why we use commodity hardware, because it is the most affordable platform.”
Just as important, he notes, Hadoop is open source.
“Similarly, open source software is very affordable,” he adds. “The core platform that folks develop their applications against is free. You may pay vendors, but you pay vendors for the value they deliver, you don’t keep paying them year after year even though you’re not getting anything fundamentally new from them. Vendors need to earn your trust and earn your confidence by providing you with value over time.”
Beyond that, he says, there are what he considers elements of Hadoop’s style.
“There’s this notion that you don’t need to constrain your data with a strict schema at the time you load it,” he says. “Rather, you can afford to save your data in a raw form and then, as you use it, project it to various schemas. We call this schema on read.
Another popular theme in the big data space is that oftentimes simply having more data is a better way to understand your problem than to have a more clever algorithm.;