Apache Hadoop and Apache Spark are complex technologies, and how to use these architectures together is often misunderstood by many organizations. Investing in both technologies enables a broad set of big data analytics and application development use cases.
Niru Anisetti is program director in the product management team for Spark offerings and next-generation big data platforms at IBM, and Rohan Vaidyanathan is senior offering manager at IBM and a leading light on the IBM Cloud Data Services team. Anisetti is an award-winning product specialist who has a background in software engineering and product development with experience in both Hadoop and Spark. And while working in the big data space during the past three years, Vaidyanathan has witnessed an explosion in the number and variety of organizations adopting big data technologies such as Hadoop and Spark. He’s also observed the recent trend to leverage data services in the cloud. We recently spoke to Anisetti and Vaidyanathan to explore some of the common misconceptions about Hadoop and Spark and help us understand the unique strengths of using the two architectures together.
Rohan Vaidyanathan: Many companies that have already invested in Hadoop, Spark or both absolutely know what they are doing. But a large group of organizations also exists that is on the edge of adopting big data, and a few key misconceptions are out there that often cause problems as these organizations start trying to define their new big data architecture.
For example, many articles online talk about Spark as a successor to Hadoop, or even as a replacement for Hadoop. Spark can execute a job 10 to 100 times faster than Hadoop, or to be precise, MapReduce; but there’s much more to Spark than just the runtime aspects of a cluster.
Niru Anisetti: My short response is, “no.” Many companies that we have started engaging with around Spark are just exploring analytics with their sample data. The best guess estimate at IBM is that around 90 percent of organizations are challenged to find analytics solutions with good return on investment (ROI) and are still in the planning stage. We need to do better at dispelling some of the myths and misunderstandings, if we’re going to help clients move forward on their big data journeys.
Anisetti: I like to use the analogy of a car. Spark is like a high-performance engine; it powers the work that you want to do with your data, and it can be bolted to all kinds of different chassis: data platforms such as object storage, IBM Cloudant or Hadoop. Hadoop can provide one of the possible storage layers that fuel the Spark engine with data.
Vaidyanathan: The key point is that Spark has no notion of storage within it. If you’re a data scientist and you’re using a Jupyter notebook to explore a small data set residing in an object store with Spark to do some ad hoc analysis, that’s fine. But what happens when you discover some exciting new way of gaining insight into that data, and you want to operationalize it on a massive scale with huge data sets and thousands of users? You need a data platform to ingest the data, store it, manage it and keep it secure. And you also need to add a robust framework for data governance to help you maintain quality and provide traceability.
Spark doesn’t provide those broader functionalities; it’s purely an engine for high-speed distributed data processing. Of course, Spark is an incredibly exciting technology and has all sorts of cool use cases, from stream processing to machine learning to real-time analytics, which is why we’re using it as an engine for more than 25 IBM products. But most real-world use cases also require additional capabilities such as governance, which means you need more than just Spark on its own.
Vaidyanathan: Exactly. Hadoop is a broad ecosystem of open source components that aims to address almost every aspect of working with big data.