Is Spark better than Hadoop Map Reduce?
- by 7wData
For anyone who gets into the Big Data world, the terms Big Data and Hadoop become synonyms. As they learn the ecosystem along with the tools and their workings, people become more aware about what big data actually means, and what role Hadoop has in the big data ecosystem.
According to Wikipedia, “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate”.
To put it in simple terms, as the size of data increases the usual processing methods takes too longer or proves to be too costly.
Hadoop was created in ,2005, by Doug Cutting, who was inspired by Google’s white papers on GFS and MapReduce. Hadoop is an open source software framework for distributed storage and distributed processing of very large data sets. In other words, it is designed to reduce cost and time of processing large data sets.
Hadoop, with its distributed file system (HDFS) and distributed processing model (MapReduce) became the de-facto standard in big data computing. The term ‘Hadoop’ refers to not only the base modules, but also the ecosystem of other software packages that can be used along with Hadoop.
As time went on, data generation exploded and the need for processing large amounts of data also exploded. This eventually generated a variety of needs in big data computing, not all of which could be satisfied by Hadoop.
Most of the analysis done on data are iterative in nature. While iterative processing could be done in MapReduce, data should be read for each iteration of the process. Under normal circumstances, this would be fine, but reading 100′s of GB’s of data or a few TB’s of data is going to take time and people are not patient.
Many people consider data analytics to be an art rather than a science. In any art, the creator creates a small piece of the puzzle and attaches it to the bigger one to witness its growth. Loosely translated, data analysts want to see the results of each process before proceeding to the next one. In other words, a lot data analytics is interactive in nature. Traditionally, interactive analytics is effected through SQL. Analysts write queries which operate on data in databases. Although, Hadoop had equivalents (Hive & Pig), this proved to be time consuming as each query takes a lot of time processing the data.
Both these hurdles led to the birth of Spark, a new processing model that facilitates iterative programming and interactive analytics. Spark provided an in-memory primitive models that loads the data into memory and query it repeatedly. This makes Spark well suited for a lot data analytics and machine learning algorithms.
Note that, Spark only defines the distributed processing model. Storing the data part is not addressed by Spark and it still relies on hadoop (HDFS) to efficiently store the data in a distributed way.
Spark is setting the big data ecosystem on hyperdrive. It promises to be 10-100 times faster than MapReduce. Many think this could be the end of MapReduce.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Strategies for simplifying complex Salesforce data migrations – Free Webinar
27 March 2024
5 PM CET – 6 PM CET
Read MoreYou Might Be Interested In
How Machine Learning is Revolutionizing Digital Enterprises
18 Apr, 2017According to the prediction of IDC Futurescapes, Two-thirds of Global 2000 Enterprises CEOs will center their corporate strategy on digital transformation. …
How IoT Is Shaping the Agriculture Sector: Benefits Offered by Latest Trends of IoT
22 Jul, 2019With the changing streams of today’s technology, business sectors are gaining various benefits from the latest trends. As the modern …
Data modeling software tackles glut of new data sources
22 May, 2019Data modeling, a key component of data management techniques and analytics processes, comprises many complex steps that are getting increasingly …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.