Big Data Processing 101: The What

Big Data Processing 101: The What, Why, and How

Big Data Processing 101: The What, Why, and How

First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. It can be used to analyze normal text for the purpose of developing an index. The index maps each term, “remembering” its location. When the term is searched for, Lucene immediately knows all the places where that term had existed. This makes the search process much faster, and much more efficient, than having to seek the term out anew, each time it is searched for. It also laid the foundation for an alternative method for Big Data processing. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. (The Apache Software Foundation is an open source, innovation software community.)

In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). Using these concepts, Doug began working with Yahoo in 2006, to build a “search engine” comparable to Google’s.  The shared and combined concepts made Hadoop a leader in search engine popularity. The fact that Apache Hadoop is free, and compatible with most common computer systems, certainly helped it gain in popularity, as did the fact “other” software programs are also compatible with Hadoop, allowing for greater freedom in the search process.

Read Also:
5 Surprising Ways Data Can Be Used

Hadoop, by itself, can operate using a single machine. This can be useful for experimentation, but normally Hadoop runs in a cluster configuration. The number of clusters can be a few nodes to a few thousand nodes. Hadoop’s efficiency comes from working with batch processes set up in parallel. Rather than having data moved through a network to a specific processing node, large problems are dealt with by dividing them into smaller, more easily solved problems. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out.

Read Also:
7 Digital Analytics Trends that will Dominate in 2017

It took a few years for Yahoo to completely transfer its web index to Hadoop, but this slow process gave the company time for intelligent decision making, including the decision to create a “research grid” for their Data Scientists. This grid started with a few dozen nodes and, as Hadoop’s technology evolved, grew to several hundred as the Data Scientists continued to add more data. Other well-known websites using Hadoop include:

Spark is fast becoming another popular system for Big Data processing. Spark is compatible with Hadoop (helping it to work faster), or it can work as a standalone processing engine. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS).

Read Also:
MDM for Business Value: New Approaches

Spark is different from Hadoop and Google’s MapReduce model because of its superior memory, which speeds up processing time. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system.

Apache Flink is an engine which processes streaming data. Spark, by way of comparison, operates in batch mode, and cannot operate on rows as efficiently as Flink can.

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Water Management Analytics

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Five Myths About Machine Learning You Need To Know Today

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
5 Tools to Consider for Healthcare Data Integration

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Military Lags in Exploiting Artificial Intelligence

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
How can you secure big data in the information age?

Leave a Reply

Your email address will not be published. Required fields are marked *