Big Data Processing 101: The What

Big Data Processing 101: The What, Why, and How

Big Data Processing 101: The What, Why, and How

First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. It can be used to analyze normal text for the purpose of developing an index. The index maps each term, “remembering” its location. When the term is searched for, Lucene immediately knows all the places where that term had existed. This makes the search process much faster, and much more efficient, than having to seek the term out anew, each time it is searched for. It also laid the foundation for an alternative method for Big Data processing. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. (The Apache Software Foundation is an open source, innovation software community.)

In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). Using these concepts, Doug began working with Yahoo in 2006, to build a “search engine” comparable to Google’s.  The shared and combined concepts made Hadoop a leader in search engine popularity. The fact that Apache Hadoop is free, and compatible with most common computer systems, certainly helped it gain in popularity, as did the fact “other” software programs are also compatible with Hadoop, allowing for greater freedom in the search process.

Read Also:
Want to make better decisions? Break down the wall between data and IT

Hadoop, by itself, can operate using a single machine. This can be useful for experimentation, but normally Hadoop runs in a cluster configuration. The number of clusters can be a few nodes to a few thousand nodes. Hadoop’s efficiency comes from working with batch processes set up in parallel. Rather than having data moved through a network to a specific processing node, large problems are dealt with by dividing them into smaller, more easily solved problems. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out.

Read Also:
Lack of Big Data Analytics Agility Hobbles Healthcare Orgs

It took a few years for Yahoo to completely transfer its web index to Hadoop, but this slow process gave the company time for intelligent decision making, including the decision to create a “research grid” for their Data Scientists. This grid started with a few dozen nodes and, as Hadoop’s technology evolved, grew to several hundred as the Data Scientists continued to add more data. Other well-known websites using Hadoop include:

Spark is fast becoming another popular system for Big Data processing. Spark is compatible with Hadoop (helping it to work faster), or it can work as a standalone processing engine. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS).

Spark is different from Hadoop and Google’s MapReduce model because of its superior memory, which speeds up processing time. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system.

Read Also:
How improved data quality management helped Comic Relief boost fundraising campaigns

Apache Flink is an engine which processes streaming data. Spark, by way of comparison, operates in batch mode, and cannot operate on rows as efficiently as Flink can.

 



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
How Restaurants are Cooking Up Insights with Analytics

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Lack of Big Data Analytics Agility Hobbles Healthcare Orgs

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
How Can Finance Professionals Take Advantage of Predictive Analytics?

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
Data lakes, don't confuse them with data warehouses, warns Gartner

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Why big data and privacy are often at odds

Leave a Reply

Your email address will not be published. Required fields are marked *