Big Data Processing 101: The What

Big Data Processing 101: The What, Why, and How

Big Data Processing 101: The What, Why, and How

First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. It can be used to analyze normal text for the purpose of developing an index. The index maps each term, “remembering” its location. When the term is searched for, Lucene immediately knows all the places where that term had existed. This makes the search process much faster, and much more efficient, than having to seek the term out anew, each time it is searched for. It also laid the foundation for an alternative method for Big Data processing. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. (The Apache Software Foundation is an open source, innovation software community.)

In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). Using these concepts, Doug began working with Yahoo in 2006, to build a “search engine” comparable to Google’s.  The shared and combined concepts made Hadoop a leader in search engine popularity. The fact that Apache Hadoop is free, and compatible with most common computer systems, certainly helped it gain in popularity, as did the fact “other” software programs are also compatible with Hadoop, allowing for greater freedom in the search process.

Read Also:
IoT plus big data analytics translate into better services management at Auckland Transport

Hadoop, by itself, can operate using a single machine. This can be useful for experimentation, but normally Hadoop runs in a cluster configuration. The number of clusters can be a few nodes to a few thousand nodes. Hadoop’s efficiency comes from working with batch processes set up in parallel. Rather than having data moved through a network to a specific processing node, large problems are dealt with by dividing them into smaller, more easily solved problems. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out.

Read Also:
Why your company needs to ask the right questions about big data

It took a few years for Yahoo to completely transfer its web index to Hadoop, but this slow process gave the company time for intelligent decision making, including the decision to create a “research grid” for their Data Scientists. This grid started with a few dozen nodes and, as Hadoop’s technology evolved, grew to several hundred as the Data Scientists continued to add more data. Other well-known websites using Hadoop include:

Spark is fast becoming another popular system for Big Data processing. Spark is compatible with Hadoop (helping it to work faster), or it can work as a standalone processing engine. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS).

Spark is different from Hadoop and Google’s MapReduce model because of its superior memory, which speeds up processing time. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system.

Read Also:
Anomaly Detection in Predictive Maintenance with Time Series Analysis

Apache Flink is an engine which processes streaming data. Spark, by way of comparison, operates in batch mode, and cannot operate on rows as efficiently as Flink can.

 



HR & Workforce Analytics Summit 2017 San Francisco

19
Jun
2017
HR & Workforce Analytics Summit 2017 San Francisco

$200 off with code DATA200

Read Also:
6 Incredible Ways Businesses are Using Artificial Intelligence Today

M.I.E. SUMMIT BERLIN 2017

20
Jun
2017
M.I.E. SUMMIT BERLIN 2017

15% off with code 7databe

Read Also:
What Bringing Big Data Analytics Into Healthcare Means For Patients

Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
Why Apple Will Become A Leader In Artificial Intelligence

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
How HR Analytics Can Transform the Workplace

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
How Artificial Intelligence Could Revolutionize Construction

Leave a Reply

Your email address will not be published. Required fields are marked *