Big Data Processing 101: The What

Big Data Processing 101: The What, Why, and How

Big Data Processing 101: The What, Why, and How

First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. It can be used to analyze normal text for the purpose of developing an index. The index maps each term, “remembering” its location. When the term is searched for, Lucene immediately knows all the places where that term had existed. This makes the search process much faster, and much more efficient, than having to seek the term out anew, each time it is searched for. It also laid the foundation for an alternative method for Big Data processing. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. (The Apache Software Foundation is an open source, innovation software community.)

In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). Using these concepts, Doug began working with Yahoo in 2006, to build a “search engine” comparable to Google’s.  The shared and combined concepts made Hadoop a leader in search engine popularity. The fact that Apache Hadoop is free, and compatible with most common computer systems, certainly helped it gain in popularity, as did the fact “other” software programs are also compatible with Hadoop, allowing for greater freedom in the search process.

Read Also:
Topic Modeling Large Amounts of Text Data

Hadoop, by itself, can operate using a single machine. This can be useful for experimentation, but normally Hadoop runs in a cluster configuration. The number of clusters can be a few nodes to a few thousand nodes. Hadoop’s efficiency comes from working with batch processes set up in parallel. Rather than having data moved through a network to a specific processing node, large problems are dealt with by dividing them into smaller, more easily solved problems. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out.

Read Also:
Will You Always Save Money with Hadoop?

It took a few years for Yahoo to completely transfer its web index to Hadoop, but this slow process gave the company time for intelligent decision making, including the decision to create a “research grid” for their Data Scientists. This grid started with a few dozen nodes and, as Hadoop’s technology evolved, grew to several hundred as the Data Scientists continued to add more data. Other well-known websites using Hadoop include:

Spark is fast becoming another popular system for Big Data processing. Spark is compatible with Hadoop (helping it to work faster), or it can work as a standalone processing engine. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS).

Read Also:
White House Takes Deep Interest In AI

Spark is different from Hadoop and Google’s MapReduce model because of its superior memory, which speeds up processing time. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system.

Apache Flink is an engine which processes streaming data. Spark, by way of comparison, operates in batch mode, and cannot operate on rows as efficiently as Flink can.


Leave a Reply

Your email address will not be published. Required fields are marked *