Top Streaming Technologies for Data Lakes and Real-Time Data Blog

Top Streaming Technologies for Data Lakes and Real-Time Data

by 7wData
October 4, 2016

More than ever, streaming technologies are at the forefront of the Hadoop ecosystem. As the prevalence and volume of real-time data continues to increase, the velocity of development and change in technology will likely do the same. However, as the number and complexity of streaming technologies grow, consumers of Hadoop must face an increasing number of choices with increasingly blurred delineation of functionality.

While it would be hard to encompass all information about streaming ingest in one page (or even one book), this post is meant to provide a basic overview of the ways in which various Hadoop technologies fit into the data lake and provide a jumping-off point for further exploration. Notably, there are many, many more technologies than are shown in the diagram below -- these are just a few of the most common.

The first point to make when considering streaming in the data lake is that although many of the available technologies are incredibly flexible and can be used in multiple contexts, a well-executed data lake provides strict rules and processes around ingestion. For example, Kafka and Flume both allow connections directly into Hive and HBase, and Spark can ingest and process data without ever writing to disk. This functionality is robust, but also compromises the original, unaltered data, which is a main principle of data lake architecture. Because of this, we generally restrict the ways in which data flows through the system. Data must be ingested, written to a raw landing zone where it can be held, and copied to another zone for processing and enrichment.

Flume and Kafka are the two most well-established messaging systems in use today. An extremely simple analysis of these products is below.

Kafka is the newer of the two technologies but is quickly gaining traction as a robust, scalable and fault-tolerant messaging system. Whereas Flume can be thought of as a pipe between two points, Kafka is more of a broadcast, making data “topics” available to any subscribers who have permission to listen in. This makes Kafka, as a whole, more scalable than Flume, and also provides mechanisms for fault tolerance and redundancy of data. If one Kafka agent goes down, another will re-broadcast the topic. Where Kafka does fall short is in commercial support. Currently, Cloudera includes Kafka, but MapR and Hortonworks do not. Additionally, Kafka does not include built-in connectors to other Hadoop products. Some have been pre-written, but in general, you can’t expect the same level of “out-of-the-box” connectivity as Flume.

Flume has historically been the only choice for streaming ingest and as such, is well-established in the Hadoop ecosystem and is supported in all commercial Hadoop distributions. For large, enterprise-wide Hadoop deployments, this is an attractive, or even essential feature and may be reason enough to choose it. Despite its age, Flume has largely stayed fresh with emerging Hadoop technologies. Flume is a push-to-client system and operates between two endpoints rather than as a broadcast for any consumer to plug into.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Top Streaming Technologies for Data Lakes and Real-Time Data

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

How AI is Transforming Business Intelligence

Want Better Cities? Here, 6,000 Years of Data Oughta Help

Many Businesses Using AI Without Realizing It

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Top Streaming Technologies for Data Lakes and Real-Time Data

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change