Building a cognitive data lake with ODPi-compliant Hadoop

Building a cognitive data lake with ODPi-compliant Hadoop

For today’s data scientists and data engineers, the data lake is a concept that is both intriguing and often misunderstood. While there are many good resources about data lakes on IBM.com and other websites, there is also a lot of hype and spin. As a result, it can be difficult to get a clear understanding of the challenges, opportunities and methods that can help companies build data lakes that deliver real business advantage.

We recently listened in to a fascinating conversation between John Mertic, Director of Program Management at the ODPi, and Neil Stokes, Worldwide Analytics Architect Leader at IBM. Putting the data lake into the broader context of today’s IT industry trends, they discussed the importance of open, interoperable data and analytics platforms in solving both traditional analytics and cognitive computing challenges.

Here are the top five things we learned from Neil and John:

1. Data lakes need to be defined by consumption patterns, not data types

There’s a school of thought that defines a data lake as a platform or set of tools for storing and analyzing large volumes of unstructured data. This definition implies that data lakes do a fundamentally different job from systems that manage and analyze other types of information, such as traditional relational database data.

Neil argues that this is a misconception. There is no such thing as “unstructured data” – there is only data whose structure has not yet been parsed. Even if you are analyzing tweets or Facebook posts, you have metadata about when and by whom the text was written, and the text itself will contain semantic patterns from which you can infer meaning. For example, a tweet that includes certain words or hashtags can be understood as referring to specific topics, themes or sentiments. If this data were completely unstructured, trying to analyze it would be a fruitless exercise, because without structure, language has no meaning.

Since the line between “structured data” and “unstructured data” is blurred, there is no reason to think that a data lake should include some types of data and exclude others. In fact, the value of the data lake concept is that it should allow you to store any kind of data, and analyze it for anypurpose.

For this reason, it makes much more sense to define data lakes in terms of consumption patterns. What is the organization trying to achieve? What kinds of data will it need to analyze to meet these objectives? And therefore, what kind of analytics infrastructure does it need to build to support that analysis? Every data lake will be different, depending on what data the organization has, and what it decides to do with it.

2. It’s not all about Hadoop

In consequence, a related assumption – that “data lakes are built on Apache Hadoop” – is equally questionable.

Certainly, we should not underestimate the importance of Hadoop to the design of most data lakes. As a general-purpose platform that can handle almost any type of data that you can throw at it, Hadoop is almost certainly going to play an important role.

However, most data lakes are likely to be built using a combination of many different tools. A traditional data warehouse could be just as much of a cornerstone of such architectures as Hadoop is. Each of these tools needs to be able to work in harmony with its peers in order to build flexible data pipelines that can deliver whatever the business needs.

The ability to build data pipelines between tools depends on the ability of those tools to interoperate with each other. Historically, Hadoop has been a difficult platform to integrate reliably with other tools, because it consists of a collection of independently developed open source projects that evolve at different speeds. In the past, this greatly increased the risk of compatibility issues, and made it difficult to integrate reliably with third party tools.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

4 ways for joining the real-time enterprise revolution

29 Apr, 2021

These days, there’s a constant stream of chatter about the importance of being a “real-time enterprise,” with the ability to …

Read more

Driving customer loyalty with analytics and AI

21 Jul, 2021

Winning and retaining customers has always been challenging for retailers. However, it has become increasingly difficult over the past few …

Read more

8 Ways To Spot A Fake Data Scientist

23 Aug, 2019

Data science is one of the fanciest jobs of the decade and there are a lot of people who are …

Read more

Recent Jobs

Cyber Security Engineer – P2

Hybrid (Aurora, CO, USA)

5 Mar, 2024

Read More

Sr. Manager – Data and Analytics Technical Lead

Hybrid (Dedham, MA, USA)

5 Mar, 2024

Read More

Manager, Business Data and Analytics

Hybrid (Troy, OH, USA)

5 Mar, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.