Building a cognitive data lake with ODPi-compliant Hadoop

Building a cognitive data lake with ODPi-compliant Hadoop

For today’s data scientists and data engineers, the data lake is a concept that is both intriguing and often misunderstood. While there are many good resources about data lakes on IBM.com and other websites, there is also a lot of hype and spin. As a result, it can be difficult to get a clear understanding of the challenges, opportunities and methods that can help companies build data lakes that deliver real business advantage.

We recently listened in to a fascinating conversation between John Mertic, Director of Program Management at the ODPi, and Neil Stokes, Worldwide Analytics Architect Leader at IBM. Putting the data lake into the broader context of today’s IT industry trends, they discussed the importance of open, interoperable data and analytics platforms in solving both traditional analytics and cognitive computing challenges.

Here are the top five things we learned from Neil and John:

1. Data lakes need to be defined by consumption patterns, not data types

There’s a school of thought that defines a data lake as a platform or set of tools for storing and analyzing large volumes of unstructured data. This definition implies that data lakes do a fundamentally different job from systems that manage and analyze other types of information, such as traditional relational database data.

Neil argues that this is a misconception. There is no such thing as “unstructured data” – there is only data whose structure has not yet been parsed. Even if you are analyzing tweets or Facebook posts, you have metadata about when and by whom the text was written, and the text itself will contain semantic patterns from which you can infer meaning. For example, a tweet that includes certain words or hashtags can be understood as referring to specific topics, themes or sentiments. If this data were completely unstructured, trying to analyze it would be a fruitless exercise, because without structure, language has no meaning.

Since the line between “structured data” and “unstructured data” is blurred, there is no reason to think that a data lake should include some types of data and exclude others. In fact, the value of the data lake concept is that it should allow you to store any kind of data, and analyze it for anypurpose.

For this reason, it makes much more sense to define data lakes in terms of consumption patterns. What is the organization trying to achieve? What kinds of data will it need to analyze to meet these objectives? And therefore, what kind of analytics infrastructure does it need to build to support that analysis? Every data lake will be different, depending on what data the organization has, and what it decides to do with it.

2. It’s not all about Hadoop

In consequence, a related assumption – that “data lakes are built on Apache Hadoop” – is equally questionable.

Certainly, we should not underestimate the importance of Hadoop to the design of most data lakes. As a general-purpose platform that can handle almost any type of data that you can throw at it, Hadoop is almost certainly going to play an important role.

However, most data lakes are likely to be built using a combination of many different tools. A traditional data warehouse could be just as much of a cornerstone of such architectures as Hadoop is. Each of these tools needs to be able to work in harmony with its peers in order to build flexible data pipelines that can deliver whatever the business needs.

The ability to build data pipelines between tools depends on the ability of those tools to interoperate with each other. Historically, Hadoop has been a difficult platform to integrate reliably with other tools, because it consists of a collection of independently developed open source projects that evolve at different speeds. In the past, this greatly increased the risk of compatibility issues, and made it difficult to integrate reliably with third party tools.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Graph Algorithms: Make Election Data Great Again

24 Sep, 2017

Editor’s Note: This presentation was given by John Swain at GraphConnect San Francisco in October 2016. In this presentation, learn …

Read more

5 Ways Artificial Intelligence Is Transforming Digital Pathology

27 Nov, 2019

– As the digital pathology market grows, facilities that rely on digital pathology will start using artificial intelligence (AI) to assist. AI …

Read more

Artificial intelligence success is tied to ability to augment, not just automate

22 Sep, 2021

Artificial intelligence is only a tool, but what a tool it is. It may be elevating our world into an …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.