Building a cognitive data lake with ODPi-compliant Hadoop
- by 7wData
For today’s data scientists and data engineers, the data lake is a concept that is both intriguing and often misunderstood. While there are many good resources about data lakes on IBM.com and other websites, there is also a lot of hype and spin. As a result, it can be difficult to get a clear understanding of the challenges, opportunities and methods that can help companies build data lakes that deliver real business advantage.
We recently listened in to a fascinating conversation between John Mertic, Director of Program Management at the ODPi, and Neil Stokes, Worldwide Analytics Architect Leader at IBM. Putting the data lake into the broader context of today’s IT industry trends, they discussed the importance of open, interoperable data and analytics platforms in solving both traditional analytics and cognitive computing challenges.
Here are the top five things we learned from Neil and John:
1. Data lakes need to be defined by consumption patterns, not data types
There’s a school of thought that defines a data lake as a platform or set of tools for storing and analyzing large volumes of unstructured data. This definition implies that data lakes do a fundamentally different job from systems that manage and analyze other types of information, such as traditional relational database data.
Neil argues that this is a misconception. There is no such thing as “unstructured data” – there is only data whose structure has not yet been parsed. Even if you are analyzing tweets or Facebook posts, you have metadata about when and by whom the text was written, and the text itself will contain semantic patterns from which you can infer meaning. For example, a tweet that includes certain words or hashtags can be understood as referring to specific topics, themes or sentiments. If this data were completely unstructured, trying to analyze it would be a fruitless exercise, because without structure, language has no meaning.
Since the line between “structured data” and “unstructured data” is blurred, there is no reason to think that a data lake should include some types of data and exclude others. In fact, the value of the data lake concept is that it should allow you to store any kind of data, and analyze it for anypurpose.
For this reason, it makes much more sense to define data lakes in terms of consumption patterns. What is the organization trying to achieve? What kinds of data will it need to analyze to meet these objectives? And therefore, what kind of analytics infrastructure does it need to build to support that analysis? Every data lake will be different, depending on what data the organization has, and what it decides to do with it.
2. It’s not all about Hadoop
In consequence, a related assumption – that “data lakes are built on Apache Hadoop” – is equally questionable.
Certainly, we should not underestimate the importance of Hadoop to the design of most data lakes. As a general-purpose platform that can handle almost any type of data that you can throw at it, Hadoop is almost certainly going to play an important role.
However, most data lakes are likely to be built using a combination of many different tools. A traditional data warehouse could be just as much of a cornerstone of such architectures as Hadoop is. Each of these tools needs to be able to work in harmony with its peers in order to build flexible data pipelines that can deliver whatever the business needs.
The ability to build data pipelines between tools depends on the ability of those tools to interoperate with each other. Historically, Hadoop has been a difficult platform to integrate reliably with other tools, because it consists of a collection of independently developed open source projects that evolve at different speeds. In the past, this greatly increased the risk of compatibility issues, and made it difficult to integrate reliably with third party tools.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Strategies for simplifying complex Salesforce data migrations – Free Webinar
27 March 2024
5 PM CET – 6 PM CET
Read MoreCategories
You Might Be Interested In
Graph Algorithms: Make Election Data Great Again
24 Sep, 2017Editor’s Note: This presentation was given by John Swain at GraphConnect San Francisco in October 2016. In this presentation, learn …
5 Ways Artificial Intelligence Is Transforming Digital Pathology
27 Nov, 2019– As the digital pathology market grows, facilities that rely on digital pathology will start using artificial intelligence (AI) to assist. AI …
Artificial intelligence success is tied to ability to augment, not just automate
22 Sep, 2021Artificial intelligence is only a tool, but what a tool it is. It may be elevating our world into an …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.