Charting the data lake: Model normalization patterns for data lakes

Charting the data lake: Model normalization patterns for data lakes

The data lake can be considered the consolidation point for all of the data which is of value for use across different aspects of the enterprise. There is a significant range of the different types of potential data repositories that are likely to be part of a typical data lake. These data repositories are likely to be required to address a number of different roles to address the needs of the different users, for example:

The other key question in building out the data lake repositories is to what level is some standardization or consistency of schema desirable or even necessary. It is a valid choice for an organization to decide that there is no need to enforce any degree of standardization of schema across the data lake. In this case, the expectation is that whatever virtualization layer is in place is capable of guiding the different users through the array of different structures, the duplication, the different terminology. In other cases, the decision is taken that at least some parts of the data lake need to comply with some degree of standardization in the data base schemas, even in cases where such data bases are still doing a range of different jobs and so may need to be structured differently.

The diagram below shows such a typical collection of different data structures combined together within a data lake.

In the landing zone, the focus is on initially ingesting the data as it comes into the data lake in a raw format. The landing zone may also be extended to enable some initial processing of the data enabling it to be more generally useful across the data lake. The data scientist sandboxes are typically ad-hoc issue-specific, very flat structures containing many repeating groups of data. The Data Warehouse typically has a need for storing data in a reasonably flexible format for various different downstream uses, whereas the data marts are often made up of a lot of aggregated data focused on a specific business issue or a specific group of users.

So a key potential role of a data model is to enable a degree of standardization across such a disparate set of repositories. So that where possible, the same or similar structures and terminology are used to assist with the subsequent understanding and navigation of the data lake.  The need to address potentially different characteristics across these data repositories in terms of how the data is stored and accessed, is where the different normalization patterns come in.

The typical data models traditionally used in the construction of data warehouse structures would often start with a model structure that has a reasonably high degree of normalization, typically a degree of normalization that provides the necessary level of flexibility to allow the effective representation of the various business needs. When these models are then transformed from such a logical platform-independent format into a more platform-specific format, varying degrees of denormalization takes place in order to ensure physical models that are performant in the specific physical environment

The focus on denormalization becomes critical in the context of the data lake and specifically in terms of any associated Hadoop/HDFS data structures.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Why Enterprises Can’t Overlook IoT Device Management

12 Aug, 2018

From simple smartwatches to today’s feature-rich smart homes, the tech industry has seen enormous growth. A few decades back, the …

Read more

What is an Artificial Neural Networks?

6 Nov, 2020

Artificial neural networks (ANN) give machines the ability to process data similar to the human brain and make decisions or …

Read more

How Big Data Analytics Solving Product Promotion Issues

17 Oct, 2017

Following quite a long while of careful excitement, the showcasing and promoting innovation segment is currently grasping enormous information in …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.