Charting the data lake: Model normalization patterns for data lakes

Charting the data lake: Model normalization patterns for data lakes

Charting the data lake: Model normalization patterns for data lakes

The data lake can be considered the consolidation point for all of the data which is of value for use across different aspects of the enterprise. There is a significant range of the different types of potential data repositories that are likely to be part of a typical data lake. These data repositories are likely to be required to address a number of different roles to address the needs of the different users, for example:

The other key question in building out the data lake repositories is to what level is some standardization or consistency of schema desirable or even necessary. It is a valid choice for an organization to decide that there is no need to enforce any degree of standardization of schema across the data lake. In this case, the expectation is that whatever virtualization layer is in place is capable of guiding the different users through the array of different structures, the duplication, the different terminology. In other cases, the decision is taken that at least some parts of the data lake need to comply with some degree of standardization in the data base schemas, even in cases where such data bases are still doing a range of different jobs and so may need to be structured differently.

Read Also:
Business Intelligence Analyst or Data Scientist? What's the Difference?

The diagram below shows such a typical collection of different data structures combined together within a data lake.

In the landing zone, the focus is on initially ingesting the data as it comes into the data lake in a raw format. The landing zone may also be extended to enable some initial processing of the data enabling it to be more generally useful across the data lake. The data scientist sandboxes are typically ad-hoc issue-specific, very flat structures containing many repeating groups of data. The Data Warehouse typically has a need for storing data in a reasonably flexible format for various different downstream uses, whereas the data marts are often made up of a lot of aggregated data focused on a specific business issue or a specific group of users.

So a key potential role of a data model is to enable a degree of standardization across such a disparate set of repositories. So that where possible, the same or similar structures and terminology are used to assist with the subsequent understanding and navigation of the data lake.  The need to address potentially different characteristics across these data repositories in terms of how the data is stored and accessed, is where the different normalization patterns come in.

Read Also:
Why data analytics is business-imperative

The typical data models traditionally used in the construction of data warehouse structures would often start with a model structure that has a reasonably high degree of normalization, typically a degree of normalization that provides the necessary level of flexibility to allow the effective representation of the various business needs. When these models are then transformed from such a logical platform-independent format into a more platform-specific format, varying degrees of denormalization takes place in order to ensure physical models that are performant in the specific physical environment

The focus on denormalization becomes critical in the context of the data lake and specifically in terms of any associated Hadoop/HDFS data structures.



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
What Spark's Structured Streaming really means

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Five Myths About Machine Learning You Need To Know Today
Read Also:
Turning Big Data from Cost to Revenue

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Why data analytics is business-imperative

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
5 Ways to Avoid Common Pitfalls in Large-Scale Analytics Projects

Big Data and Analytics Marketing Summit London

12
Jun
2017
Big Data and Analytics Marketing Summit London

$200 off with code DATA200

Read Also:
Why Government and Data Science Need to Stick Together

Leave a Reply

Your email address will not be published. Required fields are marked *