Charting the data lake: Model normalization patterns for data lakes
- by 7wData
The data lake can be considered the consolidation point for all of the data which is of value for use across different aspects of the enterprise. There is a significant range of the different types of potential data repositories that are likely to be part of a typical data lake. These data repositories are likely to be required to address a number of different roles to address the needs of the different users, for example:
The other key question in building out the data lake repositories is to what level is some standardization or consistency of schema desirable or even necessary. It is a valid choice for an organization to decide that there is no need to enforce any degree of standardization of schema across the data lake. In this case, the expectation is that whatever virtualization layer is in place is capable of guiding the different users through the array of different structures, the duplication, the different terminology. In other cases, the decision is taken that at least some parts of the data lake need to comply with some degree of standardization in the data base schemas, even in cases where such data bases are still doing a range of different jobs and so may need to be structured differently.
The diagram below shows such a typical collection of different data structures combined together within a data lake.
In the landing zone, the focus is on initially ingesting the data as it comes into the data lake in a raw format. The landing zone may also be extended to enable some initial processing of the data enabling it to be more generally useful across the data lake. The data scientist sandboxes are typically ad-hoc issue-specific, very flat structures containing many repeating groups of data. The Data Warehouse typically has a need for storing data in a reasonably flexible format for various different downstream uses, whereas the data marts are often made up of a lot of aggregated data focused on a specific business issue or a specific group of users.
So a key potential role of a data model is to enable a degree of standardization across such a disparate set of repositories. So that where possible, the same or similar structures and terminology are used to assist with the subsequent understanding and navigation of the data lake. The need to address potentially different characteristics across these data repositories in terms of how the data is stored and accessed, is where the different normalization patterns come in.
The typical data models traditionally used in the construction of data warehouse structures would often start with a model structure that has a reasonably high degree of normalization, typically a degree of normalization that provides the necessary level of flexibility to allow the effective representation of the various business needs. When these models are then transformed from such a logical platform-independent format into a more platform-specific format, varying degrees of denormalization takes place in order to ensure physical models that are performant in the specific physical environment
The focus on denormalization becomes critical in the context of the data lake and specifically in terms of any associated Hadoop/HDFS data structures.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Evolving Your Data Architecture for Trustworthy Generative AI
18 April 2024
5 PM CET – 6 PM CET
Read MoreShift Difficult Problems Left with Graph Analysis on Streaming Data
29 April 2024
12 PM ET – 1 PM ET
Read More