This is the final article in a three-part series exploring what it takes to build a data lake capable of meeting all the requirements of a truly enterprise-scale data management platform. While earlier installments focused on enterprise-scale data management in Hadoop, data onboarding into the data lake, and security, this article will focus on two things: Integrating the data lake within the broader enterprise IT landscape, and data governance.
As more lakes are deployed, we see patterns emerge for how data lakes are positioned relative to existing databases, data warehouses, analytic appliances, and enterprise applications in larger organizations.
Some data lakes are deployed from the outset as centralized system-of record data platforms, serving other systems in an enterprise scale, data-as-a-service model. As a centralized data lake builds momentum, collecting more data and attracting more use cases and users, its value grows as users collaborate on improving and reusing the data.
Other projects start at the edge of the organization to deliver data and meet the analytic needs of a specific business group. A localized data lake often expands to support multiple teams or spawn additional separate data lake instances to support other groups who want the same improved data access as the first group got.
Regardless of what pattern the data lake takes as it lands and expands in the organization, the data lake’s increasing role in the organization brings with it new requirements for enterprise readiness.
To be enterprise-ready, the data lake needs to support a set of capabilities that allow it to be integrated within the company’s overall data management strategy and IT applications and data flow landscape.
Here are some requirements to keep in mind:
In addition to streaming the integration of your data lake, you must prepare the lake to support a broad and expanding community of business users.
As more users begin working with the data lake directly or through downstream applications or reporting/analytic systems, the importance of having strong data governance grows. This topic — data governance — is the final dimension of enterprise readiness.
By bringing together typically hundreds of diverse data sets in a large repository and giving users unprecedented direct access to that data, data lakes create new governance challenges and opportunities.
The challenges have to do with ensuring that data governance policies and procedures exist and are enforced in the lake. Enterprise-ready data governance in the data lake starts with a clear definition of who owns or has custodial responsibility for each data asset as it enters the lake and as it is maintained and enhanced through the data lake process.