The Big Data revolution has redefined the way enterprises work; data underpins everything. Not only have open-source tools such as Apache Hadoop and Spark made vast quantities of data easier to collect, process, and store in real time, but business intelligence (BI) and data visualization tools have begun to help us scratch the surface of analyzing and transforming that data to inform core business decisions.
Though, despite how much Big Data and BI technology has evolved, we're still dealing with such massive volumes of constantly compounding data that finding the right points to analyze still feels like diving for needles in a never-ending haystack. The solution? Redesign the haystack.
Enter data lakes, a new type of cloud-based enterprise architecture that structures data in a more scalable way that makes it easier to experiment with; makes it more open to exploration and manipulation rather than locked in rigid schemas and silos. Nasry Angel, an Enterprise Architecture Researcher at Forrester Research, explained why enterprises are embracing data lake architectures.
"It sounds cliché, but when you think about an effective modern data environment, it's a lot more experimental," said Angel. "You need to be able to learn fast and fail fast. In the past, managing data, especially in a warehouse, was all about quality, down to the decimal point; making sure everything was completely accurate and true. It's called chasing a single version of the truth. Then generating a pixel-perfect report and blasting it out to 5,000 users.
"Nowadays, it's a more scientific process. You walk in with a hypothesis about the data you want to test and you want to be able to play with the data, mix and match, to try out different things before you go and productize something."
What's In a Data Lake? A data lake is a storage repository. Though, unlike a data warehouse or "data mart," Angel explained that data lakes are distributed over multiple nodes rather than in the fixed, structured environment of a data warehouse relying on schemas (see infographic below).
"A data lake allows you to apply a schema when you write the data versus a data warehouse that requires you to do a schema on read. So, essentially, a data warehouse requires you to model the data before you understand its context, which doesn't really make sense," said Angel.
Source:JustOne Database, Inc.(Click on graphic above to see full view.)
"Typically, in a warehouse, you have IT professionals coming up with what they think are the best data models, and they're not the eventual users of the data. You can quickly see how that hinders productivity and business value," he added. "Ultimately, you and the business users need to be the ones that make decisions about the structure of data, and, in a data lake, you can first explore and figure out what's there and then figure out a schema to best organize it."
Data lakes are typically built on Hadoop, and enterprise Hadoop distributions such as Hortonworks and MapR offer data lake architectures. Businesses can also build data lakes by using Infrastructure-as-a-Service (IaaS) clouds including Amazon Web Services (AWS) and Microsoft Azure. Amazon's Elastic Compute Cloud (EC2) supports data lakes while Microsoft has a dedicated Azure Data Lake platform to store and analyze real-time data. Angel said data lakes are maturing to the point within the Big Data space where businesses can begin investing in them with reasonable confidence.
"A few years back, Hadoop was all the rage. Now we're getting to a point where Hadoop is commoditized," said Angel. "The question is not if Hadoop but when, and what you're going to do with it. What types of applications are you going to build on top of Hadoop once you've gotten the data into a common place like a data lake? At this point, it's about using the data to develop applications to meet your specific business needs.