Data Lakes

Data Lakes, Explained

Data Lakes, Explained

The Big Data revolution has redefined the way enterprises work; data underpins everything. Not only have open-source tools such as Apache Hadoop and Spark made vast quantities of data easier to collect, process, and store in real time, but business intelligence (BI) and data visualization tools have begun to help us scratch the surface of analyzing and transforming that data to inform core business decisions.

Though, despite how much Big Data and BI technology has evolved, we're still dealing with such massive volumes of constantly compounding data that finding the right points to analyze still feels like diving for needles in a never-ending haystack. The solution? Redesign the haystack.

Enter data lakes, a new type of cloud-based enterprise architecture that structures data in a more scalable way that makes it easier to experiment with; makes it more open to exploration and manipulation rather than locked in rigid schemas and silos. Nasry Angel, an Enterprise Architecture Researcher at Forrester Research, explained why enterprises are embracing data lake architectures.

"It sounds cliché, but when you think about an effective modern data environment, it's a lot more experimental," said Angel. "You need to be able to learn fast and fail fast. In the past, managing data, especially in a warehouse, was all about quality, down to the decimal point; making sure everything was completely accurate and true. It's called chasing a single version of the truth. Then generating a pixel-perfect report and blasting it out to 5,000 users.

Read Also:
Data for all: SMES, scalability and the big Big Data tools

"Nowadays, it's a more scientific process. You walk in with a hypothesis about the data you want to test and you want to be able to play with the data, mix and match, to try out different things before you go and productize something."

What's In a Data Lake? A data lake is a storage repository. Though, unlike a data warehouse or "data mart," Angel explained that data lakes are distributed over multiple nodes rather than in the fixed, structured environment of a data warehouse relying on schemas (see infographic below).

"A data lake allows you to apply a schema when you write the data versus a data warehouse that requires you to do a schema on read. So, essentially, a data warehouse requires you to model the data before you understand its context, which doesn't really make sense," said Angel.

Source:JustOne Database, Inc.(Click on graphic above to see full view.)

"Typically, in a warehouse, you have IT professionals coming up with what they think are the best data models, and they're not the eventual users of the data. You can quickly see how that hinders productivity and business value," he added. "Ultimately, you and the business users need to be the ones that make decisions about the structure of data, and, in a data lake, you can first explore and figure out what's there and then figure out a schema to best organize it."

Read Also:
Global business intelligence and analytics market rises in value

Data lakes are typically built on Hadoop, and enterprise Hadoop distributions such as Hortonworks and MapR offer data lake architectures. Businesses can also build data lakes by using Infrastructure-as-a-Service (IaaS) clouds including Amazon Web Services (AWS) and Microsoft Azure. Amazon's Elastic Compute Cloud (EC2) supports data lakes while Microsoft has a dedicated Azure Data Lake platform to store and analyze real-time data. Angel said data lakes are maturing to the point within the Big Data space where businesses can begin investing in them with reasonable confidence.

"A few years back, Hadoop was all the rage. Now we're getting to a point where Hadoop is commoditized," said Angel. "The question is not if Hadoop but when, and what you're going to do with it. What types of applications are you going to build on top of Hadoop once you've gotten the data into a common place like a data lake? At this point, it's about using the data to develop applications to meet your specific business needs.

Read Also:
Creating a Data-Driven Organization Depends on a Data-Driven Culture

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
What industries are next to be disrupted by NLP and Text Analysis?

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
An Introduction To Self-Service Business Intelligence

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Big Data, Open Data and the Need for Data Transparency (Industry Perspective)

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
Charting the data lake: Model normalization patterns for data lakes

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
Big data: Bold promise? Or the hardest part of population health, precision medicine and better patient experience?

Leave a Reply

Your email address will not be published. Required fields are marked *