paper-boat-sailing-154996543-100265468-primary.idge

Not up for a data lake? Analyze in place

Not up for a data lake? Analyze in place

On the surface, data lakes seem like a smooth idea. Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.

Yet making a full copy of your data simply to analyze it still gives people -- particularly those in authority -- pause. The infrastructure for a data lake may cost pennies on the dollar compared to a SAN, but it’s a chunk of change nonetheless, and there are many security, synchronization, and regulatory issues to consider.

That’s why I think this year’s buzzphrase will be “analyze in place” -- that is, using distributed computing technologies to do the analysis without copying all of your data onto another storage medium. To be truthful, “analyze in place” is not a 100 percent accurate description. The idea is to analyze the data “in memory” somewhere else, but in collaboration with the existing system rather than replacing it entirely.

The core challenges for analyzing in place will be the following:

The chief benefit of analyzing in place is to avoid an incredible feat of social engineering. (My department, my data, my system, my zone of control, and my accountability, so no, you can’t have a full copy because I said so and I have the authority to say no and if I don’t I’ll slow-walk you to oblivion because I’ve outlasted kids like you for decades.) Also, you get security in context, simpler operations (not having another storage system to administer), and more.

Read Also:
Training an AI Doctor

There are plenty of good reasons to use a distributed file system -- and, frankly, SANs were a big fat farce upon us all. However, "I just want to analyze the data I already store elsewhere" may not always be one of those reasons.

There’s no way around load and latency cost. If I can analyze terabytes of data in seconds, but can move only a gigabyte at a time to my Spark cluster, then the total operation time is those seconds plus the copy time. Also, you still have to pick the data out of the source system.

We use the phrase "predicate pushdown" to mean if you don’t have an index on your RDBMS, then the time to get the data to your analytics system will be equal to the time it takes for your hair to fall out. Essentially your analytics system is passing the "where" clause along to the source system.

Right now, if you’re doing this in Spark, you need to balance the predicate pushdown (that is, network optimization) against pulling your source system over (execution costs). It’s exactly as if there is no magic and we're doing a giant query against the source system and copying it into another cluster’s memory for further analysis.

Read Also:
5 Reasons Your Big Data Strategy Is Failing

Sometimes to make this work you may have to shore up the source system. That may take time -- and by the way, it's not a sexy big data project. It may be a fatter server for the Oracle team. As a guy with a sales role for a consultancy, I get a headache from this because I hear “long sales cycle.” It may also be costlier than the lake approach; it will definitely be costlier in terms of labor. The deal is that in many cases, analyze in place is the only approach that can work.

Handling security when you have permission to use the analytics system, permission to use a source system -- and potentially multiple source systems -- is complicated. I can see the FSM-awful Kerberos rules now! Up until now, big data projects have tended to skirt around this by getting a copy and designing a new security system around it or simply “pay no attention to the flat security model we got an exception for.”

Read Also:
The hidden value in imperfect big data

In the brave new world of analyze in place, we’ll use terms like “federated” and “multitiered” to cover up “painful” and “complex,” but there is another word: “necessary.” Our organizational zones of control exist for a reason.;

 



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Can Artificial Intelligence Identify Your Next Heart Attack?

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Why Blending Analytics and Gut-Feeling Benefits your Business

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
So What Can You Actually Do With Data Visualization?

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
So What Can You Actually Do With Data Visualization?

HR & Workforce Analytics Innovation Summit 2017 London

12
Jun
2017
HR & Workforce Analytics Innovation Summit 2017 London

$200 off with code DATA200

Read Also:
How a medical device maker took control of its sales data

Leave a Reply

Your email address will not be published. Required fields are marked *