On the surface, data lakes seem like a smooth idea. Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.
Yet making a full copy of your data simply to analyze it still gives people -- particularly those in authority -- pause. The infrastructure for a data lake may cost pennies on the dollar compared to a SAN, but it’s a chunk of change nonetheless, and there are many security, synchronization, and regulatory issues to consider.
That’s why I think this year’s buzzphrase will be “analyze in place” -- that is, using distributed computing technologies to do the analysis without copying all of your data onto another storage medium. To be truthful, “analyze in place” is not a 100 percent accurate description. The idea is to analyze the data “in memory” somewhere else, but in collaboration with the existing system rather than replacing it entirely.
The core challenges for analyzing in place will be the following:
The chief benefit of analyzing in place is to avoid an incredible feat of social engineering. (My department, my data, my system, my zone of control, and my accountability, so no, you can’t have a full copy because I said so and I have the authority to say no and if I don’t I’ll slow-walk you to oblivion because I’ve outlasted kids like you for decades.) Also, you get security in context, simpler operations (not having another storage system to administer), and more.
There are plenty of good reasons to use a distributed file system -- and, frankly, SANs were a big fat farce upon us all. However, "I just want to analyze the data I already store elsewhere" may not always be one of those reasons.
There’s no way around load and latency cost. If I can analyze terabytes of data in seconds, but can move only a gigabyte at a time to my Spark cluster, then the total operation time is those seconds plus the copy time. Also, you still have to pick the data out of the source system.
We use the phrase "predicate pushdown" to mean if you don’t have an index on your RDBMS, then the time to get the data to your analytics system will be equal to the time it takes for your hair to fall out. Essentially your analytics system is passing the "where" clause along to the source system.
Right now, if you’re doing this in Spark, you need to balance the predicate pushdown (that is, network optimization) against pulling your source system over (execution costs). It’s exactly as if there is no magic and we're doing a giant query against the source system and copying it into another cluster’s memory for further analysis.
Sometimes to make this work you may have to shore up the source system. That may take time -- and by the way, it's not a sexy big data project. It may be a fatter server for the Oracle team. As a guy with a sales role for a consultancy, I get a headache from this because I hear “long sales cycle.” It may also be costlier than the lake approach; it will definitely be costlier in terms of labor. The deal is that in many cases, analyze in place is the only approach that can work.
Handling security when you have permission to use the analytics system, permission to use a source system -- and potentially multiple source systems -- is complicated. I can see the FSM-awful Kerberos rules now! Up until now, big data projects have tended to skirt around this by getting a copy and designing a new security system around it or simply “pay no attention to the flat security model we got an exception for.”
In the brave new world of analyze in place, we’ll use terms like “federated” and “multitiered” to cover up “painful” and “complex,” but there is another word: “necessary.” Our organizational zones of control exist for a reason.;