One would obviously expect Hadoop to dominate the discussions at the recent Strata & Hadoop World conference in San Jose, CA. But much of the buzz this year was around Apache Spark, and how Spark might fit into the data management strategies of many organizations.
Arno Candel, chief architect at H20.ai,shared his observations with Information Management on what conference attendees were most interested in, and how those needs are influencing his company’s go-to-market strategies.
Information Management: What are the most common themes that you heard among conference attendees and how do those themes align with what you expected?
Arno Candel: Many of the people I spoke with were interested in how Spark can, or would, fit into their overall data management and analytics strategy. While we at H2O.ai have been seeing increasing interest in Spark, which was one of the reasons that we built out Sparkling Water, our Spark API, I’ve always thought of Strata as a Hadoop conference – it is after all merged with Hadoop World.
It’s now clear that data storage is essentially a solved problem, while in-memory analytics and machine learning are driving most of the ongoing work in the field. We see ourselves as very much aligned with this trend.
IM: What are the most common data challenges that attendees are facing?
AC: Turning data into actionable insights has been, and remains, a key challenge for many organizations. Everyone has been told that they need to store more and more information in data stores like Hadoop, but there is often a lack of a plan for the “day after.” What do organizations do once they’ve stored all their data in a data lake? They realize that they need some kind of analytics strategy, but aren’t sure exactly what that should look like.
In addition, there is a huge problem with regards to data cleansing; much of the data that organizations have stored is messy, has missing variables, etc. and organizations need to find a way to deal with that.