Syncsort’s Paige Roberts caught up with Steve Sarsfield from Hewlett Packard Enterprise (HPE) at the latest Strata. Steve is the product marketing manager for HPE Big Data Software, focused on their Vertica for SQL on Hadoop product. Steve is also a notable name in the arena of data quality and governance, and authored the book The Data Governance Imperative. Enjoy some keen industry insight in this interview between Paige and Steve.
So, we’re at Strata, and you’re a Vertica person. What do you feel the intersection is for Hadoop and Vertica?
HPE and Hadoop really intersect quite a bit when it comes to some of the innovations that we’re working on. We have some great innovations that we’re showing [here at Strata]. One of the innovations is our big data reference architectures, which we’ve designed to work in partnership with Hadoop, specifically HDFS and YARN. One of the offerings we have are these reference architectures that allow you to use YARN labels to specify compute and storage, and break up compute and storage. So if you want to make that dynamic within the organization, you can use YARN labels to specify how much compute and how much storage you want to use for any job.
The second part is that we have HPE Vertica for SQL on Hadoop. That is a product that allows you to install our Vertica engine directly into the Hadoop cluster and perform SQL queries on Hadoop. It’s 100% TPC-DS compliant, fully ANSI SQL compliant and can be installed either in the Hadoop cluster or separately as a Vertica cluster. It’s a high-performance engine, and we’re happy to show that off here at Strata, too.
Syncsort and Vertica have been pretty tight over the years.
What do you see as the synergies? What makes it such a good partnership?
Our strength is in providing very fast analytics for massive amounts of data. We focus all of our effort, from the way we store data to the way we compress columns, so that the analysis happens fast. What Syncsort brings to the table is the basic concept of getting the data into the database. That’s really important, because although we ingest data, we don’t have that completely covered. If you have complex data or particularly tricky data, we rely on our partnerships like Syncsort. I think that’s a really important component, especially in today’s age when there are so many different file formats and unstructured data and a lot of options when it comes to storing data. We need a partner like you guys to do it.
This is a question I’ve been asking everyone to get different perspectives. What do you think Hadoop is for?
It’s a “make you think” question.
Hadoop is a general term that describes many projects that are going on in the open source community. Hadoop and specifically HDFS is primarily to store data at a very low cost. There’s data that companies gather that they aren’t really sure what it’s good for or what value it has. They need some low-cost place to put it. Hadoop, or at least the HDFS component of Hadoop, is a really good place for that. The whole Hadoop community is based on the fact that more and more data is coming at us. However, what we aren’t seeing is IT budgets growing by a lot. What I hear is data volumes growing by 25 to 50 percent, or more in certain companies, but IT budgets are growing by about 4 percent. So companies are looking for ways to store data at a low cost, and that’s one of the functions Hadoop does well. The other thing is around data discovery, understanding what data you have, getting into the data to see if there’s any value there. Those two components are what I think it’s for. Beyond that, it’s pretty exciting to see all the other things that the Hadoop community is incubating. Countless projects that help companies manage big data.
What do you think of Spark?
Spark is really exciting technology. It seems like something that will be really powerful in the future.