Paul Yip runs IBM’s worldwide product strategy for Apache Hadoop and Apache Spark. Yip has spent more than four years in the trenches helping customers deploy and manage big data solutions. Prior to this role, he worked in product management for Hadoop and technical presales roles, specializing in database management, data warehousing and online transaction processing (OLTP) solutions. Yip has authored three books for IBM Press, and he is a prolific contributor to the IBM developerWorks community.
IBM is extending Big SQL, which was formerly exclusive to the IBM Hadoop Platform, to the Hortonworks Data Platform (HDP). I recently asked Yip, one of the early proponents of the Big SQL on Hortonworks project, to give us some insight on what this transition means for the industry and its benefits.
In the early years of Hadoop, a lot of posturing was in play from different vendors that resulted in a fragmented platform with multiple combinations of components that were not particularly interoperable. As an industry, we realized that this situation was generally unhealthy.
The Open Data Platform initiative (ODPi) we helped originate last year is about Hadoop vendors, system integrators and customers establishing standards for more compatibility and skills reuse for Hadoop platforms. This effort allows vendors to focus on innovation for Hadoop while reducing the cost of porting and testing for compatibility. Customers benefit too because they spend less time retraining staff if they move between ODPi environments. Having both the current versions of IBM Open Platform (IOP) and HDP ODPi certified made it much easier for us.
Aside from that, the number-one issue we hear from customers who have deployed Hortonworks is they want data to be more accessible with SQL. They want more performance, more concurrent access than ever. So we’re making that happen. We’re simply responding to market demand. Hortonworks obviously has a significant presence in the market, and so our response makes sense for us.
For years, there has been the promise or hope that Apache Hive would be the way to do SQL on Hadoop. And yet, today, at least 23 other SQL engines I’m aware of are in the ecosystem, which is a clear indicator that the market is still young and undecided. General recognition prevails that SQL on Hadoop is still a problem in need of a proper solution.
We asked ourselves: what would be the one killer feature? Well, what if we could support existing SQL syntax nuances from IBM DB2, IBM PureData System for Analytics powered by Netezza technology and Oracle Database all at once? In Big SQL, that’s exactly what we did.
However, none of these engines really tackle the big problem of complex data warehousing queries on Hadoop, and this really key workload is what we’re targeting with Big SQL. With the low cost per gigabyte that characterizes Hadoop, customers are looking for ways to build Hadoop around their more costly traditional data warehouse technologies. When they start to reach capacity, they’d normally buy more, which can be very expensive. What they want to do now is offload some of the workloads into Hadoop; but if they do that, they will need to rewrite many of their SQL queries just to get them to work—let alone perform.