Opening up Big SQL for all: An interview with Paul Yip

Opening up Big SQL for all: An interview with Paul Yip

Opening up Big SQL for all: An interview with Paul Yip
Paul Yip runs IBM’s worldwide product strategy for Apache Hadoop and Apache Spark. Yip has spent more than four years in the trenches helping customers deploy and manage big data solutions. Prior to this role, he worked in product management for Hadoop and technical presales roles, specializing in database management, data warehousing and online transaction processing (OLTP) solutions. Yip has authored three books for IBM Press, and he is a prolific contributor to the IBM developerWorks community.

IBM is extending Big SQL, which was formerly exclusive to the IBM Hadoop Platform, to the Hortonworks Data Platform (HDP). I recently asked Yip, one of the early proponents of the Big SQL on Hortonworks project, to give us some insight on what this transition means for the industry and its benefits.

In the early years of Hadoop, a lot of posturing was in play from different vendors that resulted in a fragmented platform with multiple combinations of components that were not particularly interoperable. As an industry, we realized that this situation was generally unhealthy.

Read Also:
Apply good data science to outthink competitors’ marketing

The Open Data Platform initiative (ODPi) we helped originate last year is about Hadoop vendors, system integrators and customers establishing standards for more compatibility and skills reuse for Hadoop platforms. This effort allows vendors to focus on innovation for Hadoop while reducing the cost of porting and testing for compatibility. Customers benefit too because they spend less time retraining staff if they move between ODPi environments. Having both the current versions of IBM Open Platform (IOP) and HDP ODPi certified made it much easier for us.

Aside from that, the number-one issue we hear from customers who have deployed Hortonworks is they want data to be more accessible with SQL. They want more performance, more concurrent access than ever. So we’re making that happen. We’re simply responding to market demand. Hortonworks obviously has a significant presence in the market, and so our response makes sense for us.

For years, there has been the promise or hope that Apache Hive would be the way to do SQL on Hadoop. And yet, today, at least 23 other SQL engines I’m aware of are in the ecosystem, which is a clear indicator that the market is still young and undecided. General recognition prevails that SQL on Hadoop is still a problem in need of a proper solution.

Read Also:
Self-Service Data Presentation: Data Quality, Lineage and Cataloging

We asked ourselves: what would be the one killer feature? Well, what if we could support existing SQL syntax nuances from IBM DB2, IBM PureData System for Analytics powered by Netezza technology and Oracle Database all at once? In Big SQL, that’s exactly what we did.

On top of that, I can think of at least six different types of workloads that SQL can be used for, and each of these tools are often designed for one or more of these types of workloads. For example, Hive was built for Hadoop MapReduce and is good for large-scale queries. Large scale means a lot of data spread across a lot of nodes, but it doesn’t mean the queries are necessarily complex or deeply analytical. Apache Phoenix is good for key-based lookups, inserts, updates and deletes because the underlying engine is Apache HBase. Spark SQL is good for data scientists who explore and manipulate data ad hoc. Apache Drill is good for discovery workloads, especially with JavaScript Object Notation (JSON) data stores.

Read Also:
Smart Data Plus Deep Reasoning Equals Business Value from Data Analysis

However, none of these engines really tackle the big problem of complex data warehousing queries on Hadoop, and this really key workload is what we’re targeting with Big SQL. With the low cost per gigabyte that characterizes Hadoop, customers are looking for ways to build Hadoop around their more costly traditional data warehouse technologies. When they start to reach capacity, they’d normally buy more, which can be very expensive. What they want to do now is offload some of the workloads into Hadoop; but if they do that, they will need to rewrite many of their SQL queries just to get them to work—let alone perform.

Read Full Story…

 

Leave a Reply

Your email address will not be published. Required fields are marked *