Opening up Big SQL for all: An interview with Paul Yip

Opening up Big SQL for all: An interview with Paul Yip

Opening up Big SQL for all: An interview with Paul Yip

Paul Yip runs IBM’s worldwide product strategy for Apache Hadoop and Apache Spark. Yip has spent more than four years in the trenches helping customers deploy and manage big data solutions. Prior to this role, he worked in product management for Hadoop and technical presales roles, specializing in database management, data warehousing and online transaction processing (OLTP) solutions. Yip has authored three books for IBM Press, and he is a prolific contributor to the IBM developerWorks community.

IBM is extending Big SQL, which was formerly exclusive to the IBM Hadoop Platform, to the Hortonworks Data Platform (HDP). I recently asked Yip, one of the early proponents of the Big SQL on Hortonworks project, to give us some insight on what this transition means for the industry and its benefits.

In the early years of Hadoop, a lot of posturing was in play from different vendors that resulted in a fragmented platform with multiple combinations of components that were not particularly interoperable. As an industry, we realized that this situation was generally unhealthy.

The Open Data Platform initiative (ODPi) we helped originate last year is about Hadoop vendors, system integrators and customers establishing standards for more compatibility and skills reuse for Hadoop platforms. This effort allows vendors to focus on innovation for Hadoop while reducing the cost of porting and testing for compatibility. Customers benefit too because they spend less time retraining staff if they move between ODPi environments. Having both the current versions of IBM Open Platform (IOP) and HDP ODPi certified made it much easier for us.

Read Also:
Microsoft taps into Apache Spark to drive its Big Data & analytics services

Aside from that, the number-one issue we hear from customers who have deployed Hortonworks is they want data to be more accessible with SQL. They want more performance, more concurrent access than ever. So we’re making that happen. We’re simply responding to market demand. Hortonworks obviously has a significant presence in the market, and so our response makes sense for us.

For years, there has been the promise or hope that Apache Hive would be the way to do SQL on Hadoop. And yet, today, at least 23 other SQL engines I’m aware of are in the ecosystem, which is a clear indicator that the market is still young and undecided. General recognition prevails that SQL on Hadoop is still a problem in need of a proper solution.

We asked ourselves: what would be the one killer feature? Well, what if we could support existing SQL syntax nuances from IBM DB2, IBM PureData System for Analytics powered by Netezza technology and Oracle Database all at once? In Big SQL, that’s exactly what we did.

Read Also:
Yield Big Results with Data Lakes and Automation

On top of that, I can think of at least six different types of workloads that SQL can be used for, and each of these tools are often designed for one or more of these types of workloads. For example, Hive was built for Hadoop MapReduce and is good for large-scale queries. Large scale means a lot of data spread across a lot of nodes, but it doesn’t mean the queries are necessarily complex or deeply analytical. Apache Phoenix is good for key-based lookups, inserts, updates and deletes because the underlying engine is Apache HBase. Spark SQL is good for data scientists who explore and manipulate data ad hoc. Apache Drill is good for discovery workloads, especially with JavaScript Object Notation (JSON) data stores.

However, none of these engines really tackle the big problem of complex data warehousing queries on Hadoop, and this really key workload is what we’re targeting with Big SQL. With the low cost per gigabyte that characterizes Hadoop, customers are looking for ways to build Hadoop around their more costly traditional data warehouse technologies. When they start to reach capacity, they’d normally buy more, which can be very expensive. What they want to do now is offload some of the workloads into Hadoop; but if they do that, they will need to rewrite many of their SQL queries just to get them to work—let alone perform.

Read Also:
Machine Learning Is Making Unstructured Data Accessible

 



Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
All The Best Big Data Tools And How To Use Them

Chief Analytics Officer Spring 2017

2
May
2017
Chief Analytics Officer Spring 2017

15% off with code MP15

Read Also:
Microsoft taps into Apache Spark to drive its Big Data & analytics services

Big Data and Analytics for Healthcare Philadelphia

17
May
2017
Big Data and Analytics for Healthcare Philadelphia

$200 off with code DATA200

Read Also:
Data Science Talent is Key to Analytical Innovators

SMX London

23
May
2017
SMX London

10% off with code 7WDATASMX

Read Also:
5 key attributes of effective data monetization strategy

Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
Self-Service Analytics – Racing to Meet Growing Expectations

Leave a Reply

Your email address will not be published. Required fields are marked *