To accelerate the value of Big Data, many products have been developed to make data managed in Hadoop much easier to access and analyze through SQL. First there was Hive, which provides a SQL query abstraction layer by converting SQL queries into MapReduce jobs. More recently, Cloudera announced Impala which bypasses MapReduce to enable interactive queries on data stored in Hadoop using the same variant of SQL that Hive uses. And today, EMC Greenplum announced Pivotal HD, the only high performing, true SQL query engine on top of Hadoop. Don’t be confused by these approaches, as there is a common thread – to leverage Hadoop as a Big Data platform for running SQL queries. The major difference with Pivotal HD is that now there is a single, scalable, flexible, and cost-effective data platform for all of your analytic needs.
I spoke with Greenplum Chief Scientist Milind Bhandarkar to explain this breakthrough SQL interface to Hadoop.
1. How does Pivotal HD provide a true, high performing SQL interface to Hadoop?
Through a feature called Advanced Database Services (HAWQ), we deliver the only fully functional SQL query engine for data stored in HDFS. Using the Greenplum Database as the SQL engine, queries are fully supported and optimized when accessing and analyzing data managed in a Hadoop cluster. For example, Advanced Database Services support complex joins, in-database analytics (MADlib), transactions, and more compared the limited SQL support offered by Hive, Hadapt, and Impala. The advantage is not only better performance, but also data worker productivity since you can use virtually any BI tool to run your analysis on top of Hadoop.
The benchmark below consistently shows the dramatic drop in response time with queries executed in Pivotal HD compared to Hive.
2. Describe the architecture of Advanced Database Services or HAWQ?
The architecture allows you to now store native Greenplum database files in HDFS. Rather than having a separate storage layer for your data warehouse, you can now consolidate all of your data in Hadoop. The database or data warehouse schema stays the same, but now the underlying native database files are stored in Hadoop. The advantage of this architecture is that you need to manage only one copy of the data – for SQL or Hadoop access.
3. Can you provide an example use case for Pivotal HD?
A common use case we see is with data warehouse customers who want to leverage existing SQL applications for data managed in Hadoop. For example, a financial customer of ours manages 6 months of stock transaction data in a Greenplum data warehouse, and more historical data in Hadoop. Each day, an eleven-page SQL script is run on the data warehouse to quickly detect trading irregularities. On occasion, an ad hoc request will come in, and the customer must run the same script on more historical data managed in Hadoop. Prior to Pivotal HD, the customer had to modify every data warehouse SQL script to a Hive script due to Hive SQL limitations. With Pivotal HD, SQL scripts do not change, as both the data warehouse data and Hadoop data are consolidated and managed in HDFS, with full ANSI SQL accessibility.
Click here to watch the Pivotal HD launch event featuring a live demonstration and speakers Scott Yara, co-founder of Greenplum and Harper Reed, CTO of 2012 Obama Campaign.
Tags: analytics, apache, BI, big data, cloudera, data mining, data science, data scientist, Greenplum, hadoop, open source, predictive analytics, sql, storage