Everything we do generates events – click on a mobile ad, pay with a credit card, tweet, measure heart rate, accelerate on the gas pedal, etc. What if an organization can feed these events into predictive models as soon as the event happens to quickly and more accurately make decisions that generate more revenue, lower costs, minimize risk, and improve the quality of care? You would need deep and fast analytics provided by Big Data platforms such as Pivotal HD 2.0 announced yesterday.
Pivotal HD 2.0 brings an in-memory, SQL database to Hadoop through seamless integration with Pivotal GemFire XD, enabling you to combine real-time data with historical data managed in HDFS. Closed loop analytics, operational BI, and high-speed data ingest are now possible in a single OLTP/OLAP platform without any ETL processing required. Use cases are ones that are time sensitive in nature. For example, telecom companies are at the forefront of applying real-time Big Data analytics to network traffic. The “store first, analyze second” method does not make sense for rapidly shifting traffic that requires immediate action when issues arise.
I spoke with Senior Director of Engineering at Pivotal Makarand Gokhale to explain the value in bringing OLTP to a traditional batch processing Hadoop.
1. Real-time solutions for Hadoop can mean many things- performing interactive queries, real-time event processing, and fast data ingest. How would you describe Pivotal HD’s real-time data services for Hadoop?
Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.
Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data. Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently. In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read. This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.
Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.
Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.
To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:
Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.
I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.
1. RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?
To accelerate the value of Big Data, many products have been developed to make data managed in Hadoop much easier to access and analyze through SQL. First there was Hive, which provides a SQL query abstraction layer by converting SQL queries into MapReduce jobs. More recently, Cloudera announced Impala which bypasses MapReduce to enable interactive queries on data stored in Hadoop using the same variant of SQL that Hive uses. And today, EMC Greenplum announced Pivotal HD, the only high performing, true SQL query engine on top of Hadoop. Don’t be confused by these approaches, as there is a common thread – to leverage Hadoop as a Big Data platform for running SQL queries. The major difference with Pivotal HD is that now there is a single, scalable, flexible, and cost-effective data platform for all of your analytic needs.
I spoke with Greenplum Chief Scientist Milind Bhandarkar to explain this breakthrough SQL interface to Hadoop.
1. How does Pivotal HD provide a true, high performing SQL interface to Hadoop?