Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.


*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

When we introduced Business Data Lake last year, the industry confirmed that we had the right vision – include real-time, interactive, and batch data ingest and processing capabilities supported by data management technologies such as in-memory, MPP, and HDFS technologies. The challenge for customers was how to get started with the Data Lake journey and how much budget should be allocated across the breadth of data management technologies that comprise a Data Lake. Also, as data processing requirements change over time, customers want to protect IT investments and not be locked down into any specific technology.

Although Pivotal has always provided enterprise-class technologies to support Busniess Data Lakes, customers were still challenged with how much to invest in Pivotal Greenplum Database for MPP analytical processing versus Pivotal HAWQ for interactive SQL access to HDFS versus Pivotal Gemfire for real time, in-memory database processing, etc. To take these pain points off the table, Big Data Suite offers customers a flexible, multi-year subscription to Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, HAWQ, and Pivotal HD. It includes unlimited use Pivotal HD through a paid subscription of value added services- Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, HAWQ.

The significance of this new consumption model is that customers can now store as much Big Data as they want in HDFS, but only be charged for the value they extract from the data.  As an example, a customer could buy 1,000 cores worth of Big Data Suite, and for the first year use 80% of cores dedicated to Pivotal Greenplum Database and 20% of cores dedicated to HAWQ. Over the years, as data and insight start to expand in HDFS, the customer can spin down the use of Pivotal Greenplum Database, and spin up the use of HAWQ without having to pay anything extra as long as the cores don’t exceed 1,000.

2.  What was the impetus in providing unlimited use of Pivotal HD in the Big Data Suite?

Data grows 60% per year, yet IT budgets grow 3-5% per year. Hadoop pricing does not meet limited IT budgets, as vendors charge by terabyte or node. Each time you want to add more data to your Data Lake to increase capacity, you are charged for it. We are telling customers that if they invest in Pivotal, they can grow their Data Lake or expand the HDFS footprint without being taxed for it.  This allows customers to focus on more important aspects such as data analysis and operationalization through analytical database, SQL query, and in-memory technologies.

3.  It sounds like Pivotal Big Data Suite brings all data management technologies in line with Hadoop economics?

Yes, with Big Data Suite, we are aggressively cutting the price of Greenplum (Analytics Data Warehouse) and GemFire (In-memory data grid system) to be in line with the cost economics of Hadoop.

4.  How does Big Data Suite address Data Lake strategies?

Big Data suite fulfills the data management needs of a Data Lake. And because each organization will have different data processing needs over time, we have designed a flexible pricing model for Big Data Suite whereby you can mix and match technologies at any point in time.

For example, a Data Lake for a Telecommunications organization will look different from a Data Lake for a Healthcare organization. The Telco may have immediate real time requirements, whereas the Healthcare Payor may have immediate interactive SQL access to HDFS requirements, but prioritize real time capabilities for next year. If customers standardize with other Hadoop vendors, they may end up purchasing multi-vendor technologies for real time, interactive, and batch processing over time simply because of pricing, creating more data silos. With Pivotal, we remove these silos with the Big Data Suite flexible consumption model approach.

5.  Who are the ideal candidates for the Big Data Suite?

Big Data Suite is ideal for any organization since we believe a flexible subscription model is the smart way to grow a Data Lake. I confirmed this approach with our Data Science team – when they experiment with new sets of data to solve a problem, the data processing requirements are unknown until you operationalize it. One use case may require an analytical database technology versus another may require interactive SQL access to HDFS technology. Therefore, the Data Lake must offer data processing options or a toolkit to address diverse use cases without creating additional data silos.

Calculate your savings with Pivotal Big Data Suite compared to data management in an Enterprise Data Warehouse.

VCE Vblock: Converging Big Data Investments To Drive More Value

As Big Data continues to demonstrate real business value, organizations are looking to leverage this high value data across different applications and use cases. The uptake is also driving organizations to transition from siloed Big Data sandboxes, to enterprise architectures where they are mandated to address mission-critical availability and performance, security and privacy, provisioning of new services, and interoperability with the rest of the enterprise infrastructure.

Sandbox or experimental Hadoop on commodity hardware with direct attached storage (DAS) makes it difficult to address such challenges for several reasons – difficult to replicate data across applications and data centers, lack of IT oversight and visibility into the data, lack of multi-tenancy and virtualization, difficult to streamline upgrades and migrate technology components, and more. As a result, VCE, leader in converged or integrated infrastructures, is receiving an increased number of requests on how to evolve Hadoop implementations reliant on DAS to being deployed on VCE Vblock Systems -  an enterprise-class infrastructure that combines server, shared storage, network devices, virtualization, and management in a pre-integrated stack.

Formed by Cisco and EMC, with investments from VMware and Intel, VCE enables organizations to rapidly deploy business services on demand and at scale – all without triggering an explosion in capital and operating expenses. According to IDC’s recent report, organizations around the world spent over $3.3 billion on converged systems in 2012, and forecasted this spending to increase by 20% in 2013 and again in 2014. In fact, IDC calculated that Vblock Systems infrastructure resulted in a return on investment of 294% over a three-year period and 435% over a five-year period compared to data on traditional infrastructure due to fast deployments, simplified operations, improved business-support agility, cost savings, and freed staff to launch new applications, extend services, and improve user/customer satisfaction.

I spoke with Julianna DeLua from VCE Product Management to discuss how VCE’s Big Data solution enables organizations to extract more value from Big Data investments.



1.  Why are organizations interested deploying Hadoop and Big Data applications on converged or integrated infrastructures such as Vblock?

Continue reading

You Asked, Rackspace Listened – New Big Data Hosting Options

Running Hadoop on bare metals may fit some use cases, but many organizations have the types of data workloads that demand more storage than compute resources. In order to get the most efficient utilization for these types of Hadoop workloads, separating the compute and storage resources makes sense and is a configuration users are asking for, i.e EMC Isilon storage for Hadoop.  In response to diverse Big Data needs such as mixed Hadoop workloads, hybrid cloud models, and heterogenous data layers, Rackspace recently delivered new Big Data hosting options whereby users now have more choices – run Hadoop on Rackspace managed dedicated servers, spin up Hadoop on the public cloud, or configure your own private cloud.


I spoke with Sean Anderson, Product Marketing Manager for Data Solutions at Rackspace, to talk about one particular new offering called ‘Managed Big Data Platform’ whereby customers can design the optimal configuration for their data, and leave the management details to Rackspace.

1.  The Big Data lifecycle will go through various stages, with each stage imposing different requirements. From what you are seeing with your customers, can you explain this lifecycle and what value Rackspace brings to support this lifecycle?

Continue reading

EMC Isilon For Hadoop – No Ingest Necessary

In traditional Hadoop environments, the entire data set must be ingested (and three or more copies of each block made) before any analysis can begin. Once analysis is complete, results must then be exported. What’s the significance of this? COST. These are tedious and time-consuming processes, along with maintaining multiple copies of data. With EMC Isilon HDFS, the entire data set can start to be analyzed immediately without the need to replicate it, and the results are also available immediately to NFS and SMB clients.

If you don’t already own Isilon for your Hadoop environment, it is worth exploring the multitude of benefits Isilon brings over HDFS running on compute hosts. If you are already an Isilon customer, Isilon requires no data movement and instead offers in-place analytics on data, eliminating the need to build a specialty Hadoop storage infrastructure.

Ryan Peterson, Director of Solutions Architecture at Isilon, likes to say that Isilon dedupes Hadoop since Isilon satisfies Hadoop’s need to see multiple copies of the same data without having to actually copy it. In fact, with the latest release of Isilon’s OneFS 7.1 today, a new feature called Smart Dedupe can reduce the storage further by approximately 30%. Ryan Peterson now refers to this as Hadoop Dedupe Dedupe. The first ‘Dedupe’ removes 3x replication, and the second ‘Dedupe’ reduces storage by 30%. Clever!

I sat down with Ryan Peterson to walk us through Hadoop Dedupe Dedupe:

In a traditional Hadoop deployment, data loss resulting from hardware failure is handled by replicating blocks of data across a minimum of three times (3X by default), resulting in at least 4 data copies – existing primary storage plus 3 Hadoop storage copies.

Isilon for Hadoop turns this paradigm upside down because if existing primary data is NOT already on Isilon, then only 2.2 copies of data is required to protect against data loss due to hardware failure. The first copy is from the existing primary data NOT on Isilon, and the second copy is on Isilon. Isilon’s N+M RAID –like distributed parity scheme makes 1.2 copies while providing high availability and resiliency to protect from data loss due to hardware failure (i.e. nodes and disks). I

If primary data is already on Isilon there’s no need for a separate Hadoop storage infrastructure in the first place, and only 1.2 data copies are made instead of 4. With the upcoming release of Isilon’s de-duplication feature, the storage requirements will go down further by approximately 30%.

So if customers have 300TB of raw data, they will need 900TB of new storage to run their Hadoop cluster. However if they already have this data on Isilon, they will not need any new storage and will only have 252TB of raw data to work with because data in primary is de-duped and they can run Hadoop directly on that data.

Wait a minute, is this Hadoop Dedupe Dedupe Dedupe?