Hadoop-as-a-Service: An On-Premise Promise?

Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organizations with overwhelmed data center administrators that need to incorporate Hadoop but don’t have the resources to do so. What if there was also a promising option to successfully build and maintain Hadoop clusters on-premise also referred to HaaS? The EMC Hybrid Cloud (EHC) enables just this – Hadoop in the hybrid cloud.

EHC, announced at EMC World 2014, is a new end-to-end reference architecture that is based on a Software-Defined Data Center architecture comprising technologies from across the EMC federation of companies: EMC II storage and data protection, Pivotal CF Platform-as-a-service (PaaS) and the Pivotal Big Data Suite, VMware cloud management and virtualization solutions, and VMware vCloud Hybrid Service. EHC’s Hadoop-as-a- Service was demonstrated at last week’s VMworld 2014 San Francisco – the underpinnings of a Virtual Data Lake:

EHC leverages these tight integrations across the Federation so that customers can extend their existing investments for automated provisioning & self-service, automated monitoring, secure multi-tenancy, chargeback, and elasticity to addresses requirements of IT, developers, and lines of business. I spoke with Ian Breitner, Global Solutions Marketing Director for Big Data, to explain why EMC’s approach to HaaS should be considered over other Hadoop cloud offerings.

1.  In your opinion, what are the key characteristics of HaaS?

Before we delve into this I want to define what I mean by Hadoop for this post. Hadoop means the original framework for large-scale data processing on a cluster of commodity components. Originally it comprised of a set of utilities and tools, a File System (HDFS), a resource scheduler (YARN) and an analytics engine (Map Reduce) designed to process large amounts of unstructured data in an efficient fashion.

For me the key to providing anything ‘aaS’ is to provide it as a utility. Basically as a consumer of Hadoop, I want to have the service rapidly provisioned and access available when I want it, and to pay for only what I consume- and by the way it needs to be relatively inexpensive. There are a number of activities that need to occur before I am able to consume Hadoop and to me as a consumer I don’t care or need to know about them, but for the organization providing me the Hadoop utility it is important: the provision of a self-service portal, metering and chargeback mechanisms, tenant isolation, policy management framework, and management and monitoring tools.

2.  What is the value of HaaS over bare metals deployment?

Having a HaaS model means that I, as the consumer of Hadoop, can purchase what I need, when I need it, and only for the duration of its use. This is far more attractive than going down the “bare metal” route. There are also benefits to having an ‘aaS’ model where the equipment being used can be re-allocated to other workloads when not being used by Hadoop workloads.

Deploying Hadoop on bare metal- perhaps it is more accurate to say on dedicated hardware requires capital investment, datacenter floor space, HVAC, power, and a variety of technical skills (meaning additional staff). As a consumer of Hadoop, I now have to worry about managing these additional items – and if I need to grow my Hadoop cluster, I have to invest additional funds to expand the cluster and its associated items, and there is the high likelihood of under-utilization of the hardware.

3.  EMC first introduced a methodology for HaaS with the EMC Hadoop Starter Kit (HSK). How does EHC provide a more complete solution for HaaS?

HSK allows you to get started with a HaaS offering using the VMware Big Data Extensions to create virtualized Hadoop deployments. But there are many missing parts that would be required to provide this in a utility model. EHC, however, is another animal (see diagram below). EHC includes all the components to create a utility model and provide ‘aaS’ offerings. One of the items that comes with EHC are the required vCAC blueprints to deploy Hadoop, and these can be used to create the service catalog that allows a self service model to be deployed. EPC2.0solutionguideoverview-325x355 4.  The term HaaS is still evolving, but the industry generally refers to HaaS as a replacement to on-premise Hadoop, with providers such as Amazon Web Services accounting to nearly 85% of the global market HaaS revenue in 2013.  What makes EHC a better choice over HaaS providers such as AWS?

There are a number of items that I would like to address here. The first is the perception that an ‘aaS’ offering must be from a Service Provider. IT departments are perfectly capable in providing a similar utility model, especially with the EHC solutions available from EMC. The major issue for IT was and still is the budget constraints within they need to operate. They could not afford the skilled staff required to create the infrastructure, and they also had capital constraints. This meant that the organizations like sales and marketing needed to find other ways to achieve their goals – offerings like AWS EMR were and are attractive.

The issues in using AWS for Hadoop workloads comes once these workloads go from prototype and test to production and then the data sets grow. With this growth comes increasing costs and eventually the marketing or sales organization will say to IT “go and run this for us”. Now what? By choosing to use EHC and running HaaS, the consumers have access to a utility computing model that can meet their needs, and at the same time provides IT with the infrastructure to deliver the services. And as a bonus it is also possible to elastically expand into a Service Provider offering for those occasional workloads needing additional temporary capacity.

5.  Who are the ideal candidates for EHC? HSK?

Those organizations that want to run a Hadoop POC or learn how they might apply this new analytics model to their unstructured or semi structured data are ideal candidates for the HSK – especially if they already own Isilon – expanding the existing platform is easy and transparent.

Those organizations that want to provide HaaS to their internal customers are ideal candidates for EHC – typically these would be Enterprise customers. With EHC, IT organizations can broker services from private and public clouds, enabling visibility and control over the best location to run business applications. For example, you can push your EHC HaaS deployment to VCloud Air with ease when needed.

Also those organizations that have started to use or are using AWS EMR are also candidates for EHC to run HaaS.

EMC Hadoop Starter Kit ViPR Edition: Creating a Smarter Data Lake

Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.

EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service.  Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.

1.  The original EMC Hadoop Starter Kit released last year was a huge success.  Why did you create ViPR Edition?

Continue reading

Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.


*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

Continue reading

RSA and Pivotal: Laying the Foundation for a Wider Big Data Strategy

Building from years of security expertise, RSA was able to exploit Big Data to better detect, investigate, and understand threats with its RSA Security Analytics platform launched last year. Similarly, Pivotal leveraged its world-class Data Science team in conjunction with its Big Data platform to deliver Pivotal Network Intelligence for enhanced threat detection using statistical and machine learning techniques on Big Data. Utilizing both RSA Security Analytics and Pivotal Network Intelligence together, customers were able to identify and isolate potential threats faster than competing solutions for better risk mitigation.

As a natural next step, RSA and Pivotal last week announced the availability of the Big Data for Security Analytics reference architecture, solidifying a partnership that brings together the leaders in Security Analytics and Big Data/Data science. RSA and Pivotal will not only enhance the overall Security Analytics strategy, but also provide a foundation for a broader ‘IT Data Lake’ strategy to help organizations gain better ROI from these IT investments.

RSA’s reference architecture utilizes Pivotal HD, enabling security teams to gain access to a scalable platform with rich analytic capabilities from Pivotal tools and the Hadoop ecosystem to experiment and gain further visibility around enterprise security and threat detection. Moreover, the combined Pivotal and RSA platform allows organizations to leverage the collected data for non-security use cases such as capacity planning, mean-time-to-repair analysis, downtime impact analysis, shadow IT detection, and more.



Distributed architecture allows for enterprise scalability and deployment

I spoke with Jonathan Kingsepp, Director of Federation EVP Solutions from Pivotal to discuss how the RSA-Pivotal partnership allows customers to gain much wider benefits across their organization.

1.  What are the technology components of this is this new RSA-Pivotal Reference architecture?

Continue reading