Hadoop-as-a-Service: An On-Premise Promise?

Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organizations with overwhelmed data center administrators that need to incorporate Hadoop but don’t have the resources to do so. What if there was also a promising option to successfully build and maintain Hadoop clusters on-premise also referred to HaaS? The EMC Hybrid Cloud (EHC) enables just this – Hadoop in the hybrid cloud.

EHC, announced at EMC World 2014, is a new end-to-end reference architecture that is based on a Software-Defined Data Center architecture comprising technologies from across the EMC federation of companies: EMC II storage and data protection, Pivotal CF Platform-as-a-service (PaaS) and the Pivotal Big Data Suite, VMware cloud management and virtualization solutions, and VMware vCloud Hybrid Service. EHC’s Hadoop-as-a- Service was demonstrated at last week’s VMworld 2014 San Francisco – the underpinnings of a Virtual Data Lake:

EHC leverages these tight integrations across the Federation so that customers can extend their existing investments for automated provisioning & self-service, automated monitoring, secure multi-tenancy, chargeback, and elasticity to addresses requirements of IT, developers, and lines of business. I spoke with Ian Breitner, Global Solutions Marketing Director for Big Data, to explain why EMC’s approach to HaaS should be considered over other Hadoop cloud offerings.

1.  In your opinion, what are the key characteristics of HaaS?

Before we delve into this I want to define what I mean by Hadoop for this post. Hadoop means the original framework for large-scale data processing on a cluster of commodity components. Originally it comprised of a set of utilities and tools, a File System (HDFS), a resource scheduler (YARN) and an analytics engine (Map Reduce) designed to process large amounts of unstructured data in an efficient fashion.

For me the key to providing anything ‘aaS’ is to provide it as a utility. Basically as a consumer of Hadoop, I want to have the service rapidly provisioned and access available when I want it, and to pay for only what I consume- and by the way it needs to be relatively inexpensive. There are a number of activities that need to occur before I am able to consume Hadoop and to me as a consumer I don’t care or need to know about them, but for the organization providing me the Hadoop utility it is important: the provision of a self-service portal, metering and chargeback mechanisms, tenant isolation, policy management framework, and management and monitoring tools.

2.  What is the value of HaaS over bare metals deployment?

Having a HaaS model means that I, as the consumer of Hadoop, can purchase what I need, when I need it, and only for the duration of its use. This is far more attractive than going down the “bare metal” route. There are also benefits to having an ‘aaS’ model where the equipment being used can be re-allocated to other workloads when not being used by Hadoop workloads.

Deploying Hadoop on bare metal- perhaps it is more accurate to say on dedicated hardware requires capital investment, datacenter floor space, HVAC, power, and a variety of technical skills (meaning additional staff). As a consumer of Hadoop, I now have to worry about managing these additional items – and if I need to grow my Hadoop cluster, I have to invest additional funds to expand the cluster and its associated items, and there is the high likelihood of under-utilization of the hardware.

3.  EMC first introduced a methodology for HaaS with the EMC Hadoop Starter Kit (HSK). How does EHC provide a more complete solution for HaaS?

HSK allows you to get started with a HaaS offering using the VMware Big Data Extensions to create virtualized Hadoop deployments. But there are many missing parts that would be required to provide this in a utility model. EHC, however, is another animal (see diagram below). EHC includes all the components to create a utility model and provide ‘aaS’ offerings. One of the items that comes with EHC are the required vCAC blueprints to deploy Hadoop, and these can be used to create the service catalog that allows a self service model to be deployed. EPC2.0solutionguideoverview-325x355 4.  The term HaaS is still evolving, but the industry generally refers to HaaS as a replacement to on-premise Hadoop, with providers such as Amazon Web Services accounting to nearly 85% of the global market HaaS revenue in 2013.  What makes EHC a better choice over HaaS providers such as AWS?

There are a number of items that I would like to address here. The first is the perception that an ‘aaS’ offering must be from a Service Provider. IT departments are perfectly capable in providing a similar utility model, especially with the EHC solutions available from EMC. The major issue for IT was and still is the budget constraints within they need to operate. They could not afford the skilled staff required to create the infrastructure, and they also had capital constraints. This meant that the organizations like sales and marketing needed to find other ways to achieve their goals – offerings like AWS EMR were and are attractive.

The issues in using AWS for Hadoop workloads comes once these workloads go from prototype and test to production and then the data sets grow. With this growth comes increasing costs and eventually the marketing or sales organization will say to IT “go and run this for us”. Now what? By choosing to use EHC and running HaaS, the consumers have access to a utility computing model that can meet their needs, and at the same time provides IT with the infrastructure to deliver the services. And as a bonus it is also possible to elastically expand into a Service Provider offering for those occasional workloads needing additional temporary capacity.

5.  Who are the ideal candidates for EHC? HSK?

Those organizations that want to run a Hadoop POC or learn how they might apply this new analytics model to their unstructured or semi structured data are ideal candidates for the HSK – especially if they already own Isilon – expanding the existing platform is easy and transparent.

Those organizations that want to provide HaaS to their internal customers are ideal candidates for EHC – typically these would be Enterprise customers. With EHC, IT organizations can broker services from private and public clouds, enabling visibility and control over the best location to run business applications. For example, you can push your EHC HaaS deployment to VCloud Air with ease when needed.

Also those organizations that have started to use or are using AWS EMR are also candidates for EHC to run HaaS.

Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.


*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

Continue reading

RSA and Pivotal: Laying the Foundation for a Wider Big Data Strategy

Building from years of security expertise, RSA was able to exploit Big Data to better detect, investigate, and understand threats with its RSA Security Analytics platform launched last year. Similarly, Pivotal leveraged its world-class Data Science team in conjunction with its Big Data platform to deliver Pivotal Network Intelligence for enhanced threat detection using statistical and machine learning techniques on Big Data. Utilizing both RSA Security Analytics and Pivotal Network Intelligence together, customers were able to identify and isolate potential threats faster than competing solutions for better risk mitigation.

As a natural next step, RSA and Pivotal last week announced the availability of the Big Data for Security Analytics reference architecture, solidifying a partnership that brings together the leaders in Security Analytics and Big Data/Data science. RSA and Pivotal will not only enhance the overall Security Analytics strategy, but also provide a foundation for a broader ‘IT Data Lake’ strategy to help organizations gain better ROI from these IT investments.

RSA’s reference architecture utilizes Pivotal HD, enabling security teams to gain access to a scalable platform with rich analytic capabilities from Pivotal tools and the Hadoop ecosystem to experiment and gain further visibility around enterprise security and threat detection. Moreover, the combined Pivotal and RSA platform allows organizations to leverage the collected data for non-security use cases such as capacity planning, mean-time-to-repair analysis, downtime impact analysis, shadow IT detection, and more.



Distributed architecture allows for enterprise scalability and deployment

I spoke with Jonathan Kingsepp, Director of Federation EVP Solutions from Pivotal to discuss how the RSA-Pivotal partnership allows customers to gain much wider benefits across their organization.

1.  What are the technology components of this is this new RSA-Pivotal Reference architecture?

Continue reading

Alpine Data Labs – Making Predictive Analytics Pervasive and Persuasive

Big Data has exposed the need for deeper data insights through predictive analytic techniques such as data mining, machine learning, and modeling. The interesting thing to note is that predictive analytics has been around for a long time, used by a select few, in select organizations. Its value has always been recognized and applauded, but its true potential never fully realized due to lack of widespread adoption, as well as issues around data accessibility, performance, statistical expertise, business sponsorship, cost, and more. In fact, nearly 90 percent of organizations that do employ predictive analytic software agree that it has given them a competitive advantage, according to a new survey.

The advent of Big Data has driven the uptake of predictive analytics due to the curiosity of very capable Data Scientists, along with new tools and technologies from companies such as Alpine Data Labs.  Alpine Data Labs provides next generation predictive analytics to address legacy issues and meet the new demands of Big Data. But more importantly, Alpine Data Labs is mainstream-oriented whereby business users, not just statisticians and Data Scientists, are compelled to mine data.


Backed by $16M in Series B funding, Alpine Data Labs is getting some serious momentum in the Big Data analytics startup space, offering zero coding for creating and deploying complex predictive models on Hadoop. I spoke with Alpine Data Labs CEO Joe Otto to talk about their game changing approach to predictive analytics for Big Data.

1.  Lets first talk about leading predictive analytics incumbents such as SAS, IBM SPSS, and other analytics vendors who got their start years ago with desktop and server software designed for data mining and advanced analytics. How has Alpine Data Labs overcome the issues around these incumbent technologies and address the new needs of Big Data?

Continue reading