Don’t Accept The Status Quo For Hadoop

Hadoop is Everywhere – 99% companies will deploy/pilot Hadoop in 18-24 months according to IDC.  These environments will largely be based around standalone servers resulting in added management tasks due to data being spread out across many disk spindles across the data center.  With Hadoop clusters quickly expanding, organizations are starting to experience the typical growing pains one can compare to adolescence.  This begs the question- should DAS server configuration be the accepted status-quo for Hadoop deployments?


Whether you are getting started with Hadoop or growing your Hadoop deployment, EMC provides a long-term solution for Hadoop through shared storage and VM’s, delivering distinct value to the business in lower TCO and faster time-to-results.  I spoke with EMC Technical Advisory Architect Chris Harrold to explain why organizations are now turning to EMC to help transition Hadoop environments into adulthood.

1.  Almost every Hadoop deployment is based around the accepted configuration of standalone servers with DAS.   What have you seen as issues with this configuration with your customers?

These environments are growing rapidly.  As a result, the ability to support these environments starts to degrade pretty rapidly when you get to a larger scale.  Servers with DAS tend to be more difficult because they have components that can fail internally and require more babysitting over an enterprise class platform.

For example, it’s no trivial task to expand this environment, as you have to acquire the servers, stand them all up, configure, rack, and power them.  Just to add a sandbox or test environment is difficult in this standalone server model.

There is also a steep learning curve with Hadoop not only in terms of the analytics component but also just to simply get data in and out of Hadoop in a DAS environment.

2.  Hadoop was designed to run on a DAS architecture where the compute and storage is tightly coupled.  Why does EMC believe decoupling storage and compute through a shared storage and virtualizing compute resources is a better architecture?  How does this architecture address the issues you mentioned above?

When Hadoop was first introduced to the market in 2000, shared or enterprise storage was a high-end commodity; therefore it was very difficult to design something like Hadoop on shared storage.  Since then, shared storage has become more affordable.   At the same time, networking speeds have become faster so now it is more feasible to decouple compute and storage. By deploying Hadoop on a shared storage model, you eliminate all the issues around manageability with DAS and gain the benefits of enterprise class features such as virtualization, SANs, and scale out NAS.

Also, deploying large-scale standalone architectures is really a legacy approach, as many enterprises have moved away from this to a shared architecture.  As Hadoop is becoming a key component of data architectures, it will be challenging to maintain standalone servers since many enterprises have evolved to virtualized, shared environments.

EMC is enabling organizations to leverage enterprise storage and virtualization to quickly and easily deploy and manage growing Hadoop environments.  Utilizing enterprise technologies with Hadoop you also gain benefits such as ease of data import/export, data protection, and security.

3.  What are the components of the EMC recommended architecture for Hadoop?

We provide choice on how organizations want to architect Hadoop environments.  For shared storage, we provide EMC Isilon HDFS enabled storage, which is certified with all major Hadoop distributions.  We also provide Software Defined Storage through EMC ViPR HDFS/Object storage, which is certified with all major storage arrays and Hadoop distributions.

For pre-integrated compute and shared storage, the EMC Data Computing Appliance provides an optimized architecture utilizing Pivotal HD and EMC Isilon.

For an integrated infrastructure approach, we partner with VCE vBlock to provide choice of compute, storage, and networking technologies from VMware, Cisco, and EMC to optimize any major Hadoop distribution deployment.

4.  Who are the ideal candidates for the EMC architecture for Hadoop and why?

All organizations benefit from this architecture.  Whether you are just getting started with Hadoop or have a large-scale deployment, we make it easy to rapidly deploy and manage a Hadoop environment.  In fact, utilizing EMC storage, you can analytics enable data already living in storage arrays.  You don’t have to copy or move data to a separate standalone Hadoop DAS environment.   We have several step by step guides to walk you the process of easily configuring your Hadoop environment for HDFS enabled storage.

5. Although Hadoop is everywhere with IDC estimating that 99%o of companies will deploy/pilot Hadoop in the next 18-24 months, gaining ROI from the deployment is a challenge due to lack of skills – identifying the right opportunity and then executing.  How does EMC address this issue?

Yes, this a huge problem in the industry especially lack of Data Science skills.  EMC addresses the skills shortage through our services across the EMC Federation.  Pivotal Data Labs provides access to some of the best minds in Data Science to help organizations identify opportunities and execute utilizing the latest Big Data technologies and techniques.  The EMC Vision Workshop creates a strategic Big Data blueprint for organizations to continuously identify Big Data uses cases based on the organization’s business initiatives and implementation feasibility.  And this has become a huge success as the EMC Vision Workshop creates the needed organizational alignment – Lines of Business continuously working, communicating, and collaborating with IT in order to successfully identify the right Big Data use cases for success.

Cloudera Enterprise and EMC Isilon: Filling In The Hadoop Gaps

As Hadoop becomes the central component of enterprise data architectures, the open source community and technology vendors have built a large Big Data ecosystem of Hadoop platform capabilities to fill in the gaps of enterprise application requirements. For data processing, we have seen MapReduce batch processing being supplemented with additional data processing techniques such as Apache Hive, Apache Solr, and Apache Spark to fill in the gaps for SQL access, search, and streaming.  For data storage, direct attached storage (DAS) has been the common deployment configuration for Hadoop; however, the market is now looking to supplement DAS deployment with enterprise storage. Why take this approach? Organizations can HDFS enable valuable data already managed in enterprise storage without having to copy or move this data to a separate Hadoop DAS environment.


As a leader in enterprise storage, EMC has partnered with Hadoop vendors such as Cloudera to ensure customers can fill in the Hadoop gaps through HDFS enabled storage such as EMC Isilon. In addition to providing data protection, efficient storage utilization, and ease of import/export through multi-protocol support, EMC Isilon and Cloudera together allow organizations to quickly and easily take on new, analytic workloads.   With the announcement of Cloudera Enterprise certified with EMC Isilon for HDFS storage, I wanted to take the opportunity to speak with Cloudera’s Chief Strategy Officer Mike Olson about the partnership and how he sees the Hadoop ecosystem evolving over the next several years.

1.  The industry has different terminologies for enterprise data architectures centered around Hadoop. EMC refers to this next generation data architecture as a Data Lake and Cloudera as Enterprise Data Hub. What is the common thread?

The two are closely related. At Cloudera, we think of a data hub as an engineered system designed to analyze and process data in place, so it needn’t be moved to be used. The most common use of the “data lake” term is around existing large repositories (and Isilon is an excellent example), where data is collected and managed at scale, but where historically it’s had to be piped out of the lake to be used. By layering Cloudera Enterprise right on top of Isilon as a storage substrate, we layer a hub on the lake – we let you keep your data where it lives, and put the processing where you need it.

2.  Cloudera leads the Hadoop market. What does EMC Isilon bring to the table for your customers?

Best-of-breed engineered storage solutions, of course; manageability, operability, credibility and a tremendous record of success in the enterprise as well. And, of course, a substantial market presence. The data stored in Isilon systems today is more valuable if we can deliver big data analytics and processing on it, without requiring it to be migrated to separate big data infrastructure.

3.  What are the ideal use cases for a Cloudera-Isilon deployment?

We don’t see any practical difference in the use cases that matter. The processing and analytic workloads for big data apply whether data is in native HDFS managed by Apache Hadoop, or in Isilon. The real question is what the enterprise’s requirements and standards around its storage infrastructure are. Companies that choose the benefits of Isilon now get the benefits of Cloudera as well.

4.  SMB and NFS are examples of protocols that have been around for generations. Will HDFS stand the test of time or be replaced with another protocol to support for example real time applications or applications to support the Internet of Things?

Software evolves continually, but HDFS is a long-term player. SMB and NFS are more scalable and more performant today than they were ten or twenty years ago, and I’m confident that you’ll see HDFS evolve as well.

5.  MapReduce provides an excellent alternative to traditional data warehouse batch processing requirements. Other open source data processing techniques for Hadoop such as Hive, Spark, and Apache HBase, etc provide yet additional capabilities to meet enterprise application requirements.   How do you see this data processing ecosystem evolving in the next 5 years?

It’ll be faster, more powerful, more capable and more real-time. The pace of innovation in the last ten years has been breathtaking, in terms of data analysis and transformation. The open source ecosystem and traditional vendors are doing amazing things. That’ll continue – there is so much value in the data that there’s a huge reward for that innovation.

Big Data Pains & Gains From A Real Life CIO

What does it take to make CIO Magazine’s Top 100 List? Big Data victory is one of them.
Michael Cucchi, Sr Director of Product Maketing at Pivotal, had the privilege to speak with one of the winners – EMC CIO Vic Bhagat. Discussing the pains and gains of EMC’s Big Data initiative, I have put together a summary of this interview below.  EMC IT’s approach to Big Data is exactly what the EVP Federation enables organizations to do – first collect any and all data in a Data Lake, deploy the right analytic tool that your people know how to use to analyze the data, and finally learn agile development so you can take those insights and build applications rapidly.

1. Why is Big Data important to your business?

Continue reading

Hadoop-as-a-Service: An On-Premise Promise?

Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organizations with overwhelmed data center administrators that need to incorporate Hadoop but don’t have the resources to do so. What if there was also a promising option to successfully build and maintain Hadoop clusters on-premise also referred to HaaS? The EMC Hybrid Cloud (EHC) enables just this – Hadoop in the hybrid cloud.

EHC, announced at EMC World 2014, is a new end-to-end reference architecture that is based on a Software-Defined Data Center architecture comprising technologies from across the EMC federation of companies: EMC II storage and data protection, Pivotal CF Platform-as-a-service (PaaS) and the Pivotal Big Data Suite, VMware cloud management and virtualization solutions, and VMware vCloud Hybrid Service. EHC’s Hadoop-as-a- Service was demonstrated at last week’s VMworld 2014 San Francisco – the underpinnings of a Virtual Data Lake:

EHC leverages these tight integrations across the Federation so that customers can extend their existing investments for automated provisioning & self-service, automated monitoring, secure multi-tenancy, chargeback, and elasticity to addresses requirements of IT, developers, and lines of business. I spoke with Ian Breitner, Global Solutions Marketing Director for Big Data, to explain why EMC’s approach to HaaS should be considered over other Hadoop cloud offerings.

1.  In your opinion, what are the key characteristics of HaaS?

Continue reading