Cloudera Enterprise and EMC Isilon: Filling In The Hadoop Gaps

As Hadoop becomes the central component of enterprise data architectures, the open source community and technology vendors have built a large Big Data ecosystem of Hadoop platform capabilities to fill in the gaps of enterprise application requirements. For data processing, we have seen MapReduce batch processing being supplemented with additional data processing techniques such as Apache Hive, Apache Solr, and Apache Spark to fill in the gaps for SQL access, search, and streaming.  For data storage, direct attached storage (DAS) has been the common deployment configuration for Hadoop; however, the market is now looking to supplement DAS deployment with enterprise storage. Why take this approach? Organizations can HDFS enable valuable data already managed in enterprise storage without having to copy or move this data to a separate Hadoop DAS environment.

Cloudera

As a leader in enterprise storage, EMC has partnered with Hadoop vendors such as Cloudera to ensure customers can fill in the Hadoop gaps through HDFS enabled storage such as EMC Isilon. In addition to providing data protection, efficient storage utilization, and ease of import/export through multi-protocol support, EMC Isilon and Cloudera together allow organizations to quickly and easily take on new, analytic workloads.   With the announcement of Cloudera Enterprise certified with EMC Isilon for HDFS storage, I wanted to take the opportunity to speak with Cloudera’s Chief Strategy Officer Mike Olson about the partnership and how he sees the Hadoop ecosystem evolving over the next several years.

1.  The industry has different terminologies for enterprise data architectures centered around Hadoop. EMC refers to this next generation data architecture as a Data Lake and Cloudera as Enterprise Data Hub. What is the common thread?

The two are closely related. At Cloudera, we think of a data hub as an engineered system designed to analyze and process data in place, so it needn’t be moved to be used. The most common use of the “data lake” term is around existing large repositories (and Isilon is an excellent example), where data is collected and managed at scale, but where historically it’s had to be piped out of the lake to be used. By layering Cloudera Enterprise right on top of Isilon as a storage substrate, we layer a hub on the lake – we let you keep your data where it lives, and put the processing where you need it.

2.  Cloudera leads the Hadoop market. What does EMC Isilon bring to the table for your customers?

Best-of-breed engineered storage solutions, of course; manageability, operability, credibility and a tremendous record of success in the enterprise as well. And, of course, a substantial market presence. The data stored in Isilon systems today is more valuable if we can deliver big data analytics and processing on it, without requiring it to be migrated to separate big data infrastructure.

3.  What are the ideal use cases for a Cloudera-Isilon deployment?

We don’t see any practical difference in the use cases that matter. The processing and analytic workloads for big data apply whether data is in native HDFS managed by Apache Hadoop, or in Isilon. The real question is what the enterprise’s requirements and standards around its storage infrastructure are. Companies that choose the benefits of Isilon now get the benefits of Cloudera as well.

4.  SMB and NFS are examples of protocols that have been around for generations. Will HDFS stand the test of time or be replaced with another protocol to support for example real time applications or applications to support the Internet of Things?

Software evolves continually, but HDFS is a long-term player. SMB and NFS are more scalable and more performant today than they were ten or twenty years ago, and I’m confident that you’ll see HDFS evolve as well.

5.  MapReduce provides an excellent alternative to traditional data warehouse batch processing requirements. Other open source data processing techniques for Hadoop such as Hive, Spark, and Apache HBase, etc provide yet additional capabilities to meet enterprise application requirements.   How do you see this data processing ecosystem evolving in the next 5 years?

It’ll be faster, more powerful, more capable and more real-time. The pace of innovation in the last ten years has been breathtaking, in terms of data analysis and transformation. The open source ecosystem and traditional vendors are doing amazing things. That’ll continue – there is so much value in the data that there’s a huge reward for that innovation.

VCE Vblock: Converging Big Data Investments To Drive More Value

As Big Data continues to demonstrate real business value, organizations are looking to leverage this high value data across different applications and use cases. The uptake is also driving organizations to transition from siloed Big Data sandboxes, to enterprise architectures where they are mandated to address mission-critical availability and performance, security and privacy, provisioning of new services, and interoperability with the rest of the enterprise infrastructure.

Sandbox or experimental Hadoop on commodity hardware with direct attached storage (DAS) makes it difficult to address such challenges for several reasons – difficult to replicate data across applications and data centers, lack of IT oversight and visibility into the data, lack of multi-tenancy and virtualization, difficult to streamline upgrades and migrate technology components, and more. As a result, VCE, leader in converged or integrated infrastructures, is receiving an increased number of requests on how to evolve Hadoop implementations reliant on DAS to being deployed on VCE Vblock Systems –  an enterprise-class infrastructure that combines server, shared storage, network devices, virtualization, and management in a pre-integrated stack.

Formed by Cisco and EMC, with investments from VMware and Intel, VCE enables organizations to rapidly deploy business services on demand and at scale – all without triggering an explosion in capital and operating expenses. According to IDC’s recent report, organizations around the world spent over $3.3 billion on converged systems in 2012, and forecasted this spending to increase by 20% in 2013 and again in 2014. In fact, IDC calculated that Vblock Systems infrastructure resulted in a return on investment of 294% over a three-year period and 435% over a five-year period compared to data on traditional infrastructure due to fast deployments, simplified operations, improved business-support agility, cost savings, and freed staff to launch new applications, extend services, and improve user/customer satisfaction.

I spoke with Julianna DeLua from VCE Product Management to discuss how VCE’s Big Data solution enables organizations to extract more value from Big Data investments.

vce

 

1.  Why are organizations interested deploying Hadoop and Big Data applications on converged or integrated infrastructures such as Vblock?

Continue reading

Want to Explore Hadoop, But No Tour Guide? EMC Just Published a Step-By-Step Guide

Are you a VMware Vsphere customer? Do you also own EMC Isilon? If you said yes to both, I have great news for you – you have all the ingredients for the EMC Hadoop Starter Kit (HSK).  In just a few short hours you can spin up a virtualized Hadoop cluster by downloading the HSK step-by-step guide.  Watch the demo below of HSK being used to deploy Hadoop:

Now you don’t have to imagine what Hadoop tastes like because this starter kit is designed to help you execute and discover the potential of Hadoop within your organization. Whether you are new to Hadoop or an experienced Hadoop user, you will want to take advantage of this turnkey solution for the following reasons:

-Rapid provisioning – From the creation of virtual Hadoop nodes to starting up the hadoop services on the cluster, much of the Hadoop cluster deployment can be automated, requiring little expertise on the user’s part.

-High availability – HA protection can be provided through the virtualization platform to protect the single points of failure in the Hadoop system, such as NameNode and JobTracker Virtual Machines.

-Elasticity – Hadoop capacity can be scaled up and down on demand in a virtual environment, thus allowing the same physical infrastructure to be shared among Hadoop and other applications.

-Multi-tenancy – Different tenants running Hadoop can be isolated in separate VMs, providing stronger VM-grade resource and security isolation.

-Portability – Use any Hadoop distribution throughout the Big Data application lifecycle with zero data migration – Apache Open Source, Pivotal HD, Cloudera, Hortonworks.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization that uses VMware Vsphere and EMC Isilon should use this starter kit for Big Data projects.

1.  Why did you create the starter kit and what are the best use cases for this starter kit?

Continue reading

Not Sure Which Hadoop Distro To Use? EMC Isilon Provides Big Data Portability

95% of clients starting Hadoop projects don’t have an established use case; therefore, selecting the right distribution will probably be a shot in the dark. You may start off with Hortonworks for a dev/test environment, but then realize that Pivotal HD is a better choice for enterprise-class deployment. The good news is that if you start your Hadoop project using EMC Isilon scale-out NAS, you have zero data migration when moving from one Hadoop distribution to another. In fact, you can run multiple Hadoop distributions against the same data – no duplication of data required.

All this makes sense to me. Utilize Isilon scale-out NAS as the native storage layer for Hadoop, making the entire Hadoop environment more flexible. But wait, there’s more. Using Isilon storage with Hadoop instead of a traditional DAS configuration makes the entire Hadoop environment easier and faster to deploy, reliable, and in some cases, a lower TCO than DAS.

De-coupling the Hadoop compute and storage layer may lead you to believe there is a performance hit. Not true. You can expect up to 100GB/s of concurrent throughput on the Hadoop storage layer with Isilon. Additionally, by off-loading storage-related HDFS overhead to Isilon, Hadoop compute farms can be better utilized for performing more analysis jobs instead of managing local storage.

You may think I am biased towards Isilon because I do Big Data Marketing for EMC. Not true. I genuinely believe Isilon is a better choice for Hadoop than traditional DAS for the reasons listed in the table below and based on my interview with Ryan Peterson, Director of Solutions Architecture at Isilon.

table

1.  What Hadoop distributions does Isilon support?

Continue reading