Cloudera Enterprise and EMC Isilon: Filling In The Hadoop Gaps

As Hadoop becomes the central component of enterprise data architectures, the open source community and technology vendors have built a large Big Data ecosystem of Hadoop platform capabilities to fill in the gaps of enterprise application requirements. For data processing, we have seen MapReduce batch processing being supplemented with additional data processing techniques such as Apache Hive, Apache Solr, and Apache Spark to fill in the gaps for SQL access, search, and streaming.  For data storage, direct attached storage (DAS) has been the common deployment configuration for Hadoop; however, the market is now looking to supplement DAS deployment with enterprise storage. Why take this approach? Organizations can HDFS enable valuable data already managed in enterprise storage without having to copy or move this data to a separate Hadoop DAS environment.

Cloudera

As a leader in enterprise storage, EMC has partnered with Hadoop vendors such as Cloudera to ensure customers can fill in the Hadoop gaps through HDFS enabled storage such as EMC Isilon. In addition to providing data protection, efficient storage utilization, and ease of import/export through multi-protocol support, EMC Isilon and Cloudera together allow organizations to quickly and easily take on new, analytic workloads.   With the announcement of Cloudera Enterprise certified with EMC Isilon for HDFS storage, I wanted to take the opportunity to speak with Cloudera’s Chief Strategy Officer Mike Olson about the partnership and how he sees the Hadoop ecosystem evolving over the next several years.

1.  The industry has different terminologies for enterprise data architectures centered around Hadoop. EMC refers to this next generation data architecture as a Data Lake and Cloudera as Enterprise Data Hub. What is the common thread?

The two are closely related. At Cloudera, we think of a data hub as an engineered system designed to analyze and process data in place, so it needn’t be moved to be used. The most common use of the “data lake” term is around existing large repositories (and Isilon is an excellent example), where data is collected and managed at scale, but where historically it’s had to be piped out of the lake to be used. By layering Cloudera Enterprise right on top of Isilon as a storage substrate, we layer a hub on the lake – we let you keep your data where it lives, and put the processing where you need it.

2.  Cloudera leads the Hadoop market. What does EMC Isilon bring to the table for your customers?

Best-of-breed engineered storage solutions, of course; manageability, operability, credibility and a tremendous record of success in the enterprise as well. And, of course, a substantial market presence. The data stored in Isilon systems today is more valuable if we can deliver big data analytics and processing on it, without requiring it to be migrated to separate big data infrastructure.

3.  What are the ideal use cases for a Cloudera-Isilon deployment?

We don’t see any practical difference in the use cases that matter. The processing and analytic workloads for big data apply whether data is in native HDFS managed by Apache Hadoop, or in Isilon. The real question is what the enterprise’s requirements and standards around its storage infrastructure are. Companies that choose the benefits of Isilon now get the benefits of Cloudera as well.

4.  SMB and NFS are examples of protocols that have been around for generations. Will HDFS stand the test of time or be replaced with another protocol to support for example real time applications or applications to support the Internet of Things?

Software evolves continually, but HDFS is a long-term player. SMB and NFS are more scalable and more performant today than they were ten or twenty years ago, and I’m confident that you’ll see HDFS evolve as well.

5.  MapReduce provides an excellent alternative to traditional data warehouse batch processing requirements. Other open source data processing techniques for Hadoop such as Hive, Spark, and Apache HBase, etc provide yet additional capabilities to meet enterprise application requirements.   How do you see this data processing ecosystem evolving in the next 5 years?

It’ll be faster, more powerful, more capable and more real-time. The pace of innovation in the last ten years has been breathtaking, in terms of data analysis and transformation. The open source ecosystem and traditional vendors are doing amazing things. That’ll continue – there is so much value in the data that there’s a huge reward for that innovation.

EMC Hadoop Starter Kit ViPR Edition: Creating a Smarter Data Lake

Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.

EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service.  Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.

1.  The original EMC Hadoop Starter Kit released last year was a huge success.  Why did you create ViPR Edition?

Continue reading

Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.

BigDataSuite_Diagram

*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

Continue reading

VCE Vblock: Converging Big Data Investments To Drive More Value

As Big Data continues to demonstrate real business value, organizations are looking to leverage this high value data across different applications and use cases. The uptake is also driving organizations to transition from siloed Big Data sandboxes, to enterprise architectures where they are mandated to address mission-critical availability and performance, security and privacy, provisioning of new services, and interoperability with the rest of the enterprise infrastructure.

Sandbox or experimental Hadoop on commodity hardware with direct attached storage (DAS) makes it difficult to address such challenges for several reasons – difficult to replicate data across applications and data centers, lack of IT oversight and visibility into the data, lack of multi-tenancy and virtualization, difficult to streamline upgrades and migrate technology components, and more. As a result, VCE, leader in converged or integrated infrastructures, is receiving an increased number of requests on how to evolve Hadoop implementations reliant on DAS to being deployed on VCE Vblock Systems –  an enterprise-class infrastructure that combines server, shared storage, network devices, virtualization, and management in a pre-integrated stack.

Formed by Cisco and EMC, with investments from VMware and Intel, VCE enables organizations to rapidly deploy business services on demand and at scale – all without triggering an explosion in capital and operating expenses. According to IDC’s recent report, organizations around the world spent over $3.3 billion on converged systems in 2012, and forecasted this spending to increase by 20% in 2013 and again in 2014. In fact, IDC calculated that Vblock Systems infrastructure resulted in a return on investment of 294% over a three-year period and 435% over a five-year period compared to data on traditional infrastructure due to fast deployments, simplified operations, improved business-support agility, cost savings, and freed staff to launch new applications, extend services, and improve user/customer satisfaction.

I spoke with Julianna DeLua from VCE Product Management to discuss how VCE’s Big Data solution enables organizations to extract more value from Big Data investments.

vce

 

1.  Why are organizations interested deploying Hadoop and Big Data applications on converged or integrated infrastructures such as Vblock?

Continue reading