As Hadoop becomes the central component of enterprise data architectures, the open source community and technology vendors have built a large Big Data ecosystem of Hadoop platform capabilities to fill in the gaps of enterprise application requirements. For data processing, we have seen MapReduce batch processing being supplemented with additional data processing techniques such as Apache Hive, Apache Solr, and Apache Spark to fill in the gaps for SQL access, search, and streaming. For data storage, direct attached storage (DAS) has been the common deployment configuration for Hadoop; however, the market is now looking to supplement DAS deployment with enterprise storage. Why take this approach? Organizations can HDFS enable valuable data already managed in enterprise storage without having to copy or move this data to a separate Hadoop DAS environment.
As a leader in enterprise storage, EMC has partnered with Hadoop vendors such as Cloudera to ensure customers can fill in the Hadoop gaps through HDFS enabled storage such as EMC Isilon. In addition to providing data protection, efficient storage utilization, and ease of import/export through multi-protocol support, EMC Isilon and Cloudera together allow organizations to quickly and easily take on new, analytic workloads. With the announcement of Cloudera Enterprise certified with EMC Isilon for HDFS storage, I wanted to take the opportunity to speak with Cloudera’s Chief Strategy Officer Mike Olson about the partnership and how he sees the Hadoop ecosystem evolving over the next several years.
1. The industry has different terminologies for enterprise data architectures centered around Hadoop. EMC refers to this next generation data architecture as a Data Lake and Cloudera as Enterprise Data Hub. What is the common thread?
What does it take to make CIO Magazine’s Top 100 List? Big Data victory is one of them.
Michael Cucchi, Sr Director of Product Maketing at Pivotal, had the privilege to speak with one of the winners – EMC CIO Vic Bhagat. Discussing the pains and gains of EMC’s Big Data initiative, I have put together a summary of this interview below. EMC IT’s approach to Big Data is exactly what the EVP Federation enables organizations to do – first collect any and all data in a Data Lake, deploy the right analytic tool that your people know how to use to analyze the data, and finally learn agile development so you can take those insights and build applications rapidly.
1. Why is Big Data important to your business?
Hadoop-as-a-Service (HaaS) is generally referred to Hadoop in the cloud, a handy alternative to on-premise Hadoop deployments for organizations with overwhelmed data center administrators that need to incorporate Hadoop but don’t have the resources to do so. What if there was also a promising option to successfully build and maintain Hadoop clusters on-premise also referred to HaaS? The EMC Hybrid Cloud (EHC) enables just this – Hadoop in the hybrid cloud.
EHC, announced at EMC World 2014, is a new end-to-end reference architecture that is based on a Software-Defined Data Center architecture comprising technologies from across the EMC federation of companies: EMC II storage and data protection, Pivotal CF Platform-as-a-service (PaaS) and the Pivotal Big Data Suite, VMware cloud management and virtualization solutions, and VMware vCloud Hybrid Service. EHC’s Hadoop-as-a- Service was demonstrated at last week’s VMworld 2014 San Francisco – the underpinnings of a Virtual Data Lake:
EHC leverages these tight integrations across the Federation so that customers can extend their existing investments for automated provisioning & self-service, automated monitoring, secure multi-tenancy, chargeback, and elasticity to addresses requirements of IT, developers, and lines of business. I spoke with Ian Breitner, Global Solutions Marketing Director for Big Data, to explain why EMC’s approach to HaaS should be considered over other Hadoop cloud offerings.
1. In your opinion, what are the key characteristics of HaaS?
Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.
EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service. Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.
I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.
1. The original EMC Hadoop Starter Kit released last year was a huge success. Why did you create ViPR Edition?