All Paths Lead To A Federation Data Lake

Is your organization constrained by 2nd platform data warehouse technologies with limited or no budget to move forward towards 3rd platform agile technologies such as a Data Lake? As an EMC customer you have the advantage of leveraging existing EMC investments to develop a Federation Data Lake at minimal cost. Additionally, the Federation Data Lake will generate healthy returns, as it is packaged up with the expertise needed to immediately execute on data lake uses cases such as data warehouse ETL offloading and archiving.

federationdatalake

With the release of William Schmarzo’s Five Tactics to Modernize Your Existing Data Warehouse, I wanted to explore whether the Dean of Big Data views data warehouse modernization tactics or paths ultimately leading to a Federation Data Lake.

1.  What is a Data Lake and who should care?

The data lake is a modern approach to data analytics by taking advantage of the processing and cost advantages of Hadoop. It allows you to store all of the data that you think might be important into a central repository as is. Leaving the data in its raw form is key since you don’t need a pre-determined schema or ‘schema on load’. Schema on load is a data warehousing process that optimizes a query, but also strips the data of information that could be useful for analysis. This flexibility then allows the data lake to feed all downstream applications such as a data warehouse, analytic sandboxes, and other analytic environments.

Everybody should care, but especially the data warehousing and data science teams. It provides a line of demarcation between the data warehouse team who is production/SLA driven and the data science team who is ad-hoc/exploratory driven. There is a natural point of friction between these teams since the nature of data science tools such as SAS negatively affect data warehouse SLAs. With a data lake, the data science team can freely access the data they need without affecting data warehouse SLAs.

The other benefit a data lake provides for a data warehouse team is ETL offload. The data lake can perform large-scale, complex ETL processing, freeing up resources in the expensive data warehouse. I’m working with a large hospital right with this ETL offload use case as their data warehouse costs are continually rising due to having to add more resources in order to prevent ETL processing negatively affecting reporting windows.

2.  What is the Federation Data Lake solution?

Through the testing of different storage and processing technologies, the Federation Data Lake provides a technology reference architecture, with services, that span across the Federation – EMC II, Pivotal, and VMware.

It is a package that really helps customers accelerate the modernization of their data warehouse environment into a data lake – not only through a proven architecture but also with global services to assist with the migration.

3.  Who are the ideal candidates for the Federation Data Lake and why?

The ideal candidate is any large data warehouse organization having trouble meeting ETL windows or maxing out on resources. The perception is that a Data Lake is a data science tool, but it is also a great tool for data warehouse teams for ETL processing. It is a 20-50X savings when you move ETL processing from an expensive data warehouse to a low-cost Data Lake.

4.  One of the biggest barriers to getting value from Big Data or a Data Lake is the skills shortage. How does the Federation Data Lake address this issue?

Federation Data Lake addresses this issue in 3 ways. By putting together a technology reference architecture, we accelerate the development of a Data Lake. By packaging up expertise through EMC Global Services, customers can quickly get started by helping them identify use cases that have the most business impact and creating subsequent project plans for execution. Finally, the EMC Big Data curriculum is aligned with the Federation Data Lake in order to train executives, business leaders, and data scientists to successfully identify use cases and execute on them. For example, we train users how to use new technologies such as Hadoop as a more modern, powerful, and agile approach to ETL processing.

5.  Gartner says beware of data lake fallacy, citing ‘Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured’. How does the Federation Data Lake address this issue?

My issue with Gartner’s comment is that they are taking the concept of a Data Lake and beating it apart whereas EMC approaches the concept of a Data Lake as a means to solve technical and business problems. For example, we absolutely believe you need data governance and it should not be ignored in a data lake environment. EMC Global Services helps organizations with their data governance strategy by identifying the business processes that will be supported by the Data Lake. For example, a business process may use POS data, which will be highly governed, social media data, which may be lightly governed, and market intelligence data, which may need no governance.

Cloudera Enterprise and EMC Isilon: Filling In The Hadoop Gaps

As Hadoop becomes the central component of enterprise data architectures, the open source community and technology vendors have built a large Big Data ecosystem of Hadoop platform capabilities to fill in the gaps of enterprise application requirements. For data processing, we have seen MapReduce batch processing being supplemented with additional data processing techniques such as Apache Hive, Apache Solr, and Apache Spark to fill in the gaps for SQL access, search, and streaming.  For data storage, direct attached storage (DAS) has been the common deployment configuration for Hadoop; however, the market is now looking to supplement DAS deployment with enterprise storage. Why take this approach? Organizations can HDFS enable valuable data already managed in enterprise storage without having to copy or move this data to a separate Hadoop DAS environment.

Cloudera

As a leader in enterprise storage, EMC has partnered with Hadoop vendors such as Cloudera to ensure customers can fill in the Hadoop gaps through HDFS enabled storage such as EMC Isilon. In addition to providing data protection, efficient storage utilization, and ease of import/export through multi-protocol support, EMC Isilon and Cloudera together allow organizations to quickly and easily take on new, analytic workloads.   With the announcement of Cloudera Enterprise certified with EMC Isilon for HDFS storage, I wanted to take the opportunity to speak with Cloudera’s Chief Strategy Officer Mike Olson about the partnership and how he sees the Hadoop ecosystem evolving over the next several years.

1.  The industry has different terminologies for enterprise data architectures centered around Hadoop. EMC refers to this next generation data architecture as a Data Lake and Cloudera as Enterprise Data Hub. What is the common thread?

The two are closely related. At Cloudera, we think of a data hub as an engineered system designed to analyze and process data in place, so it needn’t be moved to be used. The most common use of the “data lake” term is around existing large repositories (and Isilon is an excellent example), where data is collected and managed at scale, but where historically it’s had to be piped out of the lake to be used. By layering Cloudera Enterprise right on top of Isilon as a storage substrate, we layer a hub on the lake – we let you keep your data where it lives, and put the processing where you need it.

2.  Cloudera leads the Hadoop market. What does EMC Isilon bring to the table for your customers?

Best-of-breed engineered storage solutions, of course; manageability, operability, credibility and a tremendous record of success in the enterprise as well. And, of course, a substantial market presence. The data stored in Isilon systems today is more valuable if we can deliver big data analytics and processing on it, without requiring it to be migrated to separate big data infrastructure.

3.  What are the ideal use cases for a Cloudera-Isilon deployment?

We don’t see any practical difference in the use cases that matter. The processing and analytic workloads for big data apply whether data is in native HDFS managed by Apache Hadoop, or in Isilon. The real question is what the enterprise’s requirements and standards around its storage infrastructure are. Companies that choose the benefits of Isilon now get the benefits of Cloudera as well.

4.  SMB and NFS are examples of protocols that have been around for generations. Will HDFS stand the test of time or be replaced with another protocol to support for example real time applications or applications to support the Internet of Things?

Software evolves continually, but HDFS is a long-term player. SMB and NFS are more scalable and more performant today than they were ten or twenty years ago, and I’m confident that you’ll see HDFS evolve as well.

5.  MapReduce provides an excellent alternative to traditional data warehouse batch processing requirements. Other open source data processing techniques for Hadoop such as Hive, Spark, and Apache HBase, etc provide yet additional capabilities to meet enterprise application requirements.   How do you see this data processing ecosystem evolving in the next 5 years?

It’ll be faster, more powerful, more capable and more real-time. The pace of innovation in the last ten years has been breathtaking, in terms of data analysis and transformation. The open source ecosystem and traditional vendors are doing amazing things. That’ll continue – there is so much value in the data that there’s a huge reward for that innovation.

Big Data Pains & Gains From A Real Life CIO

What does it take to make CIO Magazine’s Top 100 List? Big Data victory is one of them.
Michael Cucchi, Sr Director of Product Maketing at Pivotal, had the privilege to speak with one of the winners – EMC CIO Vic Bhagat. Discussing the pains and gains of EMC’s Big Data initiative, I have put together a summary of this interview below.  EMC IT’s approach to Big Data is exactly what the EVP Federation enables organizations to do – first collect any and all data in a Data Lake, deploy the right analytic tool that your people know how to use to analyze the data, and finally learn agile development so you can take those insights and build applications rapidly.

1. Why is Big Data important to your business?

Continue reading

Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.

BigDataSuite_Diagram

*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

Continue reading