All Paths Lead To A Federation Data Lake

Is your organization constrained by 2nd platform data warehouse technologies with limited or no budget to move forward towards 3rd platform agile technologies such as a Data Lake? As an EMC customer you have the advantage of leveraging existing EMC investments to develop a Federation Data Lake at minimal cost. Additionally, the Federation Data Lake will generate healthy returns, as it is packaged up with the expertise needed to immediately execute on data lake uses cases such as data warehouse ETL offloading and archiving.

Data Lake

With the release of William Schmarzo’s Five Tactics to Modernize Your Existing Data Warehouse, I wanted to explore whether the Dean of Big Data views data warehouse modernization tactics or paths ultimately leading to a Federation Data Lake.

1.  What is a Data Lake and who should care?

The data lake is a modern approach to data analytics by taking advantage of the processing and cost advantages of Hadoop. It allows you to store all of the data that you think might be important into a central repository as is. Leaving the data in its raw form is key since you don’t need a pre-determined schema or ‘schema on load’. Schema on load is a data warehousing process that optimizes a query, but also strips the data of information that could be useful for analysis. This flexibility then allows the data lake to feed all downstream applications such as a data warehouse, analytic sandboxes, and other analytic environments.

Everybody should care, but especially the data warehousing and data science teams. It provides a line of demarcation between the data warehouse team who is production/SLA driven and the data science team who is ad-hoc/exploratory driven. There is a natural point of friction between these teams since the nature of data science tools such as SAS negatively affect data warehouse SLAs. With a data lake, the data science team can freely access the data they need without affecting data warehouse SLAs.

The other benefit a data lake provides for a data warehouse team is ETL offload. The data lake can perform large-scale, complex ETL processing, freeing up resources in the expensive data warehouse. I’m working with a large hospital right with this ETL offload use case as their data warehouse costs are continually rising due to having to add more resources in order to prevent ETL processing negatively affecting reporting windows.

2.  What is the Federation Data Lake solution?

Through the testing of different storage and processing technologies, the Federation Data Lake provides a technology reference architecture, with services, that span across the Federation – EMC II, Pivotal, and VMware.

It is a package that really helps customers accelerate the modernization of their data warehouse environment into a data lake – not only through a proven architecture but also with global services to assist with the migration.

3.  Who are the ideal candidates for the Federation Data Lake and why?

The ideal candidate is any large data warehouse organization having trouble meeting ETL windows or maxing out on resources. The perception is that a Data Lake is a data science tool, but it is also a great tool for data warehouse teams for ETL processing. It is a 20-50X savings when you move ETL processing from an expensive data warehouse to a low-cost Data Lake.

4.  One of the biggest barriers to getting value from Big Data or a Data Lake is the skills shortage. How does the Federation Data Lake address this issue?

Federation Data Lake addresses this issue in 3 ways. By putting together a technology reference architecture, we accelerate the development of a Data Lake. By packaging up expertise through EMC Global Services, customers can quickly get started by helping them identify use cases that have the most business impact and creating subsequent project plans for execution. Finally, the EMC Big Data curriculum is aligned with the Federation Data Lake in order to train executives, business leaders, and data scientists to successfully identify use cases and execute on them. For example, we train users how to use new technologies such as Hadoop as a more modern, powerful, and agile approach to ETL processing.

5.  Gartner says beware of data lake fallacy, citing ‘Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured’. How does the Federation Data Lake address this issue?

My issue with Gartner’s comment is that they are taking the concept of a Data Lake and beating it apart whereas EMC approaches the concept of a Data Lake as a means to solve technical and business problems. For example, we absolutely believe you need data governance and it should not be ignored in a data lake environment. EMC Global Services helps organizations with their data governance strategy by identifying the business processes that will be supported by the Data Lake. For example, a business process may use POS data, which will be highly governed, social media data, which may be lightly governed, and market intelligence data, which may need no governance.

A Novel Idea: Practical Advice From A Big Data Practitioner

Big Data: Understanding How Data Powers Big Business is yet another Big Data book to hit the market. What makes this book unique? There is practical advice and hands on exercises so that you end up with a Big Data action plan unique to your business after completion of the book. I spoke to the author, EMC’s own Big Data’s preeminent expert William Schmarzo, to explain the goals of his book and why organizations grappling with Big Data should pick it up.


1.  What makes you a Big Data expert in providing practical advice for developing Big Data strategies?

Continue reading

Dear BI Users: Your Hadoop SQL Wish Has Finally Come True

To accelerate the value of Big Data, many products have been developed to make data managed in Hadoop much easier to access and analyze through SQL.  First there was Hive, which provides a SQL query abstraction layer by converting SQL queries into MapReduce jobs.  More recently, Cloudera announced Impala which bypasses MapReduce to enable interactive queries on data stored in Hadoop using the same variant of SQL that Hive uses.  And today, EMC Greenplum announced Pivotal HD, the only high performing, true SQL query engine on top of Hadoop.  Don’t be confused by these approaches, as there is a common thread – to leverage Hadoop as a Big Data platform for running SQL queries.  The major difference with Pivotal HD is that now there is a single, scalable, flexible, and cost-effective data platform for all of your analytic needs.



I spoke with Greenplum Chief Scientist Milind Bhandarkar to explain this breakthrough SQL interface to Hadoop.

1. How does Pivotal HD provide a true, high performing SQL interface to Hadoop?

Continue reading

Want To Become A Data Scientist? EMC Can Train You in 5 Days.

Everyone agrees that there is a shortage of Data Scientists. If not addressed soon, Big Data breakthroughs in areas such as healthcare, renewable energy, public sector, etc will decelerate.  I am proud to say that EMC is doing its part to solve the problem by fostering Data Science development with training and certification, hands on expertiseweb events, internships, and more.  For example, EMC Education Services offers a 5-day Data Science and Big Data Analytics  training and certification,  designed to enable immediate and effective participation in big data and other analytics projects.

As a Big Data citizen, I want to motivate those thinking about moving into the world of Data Science, to take action and get trained. I met with Barry Heller, a developer for EMC’s Data Science curriculum, who leverages his extensive education and past experience as an EMC Data Scientist for curriculum development.  If Barry’s story resonates and you relate in some way, I hope it inspires you to start a career in Data Science.

1) How many people have completed the EMC Data Science and Big Data Analytics training since its creation early this year?

Continue reading