All Paths Lead To A Federation Data Lake

Is your organization constrained by 2nd platform data warehouse technologies with limited or no budget to move forward towards 3rd platform agile technologies such as a Data Lake? As an EMC customer you have the advantage of leveraging existing EMC investments to develop a Federation Data Lake at minimal cost. Additionally, the Federation Data Lake will generate healthy returns, as it is packaged up with the expertise needed to immediately execute on data lake uses cases such as data warehouse ETL offloading and archiving.

Data Lake

With the release of William Schmarzo’s Five Tactics to Modernize Your Existing Data Warehouse, I wanted to explore whether the Dean of Big Data views data warehouse modernization tactics or paths ultimately leading to a Federation Data Lake.

1.  What is a Data Lake and who should care?

The data lake is a modern approach to data analytics by taking advantage of the processing and cost advantages of Hadoop. It allows you to store all of the data that you think might be important into a central repository as is. Leaving the data in its raw form is key since you don’t need a pre-determined schema or ‘schema on load’. Schema on load is a data warehousing process that optimizes a query, but also strips the data of information that could be useful for analysis. This flexibility then allows the data lake to feed all downstream applications such as a data warehouse, analytic sandboxes, and other analytic environments.

Everybody should care, but especially the data warehousing and data science teams. It provides a line of demarcation between the data warehouse team who is production/SLA driven and the data science team who is ad-hoc/exploratory driven. There is a natural point of friction between these teams since the nature of data science tools such as SAS negatively affect data warehouse SLAs. With a data lake, the data science team can freely access the data they need without affecting data warehouse SLAs.

The other benefit a data lake provides for a data warehouse team is ETL offload. The data lake can perform large-scale, complex ETL processing, freeing up resources in the expensive data warehouse. I’m working with a large hospital right with this ETL offload use case as their data warehouse costs are continually rising due to having to add more resources in order to prevent ETL processing negatively affecting reporting windows.

2.  What is the Federation Data Lake solution?

Through the testing of different storage and processing technologies, the Federation Data Lake provides a technology reference architecture, with services, that span across the Federation – EMC II, Pivotal, and VMware.

It is a package that really helps customers accelerate the modernization of their data warehouse environment into a data lake – not only through a proven architecture but also with global services to assist with the migration.

3.  Who are the ideal candidates for the Federation Data Lake and why?

The ideal candidate is any large data warehouse organization having trouble meeting ETL windows or maxing out on resources. The perception is that a Data Lake is a data science tool, but it is also a great tool for data warehouse teams for ETL processing. It is a 20-50X savings when you move ETL processing from an expensive data warehouse to a low-cost Data Lake.

4.  One of the biggest barriers to getting value from Big Data or a Data Lake is the skills shortage. How does the Federation Data Lake address this issue?

Federation Data Lake addresses this issue in 3 ways. By putting together a technology reference architecture, we accelerate the development of a Data Lake. By packaging up expertise through EMC Global Services, customers can quickly get started by helping them identify use cases that have the most business impact and creating subsequent project plans for execution. Finally, the EMC Big Data curriculum is aligned with the Federation Data Lake in order to train executives, business leaders, and data scientists to successfully identify use cases and execute on them. For example, we train users how to use new technologies such as Hadoop as a more modern, powerful, and agile approach to ETL processing.

5.  Gartner says beware of data lake fallacy, citing ‘Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured’. How does the Federation Data Lake address this issue?

My issue with Gartner’s comment is that they are taking the concept of a Data Lake and beating it apart whereas EMC approaches the concept of a Data Lake as a means to solve technical and business problems. For example, we absolutely believe you need data governance and it should not be ignored in a data lake environment. EMC Global Services helps organizations with their data governance strategy by identifying the business processes that will be supported by the Data Lake. For example, a business process may use POS data, which will be highly governed, social media data, which may be lightly governed, and market intelligence data, which may need no governance.

Don’t Accept The Status Quo For Hadoop

Hadoop is Everywhere – 99% companies will deploy/pilot Hadoop in 18-24 months according to IDC.  These environments will largely be based around standalone servers resulting in added management tasks due to data being spread out across many disk spindles across the data center.  With Hadoop clusters quickly expanding, organizations are starting to experience the typical growing pains one can compare to adolescence.  This begs the question- should DAS server configuration be the accepted status-quo for Hadoop deployments?


Whether you are getting started with Hadoop or growing your Hadoop deployment, EMC provides a long-term solution for Hadoop through shared storage and VM’s, delivering distinct value to the business in lower TCO and faster time-to-results.  I spoke with EMC Technical Advisory Architect Chris Harrold to explain why organizations are now turning to EMC to help transition Hadoop environments into adulthood.

1.  Almost every Hadoop deployment is based around the accepted configuration of standalone servers with DAS.   What have you seen as issues with this configuration with your customers?

These environments are growing rapidly.  As a result, the ability to support these environments starts to degrade pretty rapidly when you get to a larger scale.  Servers with DAS tend to be more difficult because they have components that can fail internally and require more babysitting over an enterprise class platform.

For example, it’s no trivial task to expand this environment, as you have to acquire the servers, stand them all up, configure, rack, and power them.  Just to add a sandbox or test environment is difficult in this standalone server model.

There is also a steep learning curve with Hadoop not only in terms of the analytics component but also just to simply get data in and out of Hadoop in a DAS environment.

2.  Hadoop was designed to run on a DAS architecture where the compute and storage is tightly coupled.  Why does EMC believe decoupling storage and compute through a shared storage and virtualizing compute resources is a better architecture?  How does this architecture address the issues you mentioned above?

When Hadoop was first introduced to the market in 2000, shared or enterprise storage was a high-end commodity; therefore it was very difficult to design something like Hadoop on shared storage.  Since then, shared storage has become more affordable.   At the same time, networking speeds have become faster so now it is more feasible to decouple compute and storage. By deploying Hadoop on a shared storage model, you eliminate all the issues around manageability with DAS and gain the benefits of enterprise class features such as virtualization, SANs, and scale out NAS.

Also, deploying large-scale standalone architectures is really a legacy approach, as many enterprises have moved away from this to a shared architecture.  As Hadoop is becoming a key component of data architectures, it will be challenging to maintain standalone servers since many enterprises have evolved to virtualized, shared environments.

EMC is enabling organizations to leverage enterprise storage and virtualization to quickly and easily deploy and manage growing Hadoop environments.  Utilizing enterprise technologies with Hadoop you also gain benefits such as ease of data import/export, data protection, and security.

3.  What are the components of the EMC recommended architecture for Hadoop?

We provide choice on how organizations want to architect Hadoop environments.  For shared storage, we provide EMC Isilon HDFS enabled storage, which is certified with all major Hadoop distributions.  We also provide Software Defined Storage through EMC ViPR HDFS/Object storage, which is certified with all major storage arrays and Hadoop distributions.

For pre-integrated compute and shared storage, the EMC Data Computing Appliance provides an optimized architecture utilizing Pivotal HD and EMC Isilon.

For an integrated infrastructure approach, we partner with VCE vBlock to provide choice of compute, storage, and networking technologies from VMware, Cisco, and EMC to optimize any major Hadoop distribution deployment.

4.  Who are the ideal candidates for the EMC architecture for Hadoop and why?

All organizations benefit from this architecture.  Whether you are just getting started with Hadoop or have a large-scale deployment, we make it easy to rapidly deploy and manage a Hadoop environment.  In fact, utilizing EMC storage, you can analytics enable data already living in storage arrays.  You don’t have to copy or move data to a separate standalone Hadoop DAS environment.   We have several step by step guides to walk you the process of easily configuring your Hadoop environment for HDFS enabled storage.

5. Although Hadoop is everywhere with IDC estimating that 99%o of companies will deploy/pilot Hadoop in the next 18-24 months, gaining ROI from the deployment is a challenge due to lack of skills – identifying the right opportunity and then executing.  How does EMC address this issue?

Yes, this a huge problem in the industry especially lack of Data Science skills.  EMC addresses the skills shortage through our services across the EMC Federation.  Pivotal Data Labs provides access to some of the best minds in Data Science to help organizations identify opportunities and execute utilizing the latest Big Data technologies and techniques.  The EMC Vision Workshop creates a strategic Big Data blueprint for organizations to continuously identify Big Data uses cases based on the organization’s business initiatives and implementation feasibility.  And this has become a huge success as the EMC Vision Workshop creates the needed organizational alignment – Lines of Business continuously working, communicating, and collaborating with IT in order to successfully identify the right Big Data use cases for success.

EMC Hadoop Starter Kit: Creating a Smarter Data Lake

Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.

EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service.  Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.

1.  The original EMC Hadoop Starter Kit released last year was a huge success.  Why did you create ViPR Edition?

Continue reading

EMC and RainStor Optimize Interactive SQL on Hadoop

Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.

Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data.  Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently.  In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read.  This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.

Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.


Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.

To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:

  • Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
  • Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
  • Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
  • Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.

I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.

1.  RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?

Continue reading