Posts Tagged ‘NAS’

Shared Infrastructure for Big Data: Separating Hadoop Compute and Storage

Anant Chintamaneni

Anant Chintamaneni

Anant Chintamaneni

Latest posts by Anant Chintamaneni (see all)

The decoupling of compute and storage for Hadoop has been of the big takeaways and themes for Hadoop in 2015. BlueData has written some blog posts about the topic this year, and many of our new customers have cited this as a key initiative in their organization. And, as indicated in this tweet from Gartner’s Merv Adrian earlier this year, it’s been a major topic of discussion at industry events:

Merv Adrian - Gartner- Tweet - Strata Hadoop

Last week I presented a webinar session with Chris Harrold, CTO for EMC’s Big Data solutions, where we discussed shared infrastructure for Big Data and the opportunity to separate Hadoop compute from storage. We had several hundred people sign up for the webinar, and there was great interaction in the Q&A chat panel throughout the session. This turnout and interest provides additional validation of the interest in this topic – it’s a clear indication that the market is looking for fresh ideas to cross the chasm with Big Data infrastructure.

Here’s a recap of some of the topic we discussed in the webinar (you can also view the on-demand replay here)

Traditional Big Data assumptions = #1 reason for complexity, cost, and stalled projects.  

The herd mentality of believing that the only path to Big Data (and in particular Hadoop) is the way it was deployed at early adopters like Yahoo, Facebook, or LinkedIn has left scorched earth for many an enterprise.

Thomas Edison- Way to do it better

The Big Data ecosystem has made Hadoop synonymous with:

  • Dedicated physical servers (“just get a bunch of commodity servers, load them up with Hadoop, and you can be like a Yahoo or a Facebook”);
  • Hadoop compute and storage on the same physical machine (the buzz word is “data locality” – “you gotta have it otherwise it’s not Hadoop”);
  • Hadoop has to be on direct attached storage [DAS] (“local computation and storage” and “HDFS requires local disks” are traditional Hadoop assumptions).

If you’ve been living by these principles, following them as “the only way” to do Hadoop, and are getting ready to throw in the towel on Hadoop … it’s time to STOP and challenge these fundamental assumptions.

Yes, there is a better way.


Big Data storage innovation

 Hadoop can run on containers or VMs. The new reality is that you can use virtual machines or containers as your Hadoop nodes rather than physical servers. This saves you time from racking, stacking, and networking those physical servers. You don’t need to wait for a new server to be ordered and provisioned; or fight deployment issues due to all the stuff that existed on a repurposed server prior to it being handed over to you.

With software-defined infrastructure like virtual machines or containers, you get a pristine and clean environment that enables predictable deployments – while also delivering greater speed and cost savings. During the webinar, Chris highlighted a virtualized Hadoop deployment at Adobe. He explained how they were able to quickly increase the number of Hadoop worker nodes from 32 to 64 to 128 in matter of days – with significantly superior performance to physical servers at a fraction of the cost.

Most data centers are fully virtualized. Why wouldn’t you virtualize Hadoop? As a matter of fact, all of the “Quick Start” options from Hadoop vendors run on a VM or (more recently) on containers (whether local or in the cloud). Companies like Netflix have built an awesome service based on virtualized Hadoop clusters that run in a public cloud. The requirement for on-premises Hadoop to run on a physical server is outdated.

The concept of data locality is overblown. It’s time to finally debunk this myth. Data locality is a silent killer that impedes Hadoop adoption in the enterprise. Copying terabytes of existing enterprise data onto physical servers with local disks, and then having to balance/re-balance the data every time the server fails, is operationally complex and expensive. As a matter of fact, it only gets worse as you scale your clusters up. The internet giants like Yahoo used this approach circa 2005 because those were the days of slow start 1Gbps networks.

Today, networks are much faster and 10Gbps networks are commonplace. Studies from U.C. Berkeley AMPLab and newer Hadoop reference architectures have shown that you can get better I/O performance with compute/storage separation. And your organization will benefit from simpler operational models, where you can scale and manage your compute and storage systems independently. Ironically, the dirty little secret is that even with compute/storage co-location, you are not guaranteed data locality in many common Hadoop scenarios. Ask the Big Data team at Facebook and they will tell you that only 30% of their Hadoop tasks run on servers where data is local 

HDFS does not require local disks. This is another one of those traditional Hadoop tenants that is no longer valid: local direct attached storage (DAS) is not required for Hadoop. The Hadoop Distributed File System (HDFS) is as much a distributed file system protocol as it is an implementation. Running HDFS on local disks is one such implementation approach, and DAS made sense for internet companies like Yahoo and Facebook – since their primary initial use case was collecting clickstream/log data.

However, most enterprises today have terabytes of Big Data from multiple sources (audio, video, text, etc.) that already resides in shared storage systems such as EMC Isilon. The data protection that enterprise-grade shared storage provides is a key consideration for these enterprises. And the need to move and duplicate this data for Hadoop deployments (with the 3x replication required for traditional DAS-based HDFS) can be a significant stumbling block.

BlueData and EMC Isilon enable an HDFS interface that can accelerate time-to-insights by leveraging your data in-place and bringing it to the Hadoop compute processes – rather than waiting weeks or months to copy the data onto local disks. If you’re interested in more on this topic, you can refer to my colleague Tom Phelan (Co-Founder and Chief Architect at Blue Data) ’s session on HDFS virtualization presented at Strata + Hadoop World in New York this fall.

What about performance with Hadoop compute/storage separation?

Any time a new infrastructure approach is introduced for applications (and many of you have seen this movie before when virtualization was introduced the early 2000’s), the number one question asked is “What about performance?” To address this question during our recent webinar session, Chris and I shared some detailed performance data from customer deployments as well as from independent studies.

In particular, I want to specifically highlight some performance results from a BlueData customer that compared physical Hadoop performance against a virtualized Hadoop cluster (running on the BlueData software platform) using the industry standard Intel HiBench performance benchmark:

Performance with Virtualized Hadoop

  • Enhanced DFSIO: Generates a large number of reads and writes (read-specific or write-specific)
  • TeraSort: Sort dataset generated by TeraGen (balanced read-write)

Testing by this customer (a U.S. Federal Agency lab) revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enabled performance in a virtualized environment that is comparable or superior to that on bare-metal. As indicated in the chart above, Hadoop jobs that were I/O intensive such as Enhanced DFSIO were shown to run 50-60% faster; balanced read-write operations were shown to run almost 200% faster.

While the numbers may vary for others customers based on hardware specifications (e.g. CPU, memory, disk types), the use of BlueData’s IOBoost (application aware caching) technology – combined with the use of multiple virtual clusters per physical server – contributed to the significant performance advantage over physical Hadoop.

Performance with DAS

This same U.S Federal Agency lab executed the same HiBench benchmark to compare performance of a shared DAS-based HDFS system and enterprise-grade NFS (with EMC Isilon) utilizing the BlueData DataTap technology. DataTap brings data from any file system (e.g. HDFS DAS, NFS, Object Storage) to Hadoop compute (e.g. MapReduce) by virtualizing the HDFS protocol.

In general, all the tests across the board showed that enterprise-grade NFS delivered superior performance compared to a DAS-based HDFS system. This specific performance comparison validates that the network is not the bottleneck (10Gbps network was used) and that the 3x replication in a DAS-based HDFS adds overhead.

A new approach will add value at every stage of the Big Data journey.

Big Data is a journey and many of you continue to persevere through it, while many others are just getting started.

Irrespective of where you are in this journey, the new approach to Hadoop that we described in the webinar session (e.g. leveraging containers and virtualization, separating compute and storage, using shared instead of local storage) will dramatically accelerate outcomes and deliver significant additional Big Data value.

Added Value for virtualized Hadoop

For those of you just getting started with Big Data and in the prototyping phase, shared infrastructure built on the foundation of compute/storage separation will enable different teams to evaluate different Big Data ecosystem products and build use case prototypes – all while sharing a centralized data set versus copying and moving data around. And by running Hadoop on Docker containers, your data scientists and developers can spin up instant virtual clusters on-demand with self-service.

If you have a specific use case in production, a shared infrastructure model will allow you to simplify your dev/test environment and eliminate the need to duplicate data from production. You can simplify management, reduce costs, and improve utilization as you begin to scale your deployment.

And finally, if you need a true multi-tenant environment for your enterprise-scale Hadoop deployment, there is simply no alternative to using some form of virtualization and shared storage to deliver the agility and efficiency of a Big Data-as-a-Service experience on-premises.

Key takeaways and next steps for your Big Data journey.

In closing, Chris and I shared some final thoughts and takeaways at the end of the webinar session:

  • Big Data is a journey: future-proof your infrastructure
  • Compute and storage separation enables greater agility for all Big Data stakeholders
  • Don’t make your infrastructure decisions based on the “data locality” myth

We expect these trends to continue into next year and beyond: we’ll see more Big Data deployments leveraging shared storage, more deployments using containers and virtual machines, and more enterprises decoupling Hadoop compute from storage. As you plan your Big Data strategy for 2016, I’d encourage you to challenge the traditional assumptions and embrace a new approach to do Hadoop.

Finally, I want to personally thank everyone who made time to attend our recent webinar session. You can view the on-demand replay. Feel free to ping me (@AnantCman) or Chris Harrold (@CHarrold303) on Twitter if you have any additional questions.

EMC and RainStor Optimize Interactive SQL on Hadoop

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.

Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data.  Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently.  In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read.  This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.

Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.

Rainstor

Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.

To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:

  • Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
  • Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
  • Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
  • Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.

I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.

1.  RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?

(more…)

Want to Explore Hadoop, But No Tour Guide?

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Are you a VMware Vsphere customer? Do you also own EMC Isilon? If you said yes to both, I have great news for you – you have all the ingredients for the EMC Hadoop Starter Kit (HSK).  In just a few short hours you can spin up a virtualized Hadoop cluster by downloading the HSK step-by-step guide.  Watch the demo below of HSK being used to deploy Hadoop:

Now you don’t have to imagine what Hadoop tastes like because this starter kit is designed to help you execute and discover the potential of Hadoop within your organization. Whether you are new to Hadoop or an experienced Hadoop user, you will want to take advantage of this turnkey solution for the following reasons:

-Rapid provisioning – From the creation of virtual Hadoop nodes to starting up the hadoop services on the cluster, much of the Hadoop cluster deployment can be automated, requiring little expertise on the user’s part.

-High availability – HA protection can be provided through the virtualization platform to protect the single points of failure in the Hadoop system, such as NameNode and JobTracker Virtual Machines.

-Elasticity – Hadoop capacity can be scaled up and down on demand in a virtual environment, thus allowing the same physical infrastructure to be shared among Hadoop and other applications.

-Multi-tenancy – Different tenants running Hadoop can be isolated in separate VMs, providing stronger VM-grade resource and security isolation.

-Portability – Use any Hadoop distribution throughout the Big Data application lifecycle with zero data migration – Apache Open Source, Pivotal HD, Cloudera, Hortonworks.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization that uses VMware Vsphere and EMC Isilon should use this starter kit for Big Data projects.

1.  Why did you create the starter kit and what are the best use cases for this starter kit?

(more…)

Converging Big Data and The Enterprise with EMC Isilon

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Experimenting with and gaining insight from Big Data first and foremost requires collecting massive amounts of unstructured, file-based data.  However, without efficient storage, IT will expire these growing mounds of Big Data. Fortunately, Isilon OneFS is an intelligent operating system for scale-out NAS storage, designed to match data sets with the appropriate tier of storage based on business value to lower the total cost of Big Data.  Not only that, the release of OneFS 7.0 scales to over 15PBs, delivering a 25% increase in single file system throughput.

As with any product release, there are so many new features to digest and understand that some go unnoticed.  Fortunately, Nick Kirsch, Vice President and Chief Technology Officer at EMC Isilon, summarizes some of the key features of Isilon OneFS 7.0 and why they are incredibly important to drive Big Data applications.

1) You have been at Isilon for 10 years focusing on bringing innovative, high quality products to market. What keeps you here?

(more…)

Follow Dell EMC

Dell EMC Big Data Portfolio

See how the Dell EMC Big Data Portfolio can make a difference for your analytics journey

Dell EMC Community Network

Participate in the Everything Big Data technical community