Global Sales Contact List

Contact   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Posts Tagged ‘big data’

Shared Infrastructure for Big Data: Separating Hadoop Compute and Storage

Anant Chintamaneni

Anant Chintamaneni

Anant Chintamaneni

Latest posts by Anant Chintamaneni (see all)

The decoupling of compute and storage for Hadoop has been of the big takeaways and themes for Hadoop in 2015. BlueData has written some blog posts about the topic this year, and many of our new customers have cited this as a key initiative in their organization. And, as indicated in this tweet from Gartner’s Merv Adrian earlier this year, it’s been a major topic of discussion at industry events:

Merv Adrian - Gartner- Tweet - Strata Hadoop

Last week I presented a webinar session with Chris Harrold, CTO for EMC’s Big Data solutions, where we discussed shared infrastructure for Big Data and the opportunity to separate Hadoop compute from storage. We had several hundred people sign up for the webinar, and there was great interaction in the Q&A chat panel throughout the session. This turnout and interest provides additional validation of the interest in this topic – it’s a clear indication that the market is looking for fresh ideas to cross the chasm with Big Data infrastructure.

Here’s a recap of some of the topic we discussed in the webinar (you can also view the on-demand replay here)

Traditional Big Data assumptions = #1 reason for complexity, cost, and stalled projects.  

The herd mentality of believing that the only path to Big Data (and in particular Hadoop) is the way it was deployed at early adopters like Yahoo, Facebook, or LinkedIn has left scorched earth for many an enterprise.

Thomas Edison- Way to do it better

The Big Data ecosystem has made Hadoop synonymous with:

  • Dedicated physical servers (“just get a bunch of commodity servers, load them up with Hadoop, and you can be like a Yahoo or a Facebook”);
  • Hadoop compute and storage on the same physical machine (the buzz word is “data locality” – “you gotta have it otherwise it’s not Hadoop”);
  • Hadoop has to be on direct attached storage [DAS] (“local computation and storage” and “HDFS requires local disks” are traditional Hadoop assumptions).

If you’ve been living by these principles, following them as “the only way” to do Hadoop, and are getting ready to throw in the towel on Hadoop … it’s time to STOP and challenge these fundamental assumptions.

Yes, there is a better way.


Big Data storage innovation

 Hadoop can run on containers or VMs. The new reality is that you can use virtual machines or containers as your Hadoop nodes rather than physical servers. This saves you time from racking, stacking, and networking those physical servers. You don’t need to wait for a new server to be ordered and provisioned; or fight deployment issues due to all the stuff that existed on a repurposed server prior to it being handed over to you.

With software-defined infrastructure like virtual machines or containers, you get a pristine and clean environment that enables predictable deployments – while also delivering greater speed and cost savings. During the webinar, Chris highlighted a virtualized Hadoop deployment at Adobe. He explained how they were able to quickly increase the number of Hadoop worker nodes from 32 to 64 to 128 in matter of days – with significantly superior performance to physical servers at a fraction of the cost.

Most data centers are fully virtualized. Why wouldn’t you virtualize Hadoop? As a matter of fact, all of the “Quick Start” options from Hadoop vendors run on a VM or (more recently) on containers (whether local or in the cloud). Companies like Netflix have built an awesome service based on virtualized Hadoop clusters that run in a public cloud. The requirement for on-premises Hadoop to run on a physical server is outdated.

The concept of data locality is overblown. It’s time to finally debunk this myth. Data locality is a silent killer that impedes Hadoop adoption in the enterprise. Copying terabytes of existing enterprise data onto physical servers with local disks, and then having to balance/re-balance the data every time the server fails, is operationally complex and expensive. As a matter of fact, it only gets worse as you scale your clusters up. The internet giants like Yahoo used this approach circa 2005 because those were the days of slow start 1Gbps networks.

Today, networks are much faster and 10Gbps networks are commonplace. Studies from U.C. Berkeley AMPLab and newer Hadoop reference architectures have shown that you can get better I/O performance with compute/storage separation. And your organization will benefit from simpler operational models, where you can scale and manage your compute and storage systems independently. Ironically, the dirty little secret is that even with compute/storage co-location, you are not guaranteed data locality in many common Hadoop scenarios. Ask the Big Data team at Facebook and they will tell you that only 30% of their Hadoop tasks run on servers where data is local 

HDFS does not require local disks. This is another one of those traditional Hadoop tenants that is no longer valid: local direct attached storage (DAS) is not required for Hadoop. The Hadoop Distributed File System (HDFS) is as much a distributed file system protocol as it is an implementation. Running HDFS on local disks is one such implementation approach, and DAS made sense for internet companies like Yahoo and Facebook – since their primary initial use case was collecting clickstream/log data.

However, most enterprises today have terabytes of Big Data from multiple sources (audio, video, text, etc.) that already resides in shared storage systems such as EMC Isilon. The data protection that enterprise-grade shared storage provides is a key consideration for these enterprises. And the need to move and duplicate this data for Hadoop deployments (with the 3x replication required for traditional DAS-based HDFS) can be a significant stumbling block.

BlueData and EMC Isilon enable an HDFS interface that can accelerate time-to-insights by leveraging your data in-place and bringing it to the Hadoop compute processes – rather than waiting weeks or months to copy the data onto local disks. If you’re interested in more on this topic, you can refer to my colleague Tom Phelan (Co-Founder and Chief Architect at Blue Data) ’s session on HDFS virtualization presented at Strata + Hadoop World in New York this fall.

What about performance with Hadoop compute/storage separation?

Any time a new infrastructure approach is introduced for applications (and many of you have seen this movie before when virtualization was introduced the early 2000’s), the number one question asked is “What about performance?” To address this question during our recent webinar session, Chris and I shared some detailed performance data from customer deployments as well as from independent studies.

In particular, I want to specifically highlight some performance results from a BlueData customer that compared physical Hadoop performance against a virtualized Hadoop cluster (running on the BlueData software platform) using the industry standard Intel HiBench performance benchmark:

Performance with Virtualized Hadoop

  • Enhanced DFSIO: Generates a large number of reads and writes (read-specific or write-specific)
  • TeraSort: Sort dataset generated by TeraGen (balanced read-write)

Testing by this customer (a U.S. Federal Agency lab) revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enabled performance in a virtualized environment that is comparable or superior to that on bare-metal. As indicated in the chart above, Hadoop jobs that were I/O intensive such as Enhanced DFSIO were shown to run 50-60% faster; balanced read-write operations were shown to run almost 200% faster.

While the numbers may vary for others customers based on hardware specifications (e.g. CPU, memory, disk types), the use of BlueData’s IOBoost (application aware caching) technology – combined with the use of multiple virtual clusters per physical server – contributed to the significant performance advantage over physical Hadoop.

Performance with DAS

This same U.S Federal Agency lab executed the same HiBench benchmark to compare performance of a shared DAS-based HDFS system and enterprise-grade NFS (with EMC Isilon) utilizing the BlueData DataTap technology. DataTap brings data from any file system (e.g. HDFS DAS, NFS, Object Storage) to Hadoop compute (e.g. MapReduce) by virtualizing the HDFS protocol.

In general, all the tests across the board showed that enterprise-grade NFS delivered superior performance compared to a DAS-based HDFS system. This specific performance comparison validates that the network is not the bottleneck (10Gbps network was used) and that the 3x replication in a DAS-based HDFS adds overhead.

A new approach will add value at every stage of the Big Data journey.

Big Data is a journey and many of you continue to persevere through it, while many others are just getting started.

Irrespective of where you are in this journey, the new approach to Hadoop that we described in the webinar session (e.g. leveraging containers and virtualization, separating compute and storage, using shared instead of local storage) will dramatically accelerate outcomes and deliver significant additional Big Data value.

Added Value for virtualized Hadoop

For those of you just getting started with Big Data and in the prototyping phase, shared infrastructure built on the foundation of compute/storage separation will enable different teams to evaluate different Big Data ecosystem products and build use case prototypes – all while sharing a centralized data set versus copying and moving data around. And by running Hadoop on Docker containers, your data scientists and developers can spin up instant virtual clusters on-demand with self-service.

If you have a specific use case in production, a shared infrastructure model will allow you to simplify your dev/test environment and eliminate the need to duplicate data from production. You can simplify management, reduce costs, and improve utilization as you begin to scale your deployment.

And finally, if you need a true multi-tenant environment for your enterprise-scale Hadoop deployment, there is simply no alternative to using some form of virtualization and shared storage to deliver the agility and efficiency of a Big Data-as-a-Service experience on-premises.

Key takeaways and next steps for your Big Data journey.

In closing, Chris and I shared some final thoughts and takeaways at the end of the webinar session:

  • Big Data is a journey: future-proof your infrastructure
  • Compute and storage separation enables greater agility for all Big Data stakeholders
  • Don’t make your infrastructure decisions based on the “data locality” myth

We expect these trends to continue into next year and beyond: we’ll see more Big Data deployments leveraging shared storage, more deployments using containers and virtual machines, and more enterprises decoupling Hadoop compute from storage. As you plan your Big Data strategy for 2016, I’d encourage you to challenge the traditional assumptions and embrace a new approach to do Hadoop.

Finally, I want to personally thank everyone who made time to attend our recent webinar session. You can view the on-demand replay. Feel free to ping me (@AnantCman) or Chris Harrold (@CHarrold303) on Twitter if you have any additional questions.

The Digital Revolution Pecha Kucha Style

Arseny Chernov

Arseny Chernov

Arseny Chernov

Latest posts by Arseny Chernov (see all)

Impressions from the inaugural Strata+Hadoop World Singapore 2015

There’s never a second chance to make a first impression, right? Well, the first session at the inaugural O’Reilly Strata+Hadoop World Singapore conference in my hometown left a remarkable impression on me.

In the evening of December 1st, 2015, I jumped out of a cab in the heart of the undeniably smartest city on the planet, whizzed by 3-storey tall 60-meter (200 feet) wide “Big Picture” screen at Suntec Conventions Centre,

SunTec Digital TV Wall Singapore

– the largest High Definition TV wall in the world – and picked up my EMC Exhibitor Badge. I rushed straight in to the only conference session of that day, the pre-keynote “PechaKucha Night”.

Wait, went straight in to… what?

 

(more…)

Can Kids Give Us A Lesson Or Two On Big Data Analytics?

Mona Patel

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

STEM education

Math scores across US grade schools have dropped this year according to newly published NAEP 2015 test scores. According to the STEM (Science, Technology, Engineering and Mathematic) Education Coalition, U.S. 15 year olds ranked 21st in science test scores among 34 developed nations. These are dismal stats that have many concerned, since STEM education plays a critical role in U.S. competitiveness and future economic prosperity.

The good news is that by 2020, the demand for STEM professionals will add over 1 million STEM jobs to the US work force. STEM jobs offer higher job security and higher yearly income than other fields. Even better, in STEM occupations the number of job postings outnumber applications by 1.9 to 1. Clearly getting our kids interested and motivated in STEM careers makes sense from many angles.

With so many engineers, data scientists, and technologists of our own, EMC is particularly passionate and committed to the STEM movement and wanted to help.

How can we motivate kids to take an interest in STEM and measure the impact of EMC’s STEM initiative? This was the challenge the EMC Presales took on and met with great success and personal reward. For example, in just one event connecting EMC with approximately 300 seventh and eighth graders on topics such as big data analytics, the interest in a math careers increased by 7%.   In fact, the more kids became excited and used their imagination about the possibilities of big data analytics, the more questions they asked, giving EMC a few more lessons to learn ourselves on the topic. Watch the video below to learn more.

I spoke with David Dietrich, Director of Technical Marketing for Big Data Solutions at EMC, to understand how the topic of big data analytics was able to change the sentiment of STEM, and in turn, how utilizing big data analytics was able to measure the effectiveness of STEM.

Q: David, why is STEM important and why did EMC become involved?

(more…)

Break the cycle of deploying unwieldy Hadoop infrastructure

Chris Harrold

Chris Harrold

CTO Big Data Solutions at EMC
Chris is responsible for the development of large-scale analytics solutions for EMC customers around emerging analytics platform technologies. Currently, he is focused on EMC Business Data Lake Solutions and delivering this solution to key EMC customer accounts.
Chris Harrold
Chris Harrold
Chris Harrold

Latest posts by Chris Harrold (see all)

 

We are in a new data-driven age. With the rise in adoption of big data analytics as a decision-making tool comes the need to accelerate time-to-insights and deliver faster innovation informed by these new data-driven insights.

You know what? That’s a lot of mumbo-jumbo. Let’s boil it down to the real issue for IT: the tools that analysts and data science professionals need were not really designed to be enterprise-friendly, and they can be unwieldy to deploy and manage. Specifically, I’m talking about Hadoop. Anything that requires the provisioning and configuration of a multitude of physical servers (that are exactly the same) is always going to be the enemy of speed and reliability. More so when those servers operate as a stand-alone, single-instance solution, without any link to the rest of the IT ecosystem (the whole point of shared nothing). Shared nothing may work for experimentation, but it is a terrible thing to build a business on and to support as an IT operations person.

How do I know this? Because I have been that guy for 25 years!

In order to bridge the gap between data science experimentation and IT operational stability, new approaches are needed to provide operational resiliency without compromising the ability to rapidly deploy new analytical tools and solutions. This speed in deployment is essential to support the needs of developers and data scientists. But the complexity and unwieldy nature of traditional Hadoop infrastructure is a major barrier to success for big data analytics projects.

Consider these questions and see if they sound familiar:

(more…)

Follow EMC

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

EMC Community Network

Participate in the Everything Big Data technical community

Follow @emcbigdata