Posts Tagged ‘big data’

Revealing the secret to speed and flexibility for data analytics

William Geller

William Geller

Data Analytics Product Marketing at Dell EMC
William Geller has been involved in new technology and data science for over 15 years, with experience launching and marketing new products for both startups and in enterprise, around the world. William is the Principal Product Marketing lead for Data Analytics in the Solutions Marketing division of CPSD. Prior to joining Dell EMC, he worked for numerous startups in Healthcare IT, Social Network Analytics, and cyber security. He holds a VMware VCP4.0 accreditation. Willam has an BS in Electrical Engineering from Drexel University and an MBA from Babson College. You can find him on Twitter at @williamgeller
William Geller
William Geller

Most companies recognize that they have opportunities through data analytics to raise productivity, improve decision making, and gain competitive advantage. Unfortunately, the majority of initiatives fail to move beyond the experimental stage, or analytic insights are not operationalized back into the business as intended. The causes range from inaccessibility to siloed data, time invested in continually gathering theAnalytic Insights Module technology review - data analytics data before performing analytics, and long lead times for resources from IT.  Recently, Enterprise Strategy Group (ESG) reviewed Dell EMC Analytic Insights Module, which is engineered to smooth out these friction points in the data analytics lifecycle.  It’s delivered on Dell EMC Native Hybrid Cloud, combining a self-service data analytics experience with cloud-native application development (more…)

Getting started on your data analytics journey

Jean Marie Martini

Jean Marie Martini

Director, Data Analytics Portfolio Messaging and Strategy at Dell EMC
Jean Marie Martini is a Senior Consultant for messaging and strategy across the data analytics portfolio at Dell EMC. Martini has been involved in data analytics for over ten years and today focuses on communicating the value of the Dell EMC solutions to enable customers to begin their data analytics journey, to remain competitive throughout their journey, and to drive the insights that will transform their organizations into data-driven businesses. You can follow Martini on Twitter @martinij.
Jean Marie Martini

Latest posts by Jean Marie Martini (see all)

 

The data analytics journey begins with an understanding of use cases and solutions that can help an organization unlock the value of its data. This is the focus of two new Dell EMC resources.

In the course of my work with the Dell EMC data analytics program, I often talk with customers who are focused on extracting value from enormous amounts of data. That was certainly the case at the recent Strata + Hadoop World conference in San Jose. The conference center was filled with people looking for innovative ways to unlock the business value that is embedded in the data they capture from the Internet of Things, social media, their corporate systems and countless diverse sources.

While each organization comes at the problem from different industries, everyone shares the goal of using data analytics to gain business insights and capitalize on the digital transformation that is under way. People understand that their enterprise data warehouses and data lakes hold the keys to achieving closer customer relationships, operational efficiencies and competitive advantages. The question then becomes, “How do you get there?”

This topic is explored in two new Dell EMC resources for organizations looking to capitalize on data for analytics. One of these assets is a white paper that explores how companies in different industries are turning to data analytics, data lakes, and the Apache™ Hadoop® platform for data collection, management and analysis.

In this paper, titled “Leveraging Data Analytics to Gain Competitive Advantage in Your Industry,” we highlight examples of diverse industry-specific and cross-industry uses cases for data analytics solutions. These use cases are based on the collective experiences of Dell EMC and our partners Intel, Cloudera, and Hortonworks.

The second asset is a brochure that drills down into solutions for organizations that are ready to begin their data analytics journeys. This brochure, titled “Power New Possibilities: Solutions for Your Data Analytics Journey,” explains the capabilities and benefits of the Dell EMC options for organizations on this path.

As for those options, Dell EMC has your needs covered no matter where you are in your data analytics journey. These offerings, summarized in the brochure, include solutions for getting started with Hadoop, building a data lake for analytics, extending your analytics capabilities, and enabling and accelerating your journey.

Regardless of the path you’re on, Dell EMC can help your organization move forward with confidence. We can help you gain hands-on experience across many solutions, from initial briefings through a proof of concept and into a full production environment that leverages validated solutions and proven reference architectures.

We can also help you with the essential initial steps of aligning the goals of IT and the business to address a use case that will deliver measureable business value. For example, you might choose a marketing analytics solution that uses predictive modeling to help your sales team target the right customer at the right time. That’s a use case that we’ve put into action at Dell EMC. (Read the case study.)

While different organizations will target different needs, the key is to begin with a use case that will showcase the power of data analytics and generate measurable results — the return on information. From that starting point, you can grow over time into an organization that is truly data-driven and poised for success in the digital economy.

For a closer look at the ways that Dell EMC can help your organization unlock the value of your data, visit DellEMC.com/BigData.

Shared Infrastructure for Big Data: Separating Hadoop Compute and Storage

Anant Chintamaneni

Anant Chintamaneni

Anant Chintamaneni

Latest posts by Anant Chintamaneni (see all)

The decoupling of compute and storage for Hadoop has been of the big takeaways and themes for Hadoop in 2015. BlueData has written some blog posts about the topic this year, and many of our new customers have cited this as a key initiative in their organization. And, as indicated in this tweet from Gartner’s Merv Adrian earlier this year, it’s been a major topic of discussion at industry events:

Merv Adrian - Gartner- Tweet - Strata Hadoop

Last week I presented a webinar session with Chris Harrold, CTO for EMC’s Big Data solutions, where we discussed shared infrastructure for Big Data and the opportunity to separate Hadoop compute from storage. We had several hundred people sign up for the webinar, and there was great interaction in the Q&A chat panel throughout the session. This turnout and interest provides additional validation of the interest in this topic – it’s a clear indication that the market is looking for fresh ideas to cross the chasm with Big Data infrastructure.

Here’s a recap of some of the topic we discussed in the webinar (you can also view the on-demand replay here)

Traditional Big Data assumptions = #1 reason for complexity, cost, and stalled projects.  

The herd mentality of believing that the only path to Big Data (and in particular Hadoop) is the way it was deployed at early adopters like Yahoo, Facebook, or LinkedIn has left scorched earth for many an enterprise.

Thomas Edison- Way to do it better

The Big Data ecosystem has made Hadoop synonymous with:

  • Dedicated physical servers (“just get a bunch of commodity servers, load them up with Hadoop, and you can be like a Yahoo or a Facebook”);
  • Hadoop compute and storage on the same physical machine (the buzz word is “data locality” – “you gotta have it otherwise it’s not Hadoop”);
  • Hadoop has to be on direct attached storage [DAS] (“local computation and storage” and “HDFS requires local disks” are traditional Hadoop assumptions).

If you’ve been living by these principles, following them as “the only way” to do Hadoop, and are getting ready to throw in the towel on Hadoop … it’s time to STOP and challenge these fundamental assumptions.

Yes, there is a better way.


Big Data storage innovation

 Hadoop can run on containers or VMs. The new reality is that you can use virtual machines or containers as your Hadoop nodes rather than physical servers. This saves you time from racking, stacking, and networking those physical servers. You don’t need to wait for a new server to be ordered and provisioned; or fight deployment issues due to all the stuff that existed on a repurposed server prior to it being handed over to you.

With software-defined infrastructure like virtual machines or containers, you get a pristine and clean environment that enables predictable deployments – while also delivering greater speed and cost savings. During the webinar, Chris highlighted a virtualized Hadoop deployment at Adobe. He explained how they were able to quickly increase the number of Hadoop worker nodes from 32 to 64 to 128 in matter of days – with significantly superior performance to physical servers at a fraction of the cost.

Most data centers are fully virtualized. Why wouldn’t you virtualize Hadoop? As a matter of fact, all of the “Quick Start” options from Hadoop vendors run on a VM or (more recently) on containers (whether local or in the cloud). Companies like Netflix have built an awesome service based on virtualized Hadoop clusters that run in a public cloud. The requirement for on-premises Hadoop to run on a physical server is outdated.

The concept of data locality is overblown. It’s time to finally debunk this myth. Data locality is a silent killer that impedes Hadoop adoption in the enterprise. Copying terabytes of existing enterprise data onto physical servers with local disks, and then having to balance/re-balance the data every time the server fails, is operationally complex and expensive. As a matter of fact, it only gets worse as you scale your clusters up. The internet giants like Yahoo used this approach circa 2005 because those were the days of slow start 1Gbps networks.

Today, networks are much faster and 10Gbps networks are commonplace. Studies from U.C. Berkeley AMPLab and newer Hadoop reference architectures have shown that you can get better I/O performance with compute/storage separation. And your organization will benefit from simpler operational models, where you can scale and manage your compute and storage systems independently. Ironically, the dirty little secret is that even with compute/storage co-location, you are not guaranteed data locality in many common Hadoop scenarios. Ask the Big Data team at Facebook and they will tell you that only 30% of their Hadoop tasks run on servers where data is local 

HDFS does not require local disks. This is another one of those traditional Hadoop tenants that is no longer valid: local direct attached storage (DAS) is not required for Hadoop. The Hadoop Distributed File System (HDFS) is as much a distributed file system protocol as it is an implementation. Running HDFS on local disks is one such implementation approach, and DAS made sense for internet companies like Yahoo and Facebook – since their primary initial use case was collecting clickstream/log data.

However, most enterprises today have terabytes of Big Data from multiple sources (audio, video, text, etc.) that already resides in shared storage systems such as EMC Isilon. The data protection that enterprise-grade shared storage provides is a key consideration for these enterprises. And the need to move and duplicate this data for Hadoop deployments (with the 3x replication required for traditional DAS-based HDFS) can be a significant stumbling block.

BlueData and EMC Isilon enable an HDFS interface that can accelerate time-to-insights by leveraging your data in-place and bringing it to the Hadoop compute processes – rather than waiting weeks or months to copy the data onto local disks. If you’re interested in more on this topic, you can refer to my colleague Tom Phelan (Co-Founder and Chief Architect at Blue Data) ’s session on HDFS virtualization presented at Strata + Hadoop World in New York this fall.

What about performance with Hadoop compute/storage separation?

Any time a new infrastructure approach is introduced for applications (and many of you have seen this movie before when virtualization was introduced the early 2000’s), the number one question asked is “What about performance?” To address this question during our recent webinar session, Chris and I shared some detailed performance data from customer deployments as well as from independent studies.

In particular, I want to specifically highlight some performance results from a BlueData customer that compared physical Hadoop performance against a virtualized Hadoop cluster (running on the BlueData software platform) using the industry standard Intel HiBench performance benchmark:

Performance with Virtualized Hadoop

  • Enhanced DFSIO: Generates a large number of reads and writes (read-specific or write-specific)
  • TeraSort: Sort dataset generated by TeraGen (balanced read-write)

Testing by this customer (a U.S. Federal Agency lab) revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enabled performance in a virtualized environment that is comparable or superior to that on bare-metal. As indicated in the chart above, Hadoop jobs that were I/O intensive such as Enhanced DFSIO were shown to run 50-60% faster; balanced read-write operations were shown to run almost 200% faster.

While the numbers may vary for others customers based on hardware specifications (e.g. CPU, memory, disk types), the use of BlueData’s IOBoost (application aware caching) technology – combined with the use of multiple virtual clusters per physical server – contributed to the significant performance advantage over physical Hadoop.

Performance with DAS

This same U.S Federal Agency lab executed the same HiBench benchmark to compare performance of a shared DAS-based HDFS system and enterprise-grade NFS (with EMC Isilon) utilizing the BlueData DataTap technology. DataTap brings data from any file system (e.g. HDFS DAS, NFS, Object Storage) to Hadoop compute (e.g. MapReduce) by virtualizing the HDFS protocol.

In general, all the tests across the board showed that enterprise-grade NFS delivered superior performance compared to a DAS-based HDFS system. This specific performance comparison validates that the network is not the bottleneck (10Gbps network was used) and that the 3x replication in a DAS-based HDFS adds overhead.

A new approach will add value at every stage of the Big Data journey.

Big Data is a journey and many of you continue to persevere through it, while many others are just getting started.

Irrespective of where you are in this journey, the new approach to Hadoop that we described in the webinar session (e.g. leveraging containers and virtualization, separating compute and storage, using shared instead of local storage) will dramatically accelerate outcomes and deliver significant additional Big Data value.

Added Value for virtualized Hadoop

For those of you just getting started with Big Data and in the prototyping phase, shared infrastructure built on the foundation of compute/storage separation will enable different teams to evaluate different Big Data ecosystem products and build use case prototypes – all while sharing a centralized data set versus copying and moving data around. And by running Hadoop on Docker containers, your data scientists and developers can spin up instant virtual clusters on-demand with self-service.

If you have a specific use case in production, a shared infrastructure model will allow you to simplify your dev/test environment and eliminate the need to duplicate data from production. You can simplify management, reduce costs, and improve utilization as you begin to scale your deployment.

And finally, if you need a true multi-tenant environment for your enterprise-scale Hadoop deployment, there is simply no alternative to using some form of virtualization and shared storage to deliver the agility and efficiency of a Big Data-as-a-Service experience on-premises.

Key takeaways and next steps for your Big Data journey.

In closing, Chris and I shared some final thoughts and takeaways at the end of the webinar session:

  • Big Data is a journey: future-proof your infrastructure
  • Compute and storage separation enables greater agility for all Big Data stakeholders
  • Don’t make your infrastructure decisions based on the “data locality” myth

We expect these trends to continue into next year and beyond: we’ll see more Big Data deployments leveraging shared storage, more deployments using containers and virtual machines, and more enterprises decoupling Hadoop compute from storage. As you plan your Big Data strategy for 2016, I’d encourage you to challenge the traditional assumptions and embrace a new approach to do Hadoop.

Finally, I want to personally thank everyone who made time to attend our recent webinar session. You can view the on-demand replay. Feel free to ping me (@AnantCman) or Chris Harrold (@CHarrold303) on Twitter if you have any additional questions.

The Digital Revolution Pecha Kucha Style

Arseny Chernov

Arseny Chernov

Arseny Chernov

Latest posts by Arseny Chernov (see all)

Impressions from the inaugural Strata+Hadoop World Singapore 2015

There’s never a second chance to make a first impression, right? Well, the first session at the inaugural O’Reilly Strata+Hadoop World Singapore conference in my hometown left a remarkable impression on me.

In the evening of December 1st, 2015, I jumped out of a cab in the heart of the undeniably smartest city on the planet, whizzed by 3-storey tall 60-meter (200 feet) wide “Big Picture” screen at Suntec Conventions Centre,

SunTec Digital TV Wall Singapore

– the largest High Definition TV wall in the world – and picked up my EMC Exhibitor Badge. I rushed straight in to the only conference session of that day, the pre-keynote “PechaKucha Night”.

Wait, went straight in to… what?

 

(more…)

Follow Dell EMC

Dell EMC Big Data Portfolio

See how the Dell EMC Big Data Portfolio can make a difference for your analytics journey

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Dell EMC Community Network

Participate in the Everything Big Data technical community

Follow us on Twitter