Global Sales Contact List

Contact   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Strata+Hadoop San Jose 2016

Erin K. Banks

Erin K. Banks

Portfolio Marketing Director at EMC
Erin K. Banks has been in the IT industry for almost 20 years. She is the Portfolio Marketing Director for Big Data at EMC. Previously she worked at Juniper Networks in Technical Marketing for the Security Business Unit. She has also worked at VMware and EMC as an SE in the Federal Division, focused on Virtualization and Security. She holds both CISSP and CISA accreditations. Erin has a BS in Electrical Engineering and is an author, blogger, and avid runner. You can find her on social media at @Banksek
Erin K. Banks
Erin K. Banks

Latest posts by Erin K. Banks (see all)

We are happy to announce that we will be at Strata+Hadoop San Jose 2016 next week (March 29 – 31) at the San Jose Convention Center. We will be in booth #1431elephant phone holder and have theater presentations covering our entire Big Data Portfolio and discussing how we can help you on your Big Data analytics journey. These adorable EMC elephant phone holder booth giveaways will be there too. Make sure you stop by, say hi, and attend one of our presentations or just talk to us about your analytics environments and how you want to get the most value out of your data because we can help you with that.

We also have two sessions during the conference:

Tame That Beast: How to Bring Operations, Governance, and Reliability to Hadoop from Keith Manthey (@KeithManthey). This session is on Wednesday at 1:50 – 2:30 PM in 210 B/F

and

Developing a Big Data Business Strategy from Bill Schmarzo (@schmarzo). This session is Wednesday at 11:00 – 11:40 AM in LL21 E/F

We look forward to seeing you there!Strata Rendering

By the way… here is a picture of our booth so you know what to look for when you visit.

 

Big Data Conversation with Steve Jones from Capgemini

Erin K. Banks

Erin K. Banks

Portfolio Marketing Director at EMC
Erin K. Banks has been in the IT industry for almost 20 years. She is the Portfolio Marketing Director for Big Data at EMC. Previously she worked at Juniper Networks in Technical Marketing for the Security Business Unit. She has also worked at VMware and EMC as an SE in the Federal Division, focused on Virtualization and Security. She holds both CISSP and CISA accreditations. Erin has a BS in Electrical Engineering and is an author, blogger, and avid runner. You can find her on social media at @Banksek
Erin K. Banks
Erin K. Banks

Latest posts by Erin K. Banks (see all)

Last year at a San Jose Earthquakes event, I met Steve Jones, Global VP for Big Data and Analytics at Capgemini:   “Is it football, is it soccer, will that discussion ever end?”Big Data Concept

In any case, we had a very lively discussion regarding the sport and our individual visions of the world. Steve is honest, fair, and a great conversationalist. I knew that I wanted to pick his brain further regarding his thoughts around Big Data and share them with you.

Below is the Big Data conversation that I had with Steve. I hoped his answers would provide greater insight into Big Data and how he sees its impact on business. And I was not disappointed. I hope you will enjoy his thoughts just as much as I do.

 

EB: How does your company define Big Data?

SJ: It’s not really about “Big” or “Fast” it’s really about a shift away from single schema data approaches towards ‘schema on need’ and the integration of insight to the point of action.  Volumes are one challenge, but the real challenge is the mental shift away from data warehouses toward more flexible insight-driven solutions.

EB: Why did you make the decision to focus on Big Data?

SJ: Capgemini was one of the first big systems integrators to look at what Big Data really meant operationally.  Back in 2011 we were already talking about how companies need to change the way they looked at governing data in a Big Data world.  We made the focus back then to really concentrate on driving the transformation towards this new world as we saw it as crucial to helping our clients deliver long term value.

EB: What impact is the change to Big Data having for your clients?

SJ: The first impact is on how we deliver information projects, being able to use multiple different types of analytical engines on the same sets of data means we are able to solve more problems without requiring a new technology stack and another data silo.  The other is in terms of being able to deliver specific pieces of insight much faster.  What new technologies are doing is enabling the sorts of agile and DevOps practices that have become the normal way of working for application development to be applied to information projects.

EB: How do you see the Big Data market changing in the future?

SJ: The big shift we are seeing now is the rise of fast data with technologies like Spark, Storm and grid databases such as Gemfire, the combination of Big and Fast helps companies to actually react as something is happening and even anticipate it in advance.  This means that insights are being integrated directly back into operational processes and having to work at application, rather than traditional BI, speeds.

EB: What is the biggest myth about Big Data?

SJ: That there isn’t any governance and that it’s about replacing a data warehouse.  The reality is that today very little information decisions are driven from a data warehouse, Excel, systems information and local data marts drive most decisions.  The reality is that taking proper control of your data requires you to recognize the full landscape.

EB: What business questions has Big Data helped you answer?

SJ: Big Data has helped us answer a huge range of questions for clients, some that couldn’t have been done before due to technology or cost.  But really its helped us answer a much more important general question “How can I get new insights faster?”, specific insights are great, but a new way of working that delivers continual value is better.

EB: What is advice you would give to someone embarking on a Big Data project?

SJ:

1)Start with the view that you are going to replace the substrate on your entire data landscape.

2) Think about governance not as about quality but as about enabling collaboration

3) Minimize your number of different technologies. ­The lesson of Google, Facebook and Amazon is that you don’t need a huge variety of technologies that do the same thing, you need a few things that each do one thing well.

You can follow more of what Steve Jones is thinking about Big Data by following him on twitter @mosesjones.

Shared Infrastructure for Big Data: Separating Hadoop Compute and Storage

Anant Chintamaneni

Anant Chintamaneni

Anant Chintamaneni

Latest posts by Anant Chintamaneni (see all)

The decoupling of compute and storage for Hadoop has been of the big takeaways and themes for Hadoop in 2015. BlueData has written some blog posts about the topic this year, and many of our new customers have cited this as a key initiative in their organization. And, as indicated in this tweet from Gartner’s Merv Adrian earlier this year, it’s been a major topic of discussion at industry events:

Merv Adrian - Gartner- Tweet - Strata Hadoop

Last week I presented a webinar session with Chris Harrold, CTO for EMC’s Big Data solutions, where we discussed shared infrastructure for Big Data and the opportunity to separate Hadoop compute from storage. We had several hundred people sign up for the webinar, and there was great interaction in the Q&A chat panel throughout the session. This turnout and interest provides additional validation of the interest in this topic – it’s a clear indication that the market is looking for fresh ideas to cross the chasm with Big Data infrastructure.

Here’s a recap of some of the topic we discussed in the webinar (you can also view the on-demand replay here)

Traditional Big Data assumptions = #1 reason for complexity, cost, and stalled projects.  

The herd mentality of believing that the only path to Big Data (and in particular Hadoop) is the way it was deployed at early adopters like Yahoo, Facebook, or LinkedIn has left scorched earth for many an enterprise.

Thomas Edison- Way to do it better

The Big Data ecosystem has made Hadoop synonymous with:

  • Dedicated physical servers (“just get a bunch of commodity servers, load them up with Hadoop, and you can be like a Yahoo or a Facebook”);
  • Hadoop compute and storage on the same physical machine (the buzz word is “data locality” – “you gotta have it otherwise it’s not Hadoop”);
  • Hadoop has to be on direct attached storage [DAS] (“local computation and storage” and “HDFS requires local disks” are traditional Hadoop assumptions).

If you’ve been living by these principles, following them as “the only way” to do Hadoop, and are getting ready to throw in the towel on Hadoop … it’s time to STOP and challenge these fundamental assumptions.

Yes, there is a better way.


Big Data storage innovation

 Hadoop can run on containers or VMs. The new reality is that you can use virtual machines or containers as your Hadoop nodes rather than physical servers. This saves you time from racking, stacking, and networking those physical servers. You don’t need to wait for a new server to be ordered and provisioned; or fight deployment issues due to all the stuff that existed on a repurposed server prior to it being handed over to you.

With software-defined infrastructure like virtual machines or containers, you get a pristine and clean environment that enables predictable deployments – while also delivering greater speed and cost savings. During the webinar, Chris highlighted a virtualized Hadoop deployment at Adobe. He explained how they were able to quickly increase the number of Hadoop worker nodes from 32 to 64 to 128 in matter of days – with significantly superior performance to physical servers at a fraction of the cost.

Most data centers are fully virtualized. Why wouldn’t you virtualize Hadoop? As a matter of fact, all of the “Quick Start” options from Hadoop vendors run on a VM or (more recently) on containers (whether local or in the cloud). Companies like Netflix have built an awesome service based on virtualized Hadoop clusters that run in a public cloud. The requirement for on-premises Hadoop to run on a physical server is outdated.

The concept of data locality is overblown. It’s time to finally debunk this myth. Data locality is a silent killer that impedes Hadoop adoption in the enterprise. Copying terabytes of existing enterprise data onto physical servers with local disks, and then having to balance/re-balance the data every time the server fails, is operationally complex and expensive. As a matter of fact, it only gets worse as you scale your clusters up. The internet giants like Yahoo used this approach circa 2005 because those were the days of slow start 1Gbps networks.

Today, networks are much faster and 10Gbps networks are commonplace. Studies from U.C. Berkeley AMPLab and newer Hadoop reference architectures have shown that you can get better I/O performance with compute/storage separation. And your organization will benefit from simpler operational models, where you can scale and manage your compute and storage systems independently. Ironically, the dirty little secret is that even with compute/storage co-location, you are not guaranteed data locality in many common Hadoop scenarios. Ask the Big Data team at Facebook and they will tell you that only 30% of their Hadoop tasks run on servers where data is local 

HDFS does not require local disks. This is another one of those traditional Hadoop tenants that is no longer valid: local direct attached storage (DAS) is not required for Hadoop. The Hadoop Distributed File System (HDFS) is as much a distributed file system protocol as it is an implementation. Running HDFS on local disks is one such implementation approach, and DAS made sense for internet companies like Yahoo and Facebook – since their primary initial use case was collecting clickstream/log data.

However, most enterprises today have terabytes of Big Data from multiple sources (audio, video, text, etc.) that already resides in shared storage systems such as EMC Isilon. The data protection that enterprise-grade shared storage provides is a key consideration for these enterprises. And the need to move and duplicate this data for Hadoop deployments (with the 3x replication required for traditional DAS-based HDFS) can be a significant stumbling block.

BlueData and EMC Isilon enable an HDFS interface that can accelerate time-to-insights by leveraging your data in-place and bringing it to the Hadoop compute processes – rather than waiting weeks or months to copy the data onto local disks. If you’re interested in more on this topic, you can refer to my colleague Tom Phelan (Co-Founder and Chief Architect at Blue Data) ’s session on HDFS virtualization presented at Strata + Hadoop World in New York this fall.

What about performance with Hadoop compute/storage separation?

Any time a new infrastructure approach is introduced for applications (and many of you have seen this movie before when virtualization was introduced the early 2000’s), the number one question asked is “What about performance?” To address this question during our recent webinar session, Chris and I shared some detailed performance data from customer deployments as well as from independent studies.

In particular, I want to specifically highlight some performance results from a BlueData customer that compared physical Hadoop performance against a virtualized Hadoop cluster (running on the BlueData software platform) using the industry standard Intel HiBench performance benchmark:

Performance with Virtualized Hadoop

  • Enhanced DFSIO: Generates a large number of reads and writes (read-specific or write-specific)
  • TeraSort: Sort dataset generated by TeraGen (balanced read-write)

Testing by this customer (a U.S. Federal Agency lab) revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enabled performance in a virtualized environment that is comparable or superior to that on bare-metal. As indicated in the chart above, Hadoop jobs that were I/O intensive such as Enhanced DFSIO were shown to run 50-60% faster; balanced read-write operations were shown to run almost 200% faster.

While the numbers may vary for others customers based on hardware specifications (e.g. CPU, memory, disk types), the use of BlueData’s IOBoost (application aware caching) technology – combined with the use of multiple virtual clusters per physical server – contributed to the significant performance advantage over physical Hadoop.

Performance with DAS

This same U.S Federal Agency lab executed the same HiBench benchmark to compare performance of a shared DAS-based HDFS system and enterprise-grade NFS (with EMC Isilon) utilizing the BlueData DataTap technology. DataTap brings data from any file system (e.g. HDFS DAS, NFS, Object Storage) to Hadoop compute (e.g. MapReduce) by virtualizing the HDFS protocol.

In general, all the tests across the board showed that enterprise-grade NFS delivered superior performance compared to a DAS-based HDFS system. This specific performance comparison validates that the network is not the bottleneck (10Gbps network was used) and that the 3x replication in a DAS-based HDFS adds overhead.

A new approach will add value at every stage of the Big Data journey.

Big Data is a journey and many of you continue to persevere through it, while many others are just getting started.

Irrespective of where you are in this journey, the new approach to Hadoop that we described in the webinar session (e.g. leveraging containers and virtualization, separating compute and storage, using shared instead of local storage) will dramatically accelerate outcomes and deliver significant additional Big Data value.

Added Value for virtualized Hadoop

For those of you just getting started with Big Data and in the prototyping phase, shared infrastructure built on the foundation of compute/storage separation will enable different teams to evaluate different Big Data ecosystem products and build use case prototypes – all while sharing a centralized data set versus copying and moving data around. And by running Hadoop on Docker containers, your data scientists and developers can spin up instant virtual clusters on-demand with self-service.

If you have a specific use case in production, a shared infrastructure model will allow you to simplify your dev/test environment and eliminate the need to duplicate data from production. You can simplify management, reduce costs, and improve utilization as you begin to scale your deployment.

And finally, if you need a true multi-tenant environment for your enterprise-scale Hadoop deployment, there is simply no alternative to using some form of virtualization and shared storage to deliver the agility and efficiency of a Big Data-as-a-Service experience on-premises.

Key takeaways and next steps for your Big Data journey.

In closing, Chris and I shared some final thoughts and takeaways at the end of the webinar session:

  • Big Data is a journey: future-proof your infrastructure
  • Compute and storage separation enables greater agility for all Big Data stakeholders
  • Don’t make your infrastructure decisions based on the “data locality” myth

We expect these trends to continue into next year and beyond: we’ll see more Big Data deployments leveraging shared storage, more deployments using containers and virtual machines, and more enterprises decoupling Hadoop compute from storage. As you plan your Big Data strategy for 2016, I’d encourage you to challenge the traditional assumptions and embrace a new approach to do Hadoop.

Finally, I want to personally thank everyone who made time to attend our recent webinar session. You can view the on-demand replay. Feel free to ping me (@AnantCman) or Chris Harrold (@CHarrold303) on Twitter if you have any additional questions.

The Digital Revolution Pecha Kucha Style

Arseny Chernov

Arseny Chernov

Arseny Chernov

Latest posts by Arseny Chernov (see all)

Impressions from the inaugural Strata+Hadoop World Singapore 2015

There’s never a second chance to make a first impression, right? Well, the first session at the inaugural O’Reilly Strata+Hadoop World Singapore conference in my hometown left a remarkable impression on me.

In the evening of December 1st, 2015, I jumped out of a cab in the heart of the undeniably smartest city on the planet, whizzed by 3-storey tall 60-meter (200 feet) wide “Big Picture” screen at Suntec Conventions Centre,

SunTec Digital TV Wall Singapore

– the largest High Definition TV wall in the world – and picked up my EMC Exhibitor Badge. I rushed straight in to the only conference session of that day, the pre-keynote “PechaKucha Night”.

Wait, went straight in to… what?

 

(more…)

Follow EMC

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

EMC Community Network

Participate in the Everything Big Data technical community

Follow @emcbigdata

If you are going to #StrataHadoop London, make sure you read this https://t.co/81xNWJ8Za0 https://t.co/MfPojhIcJ9 about 4 days ago