Posts Tagged ‘scale out’

Don’t Accept The Status Quo For Hadoop

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Hadoop is Everywhere – 99% companies will deploy/pilot Hadoop in 18-24 months according to IDC.  These environments will largely be based around standalone servers resulting in added management tasks due to data being spread out across many disk spindles across the data center.  With Hadoop clusters quickly expanding, organizations are starting to experience the typical growing pains one can compare to adolescence.  This begs the question- should DAS server configuration be the accepted status-quo for Hadoop deployments?

idcisilon

Whether you are getting started with Hadoop or growing your Hadoop deployment, EMC provides a long-term solution for Hadoop through shared storage and VM’s, delivering distinct value to the business in lower TCO and faster time-to-results.  I spoke with EMC Technical Advisory Architect Chris Harrold to explain why organizations are now turning to EMC to help transition Hadoop environments into adulthood.

1.  Almost every Hadoop deployment is based around the accepted configuration of standalone servers with DAS.   What have you seen as issues with this configuration with your customers?

(more…)

EMC and RainStor Optimize Interactive SQL on Hadoop

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.

Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data.  Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently.  In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read.  This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.

Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.

Rainstor

Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.

To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:

  • Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
  • Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
  • Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
  • Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.

I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.

1.  RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?

(more…)

EMC Isilon For Hadoop – No Ingest Necessary

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

In traditional Hadoop environments, the entire data set must be ingested (and three or more copies of each block made) before any analysis can begin. Once analysis is complete, results must then be exported. What’s the significance of this? COST. These are tedious and time-consuming processes, along with maintaining multiple copies of data. With EMC Isilon HDFS, the entire data set can start to be analyzed immediately without the need to replicate it, and the results are also available immediately to NFS and SMB clients.

If you don’t already own Isilon for your Hadoop environment, it is worth exploring the multitude of benefits Isilon brings over HDFS running on compute hosts. If you are already an Isilon customer, Isilon requires no data movement and instead offers in-place analytics on data, eliminating the need to build a specialty Hadoop storage infrastructure.

Ryan Peterson, Director of Solutions Architecture at Isilon, likes to say that Isilon dedupes Hadoop since Isilon satisfies Hadoop’s need to see multiple copies of the same data without having to actually copy it. In fact, with the latest release of Isilon’s OneFS 7.1 today, a new feature called Smart Dedupe can reduce the storage further by approximately 30%. Ryan Peterson now refers to this as Hadoop Dedupe Dedupe. The first ‘Dedupe’ removes 3x replication, and the second ‘Dedupe’ reduces storage by 30%. Clever!

I sat down with Ryan Peterson to walk us through Hadoop Dedupe Dedupe:

In a traditional Hadoop deployment, data loss resulting from hardware failure is handled by replicating blocks of data across a minimum of three times (3X by default), resulting in at least 4 data copies – existing primary storage plus 3 Hadoop storage copies.

Isilon for Hadoop turns this paradigm upside down because if existing primary data is NOT already on Isilon, then only 2.2 copies of data is required to protect against data loss due to hardware failure. The first copy is from the existing primary data NOT on Isilon, and the second copy is on Isilon. Isilon’s N+M RAID –like distributed parity scheme makes 1.2 copies while providing high availability and resiliency to protect from data loss due to hardware failure (i.e. nodes and disks). I

If primary data is already on Isilon there’s no need for a separate Hadoop storage infrastructure in the first place, and only 1.2 data copies are made instead of 4. With the upcoming release of Isilon’s de-duplication feature, the storage requirements will go down further by approximately 30%.

So if customers have 300TB of raw data, they will need 900TB of new storage to run their Hadoop cluster. However if they already have this data on Isilon, they will not need any new storage and will only have 252TB of raw data to work with because data in primary is de-duped and they can run Hadoop directly on that data.

Wait a minute, is this Hadoop Dedupe Dedupe Dedupe?

Want to Explore Hadoop, But No Tour Guide?

Mona Patel

Senior Manager, Big Data Solutions Marketing at EMC
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Are you a VMware Vsphere customer? Do you also own EMC Isilon? If you said yes to both, I have great news for you – you have all the ingredients for the EMC Hadoop Starter Kit (HSK).  In just a few short hours you can spin up a virtualized Hadoop cluster by downloading the HSK step-by-step guide.  Watch the demo below of HSK being used to deploy Hadoop:

Now you don’t have to imagine what Hadoop tastes like because this starter kit is designed to help you execute and discover the potential of Hadoop within your organization. Whether you are new to Hadoop or an experienced Hadoop user, you will want to take advantage of this turnkey solution for the following reasons:

-Rapid provisioning – From the creation of virtual Hadoop nodes to starting up the hadoop services on the cluster, much of the Hadoop cluster deployment can be automated, requiring little expertise on the user’s part.

-High availability – HA protection can be provided through the virtualization platform to protect the single points of failure in the Hadoop system, such as NameNode and JobTracker Virtual Machines.

-Elasticity – Hadoop capacity can be scaled up and down on demand in a virtual environment, thus allowing the same physical infrastructure to be shared among Hadoop and other applications.

-Multi-tenancy – Different tenants running Hadoop can be isolated in separate VMs, providing stronger VM-grade resource and security isolation.

-Portability – Use any Hadoop distribution throughout the Big Data application lifecycle with zero data migration – Apache Open Source, Pivotal HD, Cloudera, Hortonworks.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization that uses VMware Vsphere and EMC Isilon should use this starter kit for Big Data projects.

1.  Why did you create the starter kit and what are the best use cases for this starter kit?

(more…)

Follow Dell EMC

Dell EMC Big Data Portfolio

See how the Dell EMC Big Data Portfolio can make a difference for your analytics journey

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Dell EMC Community Network

Participate in the Everything Big Data technical community

Follow us on Twitter