EMC and RainStor Optimize Interactive SQL on Hadoop

Pivotal HAWQ was one of the most groundbreaking technologies entering the Hadoop ecosystem last year through its ability to execute complete ANSI SQL on large-scale datasets managed in Pivotal HD. This was great news for SQL users – organizations heavily reliant on SQL applications and common BI tools such as Tableau and MicroStrategy can leverage these investments to access and analyze new data sets managed in Hadoop.

Similarly, RainStor, a leading enterprise database known for its efficient data compression and built-in security, also enables organizations to run ANSI SQL queries against data in Hadoop – highly compressed data.  Due to the reduced footprint from extreme data compression (typically 90%+ less), RainStor enables users to run analytics on Hadoop much more efficiently.  In fact, there are many instances where queries run significantly faster with a reduced footprint plus some filtering capabilities that figure out what not to read.  This allows customers to minimize infrastructure costs and maximize insight for data analysis on larger data sets.

Serving some of the largest telecommunications and financial services organizations, RainStor enables customers to readily query and analyze petabytes of data instead of archiving data sets to tape and then having to reload it whenever it is needed for analysis. RainStor chose to partner with EMC Isilon scale-out NAS for its storage layer to manage these petabyte-scale data environments even more efficiently. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.

Rainstor

Furthermore, not only are organizations able to run any Hadoop distribution of choice with RainStor-Isilon, but you can also run multiple distributions of Hadoop against the same compressed data. For example, a single copy of the data managed in Rainstor-Isilon can service Marketing’s Pivotal HD environment, Finance’s Cloudera environment, and HR’s Apache Hadoop environment.

To summarize, running RainStor and Hadoop on EMC Isilon, you achieve:

  • Flexible Architecture Running Hadoop on NAS and DAS together: Companies leverage DAS local storage for hot data where performance is critical and use Isilon for mass data storage. With RainStor’s compression, you efficiently move more data across the network, essentially creating an I/O multiplier.
  • Built-in Security and Reliability: Data is securely stored with built-in encryption, and data masking in addition to user authentication and authorization. Carrying very little overhead, you benefit from EMC Isilon FlexProtect, which provides a reliable, highly available Big Data environment.
  • Improved Query Speed: Data is queried using a variety of tools including standard SQL, BI tools Hive, Pig and MapReduce. With built-in filtering, queries speed-up by a factor of 2-10X compared to Hive on HDFS/DAS.
  • Compliant WORM Solution: For absolute retention and protection of business critical data, including stringent SEC 17a-4 requirements, you leverage EMC Isilon’s SmartLock in addition to RainStor’s built-in immutable data retention capabilities.

I spoke to Jyothi Swaroop, Director of Product Marketing at Rainstor, to explain the value of deploying EMC Isilon with RainStor and Hadoop.

1.  RainStor is known in the industry as an enterprise database architected for Big Data. Can you please explain how this technology evolved and what needs it addresses in the market?

Continue reading

VCE Vblock: Converging Big Data Investments To Drive More Value

As Big Data continues to demonstrate real business value, organizations are looking to leverage this high value data across different applications and use cases. The uptake is also driving organizations to transition from siloed Big Data sandboxes, to enterprise architectures where they are mandated to address mission-critical availability and performance, security and privacy, provisioning of new services, and interoperability with the rest of the enterprise infrastructure.

Sandbox or experimental Hadoop on commodity hardware with direct attached storage (DAS) makes it difficult to address such challenges for several reasons – difficult to replicate data across applications and data centers, lack of IT oversight and visibility into the data, lack of multi-tenancy and virtualization, difficult to streamline upgrades and migrate technology components, and more. As a result, VCE, leader in converged or integrated infrastructures, is receiving an increased number of requests on how to evolve Hadoop implementations reliant on DAS to being deployed on VCE Vblock Systems -  an enterprise-class infrastructure that combines server, shared storage, network devices, virtualization, and management in a pre-integrated stack.

Formed by Cisco and EMC, with investments from VMware and Intel, VCE enables organizations to rapidly deploy business services on demand and at scale – all without triggering an explosion in capital and operating expenses. According to IDC’s recent report, organizations around the world spent over $3.3 billion on converged systems in 2012, and forecasted this spending to increase by 20% in 2013 and again in 2014. In fact, IDC calculated that Vblock Systems infrastructure resulted in a return on investment of 294% over a three-year period and 435% over a five-year period compared to data on traditional infrastructure due to fast deployments, simplified operations, improved business-support agility, cost savings, and freed staff to launch new applications, extend services, and improve user/customer satisfaction.

I spoke with Julianna DeLua from VCE Product Management to discuss how VCE’s Big Data solution enables organizations to extract more value from Big Data investments.

vce

 

1.  Why are organizations interested deploying Hadoop and Big Data applications on converged or integrated infrastructures such as Vblock?

Continue reading

You Asked, Rackspace Listened – New Big Data Hosting Options

Running Hadoop on bare metals may fit some use cases, but many organizations have the types of data workloads that demand more storage than compute resources. In order to get the most efficient utilization for these types of Hadoop workloads, separating the compute and storage resources makes sense and is a configuration users are asking for, i.e EMC Isilon storage for Hadoop.  In response to diverse Big Data needs such as mixed Hadoop workloads, hybrid cloud models, and heterogenous data layers, Rackspace recently delivered new Big Data hosting options whereby users now have more choices – run Hadoop on Rackspace managed dedicated servers, spin up Hadoop on the public cloud, or configure your own private cloud.

EMC_Isilon_RefArch

I spoke with Sean Anderson, Product Marketing Manager for Data Solutions at Rackspace, to talk about one particular new offering called ‘Managed Big Data Platform’ whereby customers can design the optimal configuration for their data, and leave the management details to Rackspace.

1.  The Big Data lifecycle will go through various stages, with each stage imposing different requirements. From what you are seeing with your customers, can you explain this lifecycle and what value Rackspace brings to support this lifecycle?

Continue reading

EMC Isilon For Hadoop – No Ingest Necessary

In traditional Hadoop environments, the entire data set must be ingested (and three or more copies of each block made) before any analysis can begin. Once analysis is complete, results must then be exported. What’s the significance of this? COST. These are tedious and time-consuming processes, along with maintaining multiple copies of data. With EMC Isilon HDFS, the entire data set can start to be analyzed immediately without the need to replicate it, and the results are also available immediately to NFS and SMB clients.

If you don’t already own Isilon for your Hadoop environment, it is worth exploring the multitude of benefits Isilon brings over HDFS running on compute hosts. If you are already an Isilon customer, Isilon requires no data movement and instead offers in-place analytics on data, eliminating the need to build a specialty Hadoop storage infrastructure.

Ryan Peterson, Director of Solutions Architecture at Isilon, likes to say that Isilon dedupes Hadoop since Isilon satisfies Hadoop’s need to see multiple copies of the same data without having to actually copy it. In fact, with the latest release of Isilon’s OneFS 7.1 today, a new feature called Smart Dedupe can reduce the storage further by approximately 30%. Ryan Peterson now refers to this as Hadoop Dedupe Dedupe. The first ‘Dedupe’ removes 3x replication, and the second ‘Dedupe’ reduces storage by 30%. Clever!

I sat down with Ryan Peterson to walk us through Hadoop Dedupe Dedupe:

In a traditional Hadoop deployment, data loss resulting from hardware failure is handled by replicating blocks of data across a minimum of three times (3X by default), resulting in at least 4 data copies – existing primary storage plus 3 Hadoop storage copies.

Isilon for Hadoop turns this paradigm upside down because if existing primary data is NOT already on Isilon, then only 2.2 copies of data is required to protect against data loss due to hardware failure. The first copy is from the existing primary data NOT on Isilon, and the second copy is on Isilon. Isilon’s N+M RAID –like distributed parity scheme makes 1.2 copies while providing high availability and resiliency to protect from data loss due to hardware failure (i.e. nodes and disks). I

If primary data is already on Isilon there’s no need for a separate Hadoop storage infrastructure in the first place, and only 1.2 data copies are made instead of 4. With the upcoming release of Isilon’s de-duplication feature, the storage requirements will go down further by approximately 30%.

So if customers have 300TB of raw data, they will need 900TB of new storage to run their Hadoop cluster. However if they already have this data on Isilon, they will not need any new storage and will only have 252TB of raw data to work with because data in primary is de-duped and they can run Hadoop directly on that data.

Wait a minute, is this Hadoop Dedupe Dedupe Dedupe?