EMC Hadoop Starter Kit ViPR Edition: Creating a Smarter Data Lake

Pivotal HD offers a wide variety of data processing technologies for Hadoop – real-time, interactive, and batch. Add integrated data storage EMC Isilon scale-out NAS to Pivotal HD and you have a shared data repository with multi-protocol support, including HDFS, to service a wide variety of data processing requests. This smells like a Data Lake to me – a general-purpose data storage and processing resource center where Big Data applications can develop and evolve. Add EMC ViPR software defined storage to the mix and you have the smartest Data Lake in town, one that supports additional protocols/hardware and automatically adapts to changing workload demands to optimize application performance.

EMC Hadoop Starter Kit, ViPR Edition, now makes it easier to deploy this ‘smart’ Data Lake with Pivotal HD and other Hadoop distributions such as Cloudera and Hortonworks. Simply download this step-by-step guide and you can quickly deploy a Hadoop or a Big Data analytics environment, configuring Hadoop to utilize ViPR for HDFS, with Isilon hosting the Object/HDFS data service.  Although in this guide Isilon is the storage array that ViPR deploys objects to, other storage platforms are also supported – EMC VNX, NetApp, OpenStack Swift and Amazon S3.

I spoke with the creator of this starter kit James F. Ruddy, Principal Architect for the EMC Office of the CTO to explain why every organization should use this starter kit optimize their IT infrastructure for Hadoop deployments.

1.  The original EMC Hadoop Starter Kit released last year was a huge success.  Why did you create ViPR Edition?

Organizations that are deploying Hadoop as dedicated environments are creating more data siloes in the organization. This guide will enable customers to minimize data siloes by deploying any of the three most popular Hadoop distributions (Pivotal, Cloudera, Hortonworks) utilizing EMC ViPR software defined storage, enabling organizations to leverage existing investments in storage platforms/infrastructures for Big Data analytics. There are massive amounts of data already living in storage platforms whereby ViPR will ‘analytics’ enable those storage arrays without having to create a separate dedicated Hadoop environment.

2.  What are the best use cases for HSK ViPR Edition?

First, you can instantly deploy a Big Data repository through utilizing existing enterprise storage capacity as “Data Lakes” on top of which to enable analytics.

Second, you can reduce the growth in dedicated Hadoop environments since large volumes of unstructured data already living in EMC storage or third party such as NetApp arrays can be now exploited through Hadoop programs.

Third, you can eliminate the need to have multiple copies of the same data for different types of applications through ViPR’s support for multiple protocols/mixed workloads.  ViPR will enable dual mode access to the data under its management, enabling object based workloads and analytics applications to manipulate the same data since ViPR provides S3, Swift and Atmos APIs interface support as well as HDFS API access.

3.  So what are the pre-requisities for HSK ViPR Edition?

The guides are designed to enable the use of ViPR as a Hadoop compatible file system that resides as object storage on top of an existing ViPR supported file storage array. So to start you need a file system array that you can deploy ViPR data services in front of. For the compute side you need either physical or virtual machines to run the hadoop cluster. Anywhere from one to many can be used. The guides walk you through the automated deployment tools available through each distribution and shows how to use the native management tools to integrate ViPR HDFS services.

 

Pivotal Big Data Suite: Eliminating the Tax On A Growing Hadoop Cluster

The promise of Big Data is about analyzing more data to gain unprecedented insight, but Hadoop pricing can place serious constraints on the amount of data that can actually be stored for analysis.  Each time a node is added to a Hadoop cluster to increase storage capacity, you are charged for it.  Because this pricing model is counterintuitive to the philosophy of Big Data, Pivotal has removed the tax to store data in Hadoop with its announcement of Pivotal Big Data Suite.

Through a Pivotal Big Data Suite subscription, customers store as much data as they want in fully supported Pivotal HD, paying for only value added services per core – Pivotal Greenplum Database, GemFire, SQLFire, GemFire XD, and HAWQ.   The significance of this new consumption model is that customers can now store as much Big Data as they want, but only be charged for the value they extract from Big Data.

BigDataSuite_Diagram

*Calculate your savings with Pivotal Big Data Suite compared to traditional Enterprise Data Warehouse technologies.

Additionally, Pivotal Big Data Suite removes the mind games associated with diverse data processing needs of Big Data.  With a flexible subscription of your choice of real-time, interactive, and batch processing technologies, organizations are no longer locked into a specific technology because of a contract.  At any point of time, as Big Data applications grow and Data Warehouse applications shrink, you can spin up or down licenses across the value added services without incurring additional costs.  This pooled approach eliminates the need to procure new technologies, which results in delayed projects, additional costs, and more data silos.

I spoke with Michael Cucchi, Senior Director of Product Maketing at Pivotal, to explain how Pivotal Big Data Suite radically redefines the economics of Big Data so organizations can achieve the Data Lake dream.

1. What Big Data challenges does Big Data Suite address and why?

Continue reading

Pivotal HD 2.0: Hadoop Gets Real-Time

Everything we do generates events – click on a mobile ad, pay with a credit card, tweet, measure heart rate, accelerate on the gas pedal, etc. What if an organization can feed these events into predictive models as soon as the event happens to quickly and more accurately make decisions that generate more revenue, lower costs, minimize risk, and improve the quality of care? You would need deep and fast analytics provided by Big Data platforms such as Pivotal HD 2.0 announced yesterday.

Pivotal HD 2.0 brings an in-memory, SQL database to Hadoop through seamless integration with Pivotal GemFire XD, enabling you to combine real-time data with historical data managed in HDFS. Closed loop analytics, operational BI, and high-speed data ingest are now possible in a single OLTP/OLAP platform without any ETL processing required. Use cases are ones that are time sensitive in nature. For example, telecom companies are at the forefront of applying real-time Big Data analytics to network traffic. The “store first, analyze second” method does not make sense for rapidly shifting traffic that requires immediate action when issues arise.

GemFire-2014-03-17-at-14.40.49-PM

I spoke with Senior Director of Engineering at Pivotal Makarand Gokhale to explain the value in bringing OLTP to a traditional batch processing Hadoop.

1. Real-time solutions for Hadoop can mean many things- performing interactive queries, real-time event processing, and fast data ingest. How would you describe Pivotal HD’s real-time data services for Hadoop?

Continue reading

Can Big Data Shape A Better Future? Quid is Paving the Way

World hunger, political conflict, business competition and other complex problems cannot be solved with mathematical algorithms measuring probabilities alone. However, by combining together human intelligence with the best artificial intelligence, the company Quid has built software that experts are calling the worlds first augmented intelligence platform. Using superior speed and storage capacity of computation, the process by which human beings typically acquire the deep pattern recognition of expertise is accelerated. The software does more than run simple prediction algorithms, it allows users to interact with data in an immersive, visual environment to better understand the world at a high resolution so that they can ultimately shape it and change it.

Founded in 2010, Quid is addressing a new class of problems to help organizations make strategic decisions around business innovation, public relations, foreign policy, human welfare, and more. Through advanced visualizations that interpret massive amounts of diverse internal and publicly accessible external data sets, Quid tells a unique and compelling story about the complexity of our world – trends, comparisons, multi-dimensional relationships, etc. – to change the direction of decision making.

For Quid, it’s not about man battling it out with machines, but rather, man working with machines when entering a new level of complex problem solving. For example, military intelligence may one day be able to change the direction of future conflicts by working with Quid software to analyze millions of data points from war logs and reports, news articles, and social media about the most recent casualties of war. The intelligence teams plugged into Quid would be able to see the war unfold as it happens across multiple data dimensions and uncover the mathematical patterns hidden in the data that are shaping the direction of the conflict.

Physics_explore

photo

I spoke with Quid Co-founder and CTO Sean Gourley to explain how Quid is helping organizations leverage Big Data and augmented intelligence to tackle the Bigger Problems they are facing in a fast moving world.

1.  Quid applies Data Intelligence to Big Data – a very different concept than applying Data Science to Big Data. Please explain.

Continue reading