Want To Build A Data Science Team? EMC Offers a Holistic Approach

Mona Patel
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Many of our customers invest in big data solutions to target their sales prospects better, explore advanced medical research, and make their internal processes more efficient. The biggest obstacle to getting these initiatives out of the gate is the shortage of big data skills within their own firms and across the industry.

To address this skills gap, EMC has developed a thorough data science and big data analytics curriculum for our customers. EMC was one of the first companies to offer data science education with rigorous, live instruction using free and open source tools. As of today, more than 10,000 customers, partners, and college students have attended the training.

data_science_book_top_banner_image_973x300

I spoke with EMC’s David Dietrich, who leads this unique program to discuss his approach to data science education, which differs from more traditional product-oriented education. What I found most interesting is that in addition to David’s work at EMC, he has also helped design big data analytics curricula for Babson College and other universities.  More recently,  David has published a book, Data Science and Big Data Analytics, to help further develop data science skills and expertise in the industry.

1.  Why is EMC pushing so hard to educate and develop data scientists?

As an information company, we’re extremely attuned to the value of big data, which is exploding in both the sheer amount and how organizations in virtually every field and industry are using it to solve critical problems. When EMC acquired our first big data company, Greenplum, several years ago, we quickly became aware that there was a shortage of people who had the data science and business skills to help companies utilize big data.

2.  How is EMC taking a holistic approach to data science education?

We recognize that learning how to use big data technology alone does not ensure success. Senior management must make sure that appropriate people and processes are in place to drive the change and innovation necessary for valuable big data results to occur. To help companies on their journey, we offer courses for data scientists, who execute big data projects, and business executives who sponsor, run and manage them.

Our goal is to educate all levels of an organization so that data scientists and business people understand one another. That way, the organization is able to roll out big data projects with greater adoption and success. In addition to offering courses to our customers, we also work closely with universities and educational institutions to help them develop their own curriculum and programs.

3.  Please describe some of the important skills for aspiring data scientists.

Working in strategy and analytics for the past 20 years, I’ve always been drawn to experimenting with data to solve problems, which is exactly is the mindset you need to tackle big data. Companies often ask me how to go about using massive amounts of structured and unstructured data to solve business problems. How do they know what to choose and ignore? How do they know what algorithms to apply? Our courses encourage a culture of experimentation that leads to answering these questions. We teach our students how to test an idea with data, measure it quantitatively, learn from it and iterate. This test and learn mindset is critical to becoming a talented data scientist and data-driven organization.

4.  What are some of the challenges with evolving into a data-driven organization?

There can be a substantial divide between data scientists and business people who manage and work with them on big data projects. Many business people lack the technical background to understand how the algorithms apply to the problem and how to test ideas with data. And some data scientists may not understand the business context. We’re trying to educate each side so they can get a clearer picture and drive toward common goals. Once you bridge that gap, you can start driving real change, and solving old problems with big data or new information sources that were once unusable.

5.  What should companies expect after they have successfully made the leap to big data?

We’re educating them in how to train and staff a big data team, as well as build processes to be effective and successful. With this approach, companies can more effectively define the business problem, acquire the right data sets, experiment, communicate the results, and finally, operationalize the new processes.

EMC CIO Takes On Big Data Problems With Big Data Analytics

Mona Patel
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Every second of every day, IT generates enormous amounts of data around operational activity – system behavior, application performance, user actions, security activity, and more. Instead of viewing this data explosion as a Big Data problem, IT views it as opportunity to use Big Data solutions such as IT Operations Analytics to improve the quality of their services.

itoa

For example, 75% IT professionals surveyed recently said that they believe that IT Operations Analytics are able to transform data into relevant insights into actionable plans for improvement. I spoke with EMC CIO Vic Bhagat to describe how EMC is embracing Big Data for IT Operations Analytics to solve critical problems affecting EMC IT Operations and customers.

1.  What are the biggest problems faced by IT Operations Management at EMC and how were these problems addressed before the world of Big Data?

IT generates enormous amounts of data when monitoring complex, rapidly growing and changing IT infrastructures and the applications. The challenge for IT Operations Management is to leverage this data to build an adaptive system that is more proactive, and less reactive. The more the system can learn from the data, the better it can identify variances and problems areas in a timely manner to help IT fix issues before it negatively impacts the business such as downtime or poor performance.

In the past, we relied on traditional business intelligence and data warehousing systems to gain intelligence or insight based on historical trends. Now, with analytics, we can uncover important variables and modify them to predict an outcome. And, the more data we collect at a detailed level, the more accurate we can be.

2.  How does Big Data analytics change the game to address these problems more effectively?

It cuts down the time to gain insight. The most heavily used word after ‘selfie’ is now ‘data lake’. Everyone wants to build a data lake since it provides the right architecture and capabilities to cut down the cycle time in deriving newer, predictive insight, and then continuously integrating these results back into our business processes and decision-making. At EMC, we are moving away from data warehouses to a data lake architecture enabling us to not only gain faster insight, but also gain newer insight by bringing together and analyzing both structured and unstructured data.

For example, in a data warehouse you manage structured data such as part numbers, bay numbers, disk numbers, chassis numbers, and more. In a data lake you can manage all of this structured data in addition to unstructured data such as user manuals for each system and component. Let’s now apply this data lake solution to a use case – we continuously monitor the health of a customer’s infrastructure with our call home systems. We can now leverage a data lake with more data sets to not only make more accurate component failure predictions, but we can also provide the relevant information needed from user manuals to fix the problem in a timely manner so the customer experiences no downtime.

3.  What is EMC’s IT Operations Analytics solution leveraging Big Data technologies and techniques?

We are leveraging the entire Pivotal Big Data Suite to ingest and store all of the structured and unstructured data – Pivotal Gemfire XD, Pivotal HD, Pivotal HAWQ, and Pivotal Greenplum Database. Our Data Scientists are then able to apply advanced analytic techniques to the data they need using their choice of tools which are MadLib, R, and Python. This Big Data environment will be part of a wider business data lake strategy, where all enterprise data will be managed, accessed, and used equally by all business applications, not just IT Operations. Only a few legacy or specialized applications will standalone.

4. What benefits has EMC gained from this Big Data solution?

The benefits are enormous and can be extracted from both business and technical benefits. Building predictive models and predicting imminent system failure reduces downtime and the number of alerts and enables us to identify the real issues faster, reducing the cycle for decision making and taking corrective action. This improves our performance, productivity and value we gain from Big Data.

But we are only scratching the surface. The more we can optimize our Big Data environment so that it is elastic and accessible, the faster and more precise Data Scientists will be in solving problems. For example, we can now predict MS Exchange outages two hours in advance.

5. One of the biggest barriers to getting value from Big Data is the skills shortage. How does EMC IT Operations address this issue?

EMC had the foresight to build Centers of Excellence (COE) around the globe, producing the expertise and skills needed to transition into the realm of Data Science. We are fortunate to leverage talent within the company, but also leverage the COE to attract and acquire new Data Science talent outside the company.

6. What books are you currently reading on your Kindle or if you are still paper based like me, what books are stacked on your nightstand?

I’m Kindle based, so I read periodicals such as Techmeme and Engadget. Since we are a company that is data and digital driven, I am reading a book called ‘Leading Digital’. I want help lead this digital revolution at EMC and this book provides great examples of how digital makes significant changes in how a company operates and kills bureaucracy.

Federation Security Analytics: A Data Science Approach

Mona Patel
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Too many alerts with little to no context, is the state of today’s information security landscape. For example, it’s common for an enterprise who has been breached to have received an alert from a security tool, only to have it lost in the noise of many other threats coming in at the rate of hundreds per day. To add to the flurry of alerts, security threats are constantly changing and getting more complex due a changing and complex IT environment, making it difficult to map out a single attack across all of the different infrastructure touch points. And as security teams and tools get wise to the tactics, the threats will continue to evolve to thwart them.

The key is to develop a security analytics infrastructure facilitating data science techniques that can evolve as the threats evolve. Additionally, taking a Data Science approach to security threats aims to reduce the flurry of alerts, as well as provide more context to the alert so they can be prioritized, do more efficient root cause analysis, and be quickly resolved. This is the goal of Federation Security Analytics, as it combines the technology power of a Data Lake with proven Data Science applications to:

-See and understand everything happening in your environment

-Detect and prioritize the most advanced attacks, including long and slow attacks that happen over time

-Investigate and remediate incidents with unprecedented precision and speed

federation_sa

I spoke with RSA Senior Manager David Mitchell to discuss how Federation Security Analytics can better spot today’s attacks, plus provide an adaptable infrastructure to protect the organization as attacks evolve and become more sophisticated.

1.  What are the biggest problems faced by Security Operations Centers and how does traditional SIEM fail to address these challenges?

Real threats today are more advanced and targeted, some aimed at locating specific information through an individual or use case. They are also constantly changing, targeting an environment that is not owned by the enterprise. Applications in the cloud, public networks, and mobile devices now contribute to threats outside a well-defined enterprise perimeter. The perimeter is now more porous; therefore, traditional SIEM tools that are signature or perimeter-based cannot effectively identify many of today’s attacks.

2.  How does Big Data with Data Science change the game to address these problems more effectively?

It does two things – it allows you to collect everything through an engineered big data infrastructure and enrich this data to identify high-value, high-risk assets. Once you have determined what your high-value, high-risk assets are, you prioritize them and collect everything around those assets – logs, network packets, endpoint data, and more.

To spot an advanced threat or a threat that has not advertised itself (or an unknown threat), you cannot use traditional signature-based techniques since you cannot create a signature of a threat that has never existed before. Using security data science you can remove the hay in the environment, extract information that does not make sense, and flag it to determine if a threat is real. This approach can also reduce the flurry of alerts and false positives, and provides more context to the alert so they can be prioritized, have more efficient root cause analysis performed on them, and be quickly resolved.

3.  What is Federation Security Analytics solution and what makes it unique?

Federation Security Analytics includes technology and expertise across RSA, Pivotal, and EMC II. It is unique in that it is not just a suggested architecture, but an engineered and field-tested solution that enables you to simultaneously collect required security data, analyze it, and create alerts. You can reliably collect all your data and send real-time alerts without being impacted by interacting with the data and vice versa.  The solution is also packaged with services to install and configure in your environment by the RSA Global Services organization.

4.  Can you describe a use case addressed by Federation Security Analytics?

The use of covert channel activity is one use case. These are long, advanced, persistent threats that are difficult to detect. It requires monitoring of inbound and outbound connections and being able to detect internal hosts with strange outbound communication patterns (beaconing) and spot those external hosts that are most likely to be compromised (high risk suspicious domains). Being able to detect beaconing and suspicious domains will then allow you to identify the source of the attack. From this point, analysts can immediately pivot to identify the users that are under attack. The method for uncovering covert channel activity or malicious behavior requires the collection and analysis of multiple pieces of data over an extended time period so you can identify normal behavior and apply a weighted probability risk score to all subsequent behaviors.

5.  One of the biggest barriers to getting value from Big Data is the skills shortage. How does EMC address this issue?

Because this is an engineered solution, it removes the infrastructure skills necessary to architect a reliable, high-performing big data system that provides visibility to all of the data that is collected, analysis of both real-time and historical data, and generation of actionable results through real-time and long and slow alerting.

It also helps to remove up front security Data Science skills, as this solution also provides three security analytics applications or use cases using Data Science techniques out of the box. Through the community and security expertise from RSA, we will continue to develop and provide additional use cases to our customers. As threats continue to evolve, the enterprise can be better positioned to adapt to changes in threat strategy, as well as easily scale and modify its infrastructure without having to reinvent the solution.

All Paths Lead To A Federation Data Lake

Mona Patel
Mona Patel is a Senior Manager for Big Data Marketing at EMC Corporation. With over 15 years of working with data at The Department of Water and Power, Air Touch Communications, Oracle, and MicroStrategy, Mona decided to grow her career at EMC, a leader in Big Data.

Is your organization constrained by 2nd platform data warehouse technologies with limited or no budget to move forward towards 3rd platform agile technologies such as a Data Lake? As an EMC customer you have the advantage of leveraging existing EMC investments to develop a Federation Data Lake at minimal cost. Additionally, the Federation Data Lake will generate healthy returns, as it is packaged up with the expertise needed to immediately execute on data lake uses cases such as data warehouse ETL offloading and archiving.

Data Lake

With the release of William Schmarzo’s Five Tactics to Modernize Your Existing Data Warehouse, I wanted to explore whether the Dean of Big Data views data warehouse modernization tactics or paths ultimately leading to a Federation Data Lake.

1.  What is a Data Lake and who should care?

Continue reading