15
Scaling the Data Scientist Dr. Ira Cohen, Chief Data Scientist, HP Software

Scaling the Data Scientist Dr. Ira Cohen, Chief Data Scientist, HP Software

Embed Size (px)

Citation preview

Scaling the Data Scientist

Dr. Ira Cohen, Chief Data Scientist, HP Software

2 Data Science Office @ HPSW

HP-Software and Data Science

HP-Software products collect huge amounts of IT data

Customers want us to transform the data to actionable information

System Monitoring

Events

Defects

Incidents

Logs

Changes

ConfigurationTest data

Requirements

“Big Data & Predictive Analytics: The Future of IT Management” Mike Gualtieri, Forrester

Security events

Network dataApp Monitoring

3 Data Science Office @ HPSW

Need

Expertise

Expertise in machine learning

Expertise in the products domain

Infrastructure

Data platforms

Development Tools

4 Data Science Office @ HPSW

A tale of two worldsData Scientists

• Few• Limited domain knowledge• Tools: R, Matlab, Mahout, Knime,

Weka, Sas, …

Developers/SMEs• Plentiful• Limited data science knowledge• Tools: IDEs, Excel

5 Data Science Office @ HPSW

Developer Data analytics specialist

Our solution

6 Data Science Office @ HPSW

How?

• Training• Mentoring• Community

• Data infrastructure• New Dev tool

7 Data Science Office @ HPSW

Training: Practical Machine Learning• 4 day training• Commitment to complete first project

•Big data foundations

•Problem definition

Data

•Attribute construction

•Transformations

Processing •Attribute selection

•Dimensionality reduction

Filtering

•Supervised•Unsupervised

Learning

• Validation methods• Accuracy measures

Testing

Practical Machine LearningOhad Assulin, Efrat Egozi Levi, Ira Cohen

Automatic Event

Prioritization

Anat Levinger & Roy

Wallerstein

Automatic

Vulnerability

Categorization

Barak Raz & Ben

Feher

Classifying Security Events

Yoni Roit & Omer Weissman

Early detection of anomalous behavior

in IT systems Yonatan Ben Simhon & Yaneeve Shekel

Cloud Delivery Optimization (CDO)

Ran, Levi

URL to Action ClassificationBoaz Shor & Eyal Kenigsberg

Predictive Analytics in

Release ManagementSigalit Sade

Sales Pipeline Early Warning

Gabriel, Alvarado

Pushing My Buttons

Gil Zieder, Ofer Eliassaf, Boris Kozorovitzky

10 Data Science Office @ HPSW

The process @ work

•Problem definition

Data

•Attribute construction

•Normalization

Processing •Attribute selection

Filtering

•Supervised•Classification

Learning

• Minimize false negatives

Testing

9 open source projects, 8806 individual commitsGet labels of “good” or “bad” commit by running tests after each commit“good” – tests pass, “bad” – tests fail

As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch.

80 attributes per commitsource control, previous commits, and code complexity based attributes:e.g., average change frequency, previous commit state, cyclomatic complexity

Rank based attribute selection

Classification algorithmsK-NN, SVM, Decision Tree, Random Forest, …

87% Accuracy with K-NN

11 Data Science Office @ HPSW

Analytic specialist program: Results

> 70 developers

trained

Before: 4

> 30 new capabilities since April

2013

Before: 1

1 Data scientist per

10 new capabilities

Before: 1:1

Development time reduced

by 70%

Before: 12 months

12 Data Science Office @ HPSW

Can we do better?• Yes. From months to days! • How? – Create a simple tool for analytic specialists– Automate the data scientist as much as possible

13 Data Science Office @ HPSW

Project Titan

14 Data Science Office @ HPSW

Titan: Demo

15 Data Science Office @ HPSW

Scaling the data scientist

Analytic specialists• Develops using

standard machine learning

• Uses simplified tool

Data Scientist• Provides expert

advice • Develops new types

of machine learning solutions