AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

How to Make Analy.c Opera.ons Look More Like DevOps: Lessons learned Moving Machine-‐

Learning Algorithms to Produc.on Environments

Robert L. Grossman University of Chicago

and Open Data Group

O’Reilly Strata Conference March 30, 2016

rgrossman.com @bobgrossman

Introduc.on to Analy.cOps

SoRware Development

Quality Assurance

Opera.ons

DevOps

The goal of DevOps is to establish a culture and an environment where building, tes.ng, releasing, and opera.ng soRware can happen rapidly, frequently, and more reliably.* *Adapted from Wikipedia, en.wikipedia.org/wiki/DevOps.

Analy.c Modeling

Quality Assurance

Analy.c Opera.ons

Analy.cOps

The goal of Analy.cOps is to establish a culture and an environment where building, valida.ng, deploying, and running analy.c models happen rapidly, frequently, and reliably.

Analy.c Modeling

Quality Assurance

Analy.c Opera.ons

Analy.cOps

The goal of Analy.cOps is to establish a culture and an environment where building, valida.ng, deploying, and running analy.c models happen rapidly, frequently, and reliably.

•  SoRware •  Model •  Data

Analy.c strategy and planning

Analy.c models & algorithms Analy.c opera.ons

Analy.c Infrastructure

*Source: Robert L. Grossman, The Strategy and Prac.ce of Analy.cs, O’Reilly, 2016, to appear.

A Problem

There are plaZorms and tools for managing and processing big data (Hadoop), for building analy.cs (SAS, SPSS, R, Sta.s.ca, Spark, Skytree, Mahout), but few op.ons for deploying analy.cs into opera.ons or for embedding analy.cs into products and services.

Data scien.sts developing analy.c models & algorithms

Analy.c infrastructure

Enterprise IT deploying analy.cs into products, services and opera.ons

Deploying analy.cs

7

More Problems

Data scien.sts developing analy.c models & algorithms

Analy.c infrastructure

Enterprise IT deploying analy.cs into products, services and opera.ons

Deploying analy.cs

8

Monitoring opera.onal analy.cs

ETL and datamarts for the modelers

Case Study 1: Scoring Engines for Cri.cal Systems

Life Cycle of Predic.ve Model

Exploratory Data Analysis Get and clean the data

Build model in dev/modeling environment

Deploy model in opera.onal systems with scoring applica.on Monitor performance and

employ champion-‐challenger methodology to develop improved model

Analy.c modeling

Analy.c opera.ons

Deploy model

Perf. data

Re.re model and deploy improved model

Select analy.c problem & approach

Scale up deployment

Exploratory Data Analysis Get and clean the data

Build model in dev/modeling environment

Deploy model in opera.onal systems with scoring applica.on Monitor performance and

employ champion-‐challenger methodology to develop improved model

Analy.c modeling

Analy.c opera.ons

Deploy model

Re.re model and deploy improved model

Select analy.c problem & approach

Scale up deployment

ModelDev

AnalyticOps

Perf. data

Differences Between the Modeling and Deployment Environments

•  Typically modelers use specialized languages such as SAS, SPSS or R.

•  Usually, developers responsible for products and services use languages such as Java, JavaScript, Python, C++, etc.

•  This can result in significant effort moving the model from the modeling environment to the deployment environment.

Ways to Deploy Models into Products/Services/Opera.ons

•  Export and import tables of scores •  Export and import tables of parameters •  Have the product/service interact with the model as a web or message service.

•  Import the models into a database •  Embed the model into a product or service. •  Push code.

How quickly can the model be updated? •  Model parameters? •  New features? •  New pre-‐ & post-‐ processing?

What is a Scoring Engine?

•  A scoring engine is a component that is integrated into products or enterprise IT that deploys analy.c models in opera.onal workflows for products and services.

•  A Model Interchange Format is a format that supports the expor.ng of a model by one applica.on and the impor.ng of a model by another applica.on.

•  Model Interchange Formats include the Predic.ve Model Markup Language (PMML), the Portable Format for Analy.cs (PFA), and various in-‐house or custom formats.

•  Scoring engines are integrated once, but allow applica.ons to update models as quickly as reading a a model interchange format file.

14

Analy.c algorithms & models Analy.c opera.ons

Deploying analy.c models

Model Consumer

Model Producer

Analy.c Infrastructure

Export model

Import model

PMML & PFA

Case Study 2: Scaling Bioinforma.cs Pipelines for the Genomic Data Commons*

This case study describes work by the NCI Genomic Data Commons Project and the University of Chicago Center for Data Intensive Science.

TCGA dataset: 1.54 PB consis.ng of 577,878 files about 14,052 cases (pa.ents), in 42 cancer types, across 29 primary sites.

2.5+ PB of cancer

genomics data +

Bionimbus data commons technology running mul.ple community developed variant calling pipelines. Over 12,000 cores and 10 PB of raw storage in 18+ racks running for months.

Analy.cOps for the Genomic Data Commons

Dev Ops

•  Virtualiza.on and the requirement for massive scale out spawned infrastructure automa.on (“infrastructure as code”).

•  Requirement for reducing the .me to deploying code created tools for con.nuous integra.on and tes.ng.

ModelDev AnalyticOps

•  Use virtualiza.on / containers, infrastructure automa.on and scale out to support large scale analy.cs.

•  Requirement: reduce the .me and cost to do high quality analy.cs over large amounts of data.

Genomic Data Commons (GDC) Files Vary Over 9 Orders of Magnitude in Size

GDC Pipelines Are Complex and are Mostly Wriqen by Others

Computa.ons for a Single Genome Can Take Over a Week

Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.

System Loads Vary Significantly

•  Model quality (confusion matrix)

•  Data quality (six dimensions)

•  Lack of ground truth

•  SoRware errors •  Workflow with monitoring

•  Scheduling

•  Boqlenecks, stragglers, hot spots, etc. •  Analy.c configura.ons problems* •  System failures

•  Human errors

Ten Factors Effec.ng Analy.cOps

*DMS = data-‐model-‐system

Monitor Data Quality and Model Performance and Summarize With Dashboards


Analy.cOps Dashboard


Data Quality: Batch Effects Can Be Significant


Model Quality: Differences in Three Soma.c Muta.on Detec.on Algorithms


ORen SoRware Must Be Wriqen so that It Can Be Run Efficiently in Automated Enivronments

•  Generally, community soRware in bioinforma.cs is designed to be run manually over local clusters.

•  Example – We patched one piece of soRware over 400 .mes so that it could run over 12,000 genomes

– Although only 3.3% of genomes had problems, it required significant manual effort.

•  Analy.cOps requires opera.ng the soRware in automated environments.

Decide What Not to Compute VarScan Rate

Rate (GB/hour)

Frequency

0.0 0.5 1.0 1.5 2.0

0200

400

600

800

1000

1200

Manage these cases carefully.

Model Expected Performance Processing .me

Tumor BAM size (GB) Source: University of Chicago Center for Data Intensive Science Bioinforma.cs Group.

Case Study 3: Deploying Gaussian Process Models to the Industrial Internet*

*Thanks to the DMG PMML and PFA Working Groups.

Portable Format for Analy.cs (PFA) Standard

www.dmg.org

PFA is Based Upon Defining Primi.ves for Analy.c Models

•  What would a standard look like that… – Defines primi.ves for data transforma.ons, data aggrega.ons, and sta.s.cal and analy.c models.

– Supports composi.on of data mining primi.ves (which makes it easy to specify machine learning algorithms and pre-‐/post-‐ processing of data).

–  Is extensible. –  Is “safe” to deploy in enterprise IT opera.onal environments.

•  This is a different philosophy that is different and complementary to Predic.ve Model Markup Language (PMML).

34

Benefits of PFA

•  PFA is based upon JSON and Avro and integrates easily into modern big data environments.

•  PFA allows models to be easily chained and composed

•  PFA allows developers and users users of analy.c systems to pre-‐process inputs and to post-‐process outputs to models

•  PFA is easily integrated with Storm, Akka and other streaming environments

•  PFA can be used to integrate mul.ple tools applica.ons within an analy.c ecosystem.

Gaussian Process Model

Example of a PFA model input: {type: array, items: double}output: {type: array, items: double}

cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}

action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

input and output of scoring engine expressed as Avro schemas




type (also Avro)

and value (as JSON, truncated)

Gaussian Process model parameters




calling method: parameters expressed as JSON input: get interpola.on point from input {cell: table}: get parameters from table null: no explicit Kriging weight (universal) {fcn: …}: kernel func.on

Example of a PFA model

•  Appears declara.ve, but this is a func.on call. –  Fourth parameter is another func.on: m.kernel.rbf (radial basis kernel, a.k.a. squared exponen.al).

–  m.kernel.rbf was intended for SVM, but is reusable anywhere. –  One argument (gamma) preapplied so that it fits the signature for model.reg.gaussianProcess.

•  Any kernel func.on could be used, including user-‐defined func.ons wriqen with PFA “code.”

•  The Gaussian Process could be used anywhere, even as a pre-‐processing or post-‐processing step.

model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

Summary

Ten Analy.cOps Rules

1.  Team a modeler, soRware engineer, and systems engineer. 2.  Instrument and monitor analy.cs, soRware and systems and

populate and Analy.cOps dashboard. 3.  Use an automated tes.ng and deployment environment to

improve the model quality. 4.  Use scoring engines with languages such as PFA & PMML. 5.  Put in place a data quality program. 6.  For complex workloads, use workflow and schedulers (even if

you think you don’t need them ini.ally) and model scale up. 7.  Op.mize the end to end performance of the Analy.cOps, not

individual analy.cs. 8.  Dis.nguish scores from ac.ons. 9.  Iden.fy and eliminate performance hot spots, system stragglers,

etc. 10.  Invest in root cause analysis of Analy.cOps problems.

Ques.ons?

43

rgrossman.com @bobgrossman

Technology

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments