43
How to Make Analy.c Opera.ons Look More Like DevOps: Lessons learned Moving Machine Learning Algorithms to Produc.on Environments Robert L. Grossman University of Chicago and Open Data Group O’Reilly Strata Conference March 30, 2016 rgrossman.com @bobgrossman

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Embed Size (px)

Citation preview

Page 1: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

How  to  Make  Analy.c  Opera.ons  Look  More  Like  DevOps:  Lessons  learned  Moving  Machine-­‐

Learning  Algorithms  to  Produc.on  Environments  

Robert  L.  Grossman  University  of  Chicago  

and  Open  Data  Group  

O’Reilly  Strata  Conference  March  30,  2016  

rgrossman.com  @bobgrossman  

Page 2: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Introduc.on  to  Analy.cOps    

Page 3: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

SoRware  Development  

Quality  Assurance  

Opera.ons  

DevOps  

The  goal  of  DevOps  is  to  establish  a  culture  and  an  environment  where  building,  tes.ng,  releasing,  and  opera.ng  soRware  can  happen  rapidly,  frequently,  and  more  reliably.*  *Adapted  from  Wikipedia,  en.wikipedia.org/wiki/DevOps.  

Page 4: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Analy.c  Modeling  

Quality  Assurance  

Analy.c  Opera.ons  

Analy.cOps  

The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an  environment  where  building,  valida.ng,  deploying,  and  running  analy.c  models  happen  rapidly,  frequently,  and  reliably.  

Page 5: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Analy.c  Modeling  

Quality  Assurance  

Analy.c  Opera.ons  

Analy.cOps  

The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an  environment  where  building,  valida.ng,  deploying,  and  running  analy.c  models  happen  rapidly,  frequently,  and  reliably.  

•  SoRware  •  Model  •  Data  

Page 6: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Analy.c  strategy  and  planning  

Analy.c  models  &  algorithms   Analy.c  opera.ons  

Analy.c  Infrastructure  

*Source:  Robert  L.  Grossman,  The  Strategy  and  Prac.ce  of  Analy.cs,  O’Reilly,  2016,  to  appear.  

Page 7: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

A  Problem  

There  are  plaZorms  and  tools  for  managing  and  processing  big  data  (Hadoop),  for  building  analy.cs  (SAS,  SPSS,  R,  Sta.s.ca,  Spark,  Skytree,  Mahout),  but  few  op.ons  for  deploying  analy.cs  into  opera.ons  or  for  embedding  analy.cs  into  products  and  services.  

Data  scien.sts  developing  analy.c  models  &  algorithms  

Analy.c  infrastructure  

Enterprise  IT  deploying  analy.cs  into  products,  services  and  opera.ons  

Deploying  analy.cs  

7  

Page 8: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

More  Problems  

Data  scien.sts  developing  analy.c  models  &  algorithms  

Analy.c  infrastructure  

Enterprise  IT  deploying  analy.cs  into  products,  services  and  opera.ons  

Deploying  analy.cs  

8  

Monitoring  opera.onal  analy.cs  

ETL  and  datamarts  for  the  modelers  

Page 9: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Case  Study  1:  Scoring  Engines  for  Cri.cal  Systems  

Page 10: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Life  Cycle  of  Predic.ve  Model  

Exploratory  Data  Analysis  Get  and    clean  the  data  

Build  model  in  dev/modeling  environment  

Deploy  model  in  opera.onal  systems  with  scoring  applica.on     Monitor  performance  and  

employ  champion-­‐challenger  methodology  to  develop  improved  model  

Analy.c  modeling  

Analy.c  opera.ons  

Deploy  model  

Perf.  data  

Re.re  model  and  deploy  improved  model  

Select  analy.c  problem  &  approach  

Scale  up    deployment  

Page 11: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Exploratory  Data  Analysis  Get  and    clean  the  data  

Build  model  in  dev/modeling  environment  

Deploy  model  in  opera.onal  systems  with  scoring  applica.on     Monitor  performance  and  

employ  champion-­‐challenger  methodology  to  develop  improved  model  

Analy.c  modeling  

Analy.c  opera.ons  

Deploy  model  

Re.re  model  and  deploy  improved  model  

Select  analy.c  problem  &  approach  

Scale  up    deployment  

ModelDev

AnalyticOps

Perf.  data  

Page 12: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Differences  Between  the  Modeling  and  Deployment  Environments  

•  Typically  modelers  use  specialized  languages  such  as  SAS,  SPSS  or  R.  

•  Usually,  developers  responsible  for  products  and  services  use  languages  such  as  Java,  JavaScript,  Python,  C++,  etc.  

•  This  can  result  in  significant  effort  moving  the  model  from  the  modeling  environment  to  the  deployment  environment.  

Page 13: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Ways  to  Deploy  Models  into    Products/Services/Opera.ons  

•  Export  and  import  tables  of  scores  •  Export  and  import  tables  of  parameters  •  Have  the  product/service  interact  with  the  model  as  a  web  or  message  service.  

•  Import  the  models  into  a  database  •  Embed  the  model  into  a  product  or  service.  •  Push  code.  

How  quickly  can  the  model  be  updated?  •  Model  parameters?  •  New  features?        •  New  pre-­‐  &  post-­‐  processing?  

Page 14: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

What  is  a  Scoring  Engine?  

•  A  scoring  engine  is  a  component  that  is  integrated  into  products  or  enterprise  IT  that  deploys  analy.c  models  in  opera.onal  workflows  for  products  and  services.  

•  A  Model  Interchange  Format  is  a  format  that  supports  the  expor.ng  of  a  model  by  one  applica.on  and  the  impor.ng  of  a  model  by  another  applica.on.      

•  Model  Interchange  Formats  include  the  Predic.ve  Model  Markup  Language  (PMML),  the  Portable  Format  for  Analy.cs  (PFA),  and  various  in-­‐house  or  custom  formats.  

•  Scoring  engines  are  integrated  once,  but  allow  applica.ons  to  update  models  as  quickly  as  reading  a  a  model  interchange  format  file.  

14  

Page 15: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Analy.c  algorithms  &  models   Analy.c  opera.ons  

Deploying  analy.c  models  

Model  Consumer  

Model  Producer  

Analy.c  Infrastructure  

Export  model  

Import  model  

PMML  &  PFA  

Page 16: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Case  Study  2:    Scaling  Bioinforma.cs  Pipelines  for  the  Genomic  Data  Commons*  

This  case  study  describes  work  by  the  NCI  Genomic  Data  Commons  Project  and  the  University  of  Chicago  Center  for  Data  Intensive  Science.  

Page 17: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

TCGA  dataset:  1.54  PB  consis.ng  of  577,878  files  about  14,052  cases  (pa.ents),  in  42  cancer  types,  across  29  primary  sites.      

2.5+  PB    of  cancer  

genomics  data  +  

Bionimbus  data  commons  technology  running  mul.ple  community  developed  variant  calling  pipelines.    Over  12,000  cores  and  10  PB  of  raw  storage  in  18+  racks  running  for  months.  

Analy.cOps  for  the  Genomic  Data  Commons  

Page 18: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Dev Ops

•  Virtualiza.on  and  the  requirement  for  massive  scale  out  spawned  infrastructure  automa.on  (“infrastructure  as  code”).  

•  Requirement  for  reducing  the  .me  to  deploying  code  created  tools  for  con.nuous  integra.on  and  tes.ng.  

Page 19: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

ModelDev AnalyticOps

•  Use  virtualiza.on  /  containers,  infrastructure  automa.on  and  scale  out  to  support  large  scale  analy.cs.  

•  Requirement:  reduce  the  .me  and  cost  to  do  high  quality  analy.cs    over  large  amounts  of  data.  

Page 20: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Genomic  Data  Commons  (GDC)  Files  Vary  Over  9  Orders  of  Magnitude  in  Size  

Page 21: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

GDC  Pipelines  Are  Complex    and  are  Mostly  Wriqen  by  Others  

Page 22: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Computa.ons  for  a  Single    Genome  Can  Take  Over  a  Week  

Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 23: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

System  Loads  Vary  Significantly  

Page 24: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

•  Model  quality  (confusion  matrix)  

•  Data  quality                          (six  dimensions)  

•  Lack  of  ground  truth  

•  SoRware  errors  •  Workflow  with  monitoring  

•  Scheduling  

•  Boqlenecks,  stragglers,  hot  spots,  etc.  •  Analy.c  configura.ons  problems*  •  System  failures    

•  Human  errors  

Ten  Factors  Effec.ng  Analy.cOps  

*DMS  =  data-­‐model-­‐system  

Page 25: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Monitor  Data  Quality  and  Model  Performance  and  Summarize  With  Dashboards  

Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 26: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Analy.cOps  Dashboard  

Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 27: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Data  Quality:  Batch  Effects  Can  Be  Significant  

Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 28: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Model  Quality:  Differences  in  Three  Soma.c  Muta.on  Detec.on  Algorithms  

Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 29: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

ORen  SoRware  Must  Be  Wriqen  so  that  It  Can  Be  Run  Efficiently  in  Automated  Enivronments  

•  Generally,  community  soRware  in  bioinforma.cs  is  designed  to  be  run  manually  over  local  clusters.  

•  Example  – We  patched  one  piece  of  soRware  over  400  .mes  so  that  it  could  run  over  12,000  genomes    

– Although  only  3.3%  of  genomes  had  problems,  it  required  significant  manual  effort.  

•  Analy.cOps  requires  opera.ng  the  soRware  in  automated  environments.  

Page 30: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Decide  What  Not  to  Compute  VarScan Rate

Rate (GB/hour)

Frequency

0.0 0.5 1.0 1.5 2.0

0200

400

600

800

1000

1200

Manage  these  cases  carefully.  

Page 31: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Model  Expected  Performance  Processing  .me  

Tumor  BAM  size  (GB)  Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  

Page 32: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Case  Study  3:  Deploying  Gaussian  Process  Models  to  the  Industrial  Internet*  

*Thanks  to  the  DMG  PMML  and  PFA  Working  Groups.    

Page 33: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Portable  Format  for  Analy.cs  (PFA)  Standard  

www.dmg.org  

Page 34: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

PFA  is  Based  Upon  Defining  Primi.ves  for  Analy.c  Models  

•  What  would  a  standard  look  like  that…  – Defines  primi.ves  for  data  transforma.ons,  data  aggrega.ons,  and  sta.s.cal  and  analy.c  models.  

– Supports  composi.on  of  data  mining  primi.ves  (which  makes  it  easy  to  specify  machine  learning  algorithms  and  pre-­‐/post-­‐  processing  of  data).  

–  Is  extensible.  –  Is  “safe”  to  deploy  in  enterprise  IT  opera.onal  environments.  

•  This  is  a  different  philosophy  that  is  different  and  complementary  to  Predic.ve  Model  Markup  Language  (PMML).  

34  

Page 35: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Benefits  of  PFA  

•  PFA  is  based  upon  JSON  and  Avro  and  integrates  easily  into  modern  big  data  environments.  

•  PFA  allows  models  to  be  easily  chained  and  composed  

•  PFA  allows  developers  and  users  users  of  analy.c  systems  to  pre-­‐process  inputs  and  to  post-­‐process  outputs  to  models  

•  PFA  is  easily  integrated  with  Storm,  Akka  and  other  streaming  environments  

•  PFA  can  be  used  to  integrate  mul.ple    tools  applica.ons  within  an  analy.c  ecosystem.  

Page 36: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Gaussian  Process  Model  

Page 37: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Example  of  a  PFA  model  input: {type: array, items: double}output: {type: array, items: double}

cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}

action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

input  and  output  of  scoring  engine  expressed  as  Avro  schemas  

Page 38: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Example  of  a  PFA  model  input: {type: array, items: double}output: {type: array, items: double}

cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}

action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

type  (also  Avro)  

and  value  (as  JSON,  truncated)  

Gaussian  Process  model  parameters  

Page 39: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Example  of  a  PFA  model  input: {type: array, items: double}output: {type: array, items: double}

cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}

action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

calling  method:  parameters  expressed  as  JSON  input:  get  interpola.on  point  from  input  {cell:  table}:  get  parameters  from  table  null:  no  explicit  Kriging  weight  (universal)  {fcn:  …}:  kernel  func.on  

Page 40: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Example  of  a  PFA  model  

•  Appears  declara.ve,  but  this  is  a  func.on  call.  –  Fourth  parameter  is  another  func.on:  m.kernel.rbf  (radial  basis  kernel,  a.k.a.  squared  exponen.al).  

–   m.kernel.rbf  was  intended  for  SVM,  but  is  reusable  anywhere.  –  One  argument  (gamma)  preapplied  so  that  it  fits  the  signature  for  model.reg.gaussianProcess.  

•  Any  kernel  func.on  could  be  used,  including  user-­‐defined  func.ons  wriqen  with  PFA  “code.”  

•  The  Gaussian  Process  could  be  used  anywhere,  even  as  a  pre-­‐processing  or  post-­‐processing  step.  

model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}

Page 41: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Summary  

Page 42: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Ten  Analy.cOps  Rules  

1.  Team  a  modeler,  soRware  engineer,  and  systems  engineer.  2.  Instrument  and  monitor  analy.cs,  soRware  and  systems  and  

populate  and  Analy.cOps  dashboard.    3.  Use  an  automated  tes.ng  and  deployment  environment  to  

improve  the  model  quality.    4.  Use  scoring  engines  with  languages  such  as  PFA  &  PMML.  5.  Put  in  place  a  data  quality  program.    6.  For  complex  workloads,  use  workflow  and  schedulers  (even  if  

you  think  you  don’t  need  them  ini.ally)  and  model  scale  up.  7.  Op.mize  the  end  to  end  performance  of  the  Analy.cOps,  not  

individual  analy.cs.  8.  Dis.nguish  scores  from  ac.ons.  9.  Iden.fy  and  eliminate  performance  hot  spots,  system  stragglers,  

etc.  10.  Invest  in  root  cause  analysis  of  Analy.cOps  problems.  

Page 43: AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Ques.ons?  

43  

rgrossman.com  @bobgrossman