12
AnalyticOps DevOps for Data Science Ryan Krebs, Matt von Rohr

SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

AnalyticOpsDevOps for Data ScienceRyan Krebs, Matt von Rohr

Page 2: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

2

• Ransom Eli Olds• Ford Motor Company– „Moving Conveyor“

• Current Data Science Production Challenges

• Dockerized Model Management

The „Model“ Assembly Line

https://en.wikipedia.org/wiki/Ford_Model_T

Page 3: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

3 © 2018 Teradata

Machine Learning Silos

Ops Finance

HR IT

In Operations we outsource ouranalytics with a 3rd party vendor.

IT has its ownAWS Kubernetesclusters running

DockerizedPython Models.

HR exportsfrom

QlikView, runsML models on laptops in R,

thendistributesresults in

Excel.

The Finance data scienceteam has a deep learning

model written in Scala using the On-Prem DEV

Hadoop Cluster.

Page 4: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

4 © 2017 Teradata

Current Data Science Production Challenges

• IT sent software to re-implement and deploy

• Ad-hoc process

• Data Scientist sits in Analytic Silo

• Custom datasets

• Variety of modelling techniques and technologies

• Focus on trained model historical performance

Trained Models

Analytics

IT

• Business reviewsreports on models

• Multiple stakeholders and objectives

Performance Reports

Business

Page 5: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

5 © 2018 Teradata

Inconsistent Data used in Analytics and Production?; Multiple reportingtools

Manual Model training; Custom DS reports; IT re-writes; Meetings toapprove

Opaque How were models trained? Approvals?Slow All of the above!

Consistency Unit testing to ensure correct data ingest; Templatedmodel reports

Automation CI model builds; Auto-generated reports; Trained modelsare production ready software; UI gathers approvals

Transparency VCS; Model metadata; UI surfaces metadataAgility All of the above!

Page 6: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

7 © 2018 Teradata

AnalyticOps: SimplifiedDockerized Model Management

• Model Metadata• Scoring Services

• Champion/Challenger Automation• Business and Data Science Approvals• Auditability

Page 7: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

8

Content Slide Keys to success – IN THEORY

DEV OPS ENGINEER

’’- Ron Bodkin 2016

DELIVERY EXCELLENCE

DATA SCIENTIST

SYSTEMS ARCHITECT

BUSINESS EXPERT

SOFTWARE ENGINEER+

The Approach The Team

Software Engineering

DataScience

Business

AnalyticsOps

Page 8: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

10

Content Slide Hybrid Team: Unicorn vs Chimera

• Hard to find• Expensive• Hard to retain and inefficient

• Statistician + a little bit of a DE• Consultant + a little bit of a DS• BA + a little bit of a Developer• ETL Dev + a little bit of Statistics

Page 9: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

11

AnalyticOps: Potential Components

Data scientist making models

The business using a trained modelValue

Exploration• Data Wrangling

• DS Lab

• Model scripting (untrained models)

• Testing, Training, Model Evaluation

• Version Control

• Dependency Management

Automation• Software unit tests

• Model Training

• Storage of trained models

• Model Evaluation

• Model Business Approval/Report Creation

• Comparison vs current Live model (Champion/Challenger)

Consumption• Real-time model scoring

engines

• Automatic deployment of trained model artefacts

• Dashboards and forecasts updated using new models

• Model performance monitoring

• Model output logging

Involving: Analysts, Data Scientists, Engineers, Dev Ops, Business Stakeholders

Page 10: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

12

Case 1: Production in 3 months leads to considerable savingsAnalyticOps and Deep Learning to fight fraud at Danske Bank

Impact• Instant Adoption of several algorithms

to fight fraud attempts in real time: improvement of detection rate by 35%

• Fast delivery from design to productionin 12 weeks within an agile framework

© 2017 Teradata

Situation• All banks have an obligation to

protect their customers from fraudsters using advanced techniques to break systems

Problem• To revolutionize a major bank and

fight fraud within a bank’s strict regulated procedures and existing transactional data ecosystem

Solution• Integrated teams working

together towards production• Following the bank’s existing

standards, procedures & blueprint

Page 11: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST

13

Situation• New types of Machine Learning

proven to provide better outcomes that traditional approaches for insurance risk

Problem• An insurance company wanted to

build a real-time ML system able to respond to quote requests in real-time blending old and new ML techniques

Solution• Building an Analytics Ops layer

that supports multi-languages ML and is able to serve such models during a real-time process in less than a second

Impact

© 2017 Teradata

Case 2: Smart quoting for InsuranceA Machine Learning platform to real-time insurance quotes

CDL

CurrentDataStorage

Message Queue SystemBridge

Post-processingPre-processing

PersistencyLayer

Machine Learning Models in

Prod

Production Development

Catalog Model

Run Production Pipeline & update Models

Promote to Scoring (packaging)

Model (Dev, Test, Validate)

Wrangle

Promote model to production

• Feature enrich• Scoring• Logging

Scoring Engine

Real Time/Batch Data

Page 12: SDS2018 CI CD Pipeline€¦ · Current Data Science Production Challenges •IT sent software to re-implement and deploy ... - Ron Bodkin 2016’’ DELIVERY EXCELLENCE DATA SCIENTIST