29
End-to-End Data Science and Machine Learning for Telcos Telstra's Use Case — Animesh Singh — Tim Osborne — Adam Makarucha Think 2020 / DOC ID / May, 2020 / © 2020 IBM Corporation Session 6123

End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

End-to-End Data Science and Machine Learning for Telcos Telstra's Use Case — Animesh Singh — Tim Osborne — Adam Makarucha

Think 2020 / DOC ID / May, 2020 / © 2020 IBM Corporation

Session 6123

Page 2: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model
Page 3: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

CODAIT

Improving the Enterprise AI Lifecycle in Open Source

Center for Open Source Data & AI Technologies

DEG / May XX, 2020 / © 2020 IBM Corporation 3

CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise.

We contribute to and advocate for the open-source technologies that are foundational to IBM’s AI offerings.

30+ open-source developers!

Page 4: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Enterprise Machine Learning

4

Page 5: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

The Machine Learning Lifecycle

5

Page 6: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

*Source: Hidden Technical Debt in Machine Learning Systems

Perception

Page 7: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Perception

Page 8: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

*Source: Hidden Technical Debt in Machine Learning Systems

In reality…ML Code is tiny part in this overall platform

Page 9: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

And the ML workflow spans teams …

Page 10: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Data cleansing

Dataanalysis&transformation

Data validation

Data splitting

Data prep

Building a model

Model validation

Training at scale

Model creation

Deploying Serving Monitoring &

Logging + Explainability

Finetune & improvements

Rollout

Training optimization Model

Model

Data

Data

Data ingestion

Edge Cloud

Da

taflo

w a

nd

Wo

rkflow

Orc

he

stratio

n

Marketplace(A

IHub

)

Dataconsistency(versioning)

FeatureEngineering

And is much more complex..

Page 11: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

End to end ML on Kubernetes?

11

●  Containers ●  Packaging ●  Kubernetes service endpoints ●  Persistent volumes ●  Scaling ●  Immutable deployments ●  GPUs, Drivers & the GPL ●  Cloud APIs ●  DevOps ●  ...

First, can you become an expert in ...

Page 12: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

We need a platform. Enter Kubeflow

Prepared and

Analyzed Data

Trained Model

Deployed Model

Prepared Data

Untrained Model

Page 13: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Libraries and CLIs - Focus on end users

Systems - Combine multiple services

Low Level APIs / Services (single function)

Arena kfctl kubectl

katib pipelines

notebooks

fairing

TFJob PyTorchJob

Jupyter CR

Seldon CR

kube bench

•  End to end ML Platform on Kubernetes. Focused on multiple aspects of Model Lifecycle

•  Originated at Google, and has grown to have a large community of developers

•  Google, IBM, Cisco, RedHat, Intel, Microsoft and others contributing

•  IBM is the 2nd largest contributor in terms of overall commits. IBM maintainers (committers/reviewers) in Katib (HPO+Training), Kubeflow Serving, Manifests, Pipelines etc.

Metadata

Orchestration

Pipelines CR

Argo

Study Job

MPI CR

Spark Job

Model DB

TFX

Developed By Kubeflow

Developed Outside Kubeflow

* Not all components shown

IAM

Scheduling Kubeflow https://github.com/kubeflow

Page 14: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Jupyter Notebooks

Workflow Building

Pipelines

Tools

Serving

Metadata

Data Management

Kale

Fairing

TFX

Airflow, +

KF Pipelines

HP Tuning

Tensorboard

KFServing

Seldon Core

TFServing, + Training Operators Pytorch

XGBoost, +

Tensorflow

Prometheus

Versioning Reproducibility Secure Sharing

Page 15: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Develop (Kubeflow Jupyter Notebooks)

Data Scientist

-  Self-service Jupyter Notebooks provide faster model experimentation

-  Simplified configuration of CPU/GPU, RAM, Persistent Volumes

-  Faster model creation with training operators, TFX, magics, workflow automation (Kale,

Fairing)

-  Simplify access to external data sources (using stored secrets)

-  Easier protection, faster restoration & sharing of “complete” notebooks

IT Operator

-  Profile Controller, Istio, Dex enable secure RBAC to notebooks, data & resources

-  Smaller base container images for notebooks, fewer crashes, faster to recover

Page 16: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Distributed Model Training and HPO (TFJob, PyTorch Job, MPI Job, Katib, …)

Addresses One of the key goals for model builder persona: Distributed Model Training and Hyper parameter optimization for Tensorflow, PyTorch etc. Common problems in HP optimization

–  Overfitting

–  Wrong metrics

–  Too few hyperparameters

Katib: a fully open source, Kubernetes-native hyperparameter tuning service

–  Inspired by Google Vizier

–  Framework agnostic

–  Extensible algorithms

–  Simple integration with other Kubeflow components

Kubeflow also supports distributed MPI based training using Horovod

Page 17: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

●  Founded by Google, Seldon,

IBM, Bloomberg and Microsoft

●  Part of the Kubeflow project

●  Focus on 80% use cases -

single model rollout and update

●  Kfserving 1.0 goals:

○  Serverless ML Inference

○  Canary rollouts

○  Model Explanations

○  Optional Pre/Post

processing

KFServing

Page 18: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Kubeflow Pipelines §  Containerized implementations of ML Tasks

§  Pre-built components: Just provide params or code snippets (e.g. training code)

§  Create your own components from code or libraries

§  Use any runtime, framework, data types

§  Attach k8s objects - volumes, secrets

§  Specification of the sequence of steps

§  Specified via Python DSL

§  Inferred from data dependencies on input/output

§  Input Parameters

§  A “Run” = Pipeline invoked w/ specific parameters

§  Can be cloned with different parameters

§  Schedules

§  Invoke a single run or create a recurring scheduled pipeline

Page 19: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Define Pipeline with Python SDK

@dsl.pipeline(name='TaxiCabClassificationPipelineExample’)deftaxi_cab_classification(output_dir,project,Train_data='gs://bucket/train.csv',Evaluation_data='gs://bucket/eval.csv',Target='tips',Learning_rate=0.1,hidden_layer_size='100,50’,steps=3000): tfdv =TfdvOp(train_data,evaluation_data,project,output_dir) preprocess =PreprocessOp(train_data,evaluation_data,tfdv.output[“schema”],project,output_dir) training =DnnTrainerOp(preprocess.output,tfdv.schema,learning_rate,hidden_layer_size,steps,

target,output_dir) tfma =TfmaOp(training.output,evaluation_data,tfdv.schema,project,output_dir) deploy =TfServingDeployerOp(training.output)

Compile and Submit Pipeline Run

dsl.compile(taxi_cab_classification,'tfx.tar.gz')run=client.run_pipeline(

'tfx_run','tfx.tar.gz',params={'output':‘gs://dpa22’,'project':‘my-project-33’})

Page 20: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

From Single Apps to Complete Platform

20

Dec 2017

Introduce Kubeflow JupyterHub TFJob TFServing

May 2018

Kubeflow 0.1 Argo Ambassador Seldon

Aug

Kubeflow 0.2 Katib -HP Tuning Kubebench PyTorch

Oct

Kubeflow 0.3 kfctl.sh TFJob v1alpha2

Jan 2019

Kubeflow 0.4 Pipelines JupyterHub UI refresh TFJob, PyTorch beta

April

Kubeflow 0.5 KFServing Fairing Jupyter WebApp + CR

Sep

Contributor Summit

Jul

Kubeflow 0.6 Metadata Kustomize Multi-user support

Individual Applications

Connecting Apps And Metadata

November

Kubeflow 0.7 Pipelines+ KFServing v0.2 kfctl refactor

March 2020

Kubeflow 1.0 Production ready stable components

Productionisation & Hardening

Page 21: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL) - Configuration

•  Kubernetes – 1.15

•  Spectrum Scale CSI Driver

•  MetalLB for Load Balancing

•  Istio 1.3.1 for ingress

•  Kubeflow – 1.0.1

•  Jupyter Notebook images are IBM’s

multiarchitecture powerai images (https://hub.docker.com/r/ibmcom/powerai/tags)

Page 22: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL)

Mixed-Architecture 2x IBM Power9 AC922

Nodes 4x Cisco Intel Nodes

Page 23: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL)

237.6 TFlops GPU Single Precision performance

Page 24: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL): Compute

4x NVLink’ed Nvidia V100 GPUs

4x PCIe Nvidia V100 GPUs

64x Power 9 Cores

68x Intel Cores

Page 25: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

P9 P9

AC922

150GB/s

150GB/s 150GB/s 150GB/s

Telstra AI Lab - (TAIL): AC922

Large Model Support Able to train models that are greater exceed GPU memory.

Distributed Deep Learning Linear scaling for deep learning training across multiple GPU enable nodes.

Supports Open Source DL Frameworks Tensorflow, Pytorch, Caffe all supported and optimized.

Page 26: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL): Configuration

•  Taint Nodes •  Node Selector •  Only does data science

•  Kubeflow running on x86

•  Can be used to run other components, such as databases, microservices, etc.

Page 27: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL): Challenges

•  Enterprise proxy and internal host names •  Running squid proxy that routes to the

enterprise proxy to enable access to

docker.io, github.com, pypi.org, etc.

•  Configure HostAliases in notebooks

•  Getting data into the cluster •  Provisioned Minio object storage instance in

each user namespace and accessible via

kubeflow endpoint •  User over-provisioning of cores / PVCs:

•  Locked defaults and created reasonable limits

Page 28: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL): Successes

•  Easy to select the Power platform with configuration options in the notebook server

•  Added open source code to enable node selector, tolerations, and hostAliases

•  Using Kubeflow-Kale to simplify pipelining of code

•  Significantly simplifies the adoption of pipelines and conversion of code.

•  First instance code conversion took - 1 day – optimisation of code 2 weeks.

•  Significant performance improvements thanks to the available compute and

software tools

•  First use case went from a run time 15 hours > 2 hours

Page 29: End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

Telstra AI Lab - (TAIL) – Future state

•  RedHat Openshift – 4.3

•  GPU Operator

•  Kubeflow Operator

•  Extending the compute

•  Integrate feature stores and

streaming technologies

•  Integrate with CI/CD tools (Tekton

Pipelines)