End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model

End-to-End Data Science and Machine Learning for Telcos Telstra's Use Case — Animesh Singh — Tim Osborne — Adam Makarucha

Think 2020 / DOC ID / May, 2020 / © 2020 IBM Corporation

Session 6123

CODAIT

Improving the Enterprise AI Lifecycle in Open Source

Center for Open Source Data & AI Technologies

DEG / May XX, 2020 / © 2020 IBM Corporation 3

CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise.

We contribute to and advocate for the open-source technologies that are foundational to IBM’s AI offerings.

30+ open-source developers!

Enterprise Machine Learning

4

The Machine Learning Lifecycle

5

*Source: Hidden Technical Debt in Machine Learning Systems

Perception

Perception

*Source: Hidden Technical Debt in Machine Learning Systems

In reality…ML Code is tiny part in this overall platform

And the ML workflow spans teams …

Data cleansing

Dataanalysis&transformation

Data validation

Data splitting

Data prep

Building a model

Model validation

Training at scale

Model creation

Deploying Serving Monitoring &

Logging + Explainability

Finetune & improvements

Rollout

Training optimization Model

Model

Data

Data

Data ingestion

Edge Cloud

Da

taflo

w a

nd

Wo

rkflow

Orc

he

stratio

n

Marketplace(A

IHub

)

Dataconsistency(versioning)

FeatureEngineering

And is much more complex..

End to end ML on Kubernetes?

11

●  Containers ●  Packaging ●  Kubernetes service endpoints ●  Persistent volumes ●  Scaling ●  Immutable deployments ●  GPUs, Drivers & the GPL ●  Cloud APIs ●  DevOps ●  ...

First, can you become an expert in ...

We need a platform. Enter Kubeflow

Prepared and

Analyzed Data

Trained Model

Deployed Model

Prepared Data

Untrained Model

Libraries and CLIs - Focus on end users

Systems - Combine multiple services

Low Level APIs / Services (single function)

Arena kfctl kubectl

katib pipelines

notebooks

fairing

TFJob PyTorchJob

Jupyter CR

Seldon CR

kube bench

•  End to end ML Platform on Kubernetes. Focused on multiple aspects of Model Lifecycle

•  Originated at Google, and has grown to have a large community of developers

•  Google, IBM, Cisco, RedHat, Intel, Microsoft and others contributing

•  IBM is the 2nd largest contributor in terms of overall commits. IBM maintainers (committers/reviewers) in Katib (HPO+Training), Kubeflow Serving, Manifests, Pipelines etc.

Metadata

Orchestration

Pipelines CR

Argo

Study Job

MPI CR

Spark Job

Model DB

TFX

Developed By Kubeflow

Developed Outside Kubeflow

* Not all components shown

IAM

Scheduling Kubeflow https://github.com/kubeflow

Jupyter Notebooks

Workflow Building

Pipelines

Tools

Serving

Metadata

Data Management

Kale

Fairing

TFX

Airflow, +

KF Pipelines

HP Tuning

Tensorboard

KFServing

Seldon Core

TFServing, + Training Operators Pytorch

XGBoost, +

Tensorflow

Prometheus

Versioning Reproducibility Secure Sharing

Develop (Kubeflow Jupyter Notebooks)

Data Scientist

-  Self-service Jupyter Notebooks provide faster model experimentation

-  Simplified configuration of CPU/GPU, RAM, Persistent Volumes

-  Faster model creation with training operators, TFX, magics, workflow automation (Kale,

Fairing)

-  Simplify access to external data sources (using stored secrets)

-  Easier protection, faster restoration & sharing of “complete” notebooks

IT Operator

-  Profile Controller, Istio, Dex enable secure RBAC to notebooks, data & resources

-  Smaller base container images for notebooks, fewer crashes, faster to recover

Distributed Model Training and HPO (TFJob, PyTorch Job, MPI Job, Katib, …)

Addresses One of the key goals for model builder persona: Distributed Model Training and Hyper parameter optimization for Tensorflow, PyTorch etc. Common problems in HP optimization

–  Overfitting

–  Wrong metrics

–  Too few hyperparameters

Katib: a fully open source, Kubernetes-native hyperparameter tuning service

–  Inspired by Google Vizier

–  Framework agnostic

–  Extensible algorithms

–  Simple integration with other Kubeflow components

Kubeflow also supports distributed MPI based training using Horovod

●  Founded by Google, Seldon,

IBM, Bloomberg and Microsoft

●  Part of the Kubeflow project

●  Focus on 80% use cases -

single model rollout and update

●  Kfserving 1.0 goals:

○  Serverless ML Inference

○  Canary rollouts

○  Model Explanations

○  Optional Pre/Post

processing

KFServing

Kubeflow Pipelines §  Containerized implementations of ML Tasks

§  Pre-built components: Just provide params or code snippets (e.g. training code)

§  Create your own components from code or libraries

§  Use any runtime, framework, data types

§  Attach k8s objects - volumes, secrets

§  Specification of the sequence of steps

§  Specified via Python DSL

§  Inferred from data dependencies on input/output

§  Input Parameters

§  A “Run” = Pipeline invoked w/ specific parameters

§  Can be cloned with different parameters

§  Schedules

§  Invoke a single run or create a recurring scheduled pipeline

Define Pipeline with Python SDK

@dsl.pipeline(name='TaxiCabClassificationPipelineExample’)deftaxi_cab_classification(output_dir,project,Train_data='gs://bucket/train.csv',Evaluation_data='gs://bucket/eval.csv',Target='tips',Learning_rate=0.1,hidden_layer_size='100,50’,steps=3000): tfdv =TfdvOp(train_data,evaluation_data,project,output_dir) preprocess =PreprocessOp(train_data,evaluation_data,tfdv.output[“schema”],project,output_dir) training =DnnTrainerOp(preprocess.output,tfdv.schema,learning_rate,hidden_layer_size,steps,

target,output_dir) tfma =TfmaOp(training.output,evaluation_data,tfdv.schema,project,output_dir) deploy =TfServingDeployerOp(training.output)

Compile and Submit Pipeline Run

dsl.compile(taxi_cab_classification,'tfx.tar.gz')run=client.run_pipeline(

'tfx_run','tfx.tar.gz',params={'output':‘gs://dpa22’,'project':‘my-project-33’})

From Single Apps to Complete Platform

20

Dec 2017

Introduce Kubeflow JupyterHub TFJob TFServing

May 2018

Kubeflow 0.1 Argo Ambassador Seldon

Aug

Kubeflow 0.2 Katib -HP Tuning Kubebench PyTorch

Oct

Kubeflow 0.3 kfctl.sh TFJob v1alpha2

Jan 2019

Kubeflow 0.4 Pipelines JupyterHub UI refresh TFJob, PyTorch beta

April

Kubeflow 0.5 KFServing Fairing Jupyter WebApp + CR

Sep

Contributor Summit

Jul

Kubeflow 0.6 Metadata Kustomize Multi-user support

Individual Applications

Connecting Apps And Metadata

November

Kubeflow 0.7 Pipelines+ KFServing v0.2 kfctl refactor

March 2020

Kubeflow 1.0 Production ready stable components

Productionisation & Hardening

Telstra AI Lab - (TAIL) - Configuration

•  Kubernetes – 1.15

•  Spectrum Scale CSI Driver

•  MetalLB for Load Balancing

•  Istio 1.3.1 for ingress

•  Kubeflow – 1.0.1

•  Jupyter Notebook images are IBM’s

multiarchitecture powerai images (https://hub.docker.com/r/ibmcom/powerai/tags)

Telstra AI Lab - (TAIL)

Mixed-Architecture 2x IBM Power9 AC922

Nodes 4x Cisco Intel Nodes

Telstra AI Lab - (TAIL)

237.6 TFlops GPU Single Precision performance

Telstra AI Lab - (TAIL): Compute

4x NVLink’ed Nvidia V100 GPUs

4x PCIe Nvidia V100 GPUs

64x Power 9 Cores

68x Intel Cores

P9 P9

AC922

150GB/s

150GB/s 150GB/s 150GB/s

Telstra AI Lab - (TAIL): AC922

Large Model Support Able to train models that are greater exceed GPU memory.

Distributed Deep Learning Linear scaling for deep learning training across multiple GPU enable nodes.

Supports Open Source DL Frameworks Tensorflow, Pytorch, Caffe all supported and optimized.

Telstra AI Lab - (TAIL): Configuration

•  Taint Nodes •  Node Selector •  Only does data science

•  Kubeflow running on x86

•  Can be used to run other components, such as databases, microservices, etc.

Telstra AI Lab - (TAIL): Challenges

•  Enterprise proxy and internal host names •  Running squid proxy that routes to the

enterprise proxy to enable access to

docker.io, github.com, pypi.org, etc.

•  Configure HostAliases in notebooks

•  Getting data into the cluster •  Provisioned Minio object storage instance in

each user namespace and accessible via

kubeflow endpoint •  User over-provisioning of cores / PVCs:

•  Locked defaults and created reasonable limits

Telstra AI Lab - (TAIL): Successes

•  Easy to select the Power platform with configuration options in the notebook server

•  Added open source code to enable node selector, tolerations, and hostAliases

•  Using Kubeflow-Kale to simplify pipelining of code

•  Significantly simplifies the adoption of pipelines and conversion of code.

•  First instance code conversion took - 1 day – optimisation of code 2 weeks.

•  Significant performance improvements thanks to the available compute and

software tools

•  First use case went from a run time 15 hours > 2 hours

Telstra AI Lab - (TAIL) – Future state

•  RedHat Openshift – 4.3

•  GPU Operator

•  Kubeflow Operator

•  Extending the compute

•  Integrate feature stores and

streaming technologies

•  Integrate with CI/CD tools (Tekton

Pipelines)

Documents

End-to-End Data Science and Machine Learning for Telcos · Deploying Logging +Serving Monitoring & Explainability Finetune & improvements Rollout Training optimization Model Model