Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with...

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

1New York University, 2Microsoft

jeffjunzhang@nyu.edu

USENIX HotCloud 2020 [*] Machine Learning-as-a-Service

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

Deep-Learning Models are Pervasive

Computations in Deep Learning

Activation Tensor

Convolution

Convolution FiltersOutput Tensor

Output Tensor

[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Convolutions account for more than 90% computation à dominate both run-timeand energy.*

Execution time factors: depth, activation/filter size in each layer

OfflineTraining

DataCollection

Cleaning &Visualization

Feature Eng. &Model Design

Training &Validation

Model Development

TrainedModelsTraining Pipelines

LiveData

Training

Validation

End UserApplication

Prediction

Prediction Service

Inference

Feedback

DataScientist

DataEngineer

Machine Learning Lifecycle (Typical)

3[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”

Online

Model Development• Prescribes model design, architecture and data processing

Training• At scale on live data• Retraining on new data and manage model versioning

Serving or Online Inference• Deploys trained model into device, edge, or cloud

MLaaS: Challenges and Limitations

4[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Existing Solution - Static model versioning

- Tie each application to one specific model at run-time- In the event of load spikes:

• Prune requests (new, low priority, near deadline etc.) - QoS violations, customer L

• Add “significant new capacity" (autoscaling)- Not economically viable, provider L

Maintain QoS under dynamic workloads.

- Service success rates dropped below 99.99%- Teams suffered 2-hour outage in Europe- Free offers and new subscriptions were limited

March 28, 2020

0 200 400 600 800 1000Execution Time (ms)

ySingle Image Inference with ResNet-x

1 thread2 threads4 threads8 threads16 threadsResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Opportunity: DNN Model Diversity

For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)

[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016; 5

Assumption: fluctuating workloadsfixed hardware capacity

Which DNN model to use?

How to allocate resources?

MLaaS Provider

ML AppEnd Users

What is the QoS?

sending requests

rendering predictions

Typical SLA Objectives: latency, throughput, cost,…

Make all decisions online!

Questions in this Study

2. “Dog”, 0.3s

MLaaS Provider

ApplicationEnd Users

What Do Users Care About?

1. “Cat”, 5s

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

• User can always meet deadline by guessing randomly Quick and correct predictions!

0 5 10 15 20 25 30Load (Queries Per Second)

ResNet152ResNet101ResNet50ResNet34ResNet18

A New Metric for MLaaS

No single DNN works best at all load levels

(𝜆: load, 𝑎: baseline accuracy)

Effective Accuracy (𝑎!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑎#$$ = 𝑝%,'×𝑎

1 2 3 4 5 6 7 8Load (Queries Per Second)

450099

ResNet-152 Tail Latency Vs Job arrival rate

R:16 T:1R:8 T:2R:4 T:4R:2 T:8R:1 T:16

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.9

Fixed Capacity:16 threads

End to EndQuery Latency

R: ReplicasT: Threads

Online Model Switching Framework

Load Change Detection

Offline policy training

Model Switching

Load Best Model Parallelism

0-4 QPS ResNet-152 <R:4 T:4>

> 20 ResNet-18 …

Model that exhibits best effective accuracy is a function of load

Front-End Queries

Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

SLA deadline

Experimental Setup

• Built on top of Clipper, an open-source containerized model-serving

framework (caching, adaptive batching disabled)

• Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)

• Two dedicated Azure VMs:

• Server: 32 vCPUs + 128GB RAM

• Client: 8vCPUs + 32GB RAM

• Markov Model based load generator

• Open system model

• Poisson inter-arrivals

Model Switching

[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

sampling period 1 sec

Evaluation: Automatic Model-Switching

0 50 100 150 200 250 300Time (sec)

ResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Model-Switching can quickly adapt to load spikes.

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5Deadline (ms)

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

Evaluation: Effective Accuracy

Model-Switching achieves pareto-optimal effective accuracy.

Evaluation: Tail Latency

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline:750 ms

0 0.5 1 1.5 2Latency (ms)

Empirical CDF

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

0.75(s)

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff Zhangjeffjunzhang@nyu.edu

• How to prepare a pool of models for each application?• Neural Architecture Search, Multi-level Quantization

• Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM

• Reinforcement learning based controller for model switching• Account for job queue status, system load, current latency• Offline training free

• Integrate with existing MLaaS techniques• Batching, caching, autoscaling etc.

• Exploit availability of heterogenous computing resources• CPU, GPU, TPU, FPGA

Discussion and Future Work

Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with...

Documents

Design for fluctuating loads

Fluctuating Asymmetry Analyses Revisited

Running SAP Workloads

Fluctuating hydrodynamics of electrolytes at electroneutral scalesdonev/FluctHydro/LowMachElectro... · 2019. 5. 8. · Fluctuating hydrodynamics of electrolytes at electroneutral

Seasonally fluctuating selection can maintain polymorphism

Dragline Gear Monitoring under Fluctuating Conditions

Fluctuating Balance Letters of Credit - FHLB

EFFECTS OF FLUCTUATING FUEL PRICES ON THE VAR … · effects of fluctuating fuel prices on the var of the indian airline industry ... kingfisher airlines ... effects of fluctuating

Chapter05(Design Against for Fluctuating Load)

Coherent fluctuating nephelometry application in

AWS Microsoft Workloads Competency - Amazon S3€¦ · AWS Microsoft Workloads Competency: Technology Partner Validation Checklist, v1.0 5 AWS Microsoft Workloads Competency Program

Finite-Volume Schemes for Fluctuating Hydrodynamics

Fluctuating thermal regime preserves physiological

Fluctuating Initial Conditions in Heavy Ion Collisions

Collin Broholm- Composite Spin in Fluctuating Magnets

Synthetic Enterprise Application workloads · Enterprise Application Workloads . Test Profile – IOPS & RT levels of different synthetic workloads . 6 . Composite Enterprise . Workloads

Improving Speech Intelligibility in Fluctuating Background

Algae and cyanobacteria in fluctuating (dynamic) light

First overview on the implementation ... - European Parliament · These cooperation duties lead to extra workloads, additional time dealing with cases and have an impact on the budget

Numerical Methods for Fluctuating Hydrodynamicsdonev/FluctHydro/LLNS.DSFD.handout.pdfNumerical Methods for Fluctuating Hydrodynamics Aleksandar Donev1 Courant Institute, New York University