Model-Switching: Dealing with Fluctuating Workloads in ... · Model-Switching: Dealing with...

Preview:

Citation preview

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

1New York University, 2Microsoft

jeffjunzhang@nyu.edu

USENIX HotCloud 2020 [*] Machine Learning-as-a-Service

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

Deep-Learning Models are Pervasive

1

Computations in Deep Learning

Activation Tensor

Convolution

Convolution FiltersOutput Tensor

=

Output Tensor

...

[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Convolutions account for more than 90% computation à dominate both run-timeand energy.*

2

Execution time factors: depth, activation/filter size in each layer

OfflineTraining

Data

DataCollection

Cleaning &Visualization

Feature Eng. &Model Design

Training &Validation

Model Development

TrainedModelsTraining Pipelines

LiveData

Training

Validation

End UserApplication

Query

Prediction

Prediction Service

Inference

Feedback

Logic

DataScientist

DataEngineer

DataEngineer

Machine Learning Lifecycle (Typical)

3[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”

Online

Model Development• Prescribes model design, architecture and data processing

Training• At scale on live data• Retraining on new data and manage model versioning

Serving or Online Inference• Deploys trained model into device, edge, or cloud

MLaaS

MLaaS: Challenges and Limitations

4[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Existing Solution - Static model versioning

- Tie each application to one specific model at run-time- In the event of load spikes:

• Prune requests (new, low priority, near deadline etc.) - QoS violations, customer L

• Add “significant new capacity" (autoscaling)- Not economically viable, provider L

Maintain QoS under dynamic workloads.

- Service success rates dropped below 99.99%- Teams suffered 2-hour outage in Europe- Free offers and new subscriptions were limited

March 28, 2020

0 200 400 600 800 1000Execution Time (ms)

90

90.5

91

91.5

92

92.5

93

93.5

94

94.5

Acc

urac

ySingle Image Inference with ResNet-x

1 thread2 threads4 threads8 threads16 threadsResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Opportunity: DNN Model Diversity

For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)

[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016; 5

Res

idua

l Blo

ck

Model

Assumption: fluctuating workloadsfixed hardware capacity

6

Which DNN model to use?

How to allocate resources?

MLaaS Provider

ML AppEnd Users

What is the QoS?

sending requests

rendering predictions

SLAs

Typical SLA Objectives: latency, throughput, cost,…

Make all decisions online!

Questions in this Study

2. “Dog”, 0.3s

7

MLaaS Provider

ApplicationEnd Users

What Do Users Care About?

1. “Cat”, 5s

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

• User can always meet deadline by guessing randomly Quick and correct predictions!

0 5 10 15 20 25 30Load (Queries Per Second)

0.88

0.89

0.9

0.91

0.92

0.93

0.94

Effe

ctiv

e A

ccur

acy

ResNet152ResNet101ResNet50ResNet34ResNet18

A New Metric for MLaaS

8

No single DNN works best at all load levels

(𝜆: load, 𝑎: baseline accuracy)

Effective Accuracy (𝑎!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑎#$$ = 𝑝%,'×𝑎

1 2 3 4 5 6 7 8Load (Queries Per Second)

0

500

1000

1500

2000

2500

3000

3500

4000

450099

th p

erce

ntile

tail

late

ncy

(ms)

ResNet-152 Tail Latency Vs Job arrival rate

R:16 T:1R:8 T:2R:4 T:4R:2 T:8R:1 T:16

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.9

Fixed Capacity:16 threads

End to EndQuery Latency

R: ReplicasT: Threads

Online Model Switching Framework

10

Load Change Detection

Offline policy training

Model Switching

Load Best Model Parallelism

0-4 QPS ResNet-152 <R:4 T:4>

> 20 ResNet-18 …

Model that exhibits best effective accuracy is a function of load

Front-End Queries

Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

SLA deadline

Experimental Setup

• Built on top of Clipper, an open-source containerized model-serving

framework (caching, adaptive batching disabled)

• Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)

• Two dedicated Azure VMs:

• Server: 32 vCPUs + 128GB RAM

• Client: 8vCPUs + 32GB RAM

• Markov Model based load generator

• Open system model

• Poisson inter-arrivals

11

Model Switching

[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

sampling period 1 sec

Evaluation: Automatic Model-Switching

12

0 50 100 150 200 250 300Time (sec)

0

5

10

15

20

25Lo

ad (Q

uerie

s Pe

r Sec

ond)

ResNet-18

ResNet-34

ResNet-50

ResNet-101

ResNet-152

Model-Switching can quickly adapt to load spikes.

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5Deadline (ms)

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Effe

ctiv

e A

ccur

acy

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

(s)

Evaluation: Effective Accuracy

13

Model-Switching achieves pareto-optimal effective accuracy.

Evaluation: Tail Latency

14

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline:750 ms

0 0.5 1 1.5 2Latency (ms)

0

0.2

0.4

0.6

0.8

1

Perc

enta

ge

Empirical CDF

Model SwitchResNet-18ResNet-34ResNet-50ResNet-101ResNet-152

0.75(s)

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff Zhangjeffjunzhang@nyu.edu

• How to prepare a pool of models for each application?• Neural Architecture Search, Multi-level Quantization

• Current approach pre-deploys all (20) candidate models • Cold start time (ML): tens of seconds • RAM overheads: currently 11.8% of the total 128 GB RAM

• Reinforcement learning based controller for model switching• Account for job queue status, system load, current latency• Offline training free

• Integrate with existing MLaaS techniques• Batching, caching, autoscaling etc.

• Exploit availability of heterogenous computing resources• CPU, GPU, TPU, FPGA

Discussion and Future Work

15

Recommended