Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Arpan Gujarati, Björn B. Brandenburg
Sameh Elnikety, Yuxiong He Kathryn S. McKinley
SwayamDistributed Autoscaling for Machine Learning as a Service
1
Machine Learning as a Service (MLaaS)
Data Science & Machine Learning
Amazon Machine Learning
Machine Learning
Google Cloud AI
2
Machine Learning as a Service (MLaaS)
Data Science & Machine Learning
Amazon Machine Learning
Machine Learning
Google Cloud AI
+ =Trained Model
Untrained model
Dataset
1. Training
+ =Trained Model
2. Prediction
QueryAnswer
2
Machine Learning as a Service (MLaaS)
+ =Trained Model
2. Prediction
QueryAnswer
Models are already trained and available for prediction
This work
2
Swayam
inside the MLaaS infrastructure
Distributed autoscaling
of the compute resourcesneeded for prediction serving + =
Trained Model
2. Prediction
QueryAnswer
3
Prediction serving (application perspective)
MLaaS Provider
Image classifier
"cat"
image
Application / End User
4
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction
Prediction serving (provider perspective)
5
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction Application / End User
(1) New prediction request for the
pink model
(2) A frontend receives the request
Multiple request dispatchers "Frontends"
Prediction serving (provider perspective)
5
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction Application / End User
(1) New prediction request for the
pink model
(3) The request is dispatched to an
idle backend
(2) A frontend receives the request
(4) The backend fetches the pink model
Multiple request dispatchers "Frontends"
Prediction serving (provider perspective)
5
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction Application / End User
(1) New prediction request for the
pink model
(3) The request is dispatched to an
idle backend
(2) A frontend receives the request
(4) The backend fetches the pink model (5) The request
outcome is predicted
(6) The response is sent back through
the frontend
Multiple request dispatchers "Frontends"
Prediction serving (provider perspective)
5
Lots of trained models!
MLaaS Provider Application / End UserApplication / End User
Multiple request dispatchers "Frontends"
Finite compute resources "Backends" for prediction
Prediction serving (objectives)
6
Lots of trained models!
MLaaS ProviderMLaaS Provider Application / End UserResource efficiency
Low latency, SLAs
Application / End UserApplication / End User
Multiple request dispatchers "Frontends"
Finite compute resources "Backends" for prediction
Prediction serving (objectives)
6
Static partitioning of trained models
7
MLaaS ProviderMLaaS Provider
Static partitioning of trained modelsThe trained models partitioned among the finite backends
7
MLaaS ProviderMLaaS Provider Application / End User
Multiple request dispatchers "Frontends"
Static partitioning of trained models
No need to fetch and install the pink model
The trained models partitioned among the finite backends
7
MLaaS ProviderMLaaS Provider Application / End User
Multiple request dispatchers "Frontends"
Problem: Not all models are used at all times
Static partitioning of trained models
No need to fetch and install the pink model
The trained models partitioned among the finite backends
7
MLaaS ProviderMLaaS Provider Application / End User
Multiple request dispatchers "Frontends"
Problem: Not all models are used at all times
Problem: Many more models than backends, high memory footprint per model
Static partitioning of trained models
No need to fetch and install the pink model
The trained models partitioned among the finite backends
7
MLaaS Provider
Static partitioning of trained modelsMLaaS Provider
Multiple request dispatchers "Frontends"
Problem: Not all models are used at all times
Problem: Many more models than backends, high memory footprint per model
No need to fetch and install the pink model
The trained models partitioned among the finite backends
Application / End UserResource efficiency
Low latency, SLAs
Static partitioning is infeasible
8
Classical approach: autoscaling
The number of active backends are automatically scaled up or
down based on load
Time
# Ac
tive
back
ends
fo
r the
pin
k m
odel Request load for
the pink model
9
Classical approach: autoscaling
The number of active backends are automatically scaled up or
down based on load
Time
# Ac
tive
back
ends
fo
r the
pin
k m
odel Request load for
the pink model
Enough backends to guarantee low latency# Active backends over time is minimized for resource efficiency
With ideal autoscaling ...
9
Autoscaling for MLaaS is challenging [1/3]
10
Autoscaling for MLaaS is challenging [1/3]
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction
(4) The backend fetches the pink model
(5) The request outcome is predicted
Multiple request dispatchers "Frontends"
10
Autoscaling for MLaaS is challenging [1/3]
Lots of trained models!
MLaaS Provider Finite compute resources "Backends" for prediction
(4) The backend fetches the pink model
(5) The request outcome is predicted
Multiple request dispatchers "Frontends"
Provisioning Time (4)
Execution Time (5)>>
(~ a few seconds) (~ 10ms to 500ms)
Challenge
Predictive autoscaling to hide the provisioning latency
Requirement
10
MLaaS architecture is large-scale, multi-tiered
Frontends
Backends [ VMs, containers ]
Hardware broker
Autoscaling for MLaaS is challenging [2/3]
11
MLaaS architecture is large-scale, multi-tiered
Frontends
Backends [ VMs, containers ]
Hardware broker
Autoscaling for MLaaS is challenging [2/3]
Challenge
Fast, coordination-free, globally-consistent autoscaling
decisions on the frontends
Requirement
Multiple frontends with partial information about
the workload
11
"99% of requests must complete under 500ms"
Strict, model-specific SLAs on response times
"99.9% of requests must complete under 1s"
"[B] Tolerate up to 25% increase in request rates
without violating [A]"
"[A] 95% of requests must complete under
850ms"
Autoscaling for MLaaS is challenging [3/3]
12
"99% of requests must complete under 500ms"
Strict, model-specific SLAs on response times
"99.9% of requests must complete under 1s"
No closed-form solutions to get response-time distributions
for SLA-aware autoscaling
Challenge
Accurate waiting-time and execution-time distributions
Requirement"[B] Tolerate up to 25%
increase in request rates without violating [A]"
"[A] 95% of requests must complete under
850ms"
Autoscaling for MLaaS is challenging [3/3]
12
}Provisioning Time (4)
Execution Time (5)>>
(~ a few seconds) (~ 10ms to 500ms)
Challenges
Multiple frontends withpartial information about
the workload
No closed-form solutions to get response-time distributions
for SLA-aware autoscaling
We address these challenges
by leveraging specificML workload characteristicsand design an analytical model for resource estimationthat allows distributed and predictive autoscaling
Swayam: model-driven distributed autoscaling
13
Outline
1. System architecture, key ideas
2. Analytical model for resource estimation
3. Evaluation results
14
System architecture
15
Application / End User
Application / End User
Application / End User
Application / End User
Hardware broker
Frontends Global pool of backends
Backends dedicated for the pink model
Backends dedicated for the blue model
Backends dedicated for the
green model
System architecture
15
Application / End User
Application / End User
Application / End User
Application / End User
Hardware broker
Frontends Global pool of backends
Backends dedicated for the pink model
Backends dedicated for the blue model
Backends dedicated for the
green model
System architecture
1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)
Objective: dedicated set of backends should dynamically scale
15
Application / End User
Application / End User
Application / End User
Application / End User
Hardware broker
Frontends
Backends dedicated for the pink model
System architectureLet's focus on the pink model
1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)
Objective: dedicated set of backends should dynamically scale
15
Key idea 1: Assign states to each backend
16
cold
warm
In the global pool
Dedicated to a trained model
Key idea 1: Assign states to each backend
16
cold
warm
In the global pool
Dedicated to a trained model in-use
not-in-use
Haven't executed a request for a
while
Maybe executing a request
Key idea 1: Assign states to each backend
16
cold
warm
In the global pool
Dedicated to a trained model in-use
not-in-use
busy
idle
Haven't executed a request for a
while
Maybe executing a request
Waiting for a request
Executing a request
Key idea 1: Assign states to each backend
16
cold
warm
In the global pool
Dedicated to a trained model in-use
not-in-use
busy
idle
Haven't executed a request for a
while
Maybe executing a request
Waiting for a request
Executing a request
Dedicated, but not used due to reduced load
Can be safely garbage collected
(scale-in)
Key idea 1: Assign states to each backend
... or easily transitioned to an in-use state (scale-out)
16
cold
warm
In the global pool
Dedicated to a trained model in-use
not-in-use
busy
idle
Haven't executed a request for a
while
Maybe executing a request
Waiting for a request
Executing a request
Dedicated, but not used due to reduced load
Can be safely garbage collected
(scale-in)
Key idea 1: Assign states to each backend
... or easily transitioned to an in-use state (scale-out)
How do frontends know which dedicated backends to use, and which to not use?
16
Backends dedicated for the pink model
Key idea 2: Order the dedicated set of backends
1 2 3 4 65
1110987 12
17
Backends dedicated for the pink model
Key idea 2: Order the dedicated set of backends
1 2 3 4 65
1110987 12
If 9 backends are sufficient for SLA compliance ...
17
Backends dedicated for the pink model
Key idea 2: Order the dedicated set of backends
1 2 3 4 65
1110987 12
= warm in-use busy/idle = warm not-in-use
Backends dedicated for the pink model
1 2 3 4 65
1110987 12
If 9 backends are sufficient for SLA compliance ...
backends 10-12 transition to not-in-use state
frontends use backends 1-9
17
Backends dedicated for the pink model
Key idea 2: Order the dedicated set of backends
1 2 3 4 65
1110987 12
= warm in-use busy/idle = warm not-in-use
Backends dedicated for the pink model
1 2 3 4 65
1110987 12
If 9 backends are sufficient for SLA compliance ...
backends 10-12 transition to not-in-use state
frontends use backends 1-9
How do frontends know how many backends are sufficient?
17
Frontends
Key idea 3: Swayam instance on every frontend
Incoming requests
Swayam instance
computes globally consistent minimum # backends necessary for SLA compliance
Backends dedicated for the pink model
1 2 3 4 65
1110987 12
= warm in-use busy/idle = warm not-in-use
18
Outline
1. System architecture, key ideas
2. Analytical model for resource estimation
3. Evaluation results
19
Making globally-consistent decisions
What is the minimum # backends required for SLA compliance?
at each frontend (Swayam instance)
20
Making globally-consistent decisions
What is the minimum # backends required for SLA compliance?
at each frontend (Swayam instance)
1. Expected request execution time2. Expected request waiting time3. Total request load
20
Making globally-consistent decisions
} leverage ML workload characteristics
What is the minimum # backends required for SLA compliance?
at each frontend (Swayam instance)
1. Expected request execution time2. Expected request waiting time3. Total request load
20
Determining expected request execution times
Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300 350 400N
orm
aliz
ed F
requ
ency
(%)
Service Times (ms)
Trace 1
Data from trace (bin width = 10)
21
Determining expected request execution times
Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform
‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS
events main sources of variability
Variation is low
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300 350 400N
orm
aliz
ed F
requ
ency
(%)
Service Times (ms)
Trace 1
Data from trace (bin width = 10)
21
Determining expected request execution times
Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform
‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS
events main sources of variability
Variation is low
Modeled using log-normal distributions
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300 350 400N
orm
aliz
ed F
requ
ency
(%)
Service Times (ms)
Trace 1
Data from trace (bin width = 10)
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300 350 400N
orm
aliz
ed F
requ
ency
(%)
Service Times (ms)
Trace 1
Data from trace (bin width = 10)Fitted lognormal distribution
21
Determining expected request waiting times
load balancing (LB)
22
Determining expected request waiting times
load balancing (LB)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)
22
Determining expected request waiting times
load balancing (LB)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned Scheduling
Global and partitioned perform well, but there are implementation tradeoffs
22
Determining expected request waiting times
load balancing (LB)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned Scheduling
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned SchedulingJoin-Idle-Queue
JIQ doesn't result in good tail waiting
times
Global and partitioned perform well, but there are implementation tradeoffs
22
Determining expected request waiting times
load balancing (LB)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned Scheduling
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned SchedulingJoin-Idle-Queue
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned SchedulingJoin-Idle-Queue
Random Dispatch
JIQ doesn't result in good tail waiting
times
Global and partitioned perform well, but there are implementation tradeoffs
Random dispatch gives much better tail
waiting times
22
Determining expected request waiting times
load balancing (LB)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned Scheduling
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned SchedulingJoin-Idle-Queue
0 100 200 300 400 500 600 700
30 40 50 60 70 80 90 100
Wai
ting
Tim
e (m
s)
#Backends
Threshold (350ms)Global Scheduling
Partitioned SchedulingJoin-Idle-Queue
Random Dispatch
JIQ doesn't result in good tail waiting
times
Global and partitioned perform well, but there are implementation tradeoffs
Random dispatch gives much better tail
waiting times
22
We use a LB policy based on random dispatch!
in the near future, to account for high provisioning times
Determining the total request load
23
}in the near future, to account
for high provisioning times
Determining the total request load
Frontends
Hardware broker
L
Since the broker spreads requests uniformly among
each frontends
L' = L/F
F Total # frontendsL'
L'
Total request rate
23
}in the near future, to account
for high provisioning times
Determining the total request load
Frontends
Hardware broker
L
L' = L/F
F L'
L'
‣Predicts L' for near futureEach Swayam instance
Depends on the time to setup a new backend
23
}in the near future, to account
for high provisioning times
Determining the total request load
Frontends
Hardware broker
L
L' = L/F
F L'
L'
‣Predicts L' for near future
Determined from broker / through a
gossip protocol
Each Swayam instance
‣Given F, computes L = F x L'
23
Making globally-consistent decisions
What is the minimum # backends required for SLA compliance?
at each frontend (Swayam instance)
1. Expected request execution time2. Expected request waiting time3. Total request load
24
SLA-aware resource estimationFor each
trained model
Response-Time Threshold
Service Level
Burst Threshold
RTmax
SLmin
U
n = min # backends
25
SLA-aware resource estimationFor each
trained model
Response-Time Threshold
Service Level
Burst Threshold
RTmax
SLmin
U
n = min # backends
Waiting Time Distribution
Execution Time Distribution
Response Time Modeling
Load
SLmin percentile
response time< RTmax?
n = 1 n++
No
Yesx U
25
SLA-aware resource estimationFor each
trained model
Response-Time Threshold
Service Level
Burst Threshold
RTmax
SLmin
U
n = min # backends
Waiting Time Distribution
Execution Time Distribution
Response Time Modeling
Load
SLmin percentile
response time< RTmax?
n = 1 n++
No
Yesx U
Closed-form expression for percentile response time
(see the appendix)
Convolution
25
SLA-aware resource estimationFor each
trained model
Response-Time Threshold
Service Level
Burst Threshold
RTmax
SLmin
U
n = min # backends
Waiting Time Distribution
Execution Time Distribution
Response Time Modeling
Load
SLmin percentile
response time< RTmax?
n = 1 n++
No
Yesx U
Amplified based on the burst threshold
25
SLA-aware resource estimationFor each
trained model
Response-Time Threshold
Service Level
Burst Threshold
RTmax
SLmin
U
n = min # backends
Waiting Time Distribution
Execution Time Distribution
Response Time Modeling
Load
SLmin percentile
response time< RTmax?
n = 1 n++
No
Yesx U
Initialization
Retry, as long as not SLA compliant
25
Compute percentile response time for n
Frontends
Incoming requests
Swayam instance
computes globally consistent minimum # backends necessary for SLA compliance
Backends dedicated for the pink model
1 2 3 4 65
1110987 12
= warm in-use busy/idle = warm not-in-use
26
Swayam Framework
Outline
1. System architecture, key ideas
2. Analytical model for resource estimation
3. Evaluation results
27
Evaluation setup
• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)
28
Evaluation setup
• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)
• Workload➡ 15 production service traces (Microsoft Azure MLaaS)➡ Three-hour traces (request arrival times and computation times)➡Query computation & model setup times emulated by spinning
28
SLA configuration for each model
• Response-time threshold RTmax = 5C ➡C denotes the mean computation time for the model
• Desired service level SLmin = 99%➡ 99% of the requests must have response times under RTmax
• Burst threshold U = 2x ➡ Tolerate increase in request rate by up to 100%
• Initially, 5 pre-provisioned backends
29
Baseline: Clairvoyant Autoscaler (ClairA)
➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend
➡ "Deadline-driven" approach to minimize resource waste
30
Baseline: Clairvoyant Autoscaler (ClairA)
• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload
➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend
➡ "Deadline-driven" approach to minimize resource waste
30
Baseline: Clairvoyant Autoscaler (ClairA)
• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload
• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like
➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend
➡ "Deadline-driven" approach to minimize resource waste
30
Baseline: Clairvoyant Autoscaler (ClairA)
• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload
• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like
• Both ClairA1 and ClairA2 depend on RTmax, but not on SLmin and U
➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend
➡ "Deadline-driven" approach to minimize resource waste
30
Resource usage vs. SLA compliance
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
97%
98%
64%
95%
100%
100%
100%
100%
100%
100%
100% 87
%
91%
89%
97%
Frequency of SLA Compliance
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
97%
98%
64%
95%
100%
100%
100%
100%
100%
100%
100% 87
%
91%
89%
97%
Swayam performs much better than ClairA2 in terms of
resource efficiency
31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
97%
98%
64%
95%
100%
100%
100%
100%
100%
100%
100% 87
%
91%
89%
97%
Swayam is resource efficient but at the cost of
SLA compliance31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
97%
98%
64%
95%
100%
100%
100%
100%
100%
100%
100% 87
%
91%
89%
97%
Swayam is resource efficient but at the cost of
SLA compliance31
Resource usage vs. SLA compliance
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
0 0.2 0.4 0.6 0.8
1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Res
ourc
e us
age
(nor
mal
ized
)
Trace IDs
ClairA1ClairA2
Swayam (frequency of SLA compliance)
97%
98%
64%
95%
100%
100%
100%
100%
100%
100%
100% 87
%
91%
89%
97%
Swayam seems to perform poorly because of a very bursty trace
31
Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)
32
Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)
• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations
32
Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)
• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations
• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations
32
Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)
• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations
• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations
• Easy integration into any existing request-response architecture
32
Thank you. Questions?
33