Upload
nbonvin
View
949
Download
1
Embed Size (px)
DESCRIPTION
Significant achievements have been made for automated allocation of cloud resources. However, the performance of applications may be poor in peak load periods, unless their cloud resources are dynamically adjusted. Moreover, although cloud resources dedicated to different applications are virtually isolated, performance fluctuations do occur because of resource sharing, and software or hardware failures (e.g. unstable virtual machines, power outages, etc.). We propose a decentralized economic approach for dynamically adapting the cloud resources of various applications, so as to statistically meet their SLA performance and availability goals in the presence of varying loads or failures. According to our approach, the dynamic economic fitness of a Web service determines whether it is replicated or migrated to another server, or deleted. The economic fitness of a Web service depends on its individual performance constraints, its load, and the utilization of the resources where it resides. Cascading performance objectives are dynamically calculated for individual tasks in the application workflow according to the user requirements.
Citation preview
Autonomic SLA-driven Provisioning for Cloud Applications
Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer
CCGRID 2011, May 23-26 2011, New Port Beach, CA, USA
[email protected] - EPFL
● A distributed, component-based application running on an elastic infrastructure
Cloud Apps – Issue #1 : Placement
2 EPFL – LSIR - Nicolas Bonvin
C1C1 C2C2 C3C3 C4C4
● A distributed, component-based application running on an elastic infrastructure
Cloud Apps – Issue #1 : Placement
3 EPFL – LSIR - Nicolas Bonvin
C1C1 C2C2 C3C3 C4C4
VM1 VM2 VM3
● A distributed, component-based application running on an elastic infrastructure
● Performance of C1, C2 and C3 is probably less than C4● No info on other VMs colocated on same server !
Cloud Apps – Issue #1 : Placement
4 EPFL – LSIR - Nicolas Bonvin
C3C3 C4C4
VM2 VM3
Server 1 Server 2
C1C1 C2C2
VM1
● A distributed, component-based application running on an elastic infrastructure
● Performance of C1, C2 and C3 is probably less than C4● No info on other VMs colocated on same server !
Cloud Apps – Issue #1 : Placement
5 EPFL – LSIR - Nicolas Bonvin
No control on placement
C3C3 C4C4
VM2 VM3
Server 1 Server 2
C1C1 C2C2
VM1
● Load-balanced trafic to 4 identical components on 4 identical VMs
Cloud Apps – Issue #2 : Unstability
6 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
100 ms 100 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
Cloud Apps – Issue #2 : Unstability
7 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
100 ms 140 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded
Cloud Apps – Issue #2 : Unstability
8 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...
Cloud Apps – Issue #2 : Unstability
9 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms infinity
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...● Failure of C1 on VM4 -> load is rebalanced
Cloud Apps – Issue #2 : Unstability
10 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...● Failure of C1 on VM4 -> load is rebalanced
Cloud Apps – Issue #2 : Unstability
11 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
Application should react early !
● Build for failures
– Do not trust the underlying infrastructure
– Do not trust your components either !
● Components should adapt to the changing conditions
– Quickly
– Automatically
– e.g. by replacing a wonky VM by a new one
Cloud Apps – Overview
12 EPFL – LSIR - Nicolas Bonvin
Scarce: a framework to build scalable cloud applications
Architecture Overview
14 EPFL – LSIR - Nicolas Bonvin
Agent
Server
GOSSIPING + BROADCAST
Agent
A
B
E
● An agent on each server / VM
– starts/stops/monitors the components
– Takes decisions on behalf of the components
● An agent communicates with other agents
– Routing table
– Status of the server (resources usage)
Agent
Agent
Agent
Agent
An economic approach
15 EPFL – LSIR - Nicolas Bonvin
● Time is split into epochs (no synchronization between servers)● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
An economic approach
16 EPFL – LSIR - Nicolas Bonvin
● Time is split into epochs (no synchronization between servers)● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
● Components
– Pay virtual rent at each epoch
– Gain virtual money by processing requests
– Take decisions based on balance ( = gain – rent )
● Replicate, migrate, suicide, stay
● Virtual rents are updated by gossiping (no centralized board)
Economic model (i)
17 EPFL – LSIR - Nicolas Bonvin
● The rent of a server is different for each component !
Economic model (ii)
18 EPFL – LSIR - Nicolas Bonvin
● VM1 and VM2 have an « identical » resources usage : 45%● Server rent = server's resources usage with component's weights
– Rent for C1 @ VM1 > rent for C1 @ VM2
C1C1CPU : 30%I/O : 5%
VM1
CPU : 70%I/O : 20%
Multiplexing of server resources
VM2
CPU : 25%I/O : 65%
?
Economic model (iii)
19 EPFL – LSIR - Nicolas Bonvin
● Choosing a candidate server j during replication/migration of a component i
– netbenefit maximization
● 2 optimization goals :
– high-availability by geographical diversity of replicas
– low latency by grouping related components
● gj : weight related to the proximity of the server location to the geographical distribution of the client requests to the component
● Si is the set of server hosting a replica of component i
SLA Performance Guarantees (i)
20 EPFL – LSIR - Nicolas Bonvin
● Each component has its own SLA constraints● SLA derived directly from entry components
● Resp. Time = Service Time + max (Resp. Time of Dependencies)
C3C3
C1SLA : 500ms
C1SLA : 500ms
C2C2
C5C5
C4C4
SLA Performance Guarantees (ii)
21 EPFL – LSIR - Nicolas Bonvin
● SLA propagation from parents to children● Parent j sends its performance constraints (e.g. response time upper
bound) to its dependencies D(j) :
● Child i computes its own performance constraints :
● : group of constraints sent by the replicas of the parent g
SLA Performance Guarantees (iii)
22 EPFL – LSIR - Nicolas Bonvin
● SLA propagation from parents to children
Automatic Provisioning
23 EPFL – LSIR - Nicolas Bonvin
● Usage of allocated resources is maximized :
– autonomic migration / replication / suicide of components
– not enough to ensure end-to-end response time
● Cloud resources managed by framework via cloud API
● Each individual component has to satisfy its own SLA
– SLA easily met -> decrease resources (scale down)
– SLA not met -> increase resources (scale up, scale out)
Adaptivity to slow servers
24 EPFL – LSIR - Nicolas Bonvin
● Each component keeps statistics about its children
– e.g. 95th perc. response time
● A routing coefficient is computed for each child at each epoch
– Send more requests to more performant children
Evaluation
Evaluation: Setup
26 EPFL – LSIR - Nicolas Bonvin
● 5 components, mostly CPU-intensive (wc >> wm,wn,wd)
● 8 8-cores servers (Intel Core i7 920, 2.67 GHz, 8GB, Linux 2.6.32-trunk-amd64)
● d=0, C=110, k =10000, xs* = 25%
C3C3
C1SLA : 500ms
C1SLA : 500ms
C2C2
C5C5
C4C4
Adaptation to Varying Load (i)
27 EPFL – LSIR - Nicolas Bonvin
● 5 rps to 60 rps at minute 8, step 5 rps/min● Static setup : 2 servers with 2 cores
Adaptation to Varying Load (ii)
28 EPFL – LSIR - Nicolas Bonvin
● 5 rps to 60 rps at minute 8, step 5 rps/min● Static setup : 2 servers with 2 cores
Adaptation to Slow Server
29 EPFL – LSIR - Nicolas Bonvin
● Max 2 cores/server, 25 rps● At minute 4, a server gets slower (200 ms delay)
Scalability
30 EPFL – LSIR - Nicolas Bonvin
● Add 5 rps
per minute until 150 rps● Max 6 cores/server
Conclusion
Conclusion
32 EPFL – LSIR - Nicolas Bonvin
● Framework for building cloud applications● Elasticity : add/remove resources ● High Availability : software, hardware, network failures● Scalability : growing load, peaks, scaling down, ...
– Quick replication of busy components
● Load Balancing : load has to be shared by all available servers
– Replication of busy components
– Migration of less busy components
– Reach equilibrium when load is stable
● SLA performance guarantees
– Automatic provisioning
● No synchronization, fully decentralized
Thank you !