Upload
iman
View
58
Download
0
Embed Size (px)
DESCRIPTION
mClock : Handling Throughput Variability for Hypervisor IO Scheduling. Ajay Gulati VMware Inc. Arif Merchant HP Labs. Peter Varman Rice University. in USENIX conference on Operating Systems Design and Implementation ( OSDI ) 2010. Outline. Introduction Scheduling goals of mClock - PowerPoint PPT Presentation
Citation preview
mClock: Handling Throughput Variability for Hypervisor IO Scheduling
in USENIX conference on Operating Systems Design and Implementation
(OSDI ) 2010.
Ajay GulatiVMware Inc.
Arif MerchantHP Labs
Peter VarmanRice University
2
Outline
• Introduction• Scheduling goals of mClock• mClock Algorithm• Distributed mClock• Performance Evaluation• Conclusion
3
Introduction• Hypervisors are responsible for multiplexing the underlying
hardware resources among VMs– CPU, memory, network and storage IO
HostVM
VM
VM
HostVM
VM
VM
Storage IOScheduler
CPU RAM
CPU RAM
Storage IOScheduler
Throughput available to a host is not under its own control
Storage Array
The amount of CPU and memory resources on a host are fixed andtime-invariant .
4
Introduction (cont’d)
• Existing methods provide many knobs for allocating CPU and memory to VMs.
• The current state of the art in terms of IO resource allocation is much more rudimentary.– Limited to providing proportional shares to different VMs.
• Lack of QoS support for IO resources can have widespread effects – rendering existing CPU and memory controls ineffective
when applications block on IO requests.
5
Introduction (cont’d)• The amount of IO throughput available to any particular host
can fluctuate widely based on the behavior of other hosts accessing the shared device.
VM5starts
VM1starts
VM2,3start
VM4starts
VM1stop
VM2,3stop
VM4stop
6
Introduction (cont’d)• Three main controls in resource allocation– Shares (a.k.a. weights)• proportional resource allocation
– Reservations• minimum amount of resource allocation• to provide latency guarantee
– Limits• maximum allowed resource allocation• prevent competing IO-intensive applications from
consuming all the spare bandwidth in the system
7
Scheduling goals of mClockVM IO Throughput IO Latency
Remote Desktop (RD) Low Low
Online Transaction Processing (OLTP) High Low
Data Migration (DM) High Insensitive
When reservations cannot be met: Proportional to reservations
When reservations can be met: Satisfy reservations first, then proportional to weight
Limit the maximum throughput of DM
8
Scheduling goals of mClock (cont’d)• Each VM i has three parameters:
– Reservation(ri), Limit (li), Weight (wi)
• VMs are partitioned into three sets: – Reservation-clamped(R), limit-clamped (L) or proportional (P),
based on whether their current allocation is clamped atthe lower or upper bound or is in between.
• Define
9
mClock Algorithm
• mClock uses two main ideas: – multiple real-time clocks • Reservation-based, Limit-based, and Weight-based clocks
– dynamic clock selection• Dynamic select one from multiple real-time clocks for
scheduling.
Tag assignment method is similar to the Virtual Clock scheduling.
10
mClock Algorithm (cont’d)
• Tag Adjustment– To calibrate the proportional share tags against real time
• To prevent starvation.• In virtual time based scheduling, this synchronization is done using
global virtual time. ( Si,k = max{Fi,k-1, V(ai,k)} )• In mClock, the reservation tag and limit tag must base on real time.
=> Adjust the origin of existing P tags to the real time.
11
mClock Algorithm (cont’d)Reservation first
Select the request from the VMs under limitation.
Active_IOs : count the queue length.
Tag Adjustment
12
mClock Algorithm (cont’d)
• This maintains the condition that R tags are always spaced apart by 1/ri, so that reserved service is not affected by the service provided in the weight-based phase.
time
Rk1 Rk
2 Rk3 Rk
4 Rk5
Current time trk
3 is served.The waiting time of rk
4 may be longer than 1/rk
1/rk
13
Storage-specific Issues• Bust Handling– Storage workloads are known to be bursty– Requests from the same VM often have a high spatial
locality.– We help bursty workloads that were idle to gain a limited
preference in scheduling when the system next has spare capacity.
– To accomplish this, we allow VMs to gain idle credits.
time
Pk1 Pk
2 Pk2+1/wi
Current time t:
rk3 arrival
Pk3
σi/wi
tidle
14
Storage-specific Issues (cont’d)• IO size
– Since larger IO sizes take longer to complete, differently-sized IOs should not be treated equally by the IO scheduler.
– The IO latency with n random outstanding IOs with an IO size of S each can be written as:
– Converting latency observed for an IO of size S1 to a IO of a reference size S2,
– A single request of IO size S is treated equivalent to(1 + S/(Tm ×Bpeak)) IO requests.
Tm: mechanical delay due to seek and disk rotation.Bpeak: the peak transfer bandwidth of a disk.
For a smaller reference size, this part is negligible
15
Storage-specific Issues (cont’d)• Request Location– mClock improves the overall efficiency of the system by
scheduling IOs with high locality as a batch.• A VM is allowed to issue IO requests in a batch as long as the
requests are close in logical block number space.
• Reservation Setting– IOPS = Outstanding IOs / Latency– Application that keeps 8 IOs outstanding and requires
25ms latency, 8 / 0.025 = 320 IOPS for reservation
16
Distributed mClock
• Cluster-based storage systems• dmClock runs a modified version of mClock– piggyback two integers ρi and δi with each request of VM vi to
a storage server sj .• δi : the number of IO requests from VM vi that have completed
service at all the servers between the previous request (from vi) to the server s j and the current request.
• ρi : the number of IO requests from vi that have been served as part of constraint-satisfying phase between the previous request to s j and the current request
17
Performance Evaluation
• Implemented in VMware ESX server hypervisor– By modifying the SCSI scheduling layer in the I/O stack of
VMware ESX server hypervisor.
• The host is a Dell Poweredge 2950 server – two Qlogic HBAs connected to an EMC CLARiiON CX3-40
storage array over FC SAN.– Used two different storage volumes
• A 10 disk RAID 0 disk group • A 10 disk RAID 5 disk group
18
Performance Evaluation
• Two kinds of VMs– Linux VMs with a 10GB virtual disk,
one VCPU and 512MB memory– Windows server 2003 VMs with a 16GB virtual disk,
one VCPU and 1GB memory• Workload generator– Iometer in the Windows server VMs
• http://www.iometer.org/– A self-designed work-load generator in Linux VMs
19
Performance Evaluation (cont’d)• Limit Enforcement
RD OLTP DMWorkload 32 random IO (75% read)
every 250msAlways backlogged(75% read)
Always backlogged(All sequential read)
IO size 4KB 8KB 32KBLatency bound 30ms 30ms XWeight 2 2 1
At t=140 the limitfor DM is set to 300 IOPS.
20
Performance Evaluation (cont’d)• Reservations Enforcement– Five VMs with weights in ratio 1:1:2:2:2.– VMs are started at 60 sec intervals
SFQ only does proportional allocation mClock enforces reservations
300 IOPS
250 IOPS
21
Performance Evaluation (cont’d)
• Bursty VM Workloads– VM1: 128 IOs every 400ms, all 4KB reads, 80% random.– VM2: 16 KB reads, 20% of them random and the rest sequential with
32 outstanding IOs.
– Idle credits do not impact the overall bandwidth allocation over time.– The latency seen by the bursty VM1 decreases as we increase the idle
credits.
22
Performance Evaluation (cont’d)
• Filebench Workloads– Emulate the workload of OLTP VMs
[25] R. McDougall. Filebench: Application level file system benchmark. http://www.solarisinternals.com/si/tools/filebench/index.php
23
Performance Evaluation (cont’d)
• dmClock Evaluation– Implementation in a distributed storage system that consists
of multiple storage servers (nodes).– Each node is implemented using a virtual machine running
RHEL Linux with a 10GB OS disk and a 10GB experimental disk.
24
Conclusion
• The mClock provides per-VM quality of service. The QoS requirements are expressed as – minimum reservation– maximum limit – proportional shares (weight)
• The controls provided by mClock would allow stronger isolation among VMs.
• The techniques are quite generic and can be applied to array level scheduling and to other resources such as network bandwidth allocation as well
25
Comments• Existing VM services only provide resources in terms of CPU,
memory, and storage. But I/O throughput may be the largest factor in QoS provisioning.– In terms of response time or delay time.
• It’s a good idea to combine reservation, limit and proportional share in schedule algorithms.– WF2Q-M considered the limit but no reservations.
• The problem of reservation, limit and proportional share between VMs in different hosts ??
26
Comments (cont’d)• Experiments just validate the correctness of mClock. – How about the short term fairness, latency distribution and
computation overhead ?• The experiments just use one host machine. – Cannot reflect the condition of throughput variability when
there are multiple hosts.