Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
© 2010 VMware Inc. All rights reserved
The Design and Evolution of Live Storage Migration in VMware ESX Ali Mashtizadeh, VMware, Inc. Emré Celebi, VMware, Inc. Tal Garfinkel, VMware, Inc. Min Cai, VMware, Inc.
2
Agenda
What is live migration? Migration architectures Lessons
3
What is live migration (vMotion)?
Moves a VM between two physical hosts No noticeable interruption to the VM (ideally)
Use cases: • Hardware/software upgrades
• Distributed resource management • Distributed power management
4
Virtual Machine
Live Migration
Virtual Machine
Disk is placed on a shared volume (100GBs-1TBs) CPU and Device State is copied (MBs)
Memory is copied (GBs) • Large and it changes often → Iteratively copy
Source Destination
5
Live Storage Migration
What is live storage migration? • Migration of a VM’s virtual disks
Why does this matter? • VMs can be very large • Array maintenance means you may migrate all VMs in an array
• Migration time in hours
6
Live Migration and Storage Live Migration – a short history
ESX 2.0 (2003) – Live migration (vMotion) • Virtual disks must live on shared volumes
ESX 3.0 (2006) – Live storage migration lite (Upgrade vMotion) • Enabled upgrade of VMFS by migrating the disks
ESX 3.5 (2007) – Live storage migration (Storage vMotion) • Storage array upgrade and repair • Manual storage load balancing
• Snapshot based
ESX 4.0 (2009) – Dirty block tracking (DBT) ESX 5.0 (2011) – IO Mirroring
7
Agenda
What is live migration? Migration architectures Lessons
8
Goals
Migration Time • Minimize total end-to-end migration time
• Predictability of migration time
Guest Penalty • Minimize performance loss • Minimize downtime
Atomicity • Avoid dependence on multiple volumes (for replication fault domains)
Guarantee Convergence • Ideally we want migrations to always complete successfully
9
Convergence
Migration time vs. downtime
More migration time → more guest performance impact More downtime → more service unavailability
Factors that effect convergence: • Block dirty rate
• Available storage network bandwidth • Workload interactions
Challenges: • Many workloads interacting on storage array • Unpredictable behavior
Migration Time Downtime
Dat
a R
emai
ning
Time
10
Migration Architectures
Snapshotting
Dirty Block Tracking (DBT) • Heat Optimization
IO Mirroring
11
Dest Volume Src Volume
ESX Host
Snapshot Architecture – ESX 3.5 U1
VMDK VMDK VMDK VMDK
12
Synthetic Workload
Synthetic Iometer workload (OLTP): • 30% Write/70% Read
• 100% Random • 8KB IOs
• Outstanding IOs (OIOs) from 2 to 32
Migration Setup: • Migrated both the 6 GB System Disk and 32 GB Data Disk
Hardware: • Dell PowerEdge R710: Dual Xeon X5570 2.93 GHz
• Two EMC CX4-120 arrays connected via 8Gb Fibre Channel
13
Migration Time vs. Varying OIO
0
100
200
300
400
500
600
2 4 8 16 32
Mig
ratio
n Ti
me
(s)
Workload OIO
Snapshot
Snapshot
14
Downtime vs. Varying OIO
0
5
10
15
20
25
2 4 8 16 32
Dow
ntim
e (s
)
Workload OIO
Snapshot
Snapshot
15
Total Penalty vs. Varying OIO
0 50
100 150 200 250 300 350 400 450 500
2 4 8 16 32
Effe
ctiv
e Lo
st W
orkl
oad
Tim
e (s
)
Workload OIO
Snapshot
Snapshot
16
Snapshot Architecture
Benefits • Simple implementation
• Built on existing and well tested infrastructure
Challenges • Suffers from snapshot performance issues • Disk space: Up to 3x the VM size
• Not an atomic switch from source to destination • A problem when spanning replication fault domains
• Downtime • Long migration times
17
Snapshot versus Dirty Block Tracking Intuition
Virtual disk level snapshots have overhead to maintain metadata Requires lots of disk space
We want to operate more like live migration • Iterative copy
• Block level copy rather than disk level – enables optimizations
We need a mechanism to track writes
18
Dest Volume Src Volume VMDK VMDK VMDK
Dirty Block Tracking (DBT) Architecture – ESX 4.0/4.1
ESX Host
19
Data Mover (DM)
Kernel Service • Provides disk copy operations
• Avoids memory copy (DMAs only)
Operation (default configuration) • 16 Outstanding IOs • 256 KB IOs
20
Migration Time vs. Varying OIO
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Mig
ratio
n Ti
me
(s)
Workload OIO
Snapshot DBT
21
Downtime vs. Varying OIO
0
5
10
15
20
25
30
35
2 4 8 16 32
Dow
ntim
e (s
)
Workload OIO
Snapshot DBT
22
Total Penalty vs. Varying OIO
0
50
100
150
200
250
300
350
400
450
500
2 4 8 16 32
Effe
ctiv
e Lo
st W
orkl
oad
Tim
e (s
)
Workload OIO
Snapshot DBT
23
Dirty Block Tracking Architecture
Benefits • Well understood architecture based similar to live VM migration
• Eliminated performance issues associated with snapshots • Enables block level optimizations
• Atomicity
Challenges • Migrations may not converge (and will not succeed with reasonable downtime)
• Destination slower than source • Insufficient copy bandwidth
• Convergence logic difficult to tune • Downtime
24
Block Write Frequency – Exchange Workload
25
Heat Optimization – Introduction
Defer copying data that is frequently written
Detects frequently written blocks • File system metadata
• Circular logs • Application specific data
No significant benefit for: • Copy on write file systems (e.g. ZFS, HAMMER, WAFL) • Workloads with limited locality (e.g. OLTP)
26
Heat Optimization – Design
Disk LBAs
27
DBT versus IO Mirroring Intuition
Live migration intuition – intercepting all memory writes is expensive • Trapping interferes with data fast path
• DBT traps only first write to a page • Writes a batched to aggregate subsequent writes without trap
Intercepting all storage writes is cheap • IO stack processes all IOs already
IO Mirroring • Synchronously mirror all writes • Single pass copy of the bulk of the disk
28
Dest Volume Src Volume
VMDK VMDK
IO Mirroring – ESX 5.0
ESX Host
VMDK
29
Migration Time vs. Varying OIO
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Mig
ratio
n Ti
me
(s)
Workload OIO
Snapshot DBT Mirror
30
Downtime vs. Varying OIO
0
5
10
15
20
25
30
35
2 4 8 16 32
Dow
ntim
e (s
)
Workload OIO
Snapshot DBT Mirror
31
Total Penalty vs. Varying OIO
0
50
100
150
200
250
300
350
400
450
500
2 4 8 16 32
Effe
ctiv
e Lo
st W
orkl
oad
Tim
e (s
)
Workload OIO
Snapshot DBT Mirror
32
IO Mirroring
Benefits • Minimal migration time
• Near-zero downtime • Atomicity
Challenges • Complex code to guarantee atomicity of the migration
• Odd guest interactions require code for verification and debugging
33
Throttling the source
34
IO Mirroring to Slow Destination
0
500
1000
1500
2000
2500
3000
3500
0 500 1000 1500 2000 2500 3000
IOPS
Time (seconds)
Read IOPS Source
Write IOPS Source
Write IOPS Destination
Start
End
35
Agenda
What is live migration? Migration architectures Lessons
36
Recap
In the beginning live migration
Snapshot: • Usually has the worst downtime/penalty
• Whole disk level abstraction • Snapshot overheads due to metadata
• No atomicity
DBT: • Manageable downtime (except when OIO > 16)
• Enabled block level optimizations • Difficult to make convergence decisions
• No natural throttling
37
Recap – Cont.
Insight: storage is not memory • Interposing on all writes is practical and performant
IO Mirroring: • Near-zero downtime • Best migration time consistency
• Minimal performance penalty • No convergence logic necessary
• Natural throttling
38
Future Work
Leverage workload analysis to reduce mirroring overhead • Defer mirroring regions with potential sequential write IO patterns
• Defer hot blocks • Read ahead for lazy mirroring
Apply mirroring to WAN migrations • New optimizations and hybrid architecture
39
Thank You!
40
Backup Slides
41
Exchange Migration with Heat
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10
Dat
a C
opie
d (M
B)
Iteration
Hot Block
Cold Block
Baseline without Heat
42
Exchange Workload
Exchange 2010: • Workload generated by Exchange Load Generator
• 2000 User mailboxes • Migrated only the 350 GB mailbox disk
Hardware: • Dell PowerEdge R910: 8-core Nehalem-EX
• EMC CX3-40
• Migrated between two 6 disk RAID-0 volumes
43
Exchange Results
Type Migration Time Downtime DBT 2935.5 13.297 Incremental DBT 2638.9 7.557 IO Mirroring 1922.2 0.220 DBT (2x) Failed - Incremental DBT (2x) Failed - IO Mirroring (2x) 1824.3 0.186
44
IO Mirroring Lock Behavior
Moving the lock region 1. Wait for non-mirrored inflight read IOs to complete. (queue all IOs)
2. Move the lock range 3. Release queued IOs
II
Locked region: IOs deferred until lock release III
Unlocked region: Write IOs to source only
I
Mirrored region: Write IOs mirrored
45
Non-trivial Guest Interactions
Guest IO crossing disk locked regions
Guest buffer cache changing
Overlapped IOs
46
Lock Latency and Data Mover Time
0
500 0 21
43
62
81
10 12 13 15 8 17 20 3 23 3 26 29 32 36 39 41 1 42 44 46
DM
Cop
y (m
s)
0
20
40
60
0 21
43
62
81
10 1 12 13 9 15 8 17 20 3 23 26 29 32 36 39 5 41 1 42 9 44 6 46
Lock
Lat
ency
(m
s)
Time (s)
No Contention Lock Contention
Lock Contention IO Mirroring Lock 2nd Disk
47
Source/Destination Valid Inconsistencies
Normal Guest Buffer Cache Behavior
This inconsistency is okay! • Source and destination are both valid crash consistent views of the disk
Time
Guest OS
Issues IO
Source IO
Issued
Destination
IO Issued
Guest OS
Modifies
Buffer Cache
48
Source/Destination Valid Inconsistencies
Overlapping IOs (Synthetic workloads only)
Seen in Iometer and other synthetic benchmarks File systems do not generate this
Virtual Disk
Source Disk Destination Disk
IO 1 IO 2
IO 1 IO 2 IO 1
IO 2 Issue Order Issue Order
Issue Order
LBA LBA
LBA
IO Reordering
49
Incremental DBT Optimization – ESX 4.1
Write to blocks Dirty block
? ? ? ? ?
Disk Blocks
Copy Ignore Copy Ignore