Upload
vmworld
View
251
Download
6
Embed Size (px)
Citation preview
Extreme Performance Series:vSphere Compute & Memory
Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc
INF5701
#INF5701
2
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not been determined.
Disclaimer
vSphere CPU Management
Outline• What to Expect on VM Performance
• Ready Time (%RDY)
• VM Sizing: How Many vCPUs?
• NUMA & vNUMA
Set the Right Expectation on VM Performance
6
What Happens When Idle Active?
VMK
VMM
VM
VT / AMD-V
-Privileged inst.-TLB miss
-VCPU state to RDY-Schedule-RDY Queue
IO HLT
-De-schedule VCPU-VCPU state to IDLE
-Issue to IO threads
7
When Your App is Slow in VM• High virtualization overhead
– A lot of privileged instructions / operations• CPUID, mov CR3, etc.
– A lot of TLB misses (addressing huge memory)• Large page helps a lot
• Resource contention– High ready (%RDY) time?– Host memory swap? (i.e. memory over-commit)
8
Reasonable Expectation on VM Performance• Best cases
– Computation heavy, small memory footprint– No CPU / memory over-commit– ~100% of the bare metal performance
• Common cases– Moderate mix of compute / memory / IO– Little to no CPU / memory over-commit– ~ 90% of the bare metal performance
• Worst cases– Huge number of TLB misses / privileged instructions– Heavy ESXi host memory swap
%RDY Can Happen Without CPU Contention
CPU Scheduler Accounting
10
A B C
t1 t2 t3
D
t4 t5
E
t6 t7 t8
CPU scheduling cost Time in ready queue Actual execution Efficiency loss from power
mgmt, hyper-threading, etc.Interrupted
%RDY
%RUN%OVRLP
%SYS += D if for this VM
%USED = %RUN + %SYS - %OVRLP - E
Meaning of High %RDY
11
A: Scheduling Cost B: Time In Ready Q C: Actual Execution
A B C– CPU contention– Limit, low shares– Poor CPU affinity– Poor load balancing
A C A C A C A C A C
e.g. Frequent Sleep/Wakeup
A C
A C
Troubleshooting High %RDY• High queue time
– Check DRS load balancing issue– Check CPU resource specification (limit, low shares)
• %MLMTD – Percent time in RDY state due to CPU limit
– Avoid using CPU affinity
• Dominant CPU scheduling cost– Change application behavior (avoid frequent sleep / wakeup)– Delay or do not yield PCPU
• monitor.idleLoopSpinUS > 0– Burns more CPU power– OK for consolidation
• LatencySensitivity = HIGH– Power efficient– Bad for consolidation
12
Same %RDY, Different Performance Impact
14
%RDY Impact on Throughput
0 2 4 6 8 10 12 14 16 18 200.75
0.80
0.85
0.90
0.95
1.00
1.05
%RDY
Thro
ughp
ut (b
ops)
• Throughput workload
• Java server
• CPU & memory
• %RDY ~ throughput drop
15
%RDY Impact on Latency• Latency workload
• In-memory key-value store
• CPU & memory
• %RDY can have significant impact on tail latency
• Same %RDY but different impact
0 5 10 15 20 250
2
4
6
8
10
12
14
16
18
spiky
flat
%RDY
99.9
9 Pe
rcen
tile
Late
ncy
(mse
c)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
16
When %RDY Is Acceptable• VMs are consolidated into one NUMA node
– When VMs share data (communication, same IO context, etc.)– %RDY may increase– Better than running slowly without %RDY on separate NUMA nodes
• vSphere 6.0 becomes less aggressive– Leave 10% CPU headroom– Lower /Numa/CoreCapRatioPct to increase the headroom
Oversizing VM is Wasteful and Even Harmful
Unused VCPU Wastes CPU
18
RHEL5 100Hz (*) RHEL5 1kHz RHEL6 tickless (*) Win2k8 64Hz (*) Win2k8 1kHz0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
%US
ED
• Idle VCPU does consume CPU
• Can be significant with 1kHz timer (RHEL5 1kHz)
• Mostly trivial
Over-sizing VM Can Hurt Performance
19
• Single-threaded app
• Does not benefit from more VCPUs
• Hurt by in-guest migrations
1 2 4 8 16 32 640.00
0.20
0.40
0.60
0.80
1.00
1.20
VM Size (#vCPUs)
Thro
ughp
ut
ESXi is Optimized for NUMA
21
What is NUMA?• Non-Uniform Memory Access system architecture
– Each node consists of CPU cores and memory
• VM can access memory on remote NUMA nodes, but at a performance cost– Access time can be 30% ~ 200% longer
NUMA node 1 NUMA node 2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
VM
Good Memory Locality
ESXi Schedules VM for Optimal NUMA Performance
22
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
VM
Poor Memory Locality Without vNUMA.
Wide VM Needs vNUMA
23
24
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
VM
Good Memory Locality With vNUMA.
vNUMA Achieves Optimal NUMA Performance
Stick to vNUMA Default
Do Not Change coresPerSocket Without Good Reason• Changing default means you set vNUMA
size
• If licensing requires fewer vSockets– Find optimal vNUMA size– Match coresPerSocket to vNUMA size– e.g. 20-vCPU VM on 10 cores/node system
• Default vNUMA size = 10/vNode• Set coresPerSocket = 10
• Enabling “CPU Hot Add” disables vNUMA
26
Key Takeaways
28
Summary• Set the right expectation on VM performance
• %RDY can happen without CPU contention– Watch out for frequent sleep / wakeup
• Same %RDY, different performance impact– More significant impact on the tail latency
• Oversizing VM wastes CPU and may hurt performance
• ESXi is optimized for NUMA
• Stick to vNUMA default
• Check out CPU scheduler white paper– https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf
vSphere Memory Management
Outline• ESXi Memory Management Basics
• VM Sizing
• Reservation vs. Preallocation
• Page sharing vs. Large page
• Memory Overhead
• Memory Overcommitment Guidance
Memory Terminology
Memory SizeTotal Amount of Memory
Allocated Memory Free Memory
Active MemoryAllocated Memory Recently
Accessed or Used
Idle MemoryAllocated Memory Not Recently
Accessed
31
Task of Memory Scheduler• Compute memory entitlement for each VM
– Based on reservation, limit, shares, and memory demand– Memory demand is determined by active memory
• Sampling based estimation
• Reclaim guest memory if entitlement < consumed
32
Performance goal: handle burst memory pressure well
33
Memory Reclamation Basics (vSphere 5.5 and earlier)
Host Memory
0 max
minFreeconsumedfree State:
HIGHLOW
STATE Page Sharing Ballooning Compression Swapping
High X
Low X X X XExpensive
Refer to http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf for details.
Reservation vs. Preallocation
35
Different in Many Aspects• Reservation
– Used in admission control and entitlement calculation – Setting it does NOT mean memory is fully allocated– General protection against memory reclamation
• Preallocation– Memory is fully reserved AND fully allocated
• Advanced configure option: sched.mem.prealloc = TRUE
– Mostly used for latency sensitive workloads
VM Sizing
37
Guard Against “Active Memory” Reclamation• VM memory size > the peak demand• If necessary, setting reservation above guest demand
Page Sharing & Large Page
Memory Saving from Page Sharing• Significant for homogeneous VMs
Workload Guest OS # of VMs Total guest memory
Swingbench RedHat 5.6 12 48GB
VDI Windows 7 15 60GB
43%57%
Swingbench
Sharednon-Shared
Share saved
34%
75%
25%
VDI
73%
39
40
What “Prevents” Sharing • Guest features
– ASLR (Address Space Layout Randomization) • Less than 50MB sharing reduction
– Super fetching (proactive caching)• Largely reduces sharing• Increase in I/Os hurts VM performance
• Host features– Host large page
• ESXi does not share large pages• Page sharing scanning thread still works (generates page signatures)
Why Large Page?• Fewer TLB misses • Faster page table look up time
• Enable by default
Guest Large Pages Host Large Pages SPECjbb Swingbench
√ √ +30% +12%
× √ +12% +7%
√ × +6% -
× × - -baseline
41
42
Large Page Impact on Memory Overcommitment• Higher memory pressure due to no sharing• Broken when any small page is ballooned or swapped
– Sharing happens thereafter
0 1.5 3 4.5 6 7.5 910
.5 12 13.5 15 16
.5 18 19.5 21 22
.5 24 25.5 27 28
.5 30 31.5 33 34
.5 36 37.5
5000000
10000000
15000000
0
1000
2000
3000
4000
5000
6000
7000
Memory Overcommitment with Swingbench VMs
Time (minutes)
Bal
loon
ed/S
wap
ped/
Shar
ed M
emor
y(G
B)
# of
Lar
ge P
ages
nrLarge
Ballooned
Shared Swapped
43
New in vSphere 6.0• Add a new memory state “CLEAR” (between High and Low)
• Breaking large pages in Clear state– Only if they contain shareable small pages– Avoid entering Low state– Best use of large pages and small pages
High
minFree
LowClear
44
0 1 2 3 40.8
0.9
1
1.1
1.2
1.3
1.4VDI
ESXi 5.5
ESXi 6.0
# of Extra VMs
Ave
rage
Lat
ency
(sec
onds
)Performance Improvement
• ESXi 6.0 (with Clear state): sharing happens much earlier => no ballooning/swapping!
• Reference: http://dl.acm.org/citation.cfm?id=27311870
3.5 7
10.5 14
17.5 21
24.5 28
31.5 35
38.5
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Total Ballooned + Swapped Memory (MB)
ESXi 5.5
ESXi 6.0
Time (minutes)
0 3 6 9 12 15 18 21 24 27 30 33 36
1
3,000,001
6,000,001
9,000,001
12,000,001
15,000,001
18,000,001
Total Shared Memory (GB)
ESXi 5.5
ESXi 6.0
Time (minutes)
Overhead Memory
46
Per Host & Per VM
• Composed of MANY components– In an idle host, kernel overhead
memory breakdown looks like this …
• Impossible to conduct an accurate formula
47
“Experimentally Safe” Estimation
– Per VM overhead • Less than 10% of configured memory
– Host memory usage without noticeable impact• <= 64GB : 90% of host memory• > 64GB: 95% of host memory
– Above are conservative!
Memory Overcommitment Guidance
• Two types of memory overcommitment– “Configured” memory overcommitment
• SUM (memory size of all VMs) / host memory size
– “Active” memory overcommitment• SUM (mem.active of all VMs) / host memory size
• Performance impact– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!
• Some active memory are not in physical RAM
– “Configured” memory overcommitment > 1 zero or negligible impact• Most reclaimed memory are free/idle guest memory
Configured vs. Active Memory Overcommitment
49
General Principles• High consolidation
– Keep “active memory” overcommitment < 1
• How to know the “active memory” of a VM?– Use vRealize Operations to track a VM’s average and maximum memory demand
• What if I have no idea about active memory…– Monitor performance counters while adding VMs
50
51
Host Statistics (Not Recommended)• mem.consumed
– Memory allocation varies dynamically based on entitlement– It does not imply performance problem
• Reclamation related counters– mem.balloon– mem.swapUsed– mem.compressed– mem.shared
Nonzero values do NOT necessarily mean a performance problem!
Example One (Transient Memory Pressure)• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host
0
2000
4000
6000
8000
10000VM1 VM2 VM3
Time (minutes)
Ope
ratio
ns p
er M
inut
es
0
2000000
4000000
6000000
8000000
10000000
12000000
Balloon
Swap Used
Compressed
Shared
Time (minutes)
Size
(GB
)
∆VM1 = 0%∆VM2 = 0%
52
53
Which Statistics to Watch? • mem.swapInRate
– Constant nonzero value indicates performance problem
• mem.latency– % of time waiting for decompression and swap-in. – Estimate the performance impact due to compression and swapping
• mem.active– If active is low, reclaimed memory is less likely to be a problem
Example Two (Constant Memory Pressure)• All six VMs run Swingbench workloads
0
2000
4000
6000
8000
10000VM1 VM2 VM3 VM4 VM5 VM6
Time (minutes)
Ope
ratio
ns p
er M
inut
e
0
1000
2000
3000
4000
Swap-in Rate
Time (minutes)
KB
per
Sec
ond
∆VM1 = -16%∆VM2 = -21%
54
Key Takeaways
56
Summary• Track mem.{swapInRate, active, latency} for performance issues.
• VM memory should be sized based on memory demand.
• “Single digit” memory overhead.
• New ESXi memory management feature improves performance.
• ESXi is expected to handle transient memory pressure well.
Extreme Performance Series:vSphere Compute & Memory
Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc
INF5701
#INF5701