Master VMware Performance and Capacity Management

SDDC Performance and Capacity Management

About your speakers

Iwan ‘e1’ [email protected] @e1_ang Linkedin.com/in/e1ang9119-9226

Sunny [email protected] @sunny_duaLinkedin.com/in/duasunny

mailto:[email protected]

mailto:[email protected]

One day in the life of VMware Admin…• A VM Owner complains to IaaS Team that her VM is slow.• Her application architect has verified that:

– The VM CPU and RAM utilization is good.– The disk latency is good. – There is no network drop packets.– No change in the application settings– No recent patch to Windows

What do you do?• A: Check ESXi utilization. If it’s low, tell her to doubt no more.

• B: Buy her a nice lunch + flower. Ask her to forget about it • C: Call your VMware TAM & MCS. That’s why you pay them right? • D: Roll up your sleeve. You are born for this!

What’s wrong with these statements?

• Cluster CPU – CPU Ratio is high at 1:5 times on cluster “XYZ” – Rest all other cluster overcommit ratio looks good around 1:3 – Keep the over commitment ratio to 1:4. – CPU usage is around 50% on cluster “ABCDE”. Since they are UAT servers, don’t worry.– Rest other cluster CPU utilization is around 25%. This is good!

• Cluster RAM– We recommend 1:2 overcommit ratio between physical RAM and virtual RAM.– Memory Usage on most of the cluster is high around 60%– Cluster “ABCD” is running peak at around 75%. CPU utilization should be less than 70% – If we see that Active Mem% is also high than we should add more RAM to cluster – % Active should not exceed 50-60% and Memory should be running at high state on each host

Monitoring

• There are 2 levels to monitor in VMware:– The VM.

• VM is the most important as that’s all customers care.• They do not care about your infrastructure. It is a Service. IaaS.

– The Infra. • Software: NSX, vCenter, VSAN, vRealize, Distributed vSwitch, Datastore• ESXi + hardware• Storage & Fabric• Network

• There are 4 areas to monitor

• The 4 areas above impact one another

2 distinct layer

SDDC

VM VM VM VM

VM VM VM VM

VM VM VM VM

VM VM VM VM

Performance: We check if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view.

1

Capacity. We check if VM is right-sized.If too small, increase its configuration.If too big, right size it for better performance

2

Performance: We check if IaaS is serving everyone well. Make sure there is no contention for resource among all the VMs

1

Capacity:Check utilization. Too low, we spent too much on hardware.Too high, we need to buy more hardware.

2

Configuration: Check for Compliance and Config Drift Availability: Get alert for hardware fault or software stop working3

Consumer Layer

Provider Layer

Performance

How do you know your IaaS is performing fast?

ESXi utilization a 10% means your ESXi is fast? ESXi utilization a 90% means your ESXi is fast? Storage is doing 10K IOPS?Network is processing 8 Gbps?

What counter do you use as a proof to your customers (VM Owner)?

Utilization?

Performance is measured by how well your IaaS serves the VMs.

Fast is relative to your customer. Use SLA as your defense line.

Capacity

Performance vs Capacity

Performance Capacity

Focus is on the VM.It does not apply to IaaS

Focus is on the IaaS.VM Capacity Management is just right sizing

Primary counter: Contention or Latency.Utilization is largely irrelevant.

Primary counter: Contention or LatencySecondary counter: Utilization

Does not take into account Availability SLA Takes into account Availability SLATier 1 is in fact Availabity-driven.

© 2015 VMware Inc. All rights reserved.

The Consumer LayerThe “dining area”

CONFIDENTIAL11

How a VM gets its resourceProvisioned

Limit

Reservation

Entitlement

0 vCPU or 0 GB

Contention

Usage

DemandThis is the counter

we need to measure

4 vCPU or 16 GB

Dashboards

• Detail monitoring of a single VM– When customer complains that his VM is slow. Can help desk value right away?

• Large VMs Monitoring– Because they are actually hurting your IaaS business– This impacts both Performance and Capacity– VM Right Sizing

• Excessive Usage– Excessive Usage by 1-2 VM can impact the overall IaaS performance.– VMs with excessive usage hurts the business, if we do not charge for Network and Disk IOPS

Single VM Monitoring• A VM Owner complains that his VM is slow.

– It was okay the day before– How does Help Desk quickly determine where the issue is?

• How well does Infra serve the VM?– VM CPU Contention– VM RAM contention– VM Disk latency. For each virtual disk, not average.

• Is VM undersized?– VM CPU Utilisation– VM RAM Consumed (not Usage)– VM RAM Usage– VM Disk IOPS

Dashboard 1

Single VM Monitoring

How oversized are the Large VMs?• They cause performance issue

– They impact others, and also themselves!– ESXi vmkernel scheduler has to find available cores for all the vCPU, even though they are idle. – Other VMs maybe migrated from core to core. The counter at esxtop tracks this migration.

• Tends to have slower performance – ESXi may not have all the available vCPU for them.

• Reduces consolidation ratio– You can pack more vCPU with smaller VM than with big VM.– Unless you have progressive pricing, you make more money with smaller VM as you sell more vCPU.

Dashboard of Large VMs• Overall Picture

– A line chart showing Max CPU Demand among all the Large VMs• If this is low, they are way oversubscribed. Remember, it only takes 1 VM to make this number high.• This number should be 80% most of the time, indicating right sizing.

– A line chart showing Average CPU Demand• If this chart is below <25% all the time for entire month, then the large VMs are over sized.

• Heat Map of Large VMs– Size by vCPU config. So it’s easy to see who the biggest among these large VMs.– Color by CPU Workload. Both high and low are bad. You want to see ~50% CPU utilisation

• To differentiate between the 2 ends, choose Black and Red. Expect to see mostly green.• Top-N CPU Demand

– Allows us to zoom into specific time to see the past• Line chart of a selected VM (automatically plotted)

As expected, the Max of All VMs is low. We can go back in time and see over 3 months. As expected, they are mostly Black. This means

they are over provisioned.

This shows the Top 15 VM. You can change the period to any time. This is auto shown. We are showing CPU and RAM.

You expect 70% range, not 20% like this example.

VM Right Sizing• Focus on large VM

– Every downsize is a battle. No one likes to give up that they are given. – It also requires downtime– Downsizing from 4 vCPU to 2 does not buy much nowadays with >10 core Xeon.

• Focus on CPU, not RAM– RAM in general is plentiful as RAM is cheap nowadays– RAM is hard to measure, even with agents, as it’s application dependant

23

MS Windows: Memory Management

• Windows makes great and full use of RAM– It’s using it as cache.– Adding more physical RAM will result in more usage.

• Virtual Memory is an integral part of Windows Memory Management– It is not a swap file.– A growing pagefile is an early warning.– Track that it’s below 2.0 with this formula:

≤ 2.0Commit Limit

Physical RAM

MS Windows: Memory Management

In Use

Available

Cached

vRealize Ops 6.1 EP Agent: Memory Used

vRealize Ops 6.1 EP Agent: Memory Available

MS Windows: Memory Sizing

Conservative

Cost Effective

Windows: Memory Management

• Which VMs need to be upsized?– Get all those VMs whose commit limit ratio > 2.0– List can be sorted by the one with the highest commit limit ratio

≤ 2.0Commit Limit

Physical RAM

Windows RAM: Right Sizing

• Use Commit Limit Ratio super metric to upsize VM

• Conservative or Cost Effective?– Cost Effective: Used.– Conservative: Used + Cache

• Example– Used + Cache: 90%– Value exceeding 90% means

the VM needs more RAM Server Workload VDI Workload1 apps Many apps

Long live apps Many apps launched and closed.

Varies Many files opened and closed

No Internet browsing Internet browsing (movie!)

Workload predictable Workload spiky and unpredictable

Varies (UI-less) Flash, Java, JavaScript (UI heavy)

Windows RAM: What counters to use?

Cost Effective: Used Conservative: Used + Standby Cache Normal Priority + Standby Cache Reserve

Windows RAM: Hypervisor vs In-Guest

30

A B DC

CONFIDENTIAL 31

CONFIDENTIAL 32

VM Right Sizing• Do not reduce RAM without changing application

– If the VM has no RAM shortage, reducing RAM will not speed up anything.

– Reducing RAM can trigger more internal swapping in the guest. This in turn generates IO

– Reducing RAM beyond the ISV recommendation can result in unsupported configuration.

– Reducing RAM requires manual reduction for apps that manage its own RAM (e.g. Java, SQL, Oracle).

– It's hard enough to ask apps team to reduce CPU, so asking for both will be even harder.

– If there is a performance issue after you reduce both CPU and RAM….

• If a VM is not using the full RAM, ask the Appl Team if they can

– To monitor paging from outside the Guest, put the pagefile into its own vmdk file.

Why VM Owner should right size• It takes longer to boot.

– If a VM does not have reservation, vSphere will create a swap file the size of the configured RAM.

• It takes longer to vMotion.

• Risk of NUMA effect– The RAM or CPU maybe spread over a single socket. Due to NUMA architecture, the performance will

not be as good.

• It will experience higher co-stop and ready time.

• It takes longer to snapshot, especially if memory snapshot is included.

• The processes inside the Guest OS may experience ping-pong.

• Lack of performance visibility at the individual vCPU or virtual core level.

34

Dashboard 2

VM Right Sizing

Any Excessive Utilization in our DC?• A VM consumes 5 resources:

1. vCPU2. vRAM (GB)3. Disk Space4. Disk IOPS5. Network (Mbps)

• The first 3 you can bound and control

• The last 2 you can, but normally you don’t do it. You should.

• Need a dashboard to track excessive usage– Disk IOPS– Network throughput

Dashboard for Excessive Utilisation• Excessive Storage consumption

– Line Chart: • Max VM Disk IOPS among all VMs• Average VM Disk IOPS

– Heat Map• Size by IOPS. Color by Latency• If you see a single big box, that means you have a VM dominating your storage IOPS.

• Excessive Network consumption– Similar concept as above

This tracks the IOPS from VM. From here we can tell is a distinct peak. It looks like it’s coming from 1 VM, as the average is far lower. This is a cluster of 500 VM, so even if 1 VM hits 13,200 IOPS, the

average did not even pass 15 IOPS.

Let’s zoom into the peak.

Excessive Storage Dashboard

The peak was 13,212 IOPS on 24 May, around 3:16 am. Let’s find out which VM.

Excessive Storage Dashboard

• We can list the Top VMs generating the IOPS on any given period.

Bingo, it was VM 63ee that did that 13212 IOPS. Catcha!

The dashboards are great.But it does not tell you how the IOPS distribution among all the VMs. It also does not tell if the VMs

are experiencing high latency.

You need a Heat Map for this.

At a glance, we can tell the IOPS distribution among the VMs. We can also tell if they getting low latency or not.

Dashboard 3

Excessive DC Utilization


And that’s it!You “passed” those dashboards, you’re done with the “dining area”!


The Provider LayerThe “kitchen”

CONFIDENTIAL44

Performance Management• Overall Performance Monitoring

– Is any of our customers experiencing bad performance? – CPU, RAM, Disk, Network

• If yes, who are affected?– Different VM may get different impact. – VM 007 may get hit on CPU, while VM 747 may get hit on Storage.

Performance SLA Monitoring • How do we prove that….not a single VM… in any service tier…. fails the SLA threshold we

agree for that tier… in the past 1 month?

• Since VMs move around in a cluster due to DRS and HA, we need to track at Cluster level.

• If you oversubscribe, there is a risk of Contention. – For Tier 1, do not overcommit.– For Tier 2 and 3, do overcommit.

Using Max and Average to determine how VMs are served

If the Max is: • below what you think your customers can tolerate, then you are good. • Near the threshold, then your capacity is full. Do not add more VM.• Above the threshold, move a few VMs out, preferably the large ones.

This dashboard is good as summary. You stop here if there is no issue. But say there is an issue. Can you drill down?

This is an example of how you can drill down

to individual Cluster, and see any metrics

you want.Notice the Max Contention and

Average Contention both spike. The

average is much lower than Max, indicating

room for additional VM.The VM contention hit 4.44%. It’s still okay.

Drill down: Cluster CPU Performance

There is issue here. The Max value hit 31%.

The Average is still good at 0.31%, so this means 90% of

VMs are being served well.

Storage Latency Monitoring: Details• The data you see at vCenter and vRealize Operations are average

– The storage latency data that vCenter provides is a 20-second average. With vRealize Operations, and other management tools that keeps the data much longer, it is a 5-minute average.

– 20 seconds may seem short, but if the storage is doing 10K IOPS, that is an average of 10,000 x 20 = 200,000 read or writes.

• The data you see at vmkernel log is not. – It is per individual IO. More info here. – It is acceptable to have higher latency, but ensure it is not too high. Set your threshold at 250 ms for a

start.

• Additional info– Data at vmkernel excludes bottleneck at upper layer. For example, no disk queue in vDisk. As a result,

we can conclude that the storage latency is not at VM level.

http://virtual-red-dot.info/vsphere-storage-latency-view-from-the-vmkernel/

Drill down: Cluster Storage Performance

56

We look at storage across the past 1+ month.We are seeing latency spike. There is an outlier at 6000 ms on a magnetic disk.

This data is going back to 18 April. Let’s zoom into May 22 as there are recent spike there.


Zooming into May 17 – 23. We also exclude all the Magnetic Disk.

Device ID naa.55* is SSD, while naa.5000* is magnetic.We are seeing latency at the SSD. There is 1 outlier.


We can also group the data by ESXi Host.We can also present the data in bar chart

We can zoomed into much more granular time line, below 1 second!

Which VMs are affected?• The previous slides give us info at Cluster level.

– If there is no VM affected, it’s good. No need to analyse further. – If there are VMs affected, we want to know which ones.

• We can address the above by listing the Top 30 VM– CPU Contention– RAM Contention– Disk Latency– Network drop packet (ensure it is 0)– Network latency (this needs NetFlow)

These are the top 40 VMs which experienced the worst CPU

Contention.

These are the top 40 VMs which experienced the worst RAM

Contention.

These are the top 40 VMs which experienced the worst Disk

Latency.


And that’s it!If Performance is ok, it’s time to review Capacity

61

Capacity Management based on Business Policy

http://virtual-red-dot.info/capacity-management-based-on-business-policy/

http://virtual-red-dot.info/capacity-management-based-on-business-policy/

63

Performance Policy

Group Discussion: What should your Performance Policy be?

Availability Policy

64

Group Discussion: What should your Availability Policy be?

IOPS(per VM)

Latency(VM level)

Automated DR (SRM) RPO RTO

Tier 1 1000 <10 ms Yes 5 minutes 1 hour

Tier 2 500 <20 ms Yes <2 hours <2 hours

Tier 3 100 <30 ms Yes <8 hours <4 hours

Capacity Management: Tier 15 line charts showing these in the past 3 months

• Number of vCPU left in the cluster.

• Number of vRAM left in the cluster.

• Number of VM left in the cluster.

• Maximum & Average storage latency experience by any VM in the cluster

• “Usable” space left in the datastore cluster.

65

If the number is approaching low number (your threshold) for it’s time to increase supply (e.g. IOPS, Cluster)

Capacity Management: Tier 2 or 35 line charts showing data in the past 3 months

• The Maximum CPU Contention experience by any VM in the cluster.– This number has to be lower than the SLA we promise.

• The Maximum RAM Contention experience by any VM in the cluster.– This number has to be lower than the SLA we promise.

• The total number of VM left in the cluster.

• The Maximum & Average storage latency experience by any VM in the cluster

• The disk capacity left in the datastore cluster.

66

Tier 2 or 3

67

In this example, if we use 10% as threashold, the

cluster is full

In this example, if we use 30ms as threashold, the

cluster is full

68

Capacity Management: Tier 2 or 3• RAM has different pattern to CPU as it’s a form storage

– SLA principel remains the same. If you exceed the SLA, that cluster is full.

SLA vs Internal Threshold

69

SLA Tier 1 Tier 2 Tier 3

CPU Contention 1% 2% 13%

RAM Contention 0% 10% 20%

Disk Latency 10 ms 20 ms 30 ms

SLA only applies to VM.VM owner does not care about underlying platform.

The above is my personal opinion. You need to get your Customer agreed

Internal (your own) Tier 1 Tier 2 Tier 3

CPU Contention 1% 3% 10%

RAM Contention 0% 5% 10%

Disk Latency 10 ms 15 ms 20 ms

Key Takeaways

Agree on a Performance SLA.

Contention, not Utilization.

Capacity is defined by Performance.

CONFIDENTIAL 70

More Details

• The book provides details that we could not cover in half a day.

• The book is not a product book.– It focuses on concept, which you can apply using

any product. It does not have to be vRealize Ops

71

Thank You!

Technology

Master VMware Performance and Capacity Management