36
Top Support Issues and How to Solve Them Sriram Rajendran Escalation Engineer Nov 6 2008

Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Top Support Issues and How toSolve Them

Sriram Rajendran

Escalation Engineer

Nov 6 2008

Page 2: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Agenda

Support issues covered

� VMFS volumes missing / lost and LUN re-signaturing

� ESX host not-responding / disconnected in VirtualCenter

� Expanding the size of a VMDK with existing Snapshots.

� Virtualization performance misconception

Page 3: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Cannot See My VMFS Volumes

A common support issue – “My VMFS volumes have disappeared!”In fact, it is often the case that the volumes are seen as snapshots which, by default, are not mounted.

Why are VMFS volumes seen as snapshots when they are not?ESX server A is presented with a LUN on ID 0.

The same LUN is presented to ESX server B on ID 1.

VMFS-3 volume created on LUN ID 0 from server A.

Volume on server B will not be mounted when SAN is rescanned.

Server B will state that the volume is a snapshot because of LUN ID mismatch.

LUNs must be presented with the same LUN IDs to all ESX hosts. � In the next release of ESX, LUN ID is no longer compared if the target

exports NAA type IDs.

Page 4: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

How Does ESX Determine If Volume Is A Snapshot?

� When a VMFS-3 volume is created, the SCSI Disk ID data from the LUN/storage array is stored in the volume’s LVM header.

� This contains, along with other information, the LUN ID.

� When another ESX server finds a LUN with a VMFS-3 filesystem, the SCSI Disk ID information returned from the LUN/storage array is compared with the LVM header metadata.

� The VMkernel treats a volume as a snapshot if there is a mismatch in this information.

Page 5: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

How Should You Handle Snapshots?

First of all, determine if it is really a snapshot:� If it is a mismatch of LUN IDs across different ESX hosts, fix the LUN ID

through array management software to ensure that the same LUN ID is presented to all hosts for a share volume.

� Other reasons a volume might appear as a snapshot could be changes in the way the LUN is presented to the ESX:• HDS Host Mode setting• EMC Symmetrix SPC-2 director flag• Change from A/P firmware to A/A firmware on array

If it is definitely a snapshot, you have two options:Set EnableResignature

OrDisable DisAllowSnapshotLUN

Page 6: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

LVM.EnableResignature

Used when mounting the original and the snapshot VMFS Volumes on the same ESX.

� Set LVM.EnableResignature to 1 and issue a rescan of the SAN.

� This updates the LVM header with:

• new SCSI Disk ID information

• a new VMFS-3 UUID

• a new label

� Label format will be snap-<generation number>-<label>, or snap-<generation number>-<uuid> if there is no label, e.g.

• Before resignature: /vmfs/volumes/lun2

• After resignature: /vmfs/volumes/snap-00000008-lun2

� Remember to set LVM.EnableResignature back to 0.

Page 7: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

LVM.DisallowSnapshotLUN

DisallowSnapshotLUN will not modify any part of the LVM header

To allow the mounting of snapshot LUNs, set:

� EnableResignature to 0 (disable)

and

� DisallowSnapshotLUN to 0 (disable)

Do not use DisallowSnapshotLUN to present snapshots back to same ESX server

LVM.EnableResignature overrides LVM.DisallowSnapshotLUN

Page 8: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

LVM.EnableResignature

Storage

SPBSPA

FC Switch 1 FC Switch 2

HBA 1 HBA 2

Server A

0 1 0 1

LUN 0 LUN 1

LVM.EnableResignaturewill have to be used to make the volume located on cloned LUN, LUN 1, visible to the same ESX server after a rescan.

Two volumes with the same UUID must not be presented to the same ESX server. Issues with data integrity will occur.

Page 9: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

LVM.EnableResignature OR LVM.DisallowSnapshotLUN

Storage

SPBSPA

FC Switch 1 FC Switch 2

HBA 1 HBA 2 HBA 1 HBA 2

Server A Server B

0 1 0 1

LUN 0

-- snapshot --

LUN 1

We can present the snapshot LUN, LUN 1, using DisallowSnapshotLUN = 0on Server B as long as Server B cannot see LUN 0.

Snapshot LUN presented to a different ESX server

If Server B can also see LUN 0, then we must use resignaturing since we cannot present two LUNs with the same UUID to the same ESX server.

Page 10: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

LVM.DisallowSnapshotLUN

Storage A

SPBSPA

FC Switch 1 FC Switch 2

HBA 1 HBA 2

Server A

0 1 0 1

LUN 0

Storage B

SPBSPA

FC Switch 3 FC Switch 4

HBA 1 HBA 2

Server B

0 1 0 1

LUN 1

Remote snapshot, e.g. SRDF

Since there is not going to be a LUN with the same UUID at the remote site, one can allow snapshots.

Production DR site

Page 11: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

ESX host not-responding/disconnected in VirtualCenter

Components involved in the communication between ESX and VC servers.

VC Server ESX Server

VPXDVC agent

VPXAHost Agent

HostD

Page 12: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Back to the issue

Customer complaints: ESX server is seen in disconnected or not-responding state in VC server.

� If the ESX host is seen in Disconnected state then reconnecting the Host will solve the issue.

� However, in most cases, the ESX host is seen in “Not-Responding”state.

• What do we do in this case?

Page 13: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

List of things we can do

� Verify that network connectivity exists from the VirtualCenter Server to the ESX Server.

• Use ping to check the connectivity.

� Verify that you can connect from the VirtualCenter Server to the ESX Server on port 902 (If the ESX Server was upgraded from version 2.x then verify if you can connect on port 905)

• Use telnet service to connect to the specified port.

� Verify if Hostd agent is running

� Verify if VPXA agent is running

� Check system resources

Page 14: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Checking if Hostd is alive

� Verify that the ESX Server management service, hostd is still alive / running in ESX server.

� How to check if Hostd is still alive or not?

• Connect to the ESX server using SSH from another Windows/Linux box.

• Execute the command, vmware-cmd –l• If the command succeeds – Hostd is working fine

• If the command fails – Hostd Is not working/ probably stopped.

• Restarting the mgmt-vmware service will get the ESX host back online in VirtualCenter Server.

• Once this is done, execute the same command again to know if Hostd is working fine or not.

Page 15: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Checking if Hostd is alive

If Hostd fails to start then check the following logs for any hints to identify the issue. /var/log/vmware/hostd.log

Note: In most cases, the obvious reasons we have seen are,

� The / root filesystem is full.

� There are some rogue VMs registered.

� Some of presented LUNs are either corrupted or do not have a valid partition table.

If you are not able to proceed any further file a ticket with VMware Tech-Support.

Page 16: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Checking if VPXA is alive

To verify if the VirtualCenter Agent Service (vmware-vpxa) is running:Log in to your ESX Server as root, from an SSH session or directly from the console of the server.

[root@server]# ps -ef | grep vpxaroot 24663 1 0 15:44 ? 00:00:00 /bin/sh/opt/vmware/vpxa/bin/vmware-watchdog -s vpxa -u 30 -q 5 /opt/vmware/vpxa/sbin/vpxaroot 26639 24663 0 21:03 ? 00:00:00 /opt/vmware/vpxa/vpx/vpxaroot 26668 26396 0 21:23 pts/3 00:00:00 grep vpxa

The output appears similar to the following if vmware-vpxa is not running:

[root@server]# ps -ef | grep vpxaroot 26709 26396 0 21:24 pts/3 00:00:00 grep vpxa

Page 17: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Checking if VPXA is alive

Some times the VPXA process may become orphaned. Restarting the vmware-vpxa service helps.

How to restart the service:

/etc/init.d/vmware-vpxa restart

If the Service fails to start, check the following logs to identify the issue, /var/log/vmware/vpx/vpxa.log

If you are not able to proceed any further file a ticket with VMware Tech-Support.

Page 18: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Checking System Resources

Some times the hostd or the vpxa fails to start due to the lack of system resources.

� High CPU utilization on an ESX Server -- esxtop

� High memory utilization on an ESX Server -- /proc/vmware/mem

� Slow response when administering an ESX Server

Page 19: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding the size of a VMDK with existing Snapshots

You CANNOT expand a VM’s VMDK file while it still has snapshots.

e.g.

#ls *

important.vmdk important-000001-delta.vmdk

#vmkfstools –X 20G important.vmdk

Page 20: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Data

important-000001-delta.vmdkimportant.vmdk

Page 21: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Data

important-000001-delta.vmdkimportant.vmdk

Page 22: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

If you do, you will now have a VM that won’t boot

Page 23: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

Tricking ESX into seeing the expanded VMDK as the original size.

In this example we have a test.vmdk that we expand from 5GB to 6GB

#vmkfstools -X 6G test.vmdk

Page 24: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

If we check test.vmdk we see

# Disk DescriptorFile

version=1

CID=3f24a1b3

parentCID=ffffffff

createType="vmfs"

# Extent description

RW 12582912 VMFS "test-flat.vmdk"

# The Disk Data Base

#DDB

ddb.virtualHWVersion = "4"

ddb.geometry.cylinders = "783"

ddb.geometry.heads = "255"

ddb.geometry.sectors = "63"

ddb.adapterType = "buslogic"

Page 25: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

Original - RW 10485760 VMFS "test-flat.vmdk“

New - RW 12582912 VMFS "test-flat.vmdk“

If we have no “BACKUPS” how do we get the original value?

#grep -i rw test-000001.vmdk

RW 10485760 VMFSSPARSE “test-000001-delta.vmdk"

Page 26: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

We change test.vmdk RW value.

# Disk DescriptorFile

version=1

CID=3f24a1b3

parentCID=ffffffff

createType="vmfs"

# Extent description

RW 10485760 VMFS "test-flat.vmdk"

# The Disk Data Base

#DDB

ddb.virtualHWVersion = "4"

ddb.geometry.cylinders = "783"

ddb.geometry.heads = "255"

ddb.geometry.sectors = "63"

ddb.adapterType = "buslogic"

Page 27: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Expanding VM With a Snapshot

Commit The snapshot(s)

#vmware-cmd /pathtovmx/test.vmx removesnapshots

Grow the VMDK file

#vmware-cmd –X 6GB test.vmdk

If needed add a snapshot

#vmware-cmd pathtovmx/test.vmx createsnapshot

<name> <description>

Page 28: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Virtualization Performance Myths

� CPU affinity

� Virtual SMP performance

� Ready time

� Transparent page sharing

� Memory over-commitment

� Memory Ballooning

� NICTeaming

� Hyperthreading

Page 29: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

CPU affinity

Myth: Set CPU affinity to improve VM performance

CPU affinity implications

� CPU affinity restricts scheduling freedom. VM will accrue ready time if the pinned CPU is not available for scheduling

� On NUMA system setting CPU affinity disables NUMA scheduling. VMperformance will suffer if memory is allocated on the remote node

� On Hyperthreadedsystem CPU affinity binds the VM to Logical CPU

� ESX tries to balance Interrupts. Setting CPU affinity to a physical CPU where interrupts occur frequently can impact performance

Fact: Setting CPU affinity could impact performance

Page 30: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Virtual SMP Performance

Myth: Virtual SMP improves performance of allCPU bound applications

SMP performance� Single threaded application cannot use more than one CPU at a time

� Single thread may ping pong between the virtual CPUs

Incurs virtualization overhead, pinning the thread to a vcpuhelps in this case

� Co-Scheduling Overhead

Multiple Idle physical CPUs may not be available when the VM wants to run (VM may accumulate ready time)

Fact: Virtual SMP does not improve performance of single-threaded

applications

Page 31: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Ready Time (1 of 2)

Myth: Ready time should be zero when CPU usage is low

VM state

� running (%used)

� waiting (%twait)

� ready to run (%ready)

When does a VM go to “ready to run”state

� Guest wants to run or needs to be woken up (to deliver an interrupt)

� CPU unavailable for scheduling the VMRunReadyWait

Run

ReadyWait

Page 32: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Ready Time (2 of 2)

o Factors affecting CPU availability� CPU over commitment

Even Idle VMs have to be scheduled periodically to deliver timer interrupts

� NUMA constraintsNUMA node locality gives better performance

� Burstiness – Inter-related workloadsTip: Use host anti affinity rules to place inter related workloads on different hosts

� Co-scheduling constraints� CPU affinity restrictionsFact: Ready time could exist even when CPU usage is low

Page 33: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Transparent Page Sharing

Myth: Disable Transparent page sharing to improve performance

Transparent page sharing

� VMkernel scans memory for Identical pages and collapses them into a single page

� Copy on write is performed if a shared page is modified

� OSes have static code pages that rarely change

� Number of shared pages becomes significant with more consolidation

Huge win for memory over commitment

� The default scanning rate is low and incurs negligible overhead (<1%)

Fact: Transparent page sharing does not affect performance adversely and it improves performance under memory over commitment

Page 34: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Memory Ballooning

Myth: Disable balloon driver to increase VM performance

How Ballooning works� Balloon driver gives / takes away memory from the guest under

memory pressure

� Rate of reclamation of memory is determined by memory shares

� In its default configuration, memory shares is proportional to VM memory size

� Memory is reclaimed forcibly by swapping if balloon driver is not installed

Tip: To avoid swapping/ballooning use memory reservation

Fact: Disabling ballooning driver severely affects VM performance under memory over-commitment

Page 35: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Hyperthreading

Myth: Hyperthreading hurts ESX performance

Hyperthreading support in ESX

� Hyperthreading increases the available number of CPUs for scheduling

� SMP VMs use logical CPUs from different physical CPUs whenever possible

� Scheduler temporarily quarantines a VM from logical CPU if it misbehaves (Cache trashing)

� Frequently misbehaving VMs can be selectively excluded from hyperthreading with htsharing=none option

Fact: Hyperthreading improves overall performance by reducing ready time

Page 36: Top Support Issues and How to Solve Them - VMwaredownload3.vmware.com/.../track5/T5_S3_PPT2_Sriram.pdfStorage B SPA SPB FC Switch 3 FC Switch 4 HBA 1 HBA 2 Server B 0 10 LUN 1 Remote

Additional References

Best practices Using VMware Virtual SMP

http://www.vmware.com/pdf/vsmp_best_practices.pdf

Performance tuning best practices for ESX server 3

http://www.vmware.com/pdf/vi_performance_tuning.pdf

ESX Resource Management Guide

http://www.vmware.com/pdf/esx_resource_mgmt.pdf