Upload
bert-van-der-lingen
View
222
Download
0
Embed Size (px)
Citation preview
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
1/74
Performance Report
VMware View linked clone performance
on Suns Unified Storage
Author: Erik Zandboer
Date: 02-04-2010Version 1.00
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
2/74
Page 2 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Table of contents
1 Management Summary ................................................................................................................. 6
1.1 Introduction ........................................................................................................................ 6
1.2 Objectives ........................................................................................................................... 6
1.3 Results ................................................................................................................................ 6
2 Initial objective ............................................................................................................................. 7
2.1 VMware View ....................................................................................................................... 7
2.2 Storage requirements .......................................................................................................... 7
3 Technical overview of the solutions .............................................................................................. 8
3.1 VMware View linked cloning ................................................................................................ 8
3.2 Sun Unified Storage ............................................................................................................. 83.3 Linked cloning technology combined with Unified Storage ................................................... 9
4 Performance test setup ............................................................................................................... 10
4.1 VMware ESX setup ............................................................................................................. 10
4.2 VMware View setup ........................................................................................................... 11
4.3 Windows XP vDesktop setup .............................................................................................. 11
4.4 Unified Storage setup ........................................................................................................ 12
5 Tests performed ......................................................................................................................... 13
5.1 Test 1: 1500 idle vDesktops .............................................................................................. 13
5.2 Test 2: User load simulated linked clone desktops ............................................................. 13
5.3 Test 2a: Rebooting 100 vDesktops in parallel .................................................................... 13
5.4 Test 2b: Recovering all vDesktops after storage appliance reboot ...................................... 135.5 Test 3: User load simulated full clone desktops ................................................................. 14
6 Test results ................................................................................................................................ 15
6.1 Test Results 1: 1500 idle vDesktops .................................................................................. 15
6.1.1 Measured Bandwidth and IOP sizes ................................................................................ 16
6.1.2 Caching in the ARC and L2ARC ...................................................................................... 20
6.1.3 I/O Latency ................................................................................................................... 22
6.2 Test Results 2: User load simulated linked clone desktops ................................................ 24
6.2.1 Deploying the initial 500 user load-simulated vDesktops ............................................... 25
6.2.2 Impact of 500 vDesktop deployment on VMware ESX ..................................................... 31
6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and View ................................ 34
6.2.4 Deploying vDesktops beyond 500 .................................................................................. 366.2.5 Performance figures at 1300 vDesktops ......................................................................... 40
6.2.6 Extrapolating performance figures ................................................................................. 47
6.3 Test Results 2a: Rebooting 100 vDesktops ........................................................................ 54
6.4 Test Results 2b: Recovering all vDesktops after storage appliance reboot........................... 58
6.5 Test Results 3: User load simulated full clone desktops ..................................................... 62
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
3/74
Page 3 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7 Conclusions ............................................................................................................................... 65
7.1 Conclusions on scaling VMware ESX ................................................................................... 65
7.2 Conclusions on scaling networking between ESX and Unified Storage ................................. 66
7.3 Conclusions on scaling Unified Storage CPU power ............................................................ 677.4 Conclusions on scaling Unified Storage Memory and L2ARC ............................................... 68
7.5 Conclusions on scaling Unified Storage LogZilla SSDs ........................................................ 68
7.6 Conclusions on scaling Unified Storage SATA storage ........................................................ 69
8 Conclusions in numbers ............................................................................................................. 70
9 References ................................................................................................................................. 72
Appendix 1: Hardware test setup ...................................................................................................... 73
Appendix 2: Table of derived constants ............................................................................................ 74
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
4/74
Page 4 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
People involved
Name Company Responsibility E-Mail
Erik Zandboer Dataman B.V. Sr. Technical Consultant [email protected]
Simon Huizenga Dataman B.V. Technical Consultant [email protected]
Kees Pleeging Sun Project leader [email protected]
Cor Beumer Sun Storage Solution Architect [email protected]
Version control
Version Date Status Description
0.01 11-02-2010 Initial draft Initial draft for internal (Dataman / Sun) review
0.02 12-03-2010 Final draft Adjusted some reviewed minors; added conclusions and derivedconstants
1.0 02-04-2010 Release Changed last minors; changed minors in reviewed items added in 0.02
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
5/74
Page 5 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Abbreviations and definitions
Abbreviation Description
VM Virtual Machine. Virtualized workload on a virtualization platform (such as VMware ESX)
GbE Gigabit Ethernet. Physical network connection at Gigabit speed.
IOPS I/O operations Per Second. The number of both read- and write commands from and to a
storage device per second. Take note that the ratio between reads and writes cannot be
extracted from these values, only the sum of the two. Also see ROPS and WOPS.
OPS Operations Per Second. More general term, and closely related to IOPS.
ROPS Read Operation Per Second. The number of read commands performed on a storage
device per second.
WOPS Write Operation Per Second. The number of write commands performed on a storage
device per second.
TPS Transparent Page Sharing. A feature unique to VMware ESX, where several memory pages
can be identified as containing equal data, and then stored only once in physical memory,
effectively saving physical memory. Is in most respects comparable to data deduplication.
SSD Solid State Drive. This is normally indicated as a non-volatile storage device with nomoving parts. It can be a Flash Drive (like the ReadZilla device), but it can also be a
battery-backed (plus optionally flash-backed) RAM drive (like the LogZilla device).
KB KBytes. Also seen in conjunction with /s or .sec-1 which dedicates KBytes-per-second
MB MBytes. Also seen in conjunction with /s or .sec-1 which dedicates MBytes-per-second.
Mb Mbits. Also seen in conjunction with /s or .sec-1 which dedicates Mbits-per-second.
vDesktop Virtualized Desktop. A Virtual Machine (VM) running a client operating system such as
Windows XP.
ave Average. Shorthand used in graphs to indicate the value is an averaged value.
HT, HTx Hyper Transport bus. High bandwidth connection between CPUs and I/O devices on
mainboards. Often indicated with numbers (HT0, HT1) to indicate specific connections.
UFS Unified Storage (Device). Storage device which is capable of delivering the same data using
multiple protocols.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
6/74
Page 6 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
1 Management Summary1.1 IntroductionRunning virtual desktops (vDesktops) puts a lot of stress on storage systems. Conventional storage systems
are easily scaled to the right size: A number of disks deliver a certain capacity and performance.
In an effort to tackle the need for a lot of disks in a virtualized desktop (vDesktop) environment, Dataman
started to analyze the basic needs for a vDesktop storage solution based on VMware linked cloning
technology. The new Sun Unified Storage (UFS) Solution (see reference [4] ) appeared to have a significant
head start in delivering a high vDesktop performance with small number of disks.
Because of the alternative way the storage solution works, it is next to impossible to calculate performance
numbers. The way the Unified Storage performs is very dependent on the workload offered. This is why
Dataman teamed up with Sun in order to run performance tests on these storage devices.
1.2 ObjectivesThe performance test had several goals:
- To measure performance impact on the Unified Storage Array as more vDesktops were deployed onthe environment;
- To examine impact on vDesktop reboots;- To extrapolate the measured performance data;- To project (and avoid) performance bottlenecks;- To define scaling constants for scaling the environment to a projected number of vDesktops.
The tests were performed in Suns Datacenter in Linlithgow, Scotland. Hardware and housing was generously
made available to Dataman for a period of two months, over which all necessary tests were performed.
1.3 ResultsThe performance tests have proven to be very effective; during the final stages of the test the testing
environment stopped at 1319 user-simulated vDesktops because the VMware environment having only
eight nodes could not handle any more virtual machines (VMs). At that stage, all vDesktops still performed
without any issues or noticeable latency on a single headed UFS device. Even more remarkable, the
environment could have run with only 16 SATA spindles in a mirrored setup! It is the underlying ZFS file
system and the intelligent use of memory and Solid State Disks (SSDs) that makes all the difference here.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
7/74
Page 7 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
2 Initial objectiveAfter virtualization practically conquered the world for server loads, virtualization now continues to virtualizethe desktop. Virtualizing a big number of desktops on a small set of servers has proven to pose its own set of
challenges. The one often encountered is performance requirements of the underlying storage array. Scaling
disks just to satisfy the capacity needs has always been a bad practice, but this can work out especially bad in
a virtual desktop environment. The large disk capacities of nowadays do not help either.
2.1 VMware ViewOne of the leading platforms for delivering virtualized desktops is VMware ESX in combination with VMware
View. VMware View is able to deliver virtual desktops using linked cloning technology. This technology is able
to deliver very fast desktop image duplicating and more efficient in terms of storage capacity needs.
Calculating the number of ESX nodes (cores and memory) is not too hard. It is no different from having full
cloned desktops. But what are the requirements of the underlying storage array?
2.2 Storage requirementsThe structure of linked clones poses some challenges to the storage. For reasons explained in the next
paragraphs, Suns 7000 series Unified Storage (see reference [4] ) was selected as being THE platform to drive
the linked clone loads most efficient.
The objective of this performance test is to prove that Suns 7000 series Unified Storage in combination with
linked clones gives great performance at little cost.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
8/74
Page 8 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
3 Technical overview of the solutionsIn order to understand the performance test setup and its results better, it is important to have someknowledge about the underlying technologies.
3.1 VMware View linked cloningVMware View is basically a broker between the clients and the virtualized desktops (vDesktops) in the
datacenter. The idea is that a single Windows XP image can be used to clone thousands of identical desktops.
The broker controls the cloning and customization of these desktops.
VMware View enables an extra feature: linked cloning. When using linked cloning technology, only a small
number of fully cloned desktop images exist. All virtual desktops that are actually used are derivatives of
these full clone images. In order to be able to differentiate the desktops, all writes to the virtual desktops
disk are captured in a separate file, much like VMware snapshot technology. The result of this is that many
read operations are performed from the few full clones within the environment.
Following the VMware best practices, it is recommended to have a maximum of 64 linked clones under every
full clone (called a replica).
3.2 Sun Unified StorageSuns Unified Storage uses the ZFS file system internally. There are some very specific differences with just
about any other file system. It is far beyond the scope of this document to deep dive into ZFS, so just some
features of these appliances will be discussed.
Suns Unified Storage appliances have a lot of CPU power and memory compared to most competitors. The
CPU power is required to drive the ZFS file system in an appropriate manner, and memory helps caching of
data. This caching is partly the key to extreme performance of the appliance, even with relatively slow SATA
disks. The use of Solid State Drives (SSDs) further enhances the performance of the appliance: read SSD
(called Readzillas) basically extends the appliances memory, and logging SSDs (called Logzillas) help
synchronous writes to be acknowledged faster (the effect appears somewhat similar to write caching, but the
technology is very different).
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
9/74
Page 9 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
3.3 Linked cloning technology combined with Unified StorageThe basic idea of using Suns Unified Storage for linked cloned desktops came from two directions: First, a
storage device with a lot of cache was needed, in order to be able to store the replicas (full clone images).
Secondly, the barrier of 64 linked clones per replica limited the effectiveness of the cache, since one replica is
needed for every 64 linked clones. This limit applies to storage devices having LUNs with VMFS (the VMware
file system for storing VMs) on it. LUN queuing, LUN locking and some other artifacts come into play here.
But when using NFS for storage, and not iSCSI or FC, the 64 linked clones per replica barrier could possibly
be broken. NFS has no issues having a thousand or more opened files accessed in parallel. Since Suns Unified
Storage is also able to deliver NFS, Suns storage device appeared to be the right choice.
.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
10/74
Page 10 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
4 Performance test setupThe performance test was set up in Suns test laboratory in Linlithgow, Scotland. Sun made a number ofservers, a Sun 7410 Unified storage device and the necessary switching components available. The total
hardware setup can be viewed in appendix 1.
4.1 VMware ESX setupA total of nine servers were available for VMware ESX. Eight were used for virtual desktop loads, the ninth
server was used for all other required VMs like vCenter, SQL, View and Active Directory. The specifications of
the used servers:
8x Sun x4450 with 4x 6core Intel CPU (2.6GHz), 64GB memory
1x Sun X4450 with 4x 4core Intel CPU (2.6GHz), 16GB memory
All nodes were connected with a single GbE NIC to the management network, a single NIC to a vmotion
network, and with a third Ethernet NIC to an isolated client network where the Windows XP virtual desktops
could connect to active directory / file serving.
The eight nodes performing virtual desktop loads were also connected to an NFS storage network using two
GbE interfaces. All these interfaces were connected to a single GbE switch.
ESX 3.5 update 5 was used to perform the tests. Setup was kept to a default; console memory was increased
to 800MB (maximum). In order to make sure both GbE connections to the storage array would be used, two
different subnets were used to the array, each subnet accessed by its own VMkernel interface. Each VMkernel
interface in its turn was connected to one of both GbE interfaces, guaranteeing a static load balancing across
both interfaces for every host.
To be able to house the maximum number of VMs possible on a single vSwitch, the port-count of the vSwitch
was increased to 248 ports.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
11/74
Page 11 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
4.2 VMware View setupFor managing the desktops, a template Windows 2003 64bit enterprise edition was created. From this
template, five VMs were derived:
1) Microsoft SQL 2005 standard server with SP3;2) Domain controller with DNS and file sharing enabled;3) VMware vCenter 2.5 update 5;4) VMware View 3.1.2;5) VMware Update Manager.
During the tests, all these VMs were constantly monitored to guarantee that any limits found in the
performance tests were not due to limitations within these VMs.
All ESX nodes involved in carrying vDesktops were put in a single VMware cluster, which was kept at default.
A single Resource Pool was created within the cluster (at default) to hold all vDesktops during the tests.
4.3 Windows XP vDesktop setupThe Windows XP image used was a standard Windows XP install with SP2 integrated. PSTools was installed
inside the image, in order to be able to start and stop application in batches, to simulate a simple user load
of the vDesktops. No further tuning was done to the image.
Within VMware the images were configured with an 8GB disk, a single vCPU and 512MB of memory.
User load was simulated by using autologon of the vDesktop, after which a batch file was started. This batch
file performed standard tasks with built-in delays. Examples of the tasks were:
- Starting of MSpaint which loads an image from the Domain Controller/File server;- Starting Internet Explorer;- Starting MSinfo32;- Unzipping putty.zip to a local directory, then deleting it again;-
Starting solitaire;
- Stopping all applications again.These actions were fixed in order and delay. The delays were tuned until the vDesktop delivered an average
load of 300MHz, and just about 6 IOPs (this is accepted as being a lightweight user). In this user load, a
rather high write-load was introduced (in every 6 IOs, 5 are writes). This is considered to be a worst-case IO
distribution for a vDesktop, making it a perfect setup for storage performance testing.
Checking the performance of the XP desktops was not a primary objective of the performance tests, however
after each test a few randomly chosen vDesktops would be accessed and the introduction to Windows XP
would be started to see the fluidness of the animation, making sure the desktops were still responsive.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
12/74
Page 12 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
4.4 Unified Storage setupThe Sun 7410 Unified storage device was connected to the storage switch using two 10 GbE interfaces. Only a
single head was used in the performance test, connected to 136 underlying SATA disks in six trays. In four of
the trays a LogZilla was present. In total two LogZillas (2x 18[GB]) were assigned to the 7410 head. Inside the
7410 head itself, two Readzillas were available (2x 100[GB]). All SATA storage (apart for some hot spares) was
mirrored (on a ZFS level). With a drive size of 1TB, this effectively delivers 60TB of total storage.
The 7410 itself was configured with two Quad-Core AMD Opteron 2356 processors and 64[GB] of memory. A
single, dual port 10GbE interface was added to the system for connection to the storage network. A third link
(1GbE) was introduced for management inside the management network.
During configuration, two shares were created, both having their own IP address on their own 10GbE uplink.
This ensures static load balancing for the ESX nodes, and also ensures the load is evenly spread over both10GbE links on the storage unit. Jumbo frames was not enabled anywhere in the tests.
In order to be able to measure the usage of the HyperTransport busses inside the 7410, a script was inserted
into the unit which can measure these loads.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
13/74
Page 13 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
5 Tests performedA total of three tests were performed; the first test loaded 1500 idle vDesktops in linked clone mode on thestorage. In the second test an attempt was made to load as many user load simulated vDesktops onto the
testing environment in steps of 100 vDesktops. The third and final test was equal to the second test, but now
using full clones from VMware View.
For all test both NFS shares were used. VMware View will automatically balance the number of VMs equally
across all available stores.
5.1 Test 1: 1500 idle vDesktopsIn the first test, VMware View was simply instructed to deploy 1500 Windows XP images from a single source
image. The resulting images were not performing any user load simulation, so were booted then left at idle.
This test has been performed to get a general idea about loading on ESX and storage required for this
number of VMs.
5.2 Test 2: User load simulated linked clone desktopsAfter the initial test mentioned in 5.1, the test was repeated, now with user load simulated desktops. The test
was performed in steps, with an additional 100 vDesktops every step. The steps are repeated until a
limitation in storage, ESX and/or external environment is met.
5.3 Test 2a: Rebooting 100 vDesktops in parallelAs test 2 (5.2) reached the 1000 vDesktop mark, a hundred vDesktops were rebooted in parallel. This test
was performed to simulate a real life scenario, where a group of desktops is rebooted in a live environment.
Especially the impact on the storage device is to be monitored.
5.4 Test 2b: Recovering all vDesktops after storage appliance rebootAs test 2 (5.2) reached its maximum, the storage array was forcibly rebooted. Not really part of the
performance test, yet interesting to see the recovery process of the storage array, and the recovery of the
VMs on it.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
14/74
Page 14 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
5.5 Test 3: User load simulated full clone desktopsUsing full clones on a Sun 7000 storage device was not expected to work as efficient as a linked cloning
configuration. In this test a number of full clone desktops are deployed, 25 vDesktops in each step.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
15/74
Page 15 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6 Test resultsThe test results are described on a per-test basis. The initial 1500 idle-running vDesktop test is also used asa general introduction into the behavior of the storage device, the solid state drives and the observed
latencies.
6.1 Test Results 1: 1500 idle vDesktopsAs an initial test, 1500 idle-running, linked-cloned vDesktops were deployed onto the test environment. After
the system had settled, there was first prove about the storage device being able to cope at least with 1500
idle vDesktop loads.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
16/74
Page 16 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.1.1 Measured Bandwidth and IOP sizesRunning this workload used NFS bandwidth is measured in figure 6.1.1:
Figure 6.1.1: Running 1500 idle desktops, about 22MB/s writes and 10MB/sec reads are observed.
The fact that about twice as much data is written than read, is probably due to the fact that the vDesktops are
running idle (little reads taking place), while the vDesktops only have 512[MB] of memory each, causing them
to use their local swap files and writing out to the storage device.
0
5
10
15
20
25
30
1 101 201 301 401 501 601 701 801 901 1001
NFSrate[MB.sec-1]
Time [sec]
NFS read and write MBs
(1550 idle-running vDesktops)
NFS writes ave [MB/sec]
NFS reads ave [MB/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
17/74
Page 17 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
As both bandwidth and number of IOPS have been measured, it is easy to derive the average block size
of the NFS reads and writes:
Figure 6.1.2: Average NFS read- and write block sizes observed
Since VMware ESX will try to concatenate sequential reads and writes whenever possible, it is very likely that
the writes are completely random (NTFS 4K block size appears to be overruling here). The read operations are
bigger on average, probably meaning there are some quasi sequential reads going on.
0
5
10
15
20
25
30
1 101 201 301 401 501 601 701 801 901 1001
AverageNFSBlocksize[KB]
Time [sec]
Average NFS read and write blocksizes
(1500 idle-running vDesktops)
NFS write Blocksize [KB]
NFS read blocksize [KB]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
18/74
Page 18 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Since all writes to the storage device are synchronous and have very small block sizes, all writes will be put
into the LogZilla devices before they pass on to SATA. As the data to be written traverse through these stages,
the number of WOPS becomes smaller with every step:
Figure 6.1.3: Number of Write operations observed through the three stages
Here it becomes obvious how effective the underlying ZFS file system is. The completely random write load
which consists of nearly 5000 Write Operations per second, gets converted in the last stage (SATA) to just
over 30 write operations per second. ZFS is effectively converting the tiny random writes to NFS into large
sequential blocks, effectively dealing with the relatively poor seek times of the physical SATA drives.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 101 201 301 401 501 601 701 801 901 1001
WriteOperations[sec
-1]
Time [sec]
Comparing Write OPS through stages
(1500 idle-running vDesktops)
LogZilla WOPS [/sec]
SATA WOPS [/sec]
NFS WOPS [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
19/74
Page 19 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The write operations are effectively being dealt with. On reads, the following is observed on the SATA drives:
Figure 6.1.4: Observed SATA read operation per second.
At an average read bandwidth of 10[MB.sec-1] (see figure 6.1.1), less than 0.3 read operations per second
(ROPS) are observed on the SATA drives. This raises the suspicion that most (in fact almost all) read
operations are served by the read cache (ARC or L2ARC), and only very little reads actually originate from the
SATA drives, effectively boosting overall read performance of the Sun 7000 storage device.
-0,5
0
0,5
1
1,5
2
2,5
3
1 101 201 301 401 501 601 701 801 901 1001
SATAReadOperation
[sec-1]
Time [sec]
SATA IOPS read ave [/sec]
SATA ROPS [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
20/74
Page 20 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.1.2 Caching in the ARC and L2ARCZooming in on the read performance, we need to look closer to the read caching going on. In figure 6.1.5 it is
obvious, that the ARC (64[GB] minus overhead) was saturated and the L2ARC (200[GB]) is only filled up to
about 70[GB]:
Figure 6.1.5: Running 1500 idle desktops, the ARC shows fully filled while the L2ARC flash drives
vary in usage around 64[GB].
0
10000
20000
30000
40000
50000
60000
70000
1 101 201 301 401 501 601 701 801 901 1001
ARC/L2ARCsize[MB]
Time [sec]
ARC / L2ARC size (1500 idle-running vDesktops )
ARC datasize [MB]
L2ARC datasize [MB]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
21/74
Page 21 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The ARC/L2ARC not being saturated should mean that all actively read data still fits into memory (ARC) or
Readzilla (L2ARC). This is clearly shown in figure 6.1.6, where the number of ARC hits show to be much larger
than the number of ARC misses:
Figure 6.1.6: Running 1500 idle desktops, the ARC hits show around 7000 per second while the
ARC misses show up at about 250. This is an indication of the effectiveness of the
(L2)ARC while running this specific workload.
While read operations appear to be properly drawn from ARC or L2ARC, write operations must be committedto the disks at some point. The NFS writes are synchronous, meaning that each write operation must be
guaranteed to be saved by the storage device before acknowledging the operation. This would mean a rather
bad write performance, since the underlying disks are relatively slow SATA drives.
This problem is countered by the use of LogZilla devices. These devices are write-optimized solid state disks
(SSDs), which constantly store the write operation metadata and acknowledge the write back immediately,
before it is actually committed to disk. As soon as the write is actually committed to SATA storage, the
metadata entry is removed from the LogZilla (this the reason it is called a LogZilla and not a write cache; the
LogZilla is only there to make sure the dataset does not get in an inconsistent state when for example a
power outage occurs).
0
1000
2000
3000
4000
5000
6000
7000
8000
1 101 201 301 401 501 601 701 801 901 1001
ARChits/misses[sec-1]
Time [sec]
ARC hits / misses (1500 idle-running vDesktops)
ARC hits [/sec]
ARC misses [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
22/74
Page 22 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The underlying ZFS file system flushes the writes at least every 30 seconds to disk. The ZFS file system is able
to perform random writes to the SATA disks very effective, actually being a big sequential write whenever
possible. This can be verified from the graph in figure 6.1.3.
6.1.3 I/O LatencyBesides read and write performance, it is also necessary to look at storage latency. Latency is the delay
between a request to the storage, and the answer back. During a read it is typically the time from a read
request to the delivering of the data. During a write it is typically the time required from a write to the write
acknowledgement back.
Best performance is met when latency is minimal. To be able to graph latency through time, a three
dimensional graph is required. The functions of the different axes are:
- Horizontal Axis: Time;- Vertical Axis: Number of Read and/or Write Operations;- Depth Axis: Latency.
Latency is grouped into ranges instead of unique values. This enables the creation of 3D graphs,
because it is now possible to see groups of IOPS which conform to a certain latency range.
Since in many occasions almost all latency falls within the lowest group of 0-20[ms], graphs are often
zoomed in, where the number of IOPS (Vertical axis) is clipped to a low number. As a result, the peaks
of the 0-20[ms] latency-group go of the chart. This gives room to a more clear view of the higher
latency-groups. Please take note that these graphs do not give a total overview of the number of IOPSperformed; they merely give insight to the tiny details which are almost invisible in the original (non
zoomed) graph.
In figure 6.1.7a (with its zoomed counterpart 6.1.7b) the latency graph is displayed for NFS Read
Operations with 1500 idle-running vDesktops. Almost all operations fall within the 0-20[ms] latency-
group. Only when looking at the zoomed graph (figure 6.1.7b), some higher latencies can be observed.
However, these are so very small in numbers compared to the number of IOPS within the 0-20[ms]
latency-group, that only very little impact is to be expected from this.
The read operations that required more time to complete are probably the ARC/L2ARC cache misses,
and had to be read from SATA. These SATA reads are the reads observed in figure 6.1.4.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
23/74
Page 23 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.1.7a: Observed latency in NFS reads. Most read operations are served within 20[msec]
Figure 6.1.7b: Detail of latency in NFS read operations. Clipped at only 20 OPS to visualize higher
latency read operations.
0
200
400
600
800
1000
1200
NFSReadOperations[sec-1]
NFS Read Latency (1500 idle-running
vDesktops)
0
5
10
15
20NFSReadOperations
[sec-1]
NFS Read Latency ZOOMED (1500 idle-
running vDesktops)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
24/74
Page 24 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2 Test Results 2: User load simulated linked clone desktopsAfter the initial test with idle-running desktops, the environment was reset. A new Windows XP image was
introduced, which delivers a lightweight user pattern:
- 200[MHz] CPU load;- 300[MB] active memory;- 7 observed NFS IOPS.
The memory and CPU load were deliberately held to a low level, so a maximum number of VMs would fit onto
the virtualization platform. The number of IOPs was matched to the accepted industry-average of 5 - 5.6
IOPs, with a calculated 150% overhead for linked cloning technology (See reference [1] for an explanation on
the 150% factor).
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
25/74
Page 25 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.1 Deploying the initial 500 user load-simulated vDesktopsWhen deploying the initial 500 vDesktops, the effect of the deployment was clearly reflected in several
graphs. In figure 6.2.1 the ARC + L2ARC size grow almost linear during deployment:
Figure 6.2.1: Observed ARC/L2ARC data size growth when deploying the first 500 desktops.
During the deployment of the very first vDesktops, the ARC immediately fills with both replicas (a replica is
the full-clone image from which the linked clones are derived). There are two replicas, because two NFS
shares were used, and VMware View places one replica on each share. In the leftmost part of the graph it is
actually identifiable that both replicas are put into the ARC one by one.
After this initial action, the ARC starts to fill. This is because the created linked clones are also being read
back. Since every vDesktop behaves the same, the read back performed on the linked clones is also identical,
which explains the near-linear growth.
0
10000
20000
30000
40000
50000
60000
ARC/L2ARCDatasize[MB]
Time
ARC / L2ARC datasize (0 - 500 userloaded
vDesktops)
ARC datasize [MB]
L2ARC datasize [MB]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
26/74
Page 26 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
At the right of figure 6.2.1, the ARC fills up to its memory limit of 64[GB] minus the Storage 7000 overhead. It
is not until this time that the L2ARC starts to fill in the same linear manner as the ARC did. It becomes clear
that the L2ARC behaves as a direct (though somewhat slower) extension of the ARC (which resides in
memory).
When looking at ARC hits and misses in figure 6.2.2, it becomes clear that more and more read operations
are performed throughout the deployment:
Figure 6.2.2: Observed ARC hits and misses while deploying the initial 500 user loaded vDesktops.
The graph in figure 6.2.2 clearly shows the growing number of ARC hits. The ARC misses hardly increase at
all. This means that as more vDesktops are deployed, the effectiveness of the read cache mechanism
increases.
0
1000
2000
3000
4000
5000
6000
ARChits/misses[sec-1]
ARC hits/misses (0 - 500 userloaded vDesktops)
ARC hits [/sec]
ARC misses [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
27/74
Page 27 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.3: Consumed NFS bandwidth during deployment of the initial 500 vDesktops
In figure 6.2.3 it is clearly visible that the first 500 vDesktops were deployed in batches of 100. During the
linked cloning deployment, consumed NFS bandwidth is clearly higher than during normal running periods.
0
5
10
15
20
25
30
NFSread/write[MB.sec-1]
NFS bandwidth consumed (0-500 userloaded
vDesktops)
NFS write ave [MB/sec]
NFS read ave [MB/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
28/74
Page 28 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.4: SATA Read- and Write Operations observed during the deployment of the initial 500
vDesktops. Note that the vertical scale has been extended to -2 in order to clearly
display the Read Operations, which run over the vertical axis itself.
Figure 6.2.4 shows that the SATA Write Operations increase with the number of vDesktops running. The Read
Operations remain at a minimum level, without any measurable increase. This is in line with figure 6.2.2
showing that the read cache gets more effective with a growing number of deployed vDesktops.
-2
3
8
13
18
23
28
Read/W
riteOperations[sec-1]
SATA Read- and Write Operations (0 - 500
userloaded vDesktops)
SATA WOPS [/sec]
SATA ROPS [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
29/74
Page 29 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The write operations to SATA are synchronous, and get accelerated by the LogZillas. The graph in figure 6.2.5
shows the WOPS to the LogZilla devices:
Figure 6.2.5: Write Operations to the LogZilla device(s).
0
100
200
300
400
500
600
700
800
900
1000
LogZillaWOP
S[sec-1]
Logzilla WOPS ave [/sec]
(0-500 userloaded vDesktops)
LogZilla WOPS ave
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
30/74
Page 30 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The ZFS file system is able to deliver this workload using a very limited amount of SATA write operations. A
possible downside of the ZFS file system, is the large amount of CPU overhead imposed. See figure 6.2.6 for
details on CPU usage of the sun 7000 storage device:
Figure 6.2.6: CPU usage in the Sun 7000 storage during deployment of 500 user load-simulated
vDesktops
0
5
10
15
20
25
30
35
40
Sun Storage 7000 CPU load ave [%]
CPU load ave [%]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
31/74
Page 31 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.2 Impact of 500 vDesktop deployment on VMware ESXAs the number of vDesktops increases, the load on VMware ESX and vCenter also increases. See figure 6.2.7,
6.2.8 and 6.2.9 for more details:
Figure 6.2.7: CPU usage within one of the eight VMware ESX hosts during the deployment of the
initial 250 vDesktops. The topmost grey graph is the CPU overhead of VMware ESX.
In figure 6.2.7 the deployment of vDesktops is clearly visible. Each time a vDesktop is deployed and started, a
ribbon is added to the graph. Each vDesktop uses the same amount of CPU power, which is increased
slightly just after deployment (when the VM is booting its operating system)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
32/74
Page 32 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.8: Active Memory used by the vDesktops on one of the ESX nodes during the
deployment of the initial 250 vDesktops. The lower red ribbon is ESX memory
overhead due to the Service Console.
Figure 6.2.8 shows the active memory consumed as the vDesktops are deployed on one of the ESX nodes.
After each batch of 100 vDesktops, the memory consumption stops increasing, then slightly decreases. This
effect is caused by two things:
1) Freeing up tested memory within the VMs (Windows VMs touch all memory during memory test);2) VMwares Transparent Page Sharing technology.
As the VMs settle on the ESX server, ESX starts to detect identical memory pages, effectively deduplicating
them (item 2 on the list above). This feature can save a lot of physical memory usage, especially when
deploying many (almost) identical VM workloads.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
33/74
Page 33 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.9: Physical memory shared between vDesktops thanks to VMwares Transparent Page
Sharing (TPS) function within VMware ESX.
Transparent Page Sharing (TPS) effects become clearer when looking at the graph in figure 6.2.9. As VMs are
added to the ESX server, more memory pages are identified as being duplicates, saving more and more
physical memory.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
34/74
Page 34 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and ViewVMware vCenter and VMware View are not directly involved in the delivering of the vDesktop
workloads, but they play an important role during the deployment of net vDesktops. The CPU loads on
these machines clearly show the deployment of the batches of vDesktops:
Figure 6.2.10: Observed CPU load on the (dual vCPU) vCenter server during vDesktop deployment.
Note the dual y-axis descriptions; some values are percents, others are [MHz].
In figure 6.2.10, the deployment batches can be clearly extracted. After each batch, vCenter server
settles at a slightly higher CPU load. This is caused by the number of VMs to manage and monitor
within the entire ESX cluster.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
35/74
Page 35 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.11: Observed CPU load on the VMware View server during vDesktop deployment.
Note the dual y-axis descriptions; some values are percents, others are [MHz].
The VMware View server shows pretty much the same characteristics as the VMware vCenter server.
Higher CPU loads during the batch deployment of vDesktops, and settling somewhat higher after each
batch.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
36/74
Page 36 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.4 Deploying vDesktops beyond 500After the successful deployment of the initial 500 vDesktops, further batches of 100 vDesktops were
deployed. Goal was to fit as many vDesktops onto the testbed as possible, keeping track of all
potential boundaries (performance-wise).
The largest amount of vDesktops that could be deployed was 1319. At this point VMware stopped
deploying more vDesktops because the ESX servers were running out of vCPUs. Within ESX version 3.5,
the limit of the number of VMs that can run on a single node is fixed to 170. This maximum was
reached just before ESX physical memory ran out:
Figure 6.2.12: ESX node resource usage when deploying 1300 vDesktops
As the graph in figure 6.2.12 is showing, the ratio of memory versus CPU usage was almost matched. The
limitation of the number of running VMs, memory limitations and CPU power limitations reached their
maximum almost simultaneously.
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
CPU/Memoryusage
(average)[%]
Number of deployed user-simulated vDesktops
VMware ESX node resource usage (0 to 1300
vDesktops)
Node CPU
Node mem
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
37/74
Page 37 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Due to the nature of the ZFS file system, the CPU load on the storage device was a concern. The
measured values can be found in figure 6.2.13:
Figure 6.2.13: CPU load on the 7000 storage during deployment of 1300 vDesktops. Note the HT0
value. This is the HT-bus between the two quad core CPUs inside the storage device.
The relaxation points at 600-700 and 1200 vDesktops were due to settling of the
environment during weekends.
As shown, the CPU load on the storage device is quite high, but not near saturation yet. The HT0 bus
displayed here was the one HT-bus having the biggest bandwidth usage. This is due to the fact that a
single, dual-channel PCI-e 10GbE card was used in the environment. The result of this was that the
second CPU had to transport all of its data to the first CPU in order to be able to get its data in and out
of the 10GbE interfaces. Note that the design could have been optimized here to use two separate
10GbE cards, each on PCI-e lanes that use a different HT-bus. This would have resulted in a better
load balancing across CPUs and HyperTransport busses. See figure 7.3.1 for a graphical representation
of this.
0
0,5
1
1,5
2
2,5
0
10
20
30
40
50
60
70
80
90
100
H
yperTransportbusbandwidthusage[GB.sec-1]
CPUconsumed[%]
Number of deployed user-simulated vDesktops
7000 Storage CPU resources (0 to 1300
vDesktops)
7410 CPU load
HT0/socket1
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
38/74
Page 38 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The memory consumption of the 7000 storage is directly linked to the amount of read cache used. As
the number of vDesktops increase, the ARC (memory cache) fills up. As it reaches about 450
vDesktops, the ARC reaches its 64[GB] limit and the L2ARC (solid state drive) starts to fill (see figure
6.2.14):
Figure 6.2.14: Memory usage on the 7000 storage during the deployment of 1300 vDesktops. Note
the L2ARC (SSD drive) starting to fill as the ARC (memory) saturates. The relaxation
between 600 and 700 vDesktops is due to a stop of deploying during a weekend
(ARC flushing occurred through time as the vDesktops settled in their workload).
The L2ARC finally settled at just about 100[GB] of used space (on the testbed there was a total of
200[GB] of ReadZilla available).
0
20
40
60
80
100
120
Memoryusage[GB]
Number of deployed user-simulated vDesktops
7000 Storage memory usage (0 to 1300
vDesktops)
7410 L2ARC [GB]
7410 ARC [GB]
7410 Kernel use [GB]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
39/74
Page 39 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The networking bandwidth and IOPs used by the testbed is displayed in figure 6.2.15:
Figure 6.2.15: NFS traffic observed during the deployment of 1300 vDesktops.
The dips in the graphics at 600/700 and 900/1000 vDesktops are actually weekends; the vDesktops
settled in their behavior which shows in the graph in figure 6.2.15.
0
10
20
30
40
50
60
70
80
0
1000
2000
30004000
5000
6000
7000
8000
9000
Networktraffic[MB.sec-1]
I/Ooper
ations[sec-1]
Number of deployed user-simulated vDesktops
NFS traffic (0 to 1300 vDesktops)
NFS IOPSNFS reads
NFS writes
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
40/74
Page 40 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.5 Performance figures at 1300 vDesktopsThe system saturated at 1300 vDesktops, due to the limit in the maximum number of running VMs
inside the ESX servers. Performance of the vDesktops at this number was still very acceptable, even
though both memory and CPU power where almost at their maximum.
The VMs were still very responsive. Random vDesktops were accessed through the console, and
responsiveness was tested by starting the welcome to Windows XP introduction animation. Both frame rate
and animation speed did not deteriorate significantly through the entire range of 0 to 1300 vDesktops.
A good grade to determine this technically is the CPU ready time. This is the time that a VM is ready to
execute on a physical CPU core, but ESX somehow cannot manage to schedule it to a physical core:
Figure 6.2.16: CPU ready time measured on a vDesktop on a 30 minute interval.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
41/74
Page 41 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Note that these values are summed up between samples, and all millisecond values should be divided by
1800 (30 minutes) in order to obtain the number of milliseconds ready time per second (instead of per 30
minutes). In the leftmost part of the graph, vDesktops are still being deployed and booted up, impacting
performance (ready time is about 12.5 [ms.sec-1
]). After the deployment is complete, ready time drops toabout 4.2 [ms.sec-1]. These values are very acceptable from a CPU performance point of view.
Next to CPU ready times, also the NFS latency is of great influence on the responsiveness of the
vDesktops. The graphs in the following figures were made at a load of 1300 vDesktops:
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
42/74
Page 42 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.17a and 6.2.17b: Observed NFS read latency at 1300 user simulated vDesktops
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ReadIOP
S[sec-1]
NFS Read Latency (1300 userloaded vDesktops)
0
2
4
6
8
10
12
14
16
18
20
Re
adIOPS[sec-1]
NFS Read Latency ZOOMED (1300 userloadedvDesktops)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
43/74
Page 43 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Looking at graph 6.2.17a, it shows that almost all Read Operations are served within 20 [ms]. At this load it is
quite impressive.
When we look at the read latency in more detail (figure 6.2.17b), there are some Read Operations which takea longer time to be served. To put this in numbers, there are between 1 and 2 read operations every second
which take up to about 100 [ms] to be served. Take note, this is only about 0.2% of the read operations
performed.
Next to read latency, also write latency is measured. The write latency appears to be a little worse than
the read latency:
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
44/74
Page 44 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.18a and 6.2.18b: Observed NFS write latency at 1300 user simulated vDesktops
0
2000
4000
6000
8000
10000NFSwriteOperations[sec-1
]
NFS Write Latency (1300 userloaded vDesktops)
0
200
400
600
800
1000NFSwriteOperations[sec-1]
NFS Write Latency ZOOMED (1300 userloaded
vDesktops)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
45/74
Page 45 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
About 150 write operations require more than the base 0-40ms window to complete. Since the total
number of write operations is about 6000, this is about 2,5% of the operations performed. The only
explanation of these high latency numbers can be that some writes are not committed to the LogZilla,
but are flushed to disk directly. This is normal behavior for ZFS.
Within ZFS, larger blocks are not committed to the LogZilla. This is controlled by a parameter called
zfs_immediate_write_sz. This parameter is actually a constant within ZFS, and set to 32768 (see
reference [2] )
VMware ESX will concatenate writes if possible, up to 64[KB]. It is safe to assume that the majority of
writes equal the size of the vDesktops NTFS block size (4[KB]). However, some blocks do get
concatenated within VMware.
Looking at figure 6.1.2, we can see that the average write size is 5.5 [KB]. If we calculate the projected
block size from the behavior seen above, we can conclude that:
This is a perfect match, so it is safe to assume that the large write latency observed is in fact due to
this behavior. Tuning the zfs_immediate_write_szconstant could help in this case (increasing it to
65537 (which is 2^16+1) to make sure 64K writes are also stored in the LogZillas). Unfortunately
adjusting of this parameter is not supported on the Sun 7000 storage arrays (nor is it in ZFS to myknowledge).
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
46/74
Page 46 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
VMware ESX has a feature called Transparent Page Sharing (TPS). This allows VMware ESX to map
several virtual memory pages which are identical to the same physical memory page. VMware performs
this memory deduplication in either hardware (vSphere 4 plus supported CPUs) or spare CPU cycles
(both ESX3.x and vSphere 4 optionally), so the positive effect of TPS gets bigger over time (also seefigure 6.2.9).
At a total of 1300 deployed vDesktops, ESX saves a large amount of memory:
Figure 6.2.19: Memory shared between vDesktops within a single ESX server.
As shown in figure 6.2.19, there are 170 VMs running (the graph shows only one of the eight ESX
nodes). Each ribbon in this graph represents a VM. In total, 22,5 [GB] of memory is shared thus saved
between vDesktops per ESX node. Without TPS, the ESX servers would have required 64+22,5 = 86,5
[GB] of memory (a 30% saving !).
When looking at the entire ESX cluster, each ESX server saves about the same amount of memory
thanks to TPS, saving 8*22,5 = 180 [GB] of memory.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
47/74
Page 47 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.2.6 Extrapolating performance figures
In order to be able to predict the maximum amount of vDesktops which can be placed on a certainenvironment, it is important to take note of all limiting factors. By extrapolating the measurements
performed on these factors, we can determine how to scale different resources (like CPU, memory, SSD
drives) to match the number of vDesktops we need to deploy.
For scaling VMware ESX CPU and memory, we set the maximum allowable load to 85%. The
extrapolated graph can be found in figure 6.2.20:
Figure 6.2.20: Extrapolation of figure 6.2.11: ESX node resource usage
In figure 6.2.20, memory is limited at 1300 desktops (which actually was the limit we ran into during
the test). CPU had some room to spare: If we pushed CPU consumption to 85%, we could deploy 1650
vDesktops.
05
1015202530354045505560
6570758085
0 100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
15
00
1600
17
00
1800
CPU/Mememoryusage(ave
rage)[%]
Number of deployed user-simulated vDesktops
VMware ESX node resource usage
(extrapolated)
Node CPU
Node mem
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
48/74
Page 48 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.2.21: Extrapolation of figure 6.2.12: CPU load on the 7000 storage
Looking at figure 6.2.21, the extrapolated value for the 7000 storage CPU usage would put the
maximum number of vDesktops on 1900. The theoretical maximum of the HT bus is 4[GB.sec-1], but a
general accepted value is around 2.5[GB.sec-1]. This would mean the HT-bus would limit the number of
vDesktops to 1950.
0
0,5
1
1,5
2
2,5
-5
5
15
25
35
45
55
65
75
85
0 100
200
300
00
500
600
700
800
900
1000
1100
1200
1300
1
00
15
00
1600
17
00
1800
1900
2000
HyperTransportbustroughput[GB.sec-1]
CPUconsumed[%]
Number of deployed user-simulated vDesktops
7000 Storage CPU resources (Extrapolated)
7410 CPU load
HT0/socket1
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
49/74
Page 49 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
For read caching, the 7000 storage relies on memory and Solid state drives (SSDs). Both basically
extrapolate the same, only memory is much faster than SSD. For extrapolating memory usage, using
the ARC values is sufficient:
Figure 6.2.22: Extrapolation of figure 6.2.14: Memory usage on the 7000 storage
Extrapolation of the ARC size shows, that at 256 [GB] of memory minus some overhead for the Kernel
could have up to 2400 vDesktops deployed. Beyond this point SSDs (ReadZilla) would have to be used
in order to extend memory beyond 256[GB] which is the maximum amount of RAM that can be added
to the biggest 7000 series array at the time of this writing.
An important note to take is that the measured range of the ARC is rather short. A slight variation in
the measurement could have quite a dramatic effect on the final number of vDesktops that can be
deployed in a given environment.
0
50
100
150
200
250
200
00
15
00
1800
2300
Memoryusage[GB]
Number of deployed user-simulated vDesktops
7000 Storage memory usage (Extrapolated)
7410 ARC [GB]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
50/74
Page 50 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Finally, the NFS traffic is extrapolated in order to be able to see projected network bandwidth and
number of IOPS required for a given number of vDesktops:
Figure 6.2.23: Extrapolation of figure 6.2.15: NFS traffic observed
The extrapolation in figure 6.2.23 is set to some limits. In the network bandwidth projection a
maximum of 2x 1GbE is used, with usage limited to 50% for each link in order to avoid possible
saturation / packet dropping on the link.
The number of total IOPS in this projection is limited to 12.000 [sec-1]. The reason for choosing this
number is that at the measured I/O distribution, there should be about 10.000 Write Operations
performed Per Second (WOPS), which is the maximum for a LogZilla device.
According to this graph, maximums come into play above 1800 vDesktops. For the NFS read
bandwidth, the maximum is not reached in this graph but would end somewhere near 4000
vDesktops (!).
0
2000
4000
6000
8000
10000
12000
1800
0
10
20
30
40
50
60
NFSIOPS[sec-1]
Number of deployed user-simulated vDesktops
Networktraffic[MB.sec-1]
NFS traffic (Extrapolated)
NFS writes
NFS reads
NFS IOPS
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
51/74
Page 51 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
In close relation to the NFS IOPS performed, SATA ROPS and WOPS can also be extrapolated:
Figure 6.2.24: Extrapolation of figure 6.2.4: SATA Read- and Write-Operations Extrapolated
The graph in figure 6.2.24 clearly shows that hardly any SATA ROPS are performed; SATA WOPS
steadily increase as the number of running vDesktops increase. Note that at 1500 running vDesktops
the number of WOPS is projected to be only 68 [sec-1] WOPS. ROPS remain near zero.
-2
8
18
28
38
48
58
68
Read/WriteOpe
rations[sec-1]
Number of deployed vDesktops
SATA Read- and Write Operations
(Extrapolated)
SATA WOPS [/sec]
SATA ROPS [/sec]
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
52/74
Page 52 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The write acceleration through the LogZilla device(s) can also be extrapolated:
Figure 6.2.25: Extrapolation of figure 6.2.5: LogZilla WOPS performed
0
200
400
600
800
1000
1200
1400
1600
1800
2000
LogZillaW
OPS[sec-1]
Number of vDesktops deployed
Logzilla WOPS ave [/sec] (Extrapolated)
LogZilla WOPS ave
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
53/74
Page 53 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Latency is more complex to extrapolate. By extrapolating each latency-group, a 3D graph can be
recreated to show projected NFS read latencies:
Figure 6.2.26: Extrapolation of NFS read latency, clipped at 100 read operations per second.
Figure 6.2.26 is an extreme zoom of an extrapolated NFS read latency graph. The graph has been cut
into segments and had isolations inserted to give a clear view of the latency graphs as more vDesktops
would be deployed on the environment.
As the number of vDesktops gets bigger, more latency is introduced. This was already determined. In
this graph though, it becomes clear that the distribution of latency changes as the load increases.
0,00
20,00
40,00
60,00
80,00
100,00
NFSReadOperat
ions[sec-1]
Extrapolated NFS read latencies
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
54/74
Page 54 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.3 Test Results 2a: Rebooting 100 vDesktopsThe impact that should never be underestimated is the impact on the storage when a large number of VMs
have to be rebooted. The reboot process uses far more resources than a regular workload, and especially
rebooting a lot of vDesktops in parallel can mean a large increase in I/O operations performed.
As a subtest, we shut down then restarted a hundred vDesktops with a total of 800 vDesktops deployed. The
impact is best seen in the latency graphs:
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
55/74
Page 55 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.3.1a and 6.3.1b: NFS read latency rebooting 100 vDesktops (@800 deployed).
0
2000
4000
6000
8000
10000NumberofReadOPS[sec-1]
NFS Read Latency (800 vDesktops, 100 rebooting)
0
200
400
600
800
1000NumberofReadOPS[sec-1]
NFS Read Latency ZOOMED (800 vDesktops, 100
rebooting)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
56/74
Page 56 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
As can be seen in graphs 6.3.1a and b, the reboot in total took about one hour. The restart was issued
through VMware View, who schedules the restarts spread over time to vCenter. In the a (unzoomed) graph,
the peaks above the 4000 ROPS indicate a higher number of read operations caused by the restarting of the
vDesktops. The zoomed graph (graph b) shows more detail on the read latency getting worse during therestarts. This is due to the fact that especially the linked clones files that were written by the VMs previously,
are now read back and have to be introduced to the ARC/L2ARC read cache, meaning the reads have to come
from the relatively slow SATA drives. A second restart might have had less impact in this respect (untested).
The filling of the L2ARC (from SATA) at rebooting of the vDesktops can be clearly seen in the graph in figure
6.3.2:
Figure 6.3.2: L2ARC growth on desktops reboot
A hundred rebooting vDesktops caused the L2ARC to grow about 50[GB] in size. Since all common reads
come from only two replicas which are already stored in the ARC, apparently each VM reads about 0,5[GB] of
unique data (from its linked clone).
0
20
40
60
80
100
120
ARC/L2ARCmemoryuagse[GB]
(L2)ARC growth (restart of 100 vDesktops in an
800 vDesktop environment)
ARCdata size
L2ARCdata size
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
57/74
Page 57 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Network bandwidth used is also clearly higher during the reboot of the vDesktops:
Figure 6.3.3: NFS bandwidth used during reboot of 100 vDesktops.
At the left of the graph above, the regular I/O workload can be observed. The rest of the graph is the reboot
of the 100 vDesktops.
0
20
40
60
80
100
120
NFSread/write[MB.se
c-1]
NFS bandwidth used (restart of 100 vDesktops
in an 800 vDesktop environment)
NFS write ave
NFS read ave
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
58/74
Page 58 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.4 Test Results 2b: Recovering all vDesktops after storage appliancereboot
When running 1000 vDesktops, the storage array was forcibly rebooted. This subtest was performed to see
the impact on the storage array, on the data and on the vDesktops.
At the time of the forced shutdown of the storage device, all VMs froze. After rebooting the storage
appliance, the ZFS file system had to perform some resilvering (checking and making sure the data is
consistent which is a very reliable feature in ZFS) before normal NFS communication to the ESX servers was
able to resume. At this resume, the VMs simply unfroze and started to show their normal behavior almost
instantly.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
59/74
Page 59 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
In graph 6.4.1 the effects of the forced reboot can clearly be seen:
Figure 6.4.1: Network and CPU load behavior during reboot of the storage appliance. The red
Striped bars indicate no measurements were made (during reboot of the storage
device itself)
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
60/74
Page 60 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The red bar in figure 6.4.1 indicates the time required to perform the (re)boot of the storage device. The
silent period after that is the so called resilvering of the ZFS file system. No I/O is performed at this stage,
but as can be seen the CPU is quite busy during the resilvering.
After resilvering is done, the storage device immediately continues to perform I/O and settles quite fast. After
a reboot of the appliance, the ARC is empty (being RAM), and the L2ARC data is forcibly deleted and will be
rebuilt as read operations start to occur. Initially, the read operations will have to come out of SATA, filling up
the ARC and after that the L2ARC as they are read. In figure 6.4.2 the refilling of the ARC and L2ARC are
clearly visible:
Figure 6.4.2: Filling of the ARC and the L2ARC after a forced reboot of the storage appliance.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
61/74
Page 61 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
The graph in figure 6.4.2 clearly shows the rapid filling of the ARC. It appears to fill a little bit during
resilvering, then shoots up quick (probably the two replicas are pulled into the ARC here). From there on, the
filling of the ARC slows its pace, and the L2ARC starts to come in as well. The third graph in figure 6.4.2
shows the (L2)ARC misses. During a few minutes there are quite a lot of misses, but this issue is resolvedrather quickly.
All in all the device is up and running again in 15 minutes. Take note that the setup used here did not make
use of the clustering features available for the 7000 series; all tests are performed on a single storage
processor.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
62/74
Page 62 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
6.5 Test Results 3: User load simulated full clone desktopsA limited test was added to the original linked-clone test scenario. In this test the same (user
simulated) Windows XP images were deployed, but this time not in linked clone but full clone mode.
Only 150 full-clone desktops have been deployed, to see the behavior of the ARC and L2ARC in this
scenario.
Figure 6.5.1: Filling of the ARC and the L2ARC during the deployment of 150 full-clone
vDesktops. Totally left some test vDesktops (full clones) are deployed. At (1)
the first batch of 25 vDesktops are deployed, at (2) the rest of the vDesktops
are deployed.
See figure 6.5.1. After start of the test (totally left) some full-clone vDesktops are deployed. At the (1)
marker, the first batch of 25 vDesktops are deployed. Shortly after marker (1), the ARC and L2ARC sizes settle
around 25 [GB]. This indicates that the vDesktops perform around 1[GB] of reads per vDesktop. Because the
ARC does not saturate yet, the L2ARC remains (almost) empty at this stage. Beyond marker (2) the rest of the
vDesktops are deployed, quickly filling the ARC and the L2ARC.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
63/74
Page 63 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
Figure 6.5.2: Filling Extrapolation of the 7000 storage CPU usage.
Figure 6.5.2 contains an extrapolation of the CPU load on the 7000 storage. The extrapolation is
extensive, and gives room for error. However, it appears to be pretty much in line with the CPU figures
measured in the linked-cloning setup (see figure 6.2.20).
0
5
10
15
20
25
30
35
40
45
5055
60
65
70
75
80
85
7410CPUload(avera
ge)[%]
Number of deployed full-clone vDesktops
7410 CPU load Average percent (Extrapolated)
7410 CPU load ave
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
64/74
Page 64 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
More interesting is the amount of IOPs performed in the full-clone scenario compared to the linked-clone
scenario:
Figure 6.5.3: NFS IOPS comparison of full-clone versus linked-clone vDesktop deployment.
Figure 6.5.3 shows that linked-clone vDesktops use more IOPS than full-clone vDesktops. This effect
can be explained by the way linked-clones function within VMware ESX. This behavior is much like
VMware snapshotting (see reference [1] for more details).
Another thing that can be seen in figure 6.5.3, is that the deployment itself of linked-clones appears to
have a greater IOPS impact than when performing full-clone deployment. Take note though, that this is
not the case: In figure 6.5.3, the time scale has been adjusted in order to match both graphs into a
single figure. Fact is that the speed of deployment is very different:
- Linked-clones deploy at a rate of 100 vDesktops per Hour;- Full-clones deploy at a rate of 10 vDesktops per Hour.
This factor 10 is not visible in the graph, but in fact the full-clone vDesktop deployment uses far more
IOPS. This makes sense: In the full-clone scenario every vDesktop gets its boot drive fully copied, while
linked-clones only have to perform some IOPS overhead when creating an empty linked-clone (andsome other administrative actions on disk).
0
500
1000
1500
2000
2500
3000
3500
4000
4500
NumberofIOPS
[sec-1]
Number of vDesktops deployed
NFS IOPS performed; FULL vs. LINKED clones
(0 to 100 vDesktops)
NFS IOPs FULL-CLONE
NFS IOPs LINKED-CLONE
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
65/74
Page 65 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7 ConclusionsFrom all tests conducted, some very interesting conclusions can be drawn. First of all, the fact that theenvironment managed to run over 1300 vDesktops without performance issues on its own is a great
accomplishment. Looking deeper into the measured values gives a wealth of information on best practices
how to configure Sun Unified Storage 7000 in combination with VMware View linked clones.
7.1 Conclusions on scaling VMware ESXIt proves to be very important to scale your VMware ESX nodes correctly. There are basically three things to
keep in mind:
1) The amount of CPU cores inside an ESX server;2) The amount of memory inside an ESX server;3) The amount of vCPUs/VMs the ESX server can deliver.
The first and second are obvious ones; put in too much CPU power, and you run out of memory leaving the
CPU cores underutilized; put in too much memory and you run out of CPU power leaving memory
underutilized.
The third is sometimes forgotten, but proved to be the culprit in our test setup: If you use ESX servers with
too much CPU and memory, youll run out of vCPUs and VMs will not start anymore beyond a certain point.Luckily, with each release of VMware ESX this number appears to get higher and higher:
- ESX 3.01 / ESX3.5: 128 vCPUs, 128 VMs;- ESX3.5U2+: 192 vCPUs, 170 VMs;- vSphere (ESX4): 512 vCPUs, 320 VMs.
As shown, using vSphere as a basis will allow for much bigger ESX servers.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
66/74
Page 66 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7.2 Conclusions on scaling networking between ESX and Unified StorageThe network did not really prove to be an issue during the performed tests. Bandwidth usage to any single
ESX node proved to be far within the capabilities of a single GbE connection.
Bandwidth to the storage also remained far within the designed bandwidth. The two 10 GbE connections
remained underutilized throughout all tests.
Load balancing was forcibly introduced into the test environment, but could have been skipped without issue
in this case. If the 7000 storage would have been driven using 1 GbE links, load balancing would be
recommended.
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
67/74
Page 67 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7.3 Conclusions on scaling Unified Storage CPU powerDuring the tests, the CPUs inside the 7000 storage were not saturated. At 1300 user-simulated vDesktops,
the load on the two CPUs reached 85%, which should be considered to be near its maximum performance. In
order to be able to scale up further, 4 CPUs (or 6-core CPUs) will be required.
The HyperTransport bus between the two CPUs showed quite large values (order of 1.7 [GByte.sec-1] ). This
was partially due to the fact that the two 10 GbE ports both resides on a single PCIe card. This caused all
traffic to be forcibly sent through the HyperTransport bus of CPU0, instead of being load-balanced between
CPU0 and CPU1:
HyperTransport-to-I/Obridge
HyperTransport-to-I/Obridge
MemoryBus
MemoryBus
HT-bus
HT-bus
H
T-bus
PCIebus
10Gb
Ethe
rnet 10GbEthernet
PCIebus
Sun 7410 Unified StorageHyperTransport Bus technology
Figure 7.3.1: Sun 7410 Unified Storage HyperTransport bus architecture. In the performance tests
a single PCIe card with dual 10GbE was used. Best practice would be to use two single
port 10GbE PCIe cards using a different HT-Bus (shown in semi-transparency).
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
68/74
Page 68 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7.4 Conclusions on scaling Unified Storage Memory and L2ARCIn order to obtain best performance out of the 7000 Unified Storage, read cache is very important. This type
of storage was even primarily selected for its large read cache capabilities. Using linked-clones all replicas
(the full-cloned mother of the linked clones) were directly committed to read cache. For each linked-clone
deployed a small addition amount of read cache was required. The amount of read cache should be carefully
matched to the projected number of vDesktops on the storage device. See chapter 8 for more details.
The L2ARC presents itself in the form of one or more read-optimized Solid State Drives (SSDs). It can be seen
as a direct extension to internal memory. It is important to note though, that L2ARC storage is about a factor
1000 slower than memory. Best practice would be to match internal memory to the required read-cache. If
(and only if) memory requirements exceed the physical maximum amount of internal memory, L2ARC could
be used to reach the required amount.
7.5 Conclusions on scaling Unified Storage LogZilla SSDsThe LogZilla devices enable the 7000 Unified storage to quickly acknowledge synchronous writes to the
storage device. The metadata of the write is stored in both the LogZilla and the write itself in the ARC. Finally,
the write is committed to disk from the ARC and the metadata in the LogZilla is flagged as handled.
In normal operation, the LogZilla is never read from. Only on recovery (such as power-loss) the LogZilla is
read and the ZFS file system is returned to a consistent state using the metadata present in the LogZilla that
was not flagged as handled yet.
In effect, the addition of a LogZilla greatly enhances (lowers) the write latency to the storage device. The
performed tests show that the LogZilla really helps to keep write latency to a minimum.
Each LogZilla is able to perform at 10.000 [WOPS]. When the projected number of writes is larger than 10.000
[IOPS], adding LogZillas could help. Note should be taken that adding a second LogZilla will not help
performance-wise: The Unified storage will place both LogZillas in a RAID1 configuration. This RAID1
configuration will help in ensuring performance; a LogZilla may fail, and the storage device will keep working
normally. When using a single LogZilla, the synchronous writes will have to be written to disk directly if the
LogZilla fails, clipping performance.
Using four LogZilla devices could increase the number of WOPS that can be performed to a single storage
device (the Unified Storage will put four LogZillas into a RAID10 configuration effectively being able to
perform 20.000 [WOPS])
8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0
69/74
Page 69 of74
Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)
7.6 Conclusions on scaling Unified Storage SATA storageThroughout the test the number of SATA ROPS and WOPS has been consistently limited in numbers. This is
due to the way ZFS works: ZFS aims to read most (if not all) data from the ARC and L2ARC; ZFS combines and
reorders small random writes to very large blocks and so converts the small random writes to large sequential
writes. This way of working minimizes ROPS, and performs only few large sequential writes to SATA (also see
Reference [3]).
Given the fact that a single