Performance Report Sun Unified Storage and VMware View 1.0

8/8/2019 Performance Report Sun Unified Storage and VMware View 1.0

1/74

Performance Report

VMware View linked clone performance

on Suns Unified Storage

Author: Erik Zandboer

Date: 02-04-2010Version 1.00


2/74

Page 2 of74

Performance Report: VMware View linked clone performance on Suns Unified Storage (v1.0)

Table of contents

1 Management Summary ................................................................................................................. 6

1.1 Introduction ........................................................................................................................ 6

1.2 Objectives ........................................................................................................................... 6

1.3 Results ................................................................................................................................ 6

2 Initial objective ............................................................................................................................. 7

2.1 VMware View ....................................................................................................................... 7

2.2 Storage requirements .......................................................................................................... 7

3 Technical overview of the solutions .............................................................................................. 8

3.1 VMware View linked cloning ................................................................................................ 8

3.2 Sun Unified Storage ............................................................................................................. 83.3 Linked cloning technology combined with Unified Storage ................................................... 9

4 Performance test setup ............................................................................................................... 10

4.1 VMware ESX setup ............................................................................................................. 10

4.2 VMware View setup ........................................................................................................... 11

4.3 Windows XP vDesktop setup .............................................................................................. 11

4.4 Unified Storage setup ........................................................................................................ 12

5 Tests performed ......................................................................................................................... 13

5.1 Test 1: 1500 idle vDesktops .............................................................................................. 13

5.2 Test 2: User load simulated linked clone desktops ............................................................. 13

5.3 Test 2a: Rebooting 100 vDesktops in parallel .................................................................... 13

5.4 Test 2b: Recovering all vDesktops after storage appliance reboot ...................................... 135.5 Test 3: User load simulated full clone desktops ................................................................. 14

6 Test results ................................................................................................................................ 15

6.1 Test Results 1: 1500 idle vDesktops .................................................................................. 15

6.1.1 Measured Bandwidth and IOP sizes ................................................................................ 16

6.1.2 Caching in the ARC and L2ARC ...................................................................................... 20

6.1.3 I/O Latency ................................................................................................................... 22

6.2 Test Results 2: User load simulated linked clone desktops ................................................ 24

6.2.1 Deploying the initial 500 user load-simulated vDesktops ............................................... 25

6.2.2 Impact of 500 vDesktop deployment on VMware ESX ..................................................... 31

6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and View ................................ 34

6.2.4 Deploying vDesktops beyond 500 .................................................................................. 366.2.5 Performance figures at 1300 vDesktops ......................................................................... 40

6.2.6 Extrapolating performance figures ................................................................................. 47

6.3 Test Results 2a: Rebooting 100 vDesktops ........................................................................ 54

6.4 Test Results 2b: Recovering all vDesktops after storage appliance reboot........................... 58

6.5 Test Results 3: User load simulated full clone desktops ..................................................... 62


3/74

Page 3 of74


7 Conclusions ............................................................................................................................... 65

7.1 Conclusions on scaling VMware ESX ................................................................................... 65

7.2 Conclusions on scaling networking between ESX and Unified Storage ................................. 66

7.3 Conclusions on scaling Unified Storage CPU power ............................................................ 677.4 Conclusions on scaling Unified Storage Memory and L2ARC ............................................... 68

7.5 Conclusions on scaling Unified Storage LogZilla SSDs ........................................................ 68

7.6 Conclusions on scaling Unified Storage SATA storage ........................................................ 69

8 Conclusions in numbers ............................................................................................................. 70

9 References ................................................................................................................................. 72

Appendix 1: Hardware test setup ...................................................................................................... 73

Appendix 2: Table of derived constants ............................................................................................ 74


4/74

Page 4 of74


People involved

Name Company Responsibility E-Mail

Erik Zandboer Dataman B.V. Sr. Technical Consultant [email protected]

Simon Huizenga Dataman B.V. Technical Consultant [email protected]

Kees Pleeging Sun Project leader [email protected]

Cor Beumer Sun Storage Solution Architect [email protected]

Version control

Version Date Status Description

0.01 11-02-2010 Initial draft Initial draft for internal (Dataman / Sun) review

0.02 12-03-2010 Final draft Adjusted some reviewed minors; added conclusions and derivedconstants

1.0 02-04-2010 Release Changed last minors; changed minors in reviewed items added in 0.02


5/74

Page 5 of74


Abbreviations and definitions

Abbreviation Description

VM Virtual Machine. Virtualized workload on a virtualization platform (such as VMware ESX)

GbE Gigabit Ethernet. Physical network connection at Gigabit speed.

IOPS I/O operations Per Second. The number of both read- and write commands from and to a

storage device per second. Take note that the ratio between reads and writes cannot be

extracted from these values, only the sum of the two. Also see ROPS and WOPS.

OPS Operations Per Second. More general term, and closely related to IOPS.

ROPS Read Operation Per Second. The number of read commands performed on a storage

device per second.

WOPS Write Operation Per Second. The number of write commands performed on a storage

device per second.

TPS Transparent Page Sharing. A feature unique to VMware ESX, where several memory pages

can be identified as containing equal data, and then stored only once in physical memory,

effectively saving physical memory. Is in most respects comparable to data deduplication.

SSD Solid State Drive. This is normally indicated as a non-volatile storage device with nomoving parts. It can be a Flash Drive (like the ReadZilla device), but it can also be a

battery-backed (plus optionally flash-backed) RAM drive (like the LogZilla device).

KB KBytes. Also seen in conjunction with /s or .sec-1 which dedicates KBytes-per-second

MB MBytes. Also seen in conjunction with /s or .sec-1 which dedicates MBytes-per-second.

Mb Mbits. Also seen in conjunction with /s or .sec-1 which dedicates Mbits-per-second.

vDesktop Virtualized Desktop. A Virtual Machine (VM) running a client operating system such as

Windows XP.

ave Average. Shorthand used in graphs to indicate the value is an averaged value.

HT, HTx Hyper Transport bus. High bandwidth connection between CPUs and I/O devices on

mainboards. Often indicated with numbers (HT0, HT1) to indicate specific connections.

UFS Unified Storage (Device). Storage device which is capable of delivering the same data using

multiple protocols.


6/74

Page 6 of74


1 Management Summary1.1 IntroductionRunning virtual desktops (vDesktops) puts a lot of stress on storage systems. Conventional storage systems

are easily scaled to the right size: A number of disks deliver a certain capacity and performance.

In an effort to tackle the need for a lot of disks in a virtualized desktop (vDesktop) environment, Dataman

started to analyze the basic needs for a vDesktop storage solution based on VMware linked cloning

technology. The new Sun Unified Storage (UFS) Solution (see reference [4] ) appeared to have a significant

head start in delivering a high vDesktop performance with small number of disks.

Because of the alternative way the storage solution works, it is next to impossible to calculate performance

numbers. The way the Unified Storage performs is very dependent on the workload offered. This is why

Dataman teamed up with Sun in order to run performance tests on these storage devices.

1.2 ObjectivesThe performance test had several goals:

- To measure performance impact on the Unified Storage Array as more vDesktops were deployed onthe environment;

- To examine impact on vDesktop reboots;- To extrapolate the measured performance data;- To project (and avoid) performance bottlenecks;- To define scaling constants for scaling the environment to a projected number of vDesktops.

The tests were performed in Suns Datacenter in Linlithgow, Scotland. Hardware and housing was generously

made available to Dataman for a period of two months, over which all necessary tests were performed.

1.3 ResultsThe performance tests have proven to be very effective; during the final stages of the test the testing

environment stopped at 1319 user-simulated vDesktops because the VMware environment having only

eight nodes could not handle any more virtual machines (VMs). At that stage, all vDesktops still performed

without any issues or noticeable latency on a single headed UFS device. Even more remarkable, the

environment could have run with only 16 SATA spindles in a mirrored setup! It is the underlying ZFS file

system and the intelligent use of memory and Solid State Disks (SSDs) that makes all the difference here.


7/74

Page 7 of74


2 Initial objectiveAfter virtualization practically conquered the world for server loads, virtualization now continues to virtualizethe desktop. Virtualizing a big number of desktops on a small set of servers has proven to pose its own set of

challenges. The one often encountered is performance requirements of the underlying storage array. Scaling

disks just to satisfy the capacity needs has always been a bad practice, but this can work out especially bad in

a virtual desktop environment. The large disk capacities of nowadays do not help either.

2.1 VMware ViewOne of the leading platforms for delivering virtualized desktops is VMware ESX in combination with VMware

View. VMware View is able to deliver virtual desktops using linked cloning technology. This technology is able

to deliver very fast desktop image duplicating and more efficient in terms of storage capacity needs.

Calculating the number of ESX nodes (cores and memory) is not too hard. It is no different from having full

cloned desktops. But what are the requirements of the underlying storage array?

2.2 Storage requirementsThe structure of linked clones poses some challenges to the storage. For reasons explained in the next

paragraphs, Suns 7000 series Unified Storage (see reference [4] ) was selected as being THE platform to drive

the linked clone loads most efficient.

The objective of this performance test is to prove that Suns 7000 series Unified Storage in combination with

linked clones gives great performance at little cost.


8/74

Page 8 of74


3 Technical overview of the solutionsIn order to understand the performance test setup and its results better, it is important to have someknowledge about the underlying technologies.

3.1 VMware View linked cloningVMware View is basically a broker between the clients and the virtualized desktops (vDesktops) in the

datacenter. The idea is that a single Windows XP image can be used to clone thousands of identical desktops.

The broker controls the cloning and customization of these desktops.

VMware View enables an extra feature: linked cloning. When using linked cloning technology, only a small

number of fully cloned desktop images exist. All virtual desktops that are actually used are derivatives of

these full clone images. In order to be able to differentiate the desktops, all writes to the virtual desktops

disk are captured in a separate file, much like VMware snapshot technology. The result of this is that many

read operations are performed from the few full clones within the environment.

Following the VMware best practices, it is recommended to have a maximum of 64 linked clones under every

full clone (called a replica).

3.2 Sun Unified StorageSuns Unified Storage uses the ZFS file system internally. There are some very specific differences with just

about any other file system. It is far beyond the scope of this document to deep dive into ZFS, so just some

features of these appliances will be discussed.

Suns Unified Storage appliances have a lot of CPU power and memory compared to most competitors. The

CPU power is required to drive the ZFS file system in an appropriate manner, and memory helps caching of

data. This caching is partly the key to extreme performance of the appliance, even with relatively slow SATA

disks. The use of Solid State Drives (SSDs) further enhances the performance of the appliance: read SSD

(called Readzillas) basically extends the appliances memory, and logging SSDs (called Logzillas) help

synchronous writes to be acknowledged faster (the effect appears somewhat similar to write caching, but the

technology is very different).


9/74

Page 9 of74


3.3 Linked cloning technology combined with Unified StorageThe basic idea of using Suns Unified Storage for linked cloned desktops came from two directions: First, a

storage device with a lot of cache was needed, in order to be able to store the replicas (full clone images).

Secondly, the barrier of 64 linked clones per replica limited the effectiveness of the cache, since one replica is

needed for every 64 linked clones. This limit applies to storage devices having LUNs with VMFS (the VMware

file system for storing VMs) on it. LUN queuing, LUN locking and some other artifacts come into play here.

But when using NFS for storage, and not iSCSI or FC, the 64 linked clones per replica barrier could possibly

be broken. NFS has no issues having a thousand or more opened files accessed in parallel. Since Suns Unified

Storage is also able to deliver NFS, Suns storage device appeared to be the right choice.

.


10/74

Page 10 of74


4 Performance test setupThe performance test was set up in Suns test laboratory in Linlithgow, Scotland. Sun made a number ofservers, a Sun 7410 Unified storage device and the necessary switching components available. The total

hardware setup can be viewed in appendix 1.

4.1 VMware ESX setupA total of nine servers were available for VMware ESX. Eight were used for virtual desktop loads, the ninth

server was used for all other required VMs like vCenter, SQL, View and Active Directory. The specifications of

the used servers:

8x Sun x4450 with 4x 6core Intel CPU (2.6GHz), 64GB memory

1x Sun X4450 with 4x 4core Intel CPU (2.6GHz), 16GB memory

All nodes were connected with a single GbE NIC to the management network, a single NIC to a vmotion

network, and with a third Ethernet NIC to an isolated client network where the Windows XP virtual desktops

could connect to active directory / file serving.

The eight nodes performing virtual desktop loads were also connected to an NFS storage network using two

GbE interfaces. All these interfaces were connected to a single GbE switch.

ESX 3.5 update 5 was used to perform the tests. Setup was kept to a default; console memory was increased

to 800MB (maximum). In order to make sure both GbE connections to the storage array would be used, two

different subnets were used to the array, each subnet accessed by its own VMkernel interface. Each VMkernel

interface in its turn was connected to one of both GbE interfaces, guaranteeing a static load balancing across

both interfaces for every host.

To be able to house the maximum number of VMs possible on a single vSwitch, the port-count of the vSwitch

was increased to 248 ports.


11/74

Page 11 of74


4.2 VMware View setupFor managing the desktops, a template Windows 2003 64bit enterprise edition was created. From this

template, five VMs were derived:

1) Microsoft SQL 2005 standard server with SP3;2) Domain controller with DNS and file sharing enabled;3) VMware vCenter 2.5 update 5;4) VMware View 3.1.2;5) VMware Update Manager.

During the tests, all these VMs were constantly monitored to guarantee that any limits found in the

performance tests were not due to limitations within these VMs.

All ESX nodes involved in carrying vDesktops were put in a single VMware cluster, which was kept at default.

A single Resource Pool was created within the cluster (at default) to hold all vDesktops during the tests.

4.3 Windows XP vDesktop setupThe Windows XP image used was a standard Windows XP install with SP2 integrated. PSTools was installed

inside the image, in order to be able to start and stop application in batches, to simulate a simple user load

of the vDesktops. No further tuning was done to the image.

Within VMware the images were configured with an 8GB disk, a single vCPU and 512MB of memory.

User load was simulated by using autologon of the vDesktop, after which a batch file was started. This batch

file performed standard tasks with built-in delays. Examples of the tasks were:

- Starting of MSpaint which loads an image from the Domain Controller/File server;- Starting Internet Explorer;- Starting MSinfo32;- Unzipping putty.zip to a local directory, then deleting it again;-

Starting solitaire;

- Stopping all applications again.These actions were fixed in order and delay. The delays were tuned until the vDesktop delivered an average

load of 300MHz, and just about 6 IOPs (this is accepted as being a lightweight user). In this user load, a

rather high write-load was introduced (in every 6 IOs, 5 are writes). This is considered to be a worst-case IO

distribution for a vDesktop, making it a perfect setup for storage performance testing.

Checking the performance of the XP desktops was not a primary objective of the performance tests, however

after each test a few randomly chosen vDesktops would be accessed and the introduction to Windows XP

would be started to see the fluidness of the animation, making sure the desktops were still responsive.


12/74

Page 12 of74


4.4 Unified Storage setupThe Sun 7410 Unified storage device was connected to the storage switch using two 10 GbE interfaces. Only a

single head was used in the performance test, connected to 136 underlying SATA disks in six trays. In four of

the trays a LogZilla was present. In total two LogZillas (2x 18[GB]) were assigned to the 7410 head. Inside the

7410 head itself, two Readzillas were available (2x 100[GB]). All SATA storage (apart for some hot spares) was

mirrored (on a ZFS level). With a drive size of 1TB, this effectively delivers 60TB of total storage.

The 7410 itself was configured with two Quad-Core AMD Opteron 2356 processors and 64[GB] of memory. A

single, dual port 10GbE interface was added to the system for connection to the storage network. A third link

(1GbE) was introduced for management inside the management network.

During configuration, two shares were created, both having their own IP address on their own 10GbE uplink.

This ensures static load balancing for the ESX nodes, and also ensures the load is evenly spread over both10GbE links on the storage unit. Jumbo frames was not enabled anywhere in the tests.

In order to be able to measure the usage of the HyperTransport busses inside the 7410, a script was inserted

into the unit which can measure these loads.


13/74

Page 13 of74


5 Tests performedA total of three tests were performed; the first test loaded 1500 idle vDesktops in linked clone mode on thestorage. In the second test an attempt was made to load as many user load simulated vDesktops onto the

testing environment in steps of 100 vDesktops. The third and final test was equal to the second test, but now

using full clones from VMware View.

For all test both NFS shares were used. VMware View will automatically balance the number of VMs equally

across all available stores.

5.1 Test 1: 1500 idle vDesktopsIn the first test, VMware View was simply instructed to deploy 1500 Windows XP images from a single source

image. The resulting images were not performing any user load simulation, so were booted then left at idle.

This test has been performed to get a general idea about loading on ESX and storage required for this

number of VMs.

5.2 Test 2: User load simulated linked clone desktopsAfter the initial test mentioned in 5.1, the test was repeated, now with user load simulated desktops. The test

was performed in steps, with an additional 100 vDesktops every step. The steps are repeated until a

limitation in storage, ESX and/or external environment is met.

5.3 Test 2a: Rebooting 100 vDesktops in parallelAs test 2 (5.2) reached the 1000 vDesktop mark, a hundred vDesktops were rebooted in parallel. This test

was performed to simulate a real life scenario, where a group of desktops is rebooted in a live environment.

Especially the impact on the storage device is to be monitored.

5.4 Test 2b: Recovering all vDesktops after storage appliance rebootAs test 2 (5.2) reached its maximum, the storage array was forcibly rebooted. Not really part of the

performance test, yet interesting to see the recovery process of the storage array, and the recovery of the

VMs on it.


14/74

Page 14 of74


5.5 Test 3: User load simulated full clone desktopsUsing full clones on a Sun 7000 storage device was not expected to work as efficient as a linked cloning

configuration. In this test a number of full clone desktops are deployed, 25 vDesktops in each step.


15/74

Page 15 of74


6 Test resultsThe test results are described on a per-test basis. The initial 1500 idle-running vDesktop test is also used asa general introduction into the behavior of the storage device, the solid state drives and the observed

latencies.

6.1 Test Results 1: 1500 idle vDesktopsAs an initial test, 1500 idle-running, linked-cloned vDesktops were deployed onto the test environment. After

the system had settled, there was first prove about the storage device being able to cope at least with 1500

idle vDesktop loads.


16/74

Page 16 of74


6.1.1 Measured Bandwidth and IOP sizesRunning this workload used NFS bandwidth is measured in figure 6.1.1:

Figure 6.1.1: Running 1500 idle desktops, about 22MB/s writes and 10MB/sec reads are observed.

The fact that about twice as much data is written than read, is probably due to the fact that the vDesktops are

running idle (little reads taking place), while the vDesktops only have 512[MB] of memory each, causing them

to use their local swap files and writing out to the storage device.

0

5

10

15

20

25

30

1 101 201 301 401 501 601 701 801 901 1001

NFSrate[MB.sec-1]

Time [sec]

NFS read and write MBs

(1550 idle-running vDesktops)

NFS writes ave [MB/sec]

NFS reads ave [MB/sec]


17/74

Page 17 of74


As both bandwidth and number of IOPS have been measured, it is easy to derive the average block size

of the NFS reads and writes:

Figure 6.1.2: Average NFS read- and write block sizes observed

Since VMware ESX will try to concatenate sequential reads and writes whenever possible, it is very likely that

the writes are completely random (NTFS 4K block size appears to be overruling here). The read operations are

bigger on average, probably meaning there are some quasi sequential reads going on.

0

5

10

15

20

25

30

1 101 201 301 401 501 601 701 801 901 1001

AverageNFSBlocksize[KB]

Time [sec]

Average NFS read and write blocksizes


NFS write Blocksize [KB]

NFS read blocksize [KB]


18/74

Page 18 of74


Since all writes to the storage device are synchronous and have very small block sizes, all writes will be put

into the LogZilla devices before they pass on to SATA. As the data to be written traverse through these stages,

the number of WOPS becomes smaller with every step:

Figure 6.1.3: Number of Write operations observed through the three stages

Here it becomes obvious how effective the underlying ZFS file system is. The completely random write load

which consists of nearly 5000 Write Operations per second, gets converted in the last stage (SATA) to just

over 30 write operations per second. ZFS is effectively converting the tiny random writes to NFS into large

sequential blocks, effectively dealing with the relatively poor seek times of the physical SATA drives.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 101 201 301 401 501 601 701 801 901 1001

WriteOperations[sec

-1]

Time [sec]

Comparing Write OPS through stages


LogZilla WOPS [/sec]

SATA WOPS [/sec]

NFS WOPS [/sec]


19/74

Page 19 of74


The write operations are effectively being dealt with. On reads, the following is observed on the SATA drives:

Figure 6.1.4: Observed SATA read operation per second.

At an average read bandwidth of 10[MB.sec-1] (see figure 6.1.1), less than 0.3 read operations per second

(ROPS) are observed on the SATA drives. This raises the suspicion that most (in fact almost all) read

operations are served by the read cache (ARC or L2ARC), and only very little reads actually originate from the

SATA drives, effectively boosting overall read performance of the Sun 7000 storage device.

-0,5

0

0,5

1

1,5

2

2,5

3

1 101 201 301 401 501 601 701 801 901 1001

SATAReadOperation

[sec-1]

Time [sec]

SATA IOPS read ave [/sec]

SATA ROPS [/sec]


20/74

Page 20 of74


6.1.2 Caching in the ARC and L2ARCZooming in on the read performance, we need to look closer to the read caching going on. In figure 6.1.5 it is

obvious, that the ARC (64[GB] minus overhead) was saturated and the L2ARC (200[GB]) is only filled up to

about 70[GB]:

Figure 6.1.5: Running 1500 idle desktops, the ARC shows fully filled while the L2ARC flash drives

vary in usage around 64[GB].

0

10000

20000

30000

40000

50000

60000

70000

1 101 201 301 401 501 601 701 801 901 1001

ARC/L2ARCsize[MB]

Time [sec]

ARC / L2ARC size (1500 idle-running vDesktops )

ARC datasize [MB]

L2ARC datasize [MB]


21/74

Page 21 of74


The ARC/L2ARC not being saturated should mean that all actively read data still fits into memory (ARC) or

Readzilla (L2ARC). This is clearly shown in figure 6.1.6, where the number of ARC hits show to be much larger

than the number of ARC misses:

Figure 6.1.6: Running 1500 idle desktops, the ARC hits show around 7000 per second while the

ARC misses show up at about 250. This is an indication of the effectiveness of the

(L2)ARC while running this specific workload.

While read operations appear to be properly drawn from ARC or L2ARC, write operations must be committedto the disks at some point. The NFS writes are synchronous, meaning that each write operation must be

guaranteed to be saved by the storage device before acknowledging the operation. This would mean a rather

bad write performance, since the underlying disks are relatively slow SATA drives.

This problem is countered by the use of LogZilla devices. These devices are write-optimized solid state disks

(SSDs), which constantly store the write operation metadata and acknowledge the write back immediately,

before it is actually committed to disk. As soon as the write is actually committed to SATA storage, the

metadata entry is removed from the LogZilla (this the reason it is called a LogZilla and not a write cache; the

LogZilla is only there to make sure the dataset does not get in an inconsistent state when for example a

power outage occurs).

0

1000

2000

3000

4000

5000

6000

7000

8000

1 101 201 301 401 501 601 701 801 901 1001

ARChits/misses[sec-1]

Time [sec]

ARC hits / misses (1500 idle-running vDesktops)

ARC hits [/sec]

ARC misses [/sec]


22/74

Page 22 of74


The underlying ZFS file system flushes the writes at least every 30 seconds to disk. The ZFS file system is able

to perform random writes to the SATA disks very effective, actually being a big sequential write whenever

possible. This can be verified from the graph in figure 6.1.3.

6.1.3 I/O LatencyBesides read and write performance, it is also necessary to look at storage latency. Latency is the delay

between a request to the storage, and the answer back. During a read it is typically the time from a read

request to the delivering of the data. During a write it is typically the time required from a write to the write

acknowledgement back.

Best performance is met when latency is minimal. To be able to graph latency through time, a three

dimensional graph is required. The functions of the different axes are:

- Horizontal Axis: Time;- Vertical Axis: Number of Read and/or Write Operations;- Depth Axis: Latency.

Latency is grouped into ranges instead of unique values. This enables the creation of 3D graphs,

because it is now possible to see groups of IOPS which conform to a certain latency range.

Since in many occasions almost all latency falls within the lowest group of 0-20[ms], graphs are often

zoomed in, where the number of IOPS (Vertical axis) is clipped to a low number. As a result, the peaks

of the 0-20[ms] latency-group go of the chart. This gives room to a more clear view of the higher

latency-groups. Please take note that these graphs do not give a total overview of the number of IOPSperformed; they merely give insight to the tiny details which are almost invisible in the original (non

zoomed) graph.

In figure 6.1.7a (with its zoomed counterpart 6.1.7b) the latency graph is displayed for NFS Read

Operations with 1500 idle-running vDesktops. Almost all operations fall within the 0-20[ms] latency-

group. Only when looking at the zoomed graph (figure 6.1.7b), some higher latencies can be observed.

However, these are so very small in numbers compared to the number of IOPS within the 0-20[ms]

latency-group, that only very little impact is to be expected from this.

The read operations that required more time to complete are probably the ARC/L2ARC cache misses,

and had to be read from SATA. These SATA reads are the reads observed in figure 6.1.4.


23/74

Page 23 of74


Figure 6.1.7a: Observed latency in NFS reads. Most read operations are served within 20[msec]

Figure 6.1.7b: Detail of latency in NFS read operations. Clipped at only 20 OPS to visualize higher

latency read operations.

0

200

400

600

800

1000

1200

NFSReadOperations[sec-1]

NFS Read Latency (1500 idle-running

vDesktops)

0

5

10

15

20NFSReadOperations

[sec-1]

NFS Read Latency ZOOMED (1500 idle-

running vDesktops)


24/74

Page 24 of74


6.2 Test Results 2: User load simulated linked clone desktopsAfter the initial test with idle-running desktops, the environment was reset. A new Windows XP image was

introduced, which delivers a lightweight user pattern:

- 200[MHz] CPU load;- 300[MB] active memory;- 7 observed NFS IOPS.

The memory and CPU load were deliberately held to a low level, so a maximum number of VMs would fit onto

the virtualization platform. The number of IOPs was matched to the accepted industry-average of 5 - 5.6

IOPs, with a calculated 150% overhead for linked cloning technology (See reference [1] for an explanation on

the 150% factor).


25/74

Page 25 of74


6.2.1 Deploying the initial 500 user load-simulated vDesktopsWhen deploying the initial 500 vDesktops, the effect of the deployment was clearly reflected in several

graphs. In figure 6.2.1 the ARC + L2ARC size grow almost linear during deployment:

Figure 6.2.1: Observed ARC/L2ARC data size growth when deploying the first 500 desktops.

During the deployment of the very first vDesktops, the ARC immediately fills with both replicas (a replica is

the full-clone image from which the linked clones are derived). There are two replicas, because two NFS

shares were used, and VMware View places one replica on each share. In the leftmost part of the graph it is

actually identifiable that both replicas are put into the ARC one by one.

After this initial action, the ARC starts to fill. This is because the created linked clones are also being read

back. Since every vDesktop behaves the same, the read back performed on the linked clones is also identical,

which explains the near-linear growth.

0

10000

20000

30000

40000

50000

60000

ARC/L2ARCDatasize[MB]

Time

ARC / L2ARC datasize (0 - 500 userloaded

vDesktops)

ARC datasize [MB]

L2ARC datasize [MB]


26/74

Page 26 of74


At the right of figure 6.2.1, the ARC fills up to its memory limit of 64[GB] minus the Storage 7000 overhead. It

is not until this time that the L2ARC starts to fill in the same linear manner as the ARC did. It becomes clear

that the L2ARC behaves as a direct (though somewhat slower) extension of the ARC (which resides in

memory).

When looking at ARC hits and misses in figure 6.2.2, it becomes clear that more and more read operations

are performed throughout the deployment:

Figure 6.2.2: Observed ARC hits and misses while deploying the initial 500 user loaded vDesktops.

The graph in figure 6.2.2 clearly shows the growing number of ARC hits. The ARC misses hardly increase at

all. This means that as more vDesktops are deployed, the effectiveness of the read cache mechanism

increases.

0

1000

2000

3000

4000

5000

6000

ARChits/misses[sec-1]

ARC hits/misses (0 - 500 userloaded vDesktops)

ARC hits [/sec]

ARC misses [/sec]


27/74

Page 27 of74


Figure 6.2.3: Consumed NFS bandwidth during deployment of the initial 500 vDesktops

In figure 6.2.3 it is clearly visible that the first 500 vDesktops were deployed in batches of 100. During the

linked cloning deployment, consumed NFS bandwidth is clearly higher than during normal running periods.

0

5

10

15

20

25

30

NFSread/write[MB.sec-1]

NFS bandwidth consumed (0-500 userloaded

vDesktops)

NFS write ave [MB/sec]

NFS read ave [MB/sec]


28/74

Page 28 of74


Figure 6.2.4: SATA Read- and Write Operations observed during the deployment of the initial 500

vDesktops. Note that the vertical scale has been extended to -2 in order to clearly

display the Read Operations, which run over the vertical axis itself.

Figure 6.2.4 shows that the SATA Write Operations increase with the number of vDesktops running. The Read

Operations remain at a minimum level, without any measurable increase. This is in line with figure 6.2.2

showing that the read cache gets more effective with a growing number of deployed vDesktops.

-2

3

8

13

18

23

28

Read/W

riteOperations[sec-1]

SATA Read- and Write Operations (0 - 500

userloaded vDesktops)

SATA WOPS [/sec]

SATA ROPS [/sec]


29/74

Page 29 of74


The write operations to SATA are synchronous, and get accelerated by the LogZillas. The graph in figure 6.2.5

shows the WOPS to the LogZilla devices:

Figure 6.2.5: Write Operations to the LogZilla device(s).

0

100

200

300

400

500

600

700

800

900

1000

LogZillaWOP

S[sec-1]

Logzilla WOPS ave [/sec]

(0-500 userloaded vDesktops)

LogZilla WOPS ave


30/74

Page 30 of74


The ZFS file system is able to deliver this workload using a very limited amount of SATA write operations. A

possible downside of the ZFS file system, is the large amount of CPU overhead imposed. See figure 6.2.6 for

details on CPU usage of the sun 7000 storage device:

Figure 6.2.6: CPU usage in the Sun 7000 storage during deployment of 500 user load-simulated

vDesktops

0

5

10

15

20

25

30

35

40

Sun Storage 7000 CPU load ave [%]

CPU load ave [%]


31/74

Page 31 of74


6.2.2 Impact of 500 vDesktop deployment on VMware ESXAs the number of vDesktops increases, the load on VMware ESX and vCenter also increases. See figure 6.2.7,

6.2.8 and 6.2.9 for more details:

Figure 6.2.7: CPU usage within one of the eight VMware ESX hosts during the deployment of the

initial 250 vDesktops. The topmost grey graph is the CPU overhead of VMware ESX.

In figure 6.2.7 the deployment of vDesktops is clearly visible. Each time a vDesktop is deployed and started, a

ribbon is added to the graph. Each vDesktop uses the same amount of CPU power, which is increased

slightly just after deployment (when the VM is booting its operating system)


32/74

Page 32 of74


Figure 6.2.8: Active Memory used by the vDesktops on one of the ESX nodes during the

deployment of the initial 250 vDesktops. The lower red ribbon is ESX memory

overhead due to the Service Console.

Figure 6.2.8 shows the active memory consumed as the vDesktops are deployed on one of the ESX nodes.

After each batch of 100 vDesktops, the memory consumption stops increasing, then slightly decreases. This

effect is caused by two things:

1) Freeing up tested memory within the VMs (Windows VMs touch all memory during memory test);2) VMwares Transparent Page Sharing technology.

As the VMs settle on the ESX server, ESX starts to detect identical memory pages, effectively deduplicating

them (item 2 on the list above). This feature can save a lot of physical memory usage, especially when

deploying many (almost) identical VM workloads.


33/74

Page 33 of74


Figure 6.2.9: Physical memory shared between vDesktops thanks to VMwares Transparent Page

Sharing (TPS) function within VMware ESX.

Transparent Page Sharing (TPS) effects become clearer when looking at the graph in figure 6.2.9. As VMs are

added to the ESX server, more memory pages are identified as being duplicates, saving more and more

physical memory.


34/74

Page 34 of74


6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and ViewVMware vCenter and VMware View are not directly involved in the delivering of the vDesktop

workloads, but they play an important role during the deployment of net vDesktops. The CPU loads on

these machines clearly show the deployment of the batches of vDesktops:

Figure 6.2.10: Observed CPU load on the (dual vCPU) vCenter server during vDesktop deployment.

Note the dual y-axis descriptions; some values are percents, others are [MHz].

In figure 6.2.10, the deployment batches can be clearly extracted. After each batch, vCenter server

settles at a slightly higher CPU load. This is caused by the number of VMs to manage and monitor

within the entire ESX cluster.


35/74

Page 35 of74


Figure 6.2.11: Observed CPU load on the VMware View server during vDesktop deployment.

Note the dual y-axis descriptions; some values are percents, others are [MHz].

The VMware View server shows pretty much the same characteristics as the VMware vCenter server.

Higher CPU loads during the batch deployment of vDesktops, and settling somewhat higher after each

batch.


36/74

Page 36 of74


6.2.4 Deploying vDesktops beyond 500After the successful deployment of the initial 500 vDesktops, further batches of 100 vDesktops were

deployed. Goal was to fit as many vDesktops onto the testbed as possible, keeping track of all

potential boundaries (performance-wise).

The largest amount of vDesktops that could be deployed was 1319. At this point VMware stopped

deploying more vDesktops because the ESX servers were running out of vCPUs. Within ESX version 3.5,

the limit of the number of VMs that can run on a single node is fixed to 170. This maximum was

reached just before ESX physical memory ran out:

Figure 6.2.12: ESX node resource usage when deploying 1300 vDesktops

As the graph in figure 6.2.12 is showing, the ratio of memory versus CPU usage was almost matched. The

limitation of the number of running VMs, memory limitations and CPU power limitations reached their

maximum almost simultaneously.

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300

CPU/Memoryusage

(average)[%]

Number of deployed user-simulated vDesktops

VMware ESX node resource usage (0 to 1300

vDesktops)

Node CPU

Node mem


37/74

Page 37 of74


Due to the nature of the ZFS file system, the CPU load on the storage device was a concern. The

measured values can be found in figure 6.2.13:

Figure 6.2.13: CPU load on the 7000 storage during deployment of 1300 vDesktops. Note the HT0

value. This is the HT-bus between the two quad core CPUs inside the storage device.

The relaxation points at 600-700 and 1200 vDesktops were due to settling of the

environment during weekends.

As shown, the CPU load on the storage device is quite high, but not near saturation yet. The HT0 bus

displayed here was the one HT-bus having the biggest bandwidth usage. This is due to the fact that a

single, dual-channel PCI-e 10GbE card was used in the environment. The result of this was that the

second CPU had to transport all of its data to the first CPU in order to be able to get its data in and out

of the 10GbE interfaces. Note that the design could have been optimized here to use two separate

10GbE cards, each on PCI-e lanes that use a different HT-bus. This would have resulted in a better

load balancing across CPUs and HyperTransport busses. See figure 7.3.1 for a graphical representation

of this.

0

0,5

1

1,5

2

2,5

0

10

20

30

40

50

60

70

80

90

100

H

yperTransportbusbandwidthusage[GB.sec-1]

CPUconsumed[%]


7000 Storage CPU resources (0 to 1300

vDesktops)

7410 CPU load

HT0/socket1


38/74

Page 38 of74


The memory consumption of the 7000 storage is directly linked to the amount of read cache used. As

the number of vDesktops increase, the ARC (memory cache) fills up. As it reaches about 450

vDesktops, the ARC reaches its 64[GB] limit and the L2ARC (solid state drive) starts to fill (see figure

6.2.14):

Figure 6.2.14: Memory usage on the 7000 storage during the deployment of 1300 vDesktops. Note

the L2ARC (SSD drive) starting to fill as the ARC (memory) saturates. The relaxation

between 600 and 700 vDesktops is due to a stop of deploying during a weekend

(ARC flushing occurred through time as the vDesktops settled in their workload).

The L2ARC finally settled at just about 100[GB] of used space (on the testbed there was a total of

200[GB] of ReadZilla available).

0

20

40

60

80

100

120

Memoryusage[GB]


7000 Storage memory usage (0 to 1300

vDesktops)

7410 L2ARC [GB]

7410 ARC [GB]

7410 Kernel use [GB]


39/74

Page 39 of74


The networking bandwidth and IOPs used by the testbed is displayed in figure 6.2.15:

Figure 6.2.15: NFS traffic observed during the deployment of 1300 vDesktops.

The dips in the graphics at 600/700 and 900/1000 vDesktops are actually weekends; the vDesktops

settled in their behavior which shows in the graph in figure 6.2.15.

0

10

20

30

40

50

60

70

80

0

1000

2000

30004000

5000

6000

7000

8000

9000

Networktraffic[MB.sec-1]

I/Ooper

ations[sec-1]


NFS traffic (0 to 1300 vDesktops)

NFS IOPSNFS reads

NFS writes


40/74

Page 40 of74


6.2.5 Performance figures at 1300 vDesktopsThe system saturated at 1300 vDesktops, due to the limit in the maximum number of running VMs

inside the ESX servers. Performance of the vDesktops at this number was still very acceptable, even

though both memory and CPU power where almost at their maximum.

The VMs were still very responsive. Random vDesktops were accessed through the console, and

responsiveness was tested by starting the welcome to Windows XP introduction animation. Both frame rate

and animation speed did not deteriorate significantly through the entire range of 0 to 1300 vDesktops.

A good grade to determine this technically is the CPU ready time. This is the time that a VM is ready to

execute on a physical CPU core, but ESX somehow cannot manage to schedule it to a physical core:

Figure 6.2.16: CPU ready time measured on a vDesktop on a 30 minute interval.


41/74

Page 41 of74


Note that these values are summed up between samples, and all millisecond values should be divided by

1800 (30 minutes) in order to obtain the number of milliseconds ready time per second (instead of per 30

minutes). In the leftmost part of the graph, vDesktops are still being deployed and booted up, impacting

performance (ready time is about 12.5 [ms.sec-1

]). After the deployment is complete, ready time drops toabout 4.2 [ms.sec-1]. These values are very acceptable from a CPU performance point of view.

Next to CPU ready times, also the NFS latency is of great influence on the responsiveness of the

vDesktops. The graphs in the following figures were made at a load of 1300 vDesktops:


42/74

Page 42 of74


Figure 6.2.17a and 6.2.17b: Observed NFS read latency at 1300 user simulated vDesktops

0

200

400

600

800

1000

1200

1400

1600

1800

2000

ReadIOP

S[sec-1]

NFS Read Latency (1300 userloaded vDesktops)

0

2

4

6

8

10

12

14

16

18

20

Re

adIOPS[sec-1]

NFS Read Latency ZOOMED (1300 userloadedvDesktops)


43/74

Page 43 of74


Looking at graph 6.2.17a, it shows that almost all Read Operations are served within 20 [ms]. At this load it is

quite impressive.

When we look at the read latency in more detail (figure 6.2.17b), there are some Read Operations which takea longer time to be served. To put this in numbers, there are between 1 and 2 read operations every second

which take up to about 100 [ms] to be served. Take note, this is only about 0.2% of the read operations

performed.

Next to read latency, also write latency is measured. The write latency appears to be a little worse than

the read latency:


44/74

Page 44 of74


Figure 6.2.18a and 6.2.18b: Observed NFS write latency at 1300 user simulated vDesktops

0

2000

4000

6000

8000

10000NFSwriteOperations[sec-1

]

NFS Write Latency (1300 userloaded vDesktops)

0

200

400

600

800

1000NFSwriteOperations[sec-1]

NFS Write Latency ZOOMED (1300 userloaded

vDesktops)


45/74

Page 45 of74


About 150 write operations require more than the base 0-40ms window to complete. Since the total

number of write operations is about 6000, this is about 2,5% of the operations performed. The only

explanation of these high latency numbers can be that some writes are not committed to the LogZilla,

but are flushed to disk directly. This is normal behavior for ZFS.

Within ZFS, larger blocks are not committed to the LogZilla. This is controlled by a parameter called

zfs_immediate_write_sz. This parameter is actually a constant within ZFS, and set to 32768 (see

reference [2] )

VMware ESX will concatenate writes if possible, up to 64[KB]. It is safe to assume that the majority of

writes equal the size of the vDesktops NTFS block size (4[KB]). However, some blocks do get

concatenated within VMware.

Looking at figure 6.1.2, we can see that the average write size is 5.5 [KB]. If we calculate the projected

block size from the behavior seen above, we can conclude that:

This is a perfect match, so it is safe to assume that the large write latency observed is in fact due to

this behavior. Tuning the zfs_immediate_write_szconstant could help in this case (increasing it to

65537 (which is 2^16+1) to make sure 64K writes are also stored in the LogZillas). Unfortunately

adjusting of this parameter is not supported on the Sun 7000 storage arrays (nor is it in ZFS to myknowledge).


46/74

Page 46 of74


VMware ESX has a feature called Transparent Page Sharing (TPS). This allows VMware ESX to map

several virtual memory pages which are identical to the same physical memory page. VMware performs

this memory deduplication in either hardware (vSphere 4 plus supported CPUs) or spare CPU cycles

(both ESX3.x and vSphere 4 optionally), so the positive effect of TPS gets bigger over time (also seefigure 6.2.9).

At a total of 1300 deployed vDesktops, ESX saves a large amount of memory:

Figure 6.2.19: Memory shared between vDesktops within a single ESX server.

As shown in figure 6.2.19, there are 170 VMs running (the graph shows only one of the eight ESX

nodes). Each ribbon in this graph represents a VM. In total, 22,5 [GB] of memory is shared thus saved

between vDesktops per ESX node. Without TPS, the ESX servers would have required 64+22,5 = 86,5

[GB] of memory (a 30% saving !).

When looking at the entire ESX cluster, each ESX server saves about the same amount of memory

thanks to TPS, saving 8*22,5 = 180 [GB] of memory.


47/74

Page 47 of74


6.2.6 Extrapolating performance figures

In order to be able to predict the maximum amount of vDesktops which can be placed on a certainenvironment, it is important to take note of all limiting factors. By extrapolating the measurements

performed on these factors, we can determine how to scale different resources (like CPU, memory, SSD

drives) to match the number of vDesktops we need to deploy.

For scaling VMware ESX CPU and memory, we set the maximum allowable load to 85%. The

extrapolated graph can be found in figure 6.2.20:

Figure 6.2.20: Extrapolation of figure 6.2.11: ESX node resource usage

In figure 6.2.20, memory is limited at 1300 desktops (which actually was the limit we ran into during

the test). CPU had some room to spare: If we pushed CPU consumption to 85%, we could deploy 1650

vDesktops.

05

1015202530354045505560

6570758085

0 100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

15

00

1600

17

00

1800

CPU/Mememoryusage(ave

rage)[%]


VMware ESX node resource usage

(extrapolated)

Node CPU

Node mem


48/74

Page 48 of74


Figure 6.2.21: Extrapolation of figure 6.2.12: CPU load on the 7000 storage

Looking at figure 6.2.21, the extrapolated value for the 7000 storage CPU usage would put the

maximum number of vDesktops on 1900. The theoretical maximum of the HT bus is 4[GB.sec-1], but a

general accepted value is around 2.5[GB.sec-1]. This would mean the HT-bus would limit the number of

vDesktops to 1950.

0

0,5

1

1,5

2

2,5

-5

5

15

25

35

45

55

65

75

85

0 100

200

300

00

500

600

700

800

900

1000

1100

1200

1300

1

00

15

00

1600

17

00

1800

1900

2000

HyperTransportbustroughput[GB.sec-1]

CPUconsumed[%]


7000 Storage CPU resources (Extrapolated)

7410 CPU load

HT0/socket1


49/74

Page 49 of74


For read caching, the 7000 storage relies on memory and Solid state drives (SSDs). Both basically

extrapolate the same, only memory is much faster than SSD. For extrapolating memory usage, using

the ARC values is sufficient:

Figure 6.2.22: Extrapolation of figure 6.2.14: Memory usage on the 7000 storage

Extrapolation of the ARC size shows, that at 256 [GB] of memory minus some overhead for the Kernel

could have up to 2400 vDesktops deployed. Beyond this point SSDs (ReadZilla) would have to be used

in order to extend memory beyond 256[GB] which is the maximum amount of RAM that can be added

to the biggest 7000 series array at the time of this writing.

An important note to take is that the measured range of the ARC is rather short. A slight variation in

the measurement could have quite a dramatic effect on the final number of vDesktops that can be

deployed in a given environment.

0

50

100

150

200

250

200

00

15

00

1800

2300

Memoryusage[GB]


7000 Storage memory usage (Extrapolated)

7410 ARC [GB]


50/74

Page 50 of74


Finally, the NFS traffic is extrapolated in order to be able to see projected network bandwidth and

number of IOPS required for a given number of vDesktops:

Figure 6.2.23: Extrapolation of figure 6.2.15: NFS traffic observed

The extrapolation in figure 6.2.23 is set to some limits. In the network bandwidth projection a

maximum of 2x 1GbE is used, with usage limited to 50% for each link in order to avoid possible

saturation / packet dropping on the link.

The number of total IOPS in this projection is limited to 12.000 [sec-1]. The reason for choosing this

number is that at the measured I/O distribution, there should be about 10.000 Write Operations

performed Per Second (WOPS), which is the maximum for a LogZilla device.

According to this graph, maximums come into play above 1800 vDesktops. For the NFS read

bandwidth, the maximum is not reached in this graph but would end somewhere near 4000

vDesktops (!).

0

2000

4000

6000

8000

10000

12000

1800

0

10

20

30

40

50

60

NFSIOPS[sec-1]


Networktraffic[MB.sec-1]

NFS traffic (Extrapolated)

NFS writes

NFS reads

NFS IOPS


51/74

Page 51 of74


In close relation to the NFS IOPS performed, SATA ROPS and WOPS can also be extrapolated:

Figure 6.2.24: Extrapolation of figure 6.2.4: SATA Read- and Write-Operations Extrapolated

The graph in figure 6.2.24 clearly shows that hardly any SATA ROPS are performed; SATA WOPS

steadily increase as the number of running vDesktops increase. Note that at 1500 running vDesktops

the number of WOPS is projected to be only 68 [sec-1] WOPS. ROPS remain near zero.

-2

8

18

28

38

48

58

68

Read/WriteOpe

rations[sec-1]

Number of deployed vDesktops

SATA Read- and Write Operations

(Extrapolated)

SATA WOPS [/sec]

SATA ROPS [/sec]


52/74

Page 52 of74


The write acceleration through the LogZilla device(s) can also be extrapolated:

Figure 6.2.25: Extrapolation of figure 6.2.5: LogZilla WOPS performed

0

200

400

600

800

1000

1200

1400

1600

1800

2000

LogZillaW

OPS[sec-1]

Number of vDesktops deployed

Logzilla WOPS ave [/sec] (Extrapolated)

LogZilla WOPS ave


53/74

Page 53 of74


Latency is more complex to extrapolate. By extrapolating each latency-group, a 3D graph can be

recreated to show projected NFS read latencies:

Figure 6.2.26: Extrapolation of NFS read latency, clipped at 100 read operations per second.

Figure 6.2.26 is an extreme zoom of an extrapolated NFS read latency graph. The graph has been cut

into segments and had isolations inserted to give a clear view of the latency graphs as more vDesktops

would be deployed on the environment.

As the number of vDesktops gets bigger, more latency is introduced. This was already determined. In

this graph though, it becomes clear that the distribution of latency changes as the load increases.

0,00

20,00

40,00

60,00

80,00

100,00

NFSReadOperat

ions[sec-1]

Extrapolated NFS read latencies


54/74

Page 54 of74


6.3 Test Results 2a: Rebooting 100 vDesktopsThe impact that should never be underestimated is the impact on the storage when a large number of VMs

have to be rebooted. The reboot process uses far more resources than a regular workload, and especially

rebooting a lot of vDesktops in parallel can mean a large increase in I/O operations performed.

As a subtest, we shut down then restarted a hundred vDesktops with a total of 800 vDesktops deployed. The

impact is best seen in the latency graphs:


55/74

Page 55 of74


Figure 6.3.1a and 6.3.1b: NFS read latency rebooting 100 vDesktops (@800 deployed).

0

2000

4000

6000

8000

10000NumberofReadOPS[sec-1]

NFS Read Latency (800 vDesktops, 100 rebooting)

0

200

400

600

800

1000NumberofReadOPS[sec-1]

NFS Read Latency ZOOMED (800 vDesktops, 100

rebooting)


56/74

Page 56 of74


As can be seen in graphs 6.3.1a and b, the reboot in total took about one hour. The restart was issued

through VMware View, who schedules the restarts spread over time to vCenter. In the a (unzoomed) graph,

the peaks above the 4000 ROPS indicate a higher number of read operations caused by the restarting of the

vDesktops. The zoomed graph (graph b) shows more detail on the read latency getting worse during therestarts. This is due to the fact that especially the linked clones files that were written by the VMs previously,

are now read back and have to be introduced to the ARC/L2ARC read cache, meaning the reads have to come

from the relatively slow SATA drives. A second restart might have had less impact in this respect (untested).

The filling of the L2ARC (from SATA) at rebooting of the vDesktops can be clearly seen in the graph in figure

6.3.2:

Figure 6.3.2: L2ARC growth on desktops reboot

A hundred rebooting vDesktops caused the L2ARC to grow about 50[GB] in size. Since all common reads

come from only two replicas which are already stored in the ARC, apparently each VM reads about 0,5[GB] of

unique data (from its linked clone).

0

20

40

60

80

100

120

ARC/L2ARCmemoryuagse[GB]

(L2)ARC growth (restart of 100 vDesktops in an

800 vDesktop environment)

ARCdata size

L2ARCdata size


57/74

Page 57 of74


Network bandwidth used is also clearly higher during the reboot of the vDesktops:

Figure 6.3.3: NFS bandwidth used during reboot of 100 vDesktops.

At the left of the graph above, the regular I/O workload can be observed. The rest of the graph is the reboot

of the 100 vDesktops.

0

20

40

60

80

100

120

NFSread/write[MB.se

c-1]

NFS bandwidth used (restart of 100 vDesktops

in an 800 vDesktop environment)

NFS write ave

NFS read ave


58/74

Page 58 of74


6.4 Test Results 2b: Recovering all vDesktops after storage appliancereboot

When running 1000 vDesktops, the storage array was forcibly rebooted. This subtest was performed to see

the impact on the storage array, on the data and on the vDesktops.

At the time of the forced shutdown of the storage device, all VMs froze. After rebooting the storage

appliance, the ZFS file system had to perform some resilvering (checking and making sure the data is

consistent which is a very reliable feature in ZFS) before normal NFS communication to the ESX servers was

able to resume. At this resume, the VMs simply unfroze and started to show their normal behavior almost

instantly.


59/74

Page 59 of74


In graph 6.4.1 the effects of the forced reboot can clearly be seen:

Figure 6.4.1: Network and CPU load behavior during reboot of the storage appliance. The red

Striped bars indicate no measurements were made (during reboot of the storage

device itself)


60/74

Page 60 of74


The red bar in figure 6.4.1 indicates the time required to perform the (re)boot of the storage device. The

silent period after that is the so called resilvering of the ZFS file system. No I/O is performed at this stage,

but as can be seen the CPU is quite busy during the resilvering.

After resilvering is done, the storage device immediately continues to perform I/O and settles quite fast. After

a reboot of the appliance, the ARC is empty (being RAM), and the L2ARC data is forcibly deleted and will be

rebuilt as read operations start to occur. Initially, the read operations will have to come out of SATA, filling up

the ARC and after that the L2ARC as they are read. In figure 6.4.2 the refilling of the ARC and L2ARC are

clearly visible:

Figure 6.4.2: Filling of the ARC and the L2ARC after a forced reboot of the storage appliance.


61/74

Page 61 of74


The graph in figure 6.4.2 clearly shows the rapid filling of the ARC. It appears to fill a little bit during

resilvering, then shoots up quick (probably the two replicas are pulled into the ARC here). From there on, the

filling of the ARC slows its pace, and the L2ARC starts to come in as well. The third graph in figure 6.4.2

shows the (L2)ARC misses. During a few minutes there are quite a lot of misses, but this issue is resolvedrather quickly.

All in all the device is up and running again in 15 minutes. Take note that the setup used here did not make

use of the clustering features available for the 7000 series; all tests are performed on a single storage

processor.


62/74

Page 62 of74


6.5 Test Results 3: User load simulated full clone desktopsA limited test was added to the original linked-clone test scenario. In this test the same (user

simulated) Windows XP images were deployed, but this time not in linked clone but full clone mode.

Only 150 full-clone desktops have been deployed, to see the behavior of the ARC and L2ARC in this

scenario.

Figure 6.5.1: Filling of the ARC and the L2ARC during the deployment of 150 full-clone

vDesktops. Totally left some test vDesktops (full clones) are deployed. At (1)

the first batch of 25 vDesktops are deployed, at (2) the rest of the vDesktops

are deployed.

See figure 6.5.1. After start of the test (totally left) some full-clone vDesktops are deployed. At the (1)

marker, the first batch of 25 vDesktops are deployed. Shortly after marker (1), the ARC and L2ARC sizes settle

around 25 [GB]. This indicates that the vDesktops perform around 1[GB] of reads per vDesktop. Because the

ARC does not saturate yet, the L2ARC remains (almost) empty at this stage. Beyond marker (2) the rest of the

vDesktops are deployed, quickly filling the ARC and the L2ARC.


63/74

Page 63 of74


Figure 6.5.2: Filling Extrapolation of the 7000 storage CPU usage.

Figure 6.5.2 contains an extrapolation of the CPU load on the 7000 storage. The extrapolation is

extensive, and gives room for error. However, it appears to be pretty much in line with the CPU figures

measured in the linked-cloning setup (see figure 6.2.20).

0

5

10

15

20

25

30

35

40

45

5055

60

65

70

75

80

85

7410CPUload(avera

ge)[%]

Number of deployed full-clone vDesktops

7410 CPU load Average percent (Extrapolated)

7410 CPU load ave


64/74

Page 64 of74


More interesting is the amount of IOPs performed in the full-clone scenario compared to the linked-clone

scenario:

Figure 6.5.3: NFS IOPS comparison of full-clone versus linked-clone vDesktop deployment.

Figure 6.5.3 shows that linked-clone vDesktops use more IOPS than full-clone vDesktops. This effect

can be explained by the way linked-clones function within VMware ESX. This behavior is much like

VMware snapshotting (see reference [1] for more details).

Another thing that can be seen in figure 6.5.3, is that the deployment itself of linked-clones appears to

have a greater IOPS impact than when performing full-clone deployment. Take note though, that this is

not the case: In figure 6.5.3, the time scale has been adjusted in order to match both graphs into a

single figure. Fact is that the speed of deployment is very different:

- Linked-clones deploy at a rate of 100 vDesktops per Hour;- Full-clones deploy at a rate of 10 vDesktops per Hour.

This factor 10 is not visible in the graph, but in fact the full-clone vDesktop deployment uses far more

IOPS. This makes sense: In the full-clone scenario every vDesktop gets its boot drive fully copied, while

linked-clones only have to perform some IOPS overhead when creating an empty linked-clone (andsome other administrative actions on disk).

0

500

1000

1500

2000

2500

3000

3500

4000

4500

NumberofIOPS

[sec-1]

Number of vDesktops deployed

NFS IOPS performed; FULL vs. LINKED clones

(0 to 100 vDesktops)

NFS IOPs FULL-CLONE

NFS IOPs LINKED-CLONE


65/74

Page 65 of74


7 ConclusionsFrom all tests conducted, some very interesting conclusions can be drawn. First of all, the fact that theenvironment managed to run over 1300 vDesktops without performance issues on its own is a great

accomplishment. Looking deeper into the measured values gives a wealth of information on best practices

how to configure Sun Unified Storage 7000 in combination with VMware View linked clones.

7.1 Conclusions on scaling VMware ESXIt proves to be very important to scale your VMware ESX nodes correctly. There are basically three things to

keep in mind:

1) The amount of CPU cores inside an ESX server;2) The amount of memory inside an ESX server;3) The amount of vCPUs/VMs the ESX server can deliver.

The first and second are obvious ones; put in too much CPU power, and you run out of memory leaving the

CPU cores underutilized; put in too much memory and you run out of CPU power leaving memory

underutilized.

The third is sometimes forgotten, but proved to be the culprit in our test setup: If you use ESX servers with

too much CPU and memory, youll run out of vCPUs and VMs will not start anymore beyond a certain point.Luckily, with each release of VMware ESX this number appears to get higher and higher:

- ESX 3.01 / ESX3.5: 128 vCPUs, 128 VMs;- ESX3.5U2+: 192 vCPUs, 170 VMs;- vSphere (ESX4): 512 vCPUs, 320 VMs.

As shown, using vSphere as a basis will allow for much bigger ESX servers.


66/74

Page 66 of74


7.2 Conclusions on scaling networking between ESX and Unified StorageThe network did not really prove to be an issue during the performed tests. Bandwidth usage to any single

ESX node proved to be far within the capabilities of a single GbE connection.

Bandwidth to the storage also remained far within the designed bandwidth. The two 10 GbE connections

remained underutilized throughout all tests.

Load balancing was forcibly introduced into the test environment, but could have been skipped without issue

in this case. If the 7000 storage would have been driven using 1 GbE links, load balancing would be

recommended.


67/74

Page 67 of74


7.3 Conclusions on scaling Unified Storage CPU powerDuring the tests, the CPUs inside the 7000 storage were not saturated. At 1300 user-simulated vDesktops,

the load on the two CPUs reached 85%, which should be considered to be near its maximum performance. In

order to be able to scale up further, 4 CPUs (or 6-core CPUs) will be required.

The HyperTransport bus between the two CPUs showed quite large values (order of 1.7 [GByte.sec-1] ). This

was partially due to the fact that the two 10 GbE ports both resides on a single PCIe card. This caused all

traffic to be forcibly sent through the HyperTransport bus of CPU0, instead of being load-balanced between

CPU0 and CPU1:

HyperTransport-to-I/Obridge

HyperTransport-to-I/Obridge

MemoryBus

MemoryBus

HT-bus

HT-bus

H

T-bus

PCIebus

10Gb

Ethe

rnet 10GbEthernet

PCIebus

Sun 7410 Unified StorageHyperTransport Bus technology

Figure 7.3.1: Sun 7410 Unified Storage HyperTransport bus architecture. In the performance tests

a single PCIe card with dual 10GbE was used. Best practice would be to use two single

port 10GbE PCIe cards using a different HT-Bus (shown in semi-transparency).


68/74

Page 68 of74


7.4 Conclusions on scaling Unified Storage Memory and L2ARCIn order to obtain best performance out of the 7000 Unified Storage, read cache is very important. This type

of storage was even primarily selected for its large read cache capabilities. Using linked-clones all replicas

(the full-cloned mother of the linked clones) were directly committed to read cache. For each linked-clone

deployed a small addition amount of read cache was required. The amount of read cache should be carefully

matched to the projected number of vDesktops on the storage device. See chapter 8 for more details.

The L2ARC presents itself in the form of one or more read-optimized Solid State Drives (SSDs). It can be seen

as a direct extension to internal memory. It is important to note though, that L2ARC storage is about a factor

1000 slower than memory. Best practice would be to match internal memory to the required read-cache. If

(and only if) memory requirements exceed the physical maximum amount of internal memory, L2ARC could

be used to reach the required amount.

7.5 Conclusions on scaling Unified Storage LogZilla SSDsThe LogZilla devices enable the 7000 Unified storage to quickly acknowledge synchronous writes to the

storage device. The metadata of the write is stored in both the LogZilla and the write itself in the ARC. Finally,

the write is committed to disk from the ARC and the metadata in the LogZilla is flagged as handled.

In normal operation, the LogZilla is never read from. Only on recovery (such as power-loss) the LogZilla is

read and the ZFS file system is returned to a consistent state using the metadata present in the LogZilla that

was not flagged as handled yet.

In effect, the addition of a LogZilla greatly enhances (lowers) the write latency to the storage device. The

performed tests show that the LogZilla really helps to keep write latency to a minimum.

Each LogZilla is able to perform at 10.000 [WOPS]. When the projected number of writes is larger than 10.000

[IOPS], adding LogZillas could help. Note should be taken that adding a second LogZilla will not help

performance-wise: The Unified storage will place both LogZillas in a RAID1 configuration. This RAID1

configuration will help in ensuring performance; a LogZilla may fail, and the storage device will keep working

normally. When using a single LogZilla, the synchronous writes will have to be written to disk directly if the

LogZilla fails, clipping performance.

Using four LogZilla devices could increase the number of WOPS that can be performed to a single storage

device (the Unified Storage will put four LogZillas into a RAID10 configuration effectively being able to

perform 20.000 [WOPS])


69/74

Page 69 of74


7.6 Conclusions on scaling Unified Storage SATA storageThroughout the test the number of SATA ROPS and WOPS has been consistently limited in numbers. This is

due to the way ZFS works: ZFS aims to read most (if not all) data from the ARC and L2ARC; ZFS combines and

reorders small random writes to very large blocks and so converts the small random writes to large sequential

writes. This way of working minimizes ROPS, and performs only few large sequential writes to SATA (also see

Reference [3]).

Given the fact that a single

Documents

Performance Report Sun Unified Storage and VMware View 1.0