Software Defined Datacenter Sample Architectures based on vCloud Suite 5.1 Singapore, Q2 2013

© 2009 VMware Inc. All rights reserved

Software Defined DatacenterSample Architectures based on vCloud Suite 5.1

Singapore, Q2 2013

Iwan ‘e1’ RahabokLinkedin.com/in/e1ang/ | tinyurl.com/SGP-User-Group

M: +65 9119-9226 | [email protected]

VCAP-DCD, TOGAF Certified

2

Purpose of This DocumentThere is a lot of talk about private cloud. But how does it look like at technical level?

• How do we really assure SLA, and have 3 Tier of service?• If I’m a small company with just 50 servers, what does my architecture look like? If I have 2000 VM, how does it look like?

For existing VMware customers, I go around and do a lot of “health check” at customers site.• The No 1 question is around design best practice. So this doc serves as quick reference for me. I can pull a slide from here for discussion.

I am employee of VMware. But this is my personal opinion. • Please don’t take it as official and formal VMware recommendation. I’m not authorised to do so.• Also, generally we should judge the content, rather than the organisation/person behind the content. A technical fact is a technical fact,

regardless whether an intern said it or 50-year IT engineer said it.

Technology changes• 10 Gb ethernet, Flash, SSD disk, FCoE, Converged Infrastructure, SDN, NSX, storage virtualisation, etc will impact the design. A lot ot new

innovation coming within next 2 years, and some already featured in VMworld• New modules/products from VMware’s Ecosystem Partners will also impact the design.

This is a guide• Not a Reference Architecture, let alone a Detailed Blueprint. • Don’t print and follows to the dot. This is for you to think and tailor.

It is written for hands-on vSphere Admin who have attended Design Workshop & ICM • A lot of the design consideration is covered in vSphere Design Workshop. • It complements vCAT 3.0• You should be at least a VCP 5, preferably VCAP-DCD 5• No explanation on features. Sorry, it’s already >100 slides.

With that, let’s have a professional* discussion• * Not emotional & religious & political discussion Let’s not get angry over technical stuff. Not worth your health.

Folks, some disclaimers:

Use it like a book, not slide

3

Table of ContentsIntroduction

• Architecting in vSphere• Application with special consideration• Requirements & Assumptions• Design Summary

vSphere Design: Datacenter• Datacenter, Cluster (DRS, HA, DPM, Resource Pool)

vSphere Design: Server• ESXi, physical host

vSphere Design: Network

vSphere Design: Storage

vSphere Design: Security & Compliance• vCenter roles/permission, config management

vSphere Design: VM

vSphere Design: Management• See this deck: http://communities.vmware.com/docs/DOC-17841

Disaster Recovery• See this deck: http://communities.vmware.com/docs/DOC-19992

Additional Info: email me for Appendix slide

Some slides have speaker notes for details.

Refer to Module 1 and Module 2 of vSphere Design Workshop. I’m going straight into more technical material here.

Topic only covers items thathave major design impact.Non-design items are not covered.This focuses on vCloud Suite 5.1 (infrastructure).Application specific (e.g. Database) is not covered.

4

Introduction

5

vCloud Suite Architecture: what do I considerArchitecturing a vSphere-based DC is very different to physical Data Center

• It breaks best practice, as virtualisation is a disruptive technology and it changes paradigm. Do not apply physical-world paradigm into virtual-world. There are many “best practices” in physical world that are caused by physical world limitation. Once the limitation is removed, the best practice is no more valid.

• Adopt emerging technology as virtualisation is still innovating rapidly. Best practice means proven practice, and that might mean outdated practice.

• Consider unrequested requirements as business expect cloud to be agile. You have experienced VM sprawl right My personal principle: Do not design something you cannot troubleshoot.

• A good IT Architect does not setup potential risk for Support Person down the line.• I tend to keep things simple and modular. Cost will go up a bit, but it is worth the benefits.

What I consider in vSphere based architecture• No 1: Upgradability

• This is unique in the virtual world. A key component of cloud that people have not talked much.• After all my apps run on virtual infrastructure, how do I upgrade the virtualisation layer itself? How do you upgrade SRM?• Based on historical data, VMware releases major upgrade every 2 years. • Your architecture will likely span 3 years, so check with your VMware rep for NDA roadmap presentation

• No 2: Debug-ability• Troubleshooting in virtual environment is harder than physical, as boundary is blurred and physical resources are shared.• 3 types of troubleshooting:

• Configuration. This does not normally happen in production, as once it is configured, it is not normally changed.• Stability. Stability means something hang or crash (BSOD, PSOD, etc) or corrupted• Performance. This is the hardest among the 3, especially if the slow performance is short lived and in most cases it is performing well.

• This is why the design has extra server and storage, so we can isolate some VM while doing joint troubleshooting with App team. • Supportability

• This is related, but not the same with Debug-ability. • Support relates to things that make day to day support easier. Monitoring counters, reading logs, setting up alerts, etc. For example, centralising the

log via syslog and providing intelligent search improves Supportability• A good design makes it harder for Support team to make human error. Virtualisation makes task easy, sometimes way too easy relative to physical

world. Consider this operational/phychological impact in your design.

6

vCloud Suite Architecture: what do I considerConsideration

• Cost• You will notice that the “Small” Design example has a lot more limitations than the “Large” Design. • An even bigger cost is ISV. Some, like Oracle, charges for the entire Cluster. Dedicating cluster for them is cheaper.• DR Site serves 3 purposes to reduce cost.• VMs from different Business Units are mixed in 1 cluster. If they can share same Production LAN and SAN, same reason can apply to hypervisor.• Window, Linux and Solaris VMs are mixed in 1 cluster. In large environment, separate them to maximise your OS license.• DMZ and non DMZ are mixed in 1 cluster.

• Security & Compliance• vSphere Security Hardening Guide split security into 3 levels: Production, DMZ and SSLF• Prod and Non-Prod don’t share the same cluster, storage, network

• Easy to make mistake. Easy to move in and out of Production environment. • Production is more controlled and secure• Non-Prod may spike (e.g. doing load testing).

• Availability• Software has Bugs. Hardware has Fault. We cater for hardware fault mostly. What about software bugs?• I try to cater for software bug, which is why the design has 2 VMware clusters with 2 vCenter. This lets you test cluster-related features in one cluster,

while keeping your critical VM on another cluster. • Cluster is always based on 1 host failure. In small cluster, the overhead can be high (50% in a 2-node cluster)

• Reliability• Related to availabity, but not the same. Availability is normally achieved by redundancy. Reliability is normally achieved by keeping things simple, using

proven components, separating things, standardising.• For example, solution for Small Design is simpler (a lot less features relative to Large Design). It also uses 1 vSwitch for 1 purpose, as opposed to a big

vSwitch with many port groups and complex NIC fail-over policy.• You will notice a lot of standardisation in all 3 examples. The drawback of standardisation is overhead, as we have to round up to the next bracket. A

VM with 24 GB RAM ends up getting 30 GB.

7

vCloud Suite Architecture: what do I considerConsideration

• Performance (1 and Many)• 2 types:

• How fast can we do 1 transaction? Latency, clock speed matters here.• How many transactions can we do within SLA? Throughput and scalability matters here.

• Storage, Network, VMkernel, VMM, Guest OS, etc are considered.• We are aiming for <1% CPU Ready Time and near 0 Memory Ballooning in Tier 1. In Tier 3, we can and should have higher ready time and some

ballooning, so long it still meet SLA.• Some technique to address: add ESX, add cluster, add spindles, etc.• Includes both horizontal and vertical. Includes both hardware and software.

• Skills of IT team• Especially the SAN vs NAS skill. This is more important than the protocol itself.• Skills include both internal and external (preferred vendor who complement the IT team)• In Small/Medium environment, it is impossible to be expert on all areas. Consider complementing the internal team by establishing long term

partnership with an IT vendor. Having a vendor/vendi relationship saves cost initially, but in the long run there is a cost.• Existing environment

• How does the new component fit into existing environment? E.g. adding a new Brand A server into a data center full of Brand B servers need to take into account management and compatibility with common components.

• Most customers do not have a separate network for DR test. Another word, they test their DR in production network.• Improvement

• Beside meeting current requirements, can we improve things? • Almost all companies need to have more servers, especially in non production. So when virtualisation happens, we have this VM Sprawl. As such, the

design have head room. • Moving toward “1 VM 1 OS 1 App”. In physical, some physical servers may serve multiple purpose. In virtual, they can afford, and should do so, to run 1

App per VM.

8

First Thing First: the applicationsYour cloud’s purpose is to run apps. We must know what type of VMs we are running. They impact the design or operation.

Type of VM Impact on Design

Microsoft NLB (network load balancer)Typical apps: IIS, VPN, ISA

VMware recommends Multicast. Need to have its own port group. This port group needs to have Forged Transmit (as it will change the MAC address)

MSCS Consider Symantec VCS instead as it has no restrictions on the right.

Need FC. iSCSI, NFS, FCoE is not supported. Also, the array must explicitly certify on vSphere.Need Anti-Affinity Rule (Host to VM mapping, not VM-VM as VMware HA does not obey VM-VM affinity rule). As such, need 4 node in a cluster.Need RAM to be 100% reserves. Impact HA Slot Size if you use default settings.Disk has to be eagerzerothick, so it’s a full size. Thin Provisioning at Array will not help as we zeroed all the disk.Need 2 extra NIC ports per ESX for heart beat.Need RDM disk with Physical-Compatibility mode. So VM can’t be cloned or converted to template. vMotion is not supported as at vSphere 5. This is not due to physical RDMImpact on ESX upgrade as ESX version must be the same.With native multipathing (NMP), the path policy can’t be round robinIt uses Microsoft NLB. Impact SRM 5. It works, but needs scripting. Preferably same IP, so create stretched VLAN if possible

Microsoft Exchange If you need CCR (clustered continuous replication), then you need MSCS

Oracle Softwares Oracle charges per cluster (or subcluster, if you configure host-VM affinity)I’m not 100% sure if Oracle still charge per cluster if we do not configure automatic vMotion (so just Active/Passive HA, just like physical world) for the VM (set DRS to manual for this VM). Looks like it they will charge per host in this case, basing on their document dated 13 July 2010. But interpretation from Gartner is Oracle charges for the entire cluster.

App that is licenced per cluster Similar to Oracle. I’m not aware of any other apps

App that are not supported While ISV support Vmware in general, they may only support for certain version. SAP, for example, only support from SAP NetWeaver 2004 (SAP Kernel 6.40) and only on Windows and Linux 64-bit (not on Solaris, for example)

http://www.oracle.com/us/corporate/pricing/partitioning-070609.pdf

9

VMs with additional consideration


Peer Applications(Apps that scale horizontally. Example: Web Servers, App Servers

They need to exist on different ESX host in a cluster. So need to setup the Anti-Affinity Rule. You need to configure this per Peer. So if you have 5 set of Web servers from 5 different system (so 5 pair, 10 VM), you need to create 5 Anti-Affinity rule. Too many rules will create complexity, more so when #nodes is less than 4

Pair applications(Apps that protect each other for HA. Example: AD, DHCP Server)

As above

Security VM or network packet capture tool Need to create another port group to separate VMs being monitored and not. Need to use Distributed vSwitch to turn on port mirroring or netflow.

App that depends on MAC address for licence Need to have its own port group. May need to have MAC Address Change set to Yes.

App that holds sensitive data Should encrypt the data or the entire file system. vSphere 5 can’t encrypt the vmdk file yet. If you encrypt the Guest OS, back up product may not be able to do file-level back up.Should ensure no access by MS AD Group Administrator. Find out how it is back up, and who has access to the tape. If IT does not even have access to the system, then vSphere may not pass the audit requirement.Check partner products like Intel TXT and Hytrust

Fault Tolerance requirements Impact HA Slot Size (if we use this one) as it uses full reservation. Impact Resource Pool, make sure we cater for the VM overhead (small)

App on Fault Tolerance hardware FT is still limited to 1 core. Consider Stratus to complement vSphere 5

10

VMs with additional consideration


App that require hardware dongle Dongle must be attached to 1 ESX. vSphere 4.1 adds this support. Best to use network dongle.In the DR site, the same dongle must be provided too.

App with high IOPS Need to size properly. No point having dedicated datastores if the underlying spindles are shared among multiple datastores.

Apps that uses very large block size SharePoint uses 256 KB block size. So a mere 400 IOPS will saturate the GE link already. For such application, FC or FCoE will be a better protocol.Any application with 1 MB block size can easily saturate 1 GE link.

App with very large RAM (>64 GB) This will impact DRS when a HA event occurs as it needs to have a host that house the VM. It will still boot so long reservation is not set to a high number.

App that needs Jumbo Frame This must be configured end to end (guest OS, port group, vSwitch, physical switch). Not all support 9000, so do a ping test and find the value.

App with >95% CPU utilisation in the physical world and have high run queue

Find out first why it is so high. We should not virtualise app that we are blind on its performance characteristic.

App that is very sensitive to time accuracy Time drift is a possibility in virtual world.Find out business or technical impact if time deviates by 10 seconds.

A group of apps with complex power on sequence and dependancy.

Need to be aware of impact on application if during HA event. If 1 VM is shutdown by HA and then power on, the other VMs in the chain may need restart too. This should discussed with App Owner

App that takes advantages of specific CPU Instruction Set

Mixing with older CPU Architecture is not possible. This is a small problem if you are buying new server.EVC will not help, as it’s only a mask. See speaker notes

App that need < 0.01 ms end to end latency Separate cluster as the tuning is not suitable for “normal” cluster.

11

This entire deck does not cover Mission Critical ApplicationsThe deck focus on designing a generic platform for most applications. In the 80/20 concept, it focuses on the easiest 80.

Special apps have unique requirements. They differ in the following areas:

• Size is much larger. So the S, M, L size for VM or ESXi host does not apply to them.

• VM has unique properties• They might get dedicated cluster• Picture on the right shows a VM with 12

vCPU, 160 GB vRAM, 3 SCSI controllers, usage of PVSCSI, 18 vDisks and 2 vNICs. This is an exceptional case.

There are basically 2 overall architecture in vCloud Suite 5.1:

• One for the easiest 80%• One for the hardest 20%

The management cluster described later will still apply to both architecture.

For Root Disk

For Data and Redo Disks

CRS Disks

Data Disks

OS and Binary

Redo

Disks

vNICs

12

3 Sizes: Assumptions Assumptions are needed to avoid the infamous “It depends…” answer.

• The architecture for 50 VM differs with that for 500 VM, which in turn differs with that for 5000 VM. It is the same vSphere, but you design it very differently.

• A design for large VM (20 vCPU, 200 GB vRAM) differs with a design for small VM (1 vCPU, 1 GB)• Workload for SME is smaller for Large Enterprise. Exchange handling 100 staff vs 10000 staff results in different architecture• A design for Server farm differs to Desktop farm.

I provide 3 sizes in this document: 50, 500, 1500 VM• The table below shows the definition• I try to make it as real as possible for each choice. 3 sizes give you choice and shows reasoning used.• Take the closest size to your needs, then tailor it to the specific customer (not project). Do not tailor to project as it is a subset to entire data

center. Always architect for entire datacenter, not a subset.

Size means size of entire company or branch, not size of Phase 1 of the journey• A large company starting small should not use the “Small” option below; it should the “Large” option but reduce the # ESX. • I believe in “begin with the end in mind”, projecting around 2 years. Longer than 3 years is rather hazy as private cloud is not fully matured

yet. I expect major innovation until 2015.• A large company can use the Small Cloud example for their remote office. But this needs further tailoring.

• VSA & ROBO

Small VDC Medium VDC Large VDC

Company Small Company orRemote Branch

Medium Large

IT Staff 1-2 person doing everything 4 person doing infra2 person doing desktop10 person doing apps

Different teams for each.Matrix reporting.Lots of politics & little kingdoms

Data Center 1 or none (hosted) or just a corner rack

2, but no array-level replication 2 with private connectivity.5 satelite DC

13

Assumptions, Requirements, Constraints for our Architecture

Small VDC Medium VDC Large VDC

# Servers currently 25 servers. All are production

~150 servers70% is production

700 servers55% is production

# Servers in 2 years Prod: 30 serversNon Prod: 15 servers (50%)

Prod: 250 serversNon Prod: 250 servers (100%)

Prod: 500 serversNon Prod: 1000 servers (200%)

# Server VM that our design needs to cater 50 500 1500

# View VM or Laptop 500. With remote access.No need for offline VDI.

5000. With remote access.Need offline VDI

15000. With remote access + 2 FANeed offline VDI

DR Requirements Yes Yes Yes

Storage expertise Minimal. Also keeping cost low by using IP Storage.

No SAN. Yes. RDM will be used as some DB may be large.

DMZ Zone / SSLF Zone Yes/No Yes/No Yes/No. Intranet also zoned

Back up Disk Tape Tape

Network standard No standard No standard Cisco

ITIL Compliance Not applicable A few are in place Some are in place

Change Management Mostly not in place A few are in place Some are in place

Overall System Mgmt SW (BMC, CA, etc) No Needs to have tools Needs to have tools

Configuration Management No Needs to have tools Needs to have tools

Oracle RAC No Yes Yes

Audit Team No External External & Internal

Capacity Planning No Needs to have tools Needs to have tools

Oracle softwares (BEA, DB, etc) No No Yes

14

3 Sizes: Design SummaryThe table below provides the overall comparison, so you can easily compare what was taken out in the Small or Medium design.

• Just like any other design, there is no 1 perfect answer. Example: you may use FC or iSCSI for Small.

This assumes 100% virtualised. It is easier to have 1 platform than 2.• Certain things in company, you should only have 1 (email, directory, office suite, back up). Something as big as a “platform” should be

standardised. That’s why they are called platform.

Design for Medium will be in between Small and Large.

Small Large

# FT VM 0 – 3 (in Prod Cluster only) 0 – 6 (Prod Cluster only)

VMware products vSphere StandardSRM StandardvCloud Security & Networking Horizon View EnterprisevCenter Operations StandardvSphere Storage Appliance

vCloud Suite EnterprisevCenter Server StandardvCenter Server HeartbeatHorizon Suite

VMware certification & Skill 1 VCP 1 VCAP DCA, 1 VCAP DCDVMware Mission Critical Support

Storage iSCSIvSphere replication

FC + iSCSI, with snapshotvSphere + Array replication

Server 2x Xeon 5650, 72 GB RAM 2x Xeon (8-10 core/socket), 128 GB RAM

Back up VMware Data Protection to Array 2 VADP + 3rd party to Tape

15

Other Design possibilitiesWhat if you need to architect for larger environment?

• Take the Large Cloud sample as starting point. It can be scaled to 10,000 VM.• Above 1000 VM, you should consider a Pod approach.

• Upsize it by:• Adding larger ESXi Host. I’m using an 8-core socket, based on Xeon 5600. You should use 10-core Xeon 7500 to fit larger VM. Take note of cost.• Adding more ESX in the existing cluster. Keep it maximum 10 nodes per cluster.• Adding more cluster. For example, you can have multiple Tier 1 Clusters.• Adding Fault Tolerant Hardware from Stratus. Make this Stratus server as a member of the Tier 1 Cluster. It appears as 1 ESX, although there are 2

physical hardware. Stratus has its own hardware, so ensure the consistency in your cluster design. • Split the IT Datastore into multiple. Group by function or criticality.• If you are using Blade server and have filled 2 chassis, put the IT Cluster outside the blade and use rack mount. Separating the Blade and the server

managing it minimise chance of human error as we avoid the “Managing Itself” complexity.• Migrating inter cluster

• vSphere 5.1 supports live migration between cluster that don’t have common datastore. • I don’t advocate live migration from/to Production Envi. It should be part of Change Control.

• The Large Cloud is not yet architected for vCloud Director• vCloud Director has its own best practices for vSphere design. • Adding vCloud + SRM on DR site requires proper design by itself. And this deck is already 100+ slides….

16

Design MethodologyArchitecting a Private Cloud is not a sequential process

• There are 8 components.• Application is driving infrastructure.• The components are inter-linked. Like a mash.• In >1000 VM category, where it takes >2 years to virtualise >1000 VM, new vSphere will change the design.

Even the Bigger Picture is not sequential• Sometimes, you may even have to leave Design and go back to Requirements or Budgetting.

There is no perfect answer. Below is one example.

This entire document is about Design only. Operation is another big space.• I have not taken into account Audit, Change Control, ITIL, etc.

VM

Server

Storage

Network

Security

Data Center

Mgmt

DR

The steps are more like this

17

Data Center DesignData Center, DR, Cluster, Resource Pool

18

Virtual Datacenter

Physical Datacenter 2Physical Datacenter 1

Just what is a software-defined datacenter anyway?

Physical Compute Function

Compute Vendor 1 Compute Vendor 2

Physical Network Function

Network Vendor 1 Network Vendor 2

Physical Storage Function

Storage Vendor 1 Storage Vendor 2

Physical Compute Function

Compute Vendor 1 Compute Vendor 2

Physical Network Function

Network Vendor 1 Network Vendor 2

Physical Storage Function

Storage Vendor 1 Storage Vendor 2

Shared Nothing Architecture.

Not stretched between 2 physical DC.

Production might be 10.10.x.x. DR might be 20.20.x.x


No replication between 2 physical DC.

Production might be FC. DR might be iSCSI.


No stretched cluster between 2 physical DC.

Each site has its own vCenter.

19

2-distinct Layer: Consumer and Producer

Separation & Abstraction(done by the Hypervisor or DC OS)

2 distinct layers• Supporting the principle of Consumer and Producer.• VM is Consumer. Does not care about underlying

technology. Its sole purpose is to run the application.• DC Infra is Producer. Provide common services.

VM is freed from (or independent of) underlying technology. These technology can change without impacting VM:• Storage protocol (iSCSI, NFS, FC, FCoE)• Storage file system (NFS, VMFS, VVOL)• Storage multi-pathing (VMware, EMC, etc)• Storage replication• Network teaming

The Datacenter OS provides a lot of services, such as:• Security: Firewall, IDS, IPS, Virtual Patching• Networking: LB, NAT• Availability: backup, cluster, HA, DR, FT• Management & Monitoring

A lot of agents are removed from VM, resulting in simpler server.

DC ServicesDC Services

Datacenter Technologies

Datacenter Implementation

20

Large: A closer look at Active/Active Datacenter

250 Prod VMs

Prod Clusters

500 Test/Dev VMs

T/D Clusters

vCenter

500 Test/Dev VMs250 Prod VMs

Prod Clusters T/D Clusters

vCenter

Lots of traffic between:

Prod to Prod

T/D to T/D

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

21

Large: Adding Active/Active to a mostly Active/Passive vSphere

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

50 VMs

1 Cluster

Global LBGlobal LB

22

Large: Level of Availability

Tier Technology RPO RTO

Tier 0 Active/Active at Application Layer 0 hours 0 hours

Tier 1 Array-based Replication 0 hours 2 hour

Tier 2 vSphere Replication 15 min 1 hour

Tier 3 No replication. Backup & Restore 1 day 8 hours

23

Methodology

Define how many physical data centers are required• DR requirements normally dictate 2

For each Physical DC, define how many vCenter are required• Desktop and Server should be separated by vCenter

• Connected to same SSO server, fronted by same Web Client VM• View comes with bundled vSphere (unless you are buying add-on)• Ease of management.

• In some cases (Hybrid Active/Active), a vCenter may span multiple physical DC.

For each vCenter, define how many virtual data centers are required• Virtual Data Center serve as name boundary. • A good way to separate IT (Provider) and Business (Consumer)

For each vDC, define how many Cluster are required• In large setup, there will be multiple clusters for each Tier.

For each Cluster, define how many ESXi are required• Preferably 4 – 12. 2 is too small a size• Standardise the host spec across cluster. • While each cluster can have its own host type, this adds complexity

Physical DC vCenter Virtual

DC Cluster ESXi

Physical DC

vCenter(Server pool)

Virtual DC (IT)

Virtual DC (Biz)

Tier 1 Cluster

Tier 2 Cluster

ESXi ESXi ESXi

vCenter(Desktop pool)

Virtual DC

24

Large: The need for Non Prod ClusterThis is unique in the virtual data center.

• We don’t have “Cluster” to begin with in physical DC as cluster means different thing.

Non-Prod Cluster serves multiple purposes• Run Non Production VM

• In our design, all Non-Production run on DR Site to save cost. • A consequence of our design is migrating from/to Production can mean

copying large data across WAN.• Disaster Recovery• Test-Bed for Infrastructure patching or updates.• Test-Bed for Infrastructure upgrade or expansion

Evaluating or Implementing new features• In Virtual Data Centre, a lot of enhancements can impact entire data centre

• e.g. Distributed Switch, Nexus 1000V, Fault Tolerant, vShield• All the above need proper testing. • Non-Prod Cluster should provide sufficient large scale scope

to make testing meaningful

Upgrade of the core virtual infrastructure• e.g. from vSphere 4 to 5 (major release)• This needs extensive testing and roll back plan.

Even with all the above…• How are you going to test SRM upgrade & updates properly?

• In Singapore, MAS TRM guidelines require Financial Institution to test before updating production.• SRM test needs 2 vCenters, 2 arrays, 2 SRM servers. If all are used in production, then where is the test-environment for SRM?

• When happens when you are upgrading SRM? • You will lose protection during this period.

Business

IT

This new layer does not exist in physical world.

It is software, hence needs its own Non Prod envi.

25

Large: The need for IT ClusterSpecial purpose cluster

• More than Management Cluster. It runs non Management VMs that are not owned by Business. Examples: • Active Directory• File Server• Email & Collaboration (in the Large example, this might warrant its own cluster)

• Running all the IT VMs used to manage the virtual DC or provide core services• The Central Management will reside here too• Separated for ease for management & security

The next page shows the list of VMs that resides on the IT Cluster. Each line represent a VM.• This shows for Production Site. DR Site will have a subset of this.• Except for vCloud Director, which is only deployed on DR Site

Explanation of some of the servers below:• Security Management Server = VM to manage security (e.g TrendMicro Deep Security)

This separation keeps Business Cluster clean,

“strictly for business”.

26

Large: IT Cluster (part 1)The table provides samples of VMs that should run on the IT cluster.

4 ESXi Host should be sufficient as most VM is not demanding. They are mostly management tool.

• Relatively more demanding VMs are vCenter Operations.

There are many databases here. Standardise on 1.

I will not put these databases together with DB running business workload.

Keep Business and IT separate.

Category Large Cloud

Base Platform vCenter (for Server Cloud) – active nodevCenter (for Server Cloud) – passive nodevCenter (for Server Cloud) DB – active nodevCenter (for Server Cloud) DB – passive nodevCenter Web Server vCenter Inventory ServervCenter SSO Server x2 with Global HAvCenter HeartbeatAuto-Deploy + Authentication Proxy (1 per vCenter)vCenter Update Manager + DB. 1 per vCenter.vCenter Update Manager Download Service (in DMZ)Auto-Deploy + vSphere Authentication ProxyvCloud Director (Non Prod) + DBCertificate Server

Storage Storage Mgmt tool (need physical RDM to get fabric info)VSA ManagerBack up Server

Network Network Management Tool (need a lot of bandwidth)Nexus 1000V Manager (VSM) x 2

Sys Admin Tools Admin client (1 per Sys Admin) with PowerCLIVMware ConvertervMA (management Assistant)vCenter Orchestrator + DB

27

Large Cloud: IT Cluster (page 2)Continued from previous page.

What IT apps that are not in this Cluster:• View Security Servers. These servers reside in the

DMZ zone. It is directly accessible from the Internet. Putting them in the management cluster means the management cluster needs to support Internet facing network.

Category Large Cloud

Application Mgmt AppDirectorHyperic

Advance vDC Services- Security- Availability

Site Recovery Manager + DBSRM Replication Mgmt Server + DBvSphere Replication Servers (1 per 1 Gbps bandwidth, 1 per site)AppHA Server (e.g. Symantec)Security Management Server (e.g. TrendMicro DeepSecurity)vShield Manager

Management- Performance- Capacity- Configuration

vCenter Operations Enterprise (2 VM)vCenter Infrastructure NavigatorvCloud Automation Center (5 VM)VCM: Web + App + DB (3 VM)Chargeback + DB, Chargeback Data Collector (2)Help Desk systemCMDBChange Management system

Desktop as a Service View Managers + DBView Security Servers (sitting in DMZ zone!)ThinApp Update ServervCenter (for Desktop Cloud) + DBvCenter Operations for ViewHorizon SuiteMirage Server

Core Infra MS AD 1, AD 2, AD 3, etc.DNS, DHCP, etcSyslog server + Core Dump serverFile Server (FTP Server) for ITFile Server (FTP Server) for Business (multiple)Print Server

Core Services Email & Collaboration

28

Cluster SizeI recommend 6-10 nodes per cluster, depending on the Tier. Why not 4 or 12 or 16 or 32?

• A balance between too small (4 hosts) and too large (>12 hosts)• DRS: 8 give DRS sufficient host to “maneuver”. 4 is rather small from DRS scheduler point of view.• With “sub cluster” ability introduced in 4.1, we can get the benefit of small cluster without creating one

Best practice for cluster is same hardware spec with same CPU frequency.• Eliminates risk of incompatibility• Complies with Fault Tolerant & VMware View best practices• So more than 8 means it’s more difficult/costly to keep them all the same. You need to buy 8 hosts a time. • Upgrading >8 servers at a time is expensive ($$) and complex. A lot of VMs will be impacted when you upgrade > 8 hosts.

Manageability• Too many hosts are harder to manage (patch, performance troubleshooting, too many VMs per cluster, HW upgrade)• Allow us to isolate 1 host for VM-troubleshooting purpose. At 4 node, we can’t afford such ”luxury”• VM Restart priority is simpler when you don’t have too many VM

Too many paths to a LUN can be complex to manage and troubleshoot• Normally, a LUN is shared by 2 clusters, which are “adjacent” cluster.• 1 ESX is 4 paths. So 8 ESX is 32 paths. 2 clusters is 64 paths. This is a rather high number (if you compare with physical world)

N+2 for Tier 1 and N+1 for others• With 8 host, you can withstand 2 host failures if you design it to. • At 4 nodes, it is too expensive as payload is only 50% at N+2

Small Cluster size• In a lot of cases, the cluster size is just 2 – 4 nodes. From Availability and Performance point of view, this is rather risky.• Say you have 3-node cluster…. You are doing maintenance on Host 1 and suddenly Host 2 goes down… you are exposed with just 1 node.

Assuming HA Admission Control is enabled (which you should), the affected VM may not even boot. When a host is placed into maintenance mode, or disconnected for that matter, it is taken out of the admission control calculation.

• Cost: Too few hosts result in overhead (the “spare” host)

29

Small Cloud: Cluster Design

30

Small Cloud: Design LimitationIt is important to document clearly, the design limitation.

• It is perfectly fine for a design to have limitation. After all you have limited budget.• Inform CIO and Business clearly on the limitation.

It is based on vSphere Standard edition• No Storage vMotion • No DRS and DPM• No Distributed Switch• Can’t use 3rd party multi-pathing.

Does not support MSCS• Veritas VCS does not have this restriction• vSphere 5.1 only support FC for now. I use iSCSI in this design.• For 30-server environment, HA with VM monitoring should be sufficient.

• In vSphere 5.1 HA, a script can be added that ping the application (services) is active on its given port/socket. • Alternative, a script within the Guest OS check the process if it’s up or not. If not, it sends alert.

Only 1 cluster in primary data center• Production, DMZ and IT all run on the same cluster. • Network are segregated as they use different network• Storage are separated as they use different datastore

31

Small Cloud: Scaling to 100 VMThe next slide shows an example where the requirement is for 100 VM instead of 50.

• We have 7 hosts in DC 1 instead of 3 hosts• We have 3 hosts in DC 2 instead of 2 hosts

Only 1 cluster in primary data center• Production, DMZ and IT all run on the same cluster. • Network are segregated as they use different network• Storage are separated as they use different datastore

Since we have more hosts, we can do sub-cluster. We will place the following as sub-cluster• Host 1 – 2: Oracle BEA SubCluster• Host 6 – 7 : Oracle DB SubCluster• Production is soft cluster. So a host failure means it can use Host 1 – 2 too.

Complex Affinity and Host/VM• Be careful in designing VM Anti-Affinity rule

• We are using Group Affinity as we have sub-cluster. So we have extra constraint.

32

Small Cloud: Scaling to 100 VMCertainly, there can be possible variations. 2 are described below.

If we can add 1 more ESX host, we can create 2 cluster of 4 node each.• This will simplify the Affinity Rule

We can use a 1-socket ESX host instead of 2-socket• Save on VMware licence• Extra cost on servers• Extra cooling/power operational cost

Oracle BEA

DMZ LAN

Production LAN

Management LAN

Oracle DB

Rest of VMs

33

Small Cloud: Scaling to 150 VMWe have more “room” to design, but it is still too small

• Production needs 7 hosts• IT needs 2 hosts• DMZ needs 2 hosts

Putting IT with DMZ is a design trade-off• vShield is used to separate IT and DMZ• If the above is deemed not enough, we can add VLAN.• If it is still not enough, use different physical cables or switch• The more you separate physically, the more you defeat your purpose of virtualisation.

Oracle BEA

Rest of VMs

Production LAN

Oracle DB

34

Small: Scaling to 150 VM

Physical Data Center 1

ProductionCluster

DMZ ClusterIT Cluster

DesktopCluster 1

DesktopCluster N

vCenter 1 vCenter 2

LinkedMode enables global view

Management VMsfor Desktops

reside in IT Cluster

SRM enables DR

FC StorageNFS Storage

NFS LANSAN Fabric

7 ESXi 4 ESXi 8 ESXi 8 ESXi

35

Large: Overall Architecture

36

37

Large: DataCenter and ClusterIn our design, we will have 2 Datacenter only

• Separating the IT Cluster from the Business Clusters. • Certain objects can go across Cluster, but not across Data Center

• You can vMotion from one cluster to another within a datacenter, but not to another datacenter. • Networking: Distributed Switch , VXLAN, vShield Edge can’t go across DC as at vCloud Suite 5.1• Datastore name is per DataCenter. So network and storage are per Data Center

• You can still clone a VM within a datacenter and to a different datacenter

38

Large: Cluster Design

39

Large: Tiered ClusterThe 3 tiers becomes the standard offering that Infra team provides to app team.

• If Tier 3 is charged $X/VM, then Tier 2 is priced at 2x and Tier 1 is priced at 4x. • Apps team can then choose based on their budget.• Cluster size varies, depending on criticality. A test/dev might have 10 node, while a Tier-0 might have just 2 node.

The Server Cluster also maps 1:1 to the Storage Cluster• This keeps thing simple.• If a VM is so important that it is on Tier 1 cluster, then its storage should be on Tier 1 cluster too.

This excludes Tier 0, which is special and handled per application.• Tier 0 means the cost of infra is very low relative to the value & cosst of the apps to the business.

Tier “SW” is a dedicated cluster running a particular software. • Normally, this is Oracle, MS SQL, Exchange. While we can have “sub-cluster”, it is simpler to dedicate entire cluster.

Tier # Host Node Spec? FailureTolerance

MSCS Max #VM Monitoring Remarks

Tier 1 Always 6 Always Identical 2 hosts Yes 25 Application level.Extensive Alert

Only for Critical App. No Resource Overcommit.

Tier 2 4-8 Maybe 1 host Limited 75 App can be vMotioned to Tier 1 during critical run

Tier 3 4-10 No 1 host No 150 Infrastructure levelMinimal Alert.

Some Resource Overcommit

SW 2-10 Maybe 1-3 hosts No 25 Application specific Running expensive softwares. Oracle, SQL are the norms as part of DB as a

Service

40

Large: Example 1 Goal is to provide 500 Prod VM and 1000 Non Prod VMProduction Cloud

Tier Type # VM # Datastore Nett Core Nett RAM # ESXi

1 Standard 30 3 56 cores 360 GB 6

1 Standard 30 3 56 cores 360 GB 6 Environment

1 Standard 30 3 56 cores 360 GB 6 Production 3.4 vCPU per VM 19 GB vRAM per VM

1 Large 30 3 152 cores 740 GB 6 Non Production 2.6 vCPU per VM 15 GB vRAM per VM


2 Standard 75 4 98 cores 630 GB 8 Environment

2 Standard 75 4 98 cores 630 GB 8 Production 8 : 1 ratio

2 Large 75 4 266 cores 1295 GB 8 Non Production 13 : 1 ratio

2 Large 75 4 266 cores 1295 GB 8

Total 495 32 1146 cores 6300 GB 64

Non Production Cloud

Tier Type # VM # Datastore Nett Core Nett RAM # ESXi



2 Large 75 4 266 cores 1295 GB 8





3 Large 150 5 342 cores 1665 GB 10

Total 975 37 1308 cores 7460 GB 74

Consolidation Ratio

Average VM Size (with over subscribe)

As you scale >1000 VM, keep in mind the number of clusters & hosts.As you scale >10 clusters, consider using 4 socket hosts.This example does have Large VM cluster, which is an exception cluster. Large VM in this case is > 8 vCPU and > 64 GB vRAM.

41

Large: Example 2 Same goal as previous, but we’re going for higher consolidation ratio (and hence using 40-core box)

Production Cloud

Tier # VM # Datastore Nett Core Nett RAM # ESXi

1 60 12 152 cores 1000 GB 6

2 150 8 266 cores 1750 GB 8 Production 17 : 1 ratio

2 150 8 266 cores 1750 GB 8 Non Production 28 : 1 ratio

2 150 8 266 cores 1750 GB 8

510 36 950 cores 6250 GB 30

Non Production Cloud

Tier # VM # Datastore Nett Core Nett RAM # ESXi

2 150 8 266 cores 1750 GB 8

3 300 10 342 cores 2250 GB 10

3 300 10 342 cores 2250 GB 10

3 300 10 342 cores 2250 GB 10

1050 38 1292 cores 8500 GB 38

Consolidation Ratio

42

Large: Example Pod (with Converged Hardware)

Compute + Storage

Converged Block.

32 RU

Management Block.

2 RU

4x 48 ports. 10 GE. Total 192 ports.

1x 48 ports. 1 GE (for Management)

Each ESXi hosts has:- 4x 10 GE for network and storage- 1x 1 GE for iLO- 1x Flash for performance- 2x SSD for performance- 4x SAS for capacity

Total ports requirements per rack:- 34 x 4 = 136 10GE ports- 34 x 1 = 34 1GE ports- ISL & uplinks = 6 GE ports

Total compute per Pod: 2 racks x 32 x 16 cores = 1024 cores

Network Block.

5 RU

Compute + Storage

Converged Block.

32 RU

Management Block.

2 RU

Network Block.

5 RU

Rack 1 (42 RU) Rack 2 (42 RU)

IT Cluster. It’s a 4-node cluster.

43

Resource Pool: Best PracticesWhat they are not

• A way to organise VM. Use folder for this.• A way to segregate admin access for VM. Use folder for this.

For Tier 1 cluster, where all the VMs are critical to business• Architect for Availability first, Performance second.

• Translation: Do not over-commit. • So resource pool, reservation, etc are immaterial as there is enough for everyone.

• But size each VM accordingly. No oversizing as it might slow down.

For Tier 3 cluster, use carefully• Tier 3 = overcommit.• Use Reservation sparingly, even at VM level.

• This guarantees resource, so it impacts the cluster slot size. • Naturally, you can’t boot additional VM if your guarantee is fully used• Take note of extra complexity in performance troubleshooting.

• Use as a mechanism to reserve at “group of VMs” level.• If Department A pays for half the cluster, then creating an RP with 50% of cluster resource will guarantee them the resource, in the event of

contention. They can then put as many VM as they need. • But as a result, you cannot overcommit at cluster level, as you have guaranteed at RP level.

• Introduce a scheduled task which sets the shares per resource pool, based on the number of VMs/vCPUs they contain.• E.g.: a PowerCLI script which runs daily and takes corrective actions. Just google it

Don’t put VM and RP as “sibling” or same levelSee my Resource Management slide for details

http://communities.vmware.com/docs/DOC-17417

44

VM-level Reservation & LimitCPU reservation:

• Guarantees a certain level of resources to a VM• Influences the admission control (PowerOn)• CPU reservation isn’t as bad as often referenced:• CPU reservation doesn’t claim the CPU when VM is idle (is refundable)• CPU reservation caveats: CPU reservation does not always equal priority

• VM uses processors and “Reserved VM” is claiming those CPUs = ResVM has to wait until threads / tasks are finished • Active threads can’t be “de-schedules” if you do so = Blue Screen / Kernel Panic

Memory reservation • Guarantees a certain level of resources to a VM• Influences the admission control (PowerOn)• Memory reservation is as bad as often referenced. “Non-Refundable” once allocated

• Windows is zeroing out every bit of memory during startup…

Memory reservation caveats:• Will drop the consolidation ratio• May waste resources (idle memory cant’ be reclaimed)• Introduces higher complexity (capacity planning)

Do not configure high CPU or RAM, then use Limit• E.g. configure with 4 vCPU, then use limit to make it “2” vCPU • It can result in unpredictable performance as Guest OS does not know.• High CPU or high RAM has higher overhead.• Limit is used when you need to force slow down a VM. Using Shares won’t achieve the same result

45

Fault ToleranceDesign Consideration

• Still limited to 1 vCPU in vSphere 5.1• FT impacts Reservation. It will auto reserve at 100%• Reservation impacts HA Admission Control as slot size is bigger.• HA does not check Slot Size nor actual utilisation when

booting up. It checks Reservation of that affected VM.• FT impacts Resource Pool. Make sure the RP includes the RAM Overhead. • Cluster Size is minimum 3, recommended 4.• Tune the application and Windows HAL to use 1 CPU.

• In Win2008 this no longer matters [e1: need to verify]

General guides • Assuming 10:1 consolidation ratio, I’d cap FT usage to just

10% of Production VM• So 80 VM means around 8 ESX host means around 8 FT VM.• This translates to 1 Primary VM + 1 Secondary VM per host.• Small cluster size (<5 nodes) are more affected when there is a HA.

See picture for a 3-node example.

Limitation• Turn off FT before doing Storage vMotion• FT protect infra, not app. Use Symantec ApplicationHA to protect App

46

Branch or remote sitesSome small sites may not warrant its own vCenter

• No expertise to manage it either.• Consider vSphere Essential Plus ROBO edition. Need 10 sites for best financial return as it is sold in 10 units.

Features that are network heavy should be avoided.• Auto deploy means sending around 150 MBtye. If link is 10 Mbit shared, it will add up.

Best practices• Install a copy of template at remote site. If not, use OVF as it is compressed.• Increase vCenter Server and vSphere hosts timeout values to ~3 hours• Consider manual vCenter agent installs prior to connecting ESXi hosts• Use RDP/SSH instead of Remote Console for VM console access

• If absolutely needed, reduce remote console displays to smaller values, e.g. 800x600/16-bit

vCenter 5.1 improvement over 4.1 on remote ESX• Use web client if vCenter is remote, which uses less bandwidth• No other significant changes

Certain vCenter operations that involve a heavier payload• E.g. Add Host, vCenter agent upgrades, HA enablement, Update Manager based host patching

47

Server DesignESXi Host

48

ApproachGeneral guidelines as at Q3 2013:

• Use 2 sockets, 16-20 cores, Xeon 2820 with 128 GB RAM• For large VM, use 4 sockets, 40 cores, Xeon 4820 with 256 GB RAM• 8 GB RAM per core. A 12-core ESX box should have 96 GB. This should be enough to cater for VM with large

RAM

Consideration when deciding the size of ESXi host• Look at overall cost, not just the cost of ESX host. Cost of network equipments, cost of management, power

cost, space cost.• Larger host can take larger VM or more VM/host.• Think of cluster, not 1 ESX host when sizing the ESXi host. Cluster is the smallest logical building block in this

Pod approach.• Plan for 1 fiscal year, not just next 6 months.

• You should buy host per cluster. This ensures they are the same batch.• Standardise the host spec makes management easier.

• Know #VM you need to host and their size.• This gives you idea how many ESX you need.

• Define 2 VM sizing: Common and Large• If your largest VM needs >8 core, go for >8 core pCPU. Ideally, a VM should fit inside socket to minimise NUMA effect.

This happens in physical world too.• If your largest VM needs 64 GB of RAM, then each socket should have 72 GB. I consider RAM overhead. Note that

Extra RAM = Slower boot. ESXi is creating swap file that match the RAM size. You can use reservation to reduce this, so long you use “% based” in Cluster setting.

• ESXi host should be >2x the largest VM.

Decide: Blade or Rack or Converged

Decide: IP or FC storage• If you use Converged, then it’s either NFS or iSCSI

49

ESXi Host: CPUCPU performance has improved drastically.

• Something like 1800%

No need to buy the highest end CPU as the Premium is too high. Use the savings and buy more hosts instead, unless:

• the # hosts are becoming a problem• you need to run high performance single thread• You need to run more VM per host.

The 2 table below VMmark result• First table shows improvement from 2005-2010

• Based on VMmark 1.x• Second table shows from 2010 to May 2013

• based on VMmark 2.x

50

ESXi Host: CPU SizingBuffer the following:

• Agent VM or vmkernel module: • vShield App or other hypervisor-based firewall• Hypervisor based firewall such as vShield App• Hypervisor based IDS/IPS such as TrendMicro Deep Security• vSphere Replication• Distributed Storage

• HA event. • Performance isolation. • Hardware maintenance• Peak: month end, quarter end, year end• Future requirements within the same fiscal year• DR. If your cluster needs to run VM from the Production site.

The table below is before we add HA into account. So it is purely from performance point of view.• When you add the HA host, the day to day ratio will drop. So the utilisation will be lower as you have “spare” host• Doing 2 vCPU per 1 physical core is around 1.6x over-subscribe, as there is benefit of Hyper-Threading.

Tier vCPU Ratio VM Ratio Total vCPU(2 sockets, 16 cores)

Average VM size

Tier 1 2 vCPU per core 5:1 32 vCPU 32/5 = 6.4 vCPU each

Tier 2 4 vCPU per core 10:1 – 15:1 64 vCPU 64/10 = 6.4 vCPU each

Tier 3 or Dev 6 vCPU per core 20:1 – 30:1 96 vCPU 96/30 = 3.2 vCPU each

51

UNIX X64 migration: Performance SizingWhen migrating from UNIX to X64, we can use industry standard benchmark where both platforms participate. Benchmarks like SAP and SPEC are established benchmark, so we can easily get data from older UNIX machines (which are common source of migration as they have reached 5 years and hence have high maintenance cost).

Based on SPEC-int2006 rate benchmark results published July 2012:

HP Integrity Superdome (1.6GHz/24MB Dual-Core Intel Itanium 2) 128 cores • SPEC-int2006 rate result: 1650

Fujitsu / Oracle SPARC Enterprise M9000 256 cores • SPEC-int2006 rate result: 3150

IBM Power 780 (3.44 GHz, 96 core) • SPEC-int2006 rate result: 3520• IBM result per core is higher than X64 as it uses MCM module. In Power series CPU and Software are priced at core basis, not socket.

Bull Bullion E7-4870 (160 cores - 4TB RAM) • SPEC-int2006 rate result : 4110

Sizing of RAM, Disk and Network are much easier as we can ignore the speed/generation. We simply match it. For example, if the UNIX apps need 7000 IOPS and 100 GB of RAM we simply match it. The higher speed of RAM is a bonus.

With Flash and SSD, IOPS is no longer concern. The vCPU is the main factor as UNIX partition can be large (e.g. 48 cores), and we need to reduce the vCPU.

52

ESXi Host: RAM sizingHow much RAM? It depends on the # core in previous slide.

• Not so simple anymore. Each vendor is different.• 8 GB DIMM is cheaper than 2x 4 GB DIMM.• 8 GB per core. So 12 core means around 96 GB.• Consider the channel best practice

• Don’t leave some empty. This bring benefits of memory interleaving.

• Check with the server vendor on the specific model. • Some models now comes with 16 slots per socket, so you might be able to use lower DIMM size. • Some vendors like HP has similar price between 4 GB and 8 GB.• Dell R710 has 18 DIMM slots (?)• IBM x3650 M3 has 18 DIMM slots• HP DL 360/380 G8 has 24 DIMM slots• HP DL 380 G7 and BL490cG6/G7 have 18 DIMM slots• Cisco has multiple models. B200 M3 has 24 slots.

VMkernel has Home Node concept in NUMA system. For ideal performance, fit a VM within 1 CPU-RAM “pair” to avoid “remote memory” effect.

• # of vCPUs + 1 <= # of cores in 1 socket. So running a 5 vCPU VM in a quad-core will force remote memory situation

• VM memory <= memory of one node

Turn on Large Page, especially for Tier 1. • Need application-level support

64 GB 64 GB

53

ESXi Host: IO & ManagementIO requirements will increase in 2014. The table provides estimate.

• It is a prediction based on tech preview or VMworld 2012. Actual result may vary.

• Converged Infrastructure needs high bandwidth

IO card• I personally prefer 4x 10 GE NIC• Not supported: mixing hardware iSCSI and software iSCSI.

Management• Lights-out management

• So you don’t have to be in front of physical server to do certain thing (e.g. go into CLI as requested by VMware Support)

• Hardware agent is properly configured• Very important to monitor hardware health due to many VMs

in 1 box.

PCI Slot on the motherboard• Since we are using 8 Gb FC HBA, make sure the physical PCI-

E slot has sufficient bandwidth.• A single-dual port FC port makes more sense if the saving is

high and you need the slot. But there is a risk of bus failure. Also, double check to ensure the chip can handle the throughput of both ports.

• If you are using blade, and have to settle for a single 2-port HBA (instead of two 1-port HBA), then ensure the PCI slot has bandwidth for 16 Gb. When using a dual-port HBA, ensure the chip & bus in the HBA can handle the peak load of 16 Gb.

Purpose Bandwidth Remarks

VM 4 Gb For ~20 VM.vShield Edge VM needs a lot of bandwidth as all traffic pass through it

FT 10 Gb Based on VMworld 2012 presentation.

Distributed Storage 10 Gb Based on Tech Preview in 5.1 and Nutanix

vMotion 8 Gb vSphere 5.1 introduces shared-nothing live migration. This increases the demand as vmdk is much larger than vRAM.Include multi-NIC vMotion for faster vMotion when there are multiple VMs to be migrated.

Management 1 Gb Copying a powered-off VM to another host without shared datastore takes this bandwidth

IP Storage 6 Gb NFS or iSCSI.Not the same with the Distributed Storage as DS is not serving VM.No need 10 Gb as the storage array is likely shared by 10-50 hosts. The array may only have 40 Gb total for all these hosts.

vSphere Replication 1 Gb Should be sufficient as the WAN link is likely the bottlenect

Total 40 Gb

Estimated ESXi IO bandwidth in early 2014

54

Large: Sample ESXi host specification Estimated Hardware Cost: US$ 10K per ESXi.

Configuration included in the above price:• 2 Xeon X5650. The E series has different performance & price attributes• 128 GB RAM• 4x 10 GE NIC. No HBA• 2x 100 GB SSD.

• Swap to host-cache feature in ESXi 5• Running agent VM that is IO intensive• Could be handy during troubleshooting. Only need 1 HD as it’s for troubleshooting purpose.

• No installation disk. We will use auto-deploy, except for management cluster.• Light-Out Management. Avoid using WoL. Uses IPMI or HP iLO.

Costs not yet included• LAN switches. Around S$15 K for a pair of 48-port GE switch (total 96 ports)• SAN switches.

55

Blade or Rack or Converged Both are good. Both have pro and cons. Table below is relative comparison, not absolute. • Consult principal for specific model. Below is just for guidelines.

Comparison below is only for vSphere purpose. Not for other use case, say HPC or non VMware. There is a 3rd choice, which is converged infrastructure. Example is Nutanix.

Blade RackRelativeAdvantages

• Some blades come with built-in 2x 10 GE port. To use it, you just need to get 10 GE switch.• Less cabling, less problem.• Easier to replace a blade• Better power efficiency. Better rack space efficiency.• Better cooling efficiency. The larger fan (4 RU) is better than the small fan (2 RU) used in rack• Some blade can be stateless. The management software can clone 1 ESX to another.• Better management, especially when you have many ESXi hosts.

• Typical 1RU rack server normally comes with 4 built-in ports.

• Better suited for <20 ESX per site

• More local storage

RelativeDisadvantages

• More complex management, both on Switch and Chassis. Proprietary too. Need to learn the rules of the chassis/switches. Positioning of the switch matters in some model.

• Some blade virtualise the 10 GE NIC and can slice it. This adds another layer + complexity.• Some replacement or major upgrade may require entire chassis to be powered off• Some have 2 PCI slots only. Might not support if you need >20 GE per ESXi.• Best practice recommends 2 enclosures. The enclosure is passive, it does not contain electronic. • There is initial cost as each chassis/enclosure needs to have 2 switches.• Ownership of the SAN/LAN switches in the chassis needs to be made clear.• The common USB port in the enclosure may not be accessible by ESX. Need to check with respective blade

vendor.• USB dongle (which you should not use) can only be mounted in front. Make sure it’s short enough that you

can still close the rack door.

• The 1 RU rack server has very small fan, which is not as good as larger fan.

• Less suited when each DC is big enough to have 2 chassis

• Cabling & rewiring

56

ESXi boot options 4 methods of ESXi boot• Need installation:

• Local Compact Flash• Local Disk• SAN Boot

• No need installation.• LAN Boot (PXE) with Auto-Deploy

Auto Deploy• Environment with >30 ESXi should consider Auto Deploy. Best practice is to put the IT Cluster on non-autodeploy.• An ideal ESX is just pure CPU and RAM. No disk, no PCI card, no identity. • Auto-Deploy is also good for environment where you need to prove to security team that your ESXi has not been tempered (you

can simply boot it and it is back to “normal” )• Centralised image management.• Consideration when the Auto-Deploy infrastructure are also VM:

• Keep the IT using local install.

Advantages of Local Disk to SAN boot• No SAN complexity• Need to label the LUN properly.

Disadvantages of Local Disk to SAN boot• Need 2 local disk, mirrored.

• Certain organisation does not like local disk.• Disk is a moving part. Lower MTBF.• Save power/cooling

57

Storage Design

58

Methodology

• Most app team will not know their IOPS and Latency requirement.• Make it as part of Storage Tiering, so they consider the bigger picture

• Turn on Storage IO Control• Storage IO Control is per datastore. If underlying LUN shares spindles with all other LUN, then it may not achieve the result. Consult with storage

vendor on this as they have entire array visibility/control.

SLADatastore

ClusterVM input Mapping Monitor

Define the standard (Storage Driven Profile)

Map each VM to each datastoreCreate another DS if insufficient (either capacity or performance)

See next slide for detail

For each VM, ask the owner to choose: • Capacity (GB)• Which Tier they want to buy. Let them

decide as they know their own app

Define the Datastore profile.

Map Cluster to Datastore Cluster

59

SLA: 3 Tier pools of Storage Profile Create 3 Tiers of Storage with Storage DRS• This become the type of Storage Pool presented to clusters or VM.• Implement VASA so the profiles are automatically presented and compliance check can be performed.• Paves for standardisation

• Choose 1 size for each Tier. Keep it consistent. Choose an easy number (e.g. 1000 vs 800).• Tier 3 is also used in non production.• Map the ESX Cluster tier to the Datastore Tier.

• If a VM is on Tier 1 Production cluster, then it will be placed on Tier 1 Datastore, not Tier 2 datastore.• The strict mapping reduces the #paths drastically.

Example• Based on the Large Cloud scenario. Small Cloud will have simpler and smaller design. • Snapshot means protected with array-level snapshot for fast restore• VMDK larger than 1.5 TB will be provisioned as RDM. • RDM will be used sparingly. Virtual-compatibility mode used unless App team said so.• Tier 2 and 3 can have large Datastore as replication is done at vSphere layer.• Interface will be FC for all. This means storage vMotion can be done with VAAI

Consult storage vendor for array specific design. I don’t think the array has Shares & Reservation concept.• IOPS• Array can’t guarantee or control latency per Tier.

Tier Price Min IOPS

Max Latency

RAID RPO RTO Size Limit ReplicatedMethod

ArraySnapshot

# VM

1 4x/GB 6000 10 ms 10 15 minute 1 hour 2 TB 70% Array level Yes ~10 VM. EagerZeroedThick

2 2x/GB 4000 20 ms 10 2 hour 4 hour 3 TB 80% vSphere level No ~20 VM. Normal Thick

3 1x/GB 2000 30 ms 10 4 hour 8 hour 4 TB 80% vSphere level No ~30 VM. Thin Provision

60

Arrangement within an arrayBelow is a sample diagram, showing disk grouping inside an array.

• The array has 48 disks. Hot Spare not shown for simplicity• This example only has 1 RAID Group (2+2) for simplicity

Design consideration• Datastore 1 and Datastore 2 performance can impact one another, as they share physical spindles.

• Each datastore spans across 16 spindles.• IOPS is only 2800 max (based on 175 IOPS for a 15K RPM FC disk). Because of RAID, the effective IOPS will be lower.• The only way they don’t impact if there are “Share” and “Reservation” concept at “meta slice” level.

• Datastore 3, 4, 5, 6 performance can impact one another.• DS 1 and DS 3 can impact each other since they share the same Controller (or SP). This contention happens if the shared component

becomes bottlenect (e.g. cache, RAM, CPU).• The only way to prevent is to implement “Share” or “Reservation” at SP level.

For Storage IO Control to be effective, it should be applied to all datastores sharing the same physical spindles.So if we enable for Datastore 3, then Datastore 4, 5, 6 should be enabled too.

61

Storage IO ControlStorage IO Control is at the Datastore level

• There is no control at RDM level. ???

But array normally share spindles. • In the example below, the array has 3 volumes. Each volume is configured the same way. Each has 32 spindles in RAID10 configuration (8

units of 2+2 disk groups). • There are non vSphere sharing the same spindles

Best practices• Unless the array has “Shares” or “Reservation” concept, then avoid sharing spindles between each Storage Profile.

Datastore A Datastore B

SIOC SIOC

62

Storage DRS, Storage IO Control and physical arrayArray is not aware of VM inside VMFS. It only sees LUN.

• Moving VM from 1 datastore to another will look like large IO operation to the array. One LUN will decrease in size, while the other one increase drastically.

With array capable of auto-tiering:• VMware recommends configuring Storage DRS in Manual Mode with I/O metric disabled• Use Storage DRS for initial placement and out of space avoidance features• Whitepaper on Storage DRS interoperability with Storage Technologies: http://www.vmware.com/resources/techresources/10286

Feature or Product Initial Placement Migration Recommendations

Array-based replication(SRDF, MirrorView, SnapMirror, etc ) Supported

Moving VM from one datastore to another can cause a temporary lapse in SRM protection (?) and increase size of next replication transfer.

Array-based snapshots SupportedMoving VM from one datastore to another can cause increase in space usage in the destination LUN, so the snapshot takes longer.

Array-based Dedupe Supported Moving VM from one datastore to another can cause temporary increase in space usage, so the dedupe takes longer.

Array based thin provisioning Supported Supported on VASA-enabled arrays only[e1: reason??]

Array-based auto-tiering (EMC FAST, Compellent Data Progression, etc) Supported Do not use IO-based balancing. Just use Space-based.

Array-based I/O balancing (Dell Equallogic)

n/a as it is controlled by the array Do not use IO-based balancing. Just use Space-based.

http://www.vmware.com/resources/techresources/10286

http://www.vmware.com/resources/techresources/10286

63

RAID typeIn this example, I’m using just RAID10.

• Generally speaking, I see a rather high Write ratio (around 50%). RAID5 will result in higher cost, as it needs more spindle. • More spindles gives the impression we have enough Storage. It is difficult to say no to request when you don’t have storage issue.• More spindles mean you’re burning the environment more.

vCloud Suite introduces additional IOPS outside the guest OS• VM boots results in writing the vRAM to disk.• Storage vMotion and Storage DRS• Snapshot

Mixing RAID5 and RAID10• This increases complexity. RAID5 was used for capacity. But nowadays each disk is huge (2 TB).• I’d go for mixing SSD and Disk, then mixing RAID type. So it is:

• SSD RAID 10 for performance & IOPS• Disk RAID 10 for capacity

• I’d for just 2 tier instead of 3. This minimises the movement. Each movement cost both read and write.

Sample below is based on • 150 IOPS per spindle.• Need to achieve 1200 IOPS

RAID Level

# Disks required.(20% Write)

# Disks required.(80% Write)

6 16 40

5 13 27 (nearly 2X of RAID10)

10 10 14

RAID Type Write IO Penalty

5 4

6 6

10 2

64

Cluster Mapping: Host to Datastore Always know which ESX cluster mounts what datastore cluster• Keep the diagram simple. Main purpose is to communicate with other team. Not too many info. The idea is to have a mental

picture that they can understand• If your diagram has too many lines, too many datastores, too many clusters, then it is too complex. Create a Pod when such thing

happens. Modularisation makes management and troubleshooting easier.

65

Mapping: Datastore Replication

66

Type of DatastoresTypes of datastore

• Business VM• Tier 1 VM, Tier 2 VM, Tier 3 VM, Single VM• Each Tier may have multiple datastores.

• IT VM• Staging VM

• From P2V process, or moving from Non-Prod to Prod.• Isolated VM• Template & ISO• Desktop VM

• Mostly local datastore on ESXi host, backed by SSD.• SRM Placeholder• Datastore Heartbeat

• Pro: Dedicated DS so we don’t accidentaly impact while offlining a datastore.• Cons: another 2 DS to manage per cluster. Increase scanning time.• Can use the SRM placeholder as heartbeat?

Always know where a key VM is stored. A Datastore corruption, while rare, is possible.

1 datastore = 1 LUN• Relative to “1 LUN = Many VMFS”, it gives better performance due to less SCSI reservation

Other guides:• Use Thin Provisioning at array level, not ESX level.• Separate Production and Non Production. Add a process to migrate into Prod. You can’t guarantee Production performance if VM moves in

and out without control.• RAID level does not matter so much if Array has sufficient cache (with battery backed, naturally)• 20% free capacity for VM swap files, snapshots, logs, thin volume growth, and storage vMotion (inter tier).

67

Special Purpose Datastore1 low cost Datastores for ISO and Templates

• Need 1 per vCenter data center.• Need 1 per physical Data Center. Else you will transfer GBs of data across WAN.• Around 1 TB• ISO directory structure:

1 staging/troubleshooting datastore• To isolate a VM. Proof to Apps team that datastore is not affected by other VM.• For storage performance study or issue. Makes it easier to corelate with data from Array.• The underlying spindles should have enough IOPS & Size for the single VM• Our sizing:

• Small Cloud: 1 TB• Large Cloud: 1 TB

1 SRM Placeholder datastore• So you always know where it is. • Sharing with other datastore may confuse others.• Used in SRM 5 to place the VMs metadata so it can be seen in vCenter.• 10 GB enough. Low performance.

\ISO\ \OS\Windows \OS\Linux \Non OS\ store things like anti virus, utilities, etc

68

Storage Capacity PlanningTheory and Reality can differ. Theory is the initial, high level planning you do. Reality is what it is after 1-2 years.

Theory or Initial Planning• For green field deployment, use the Capacity Planner. The info on actual usage is usefull as the utilisation can be low. The IOPS info is good

indicator too.• For brown field deployment, use the existing VMs as indicator. If you have virtualised 70%, this 70% will be a good indicator as it’s your

actual environment• You can also use rules of thumb, such as:

• 100 IOPS per normal VM. • 100 IOPS per VM is low. But this is a Concurrent Average. If you have 1000 VM, this will be 100K IOPS.

• 500 IOPS per database VM• 20 GB per C:\ drive (or where you store OS + Apps)• 50 GB per data drive for small VM• 500 GB per data drive for Database VM• 2 TB per data drive for File server

Actual or Reality• Use tool, such as VC Ops 5.6 for actual measurement.• VC Ops 5.6 needs to be configured (read: tailored) to your environment.

• Create custom groups. For each group, adjust the buffer accordingly.• You will need at least 3 groups, 1 per tier.

• I’d not use spreadsheet or Rules of Thumb for >100 VM environment.

69

Multi-PathingDifferent protocol has different technology

• NFS, iSCSI, FC all have different solution• NFS uses single-path for a given datastore. No multi-pathing. So use multiple datastore to spread load

In this design, I do not go for high-end array due to cost• High-end Array gives Active/Active, so we don’t have to do regular load balancing.• Most mid-range is Active-Passive (ALUA). Always ensure the LUNs are balanced among the 2 SP. This is done manually within the array.

Choose ALUA array instead of plain Active/Passive• Less manual work on the balancing and selecting the optimal path.

• Both controller can receive IO request/command, although only 1 owns the LUN.• Path from the managing controller is the optimized path.

• Better utilization of the array storage processors (minimize unnecessary SP failover)• vSphere will show both path as Active, but the Preferred one is marked “Active IO”• Round Robin will issue IO across all optimized paths and will use non-optimized paths only if no optimized paths are available.• See http://www.yellow-bricks.com/2009/09/29/whats-that-alua-exactly/

Array Type My selection

Active/Active Round Robin or Fixed

ALUA Round Robin or MRU

Active/Passive Round Robin or MRU

EMC PowerPath/VE 5.4 SP2

Dell EquaLogic

EquaLogic MMP

HP/HDS PowerPath/VE 5.4 SP2?

http://www.yellow-bricks.com/2009/09/29/whats-that-alua-exactly/

70

FC: Multi-pathingVMware recommends 4 paths

• Path is point to point. The Switch in the middle is not part of the path as far as vSphere is concerned.• Ideally, they are all active-active for a given datastore.• Fixed means 1 path active, 3 idle.• 1 zone per HBA port. The zone should see all the Target ports.

If you are buying new SAN Switches, consider the direction for the next 3 years. • Whatever you choose will likely be in your data center for the next 5 years.• If you are buying a Director-class, then consider for the next 5 years.

Upgrading Director is a major work, so plan for 5 years usage. Consider both EOL and EOSL date.

• Discuss with SAN switches vendors and understand their roadmap. • 8 Gb and FCoE are becoming common

Round-Robin• It is per Datastore, not per HBA.

• 1 ESX host typically has multiple datastores.• 1 Array certainly has multiple datastores. • All these datastores share the same SP, Cache, Ports, and possibly spindles.

• It is active/passive at a given datastore.• Leave the default settings of 1000.

No need to set iooperationslimit=1• Choose this over MRU. MRU needs manual fail back after path failure.

71

FC: Zoning & MaskingImplement zoning

• Do it before going live, or during quite maintenance window due to high risk potential• 1 zone per HBA port.

• 1 HBA port does not need to know the existence of others.• This eliminates the Registered State Change Notification

• Use soft zoning, not hard zoning• Hard zone: zone based on the SAN Switch port. Any HBA connects to this switch port get this zone. So this is more secure. But be careful when

recabling things into the SAN switch!• Soft zone: zone based on the HBA port. The switch port is irrelevant.• Situation that needs rezoning in Soft Zone: Changing HBA, replacing ESX server (which comes with new HBA), upgrading HBA• Situation that needs rezoning in Hard Zone: reassigning the ESX to another zone, port failure in the SAN switch.

• Virtual HBA can further reduce cost and offer more flexibility

Implement LUN Masking• Complement zoning. Zoning is about path segregation, zone is about access.• Do at array level, not ESX level.• Mask on the array, not on each ESXi host. • Masking done at the ESXi host level is often based on controller, target, and LUN numbers, all of which can change with the hardware

configuration

72

FC: Zoning & Masking

See the figure, there are 3 zones. • Zone A has 1 initiator and 1 target. Single-Initiator zone is good.• Zone B has two initiators and targets. This is bad.• Zone C has 1 initiator and 1 target• Both SAN switches are connected via an Inter-Switch Link.

• If Host X rebooted and it’s HBA in Zone B logs out of the SAN, an RSCN will be sent to Host Y’s initiator in Zone B and cause all I/O going to that initiator to halt momentarily and recover within seconds.

• Another RSCN will be sent out to Host Y’s initiator in Zone B when Host X’s HBA logs back in to the SAN and cause another momentary halt in I\O. • Initiators in Zone A and Zone C are protected from these events because there are no other initiators in these zones.

• Most latest SAN switches provide RSCN suppression methods. • But suppressing RSCNs is not recommended, since RSCNs are the primary way for initiators to determine an event has occurred and to act on the

specified event such as lost of access to targets.

73

Large: Reasons for FC (partial list) Network issue does not create storage issue Troubleshooting storage does not mean troubleshooting network too FC vs IP• FC protocol is more efficient & scalable than

IP protocol for storage• Path failover is <30 seconds, compared with

<60 seconds for iSCSI

Lower CPU cost• See the chart. FC has lowest CPU hit

to process the IO, followed by hardware iSCSI

Storage vMotion is best served with 10 GE FC consideration• Need SAN skills. Troubleshooting skills, not just

Install/Configure/Manage. • Need to be aware of WWWWW. This can impact

upgrade later on as new component may not work with older component

NFS S/W iSCSI

H/W iSCSI

FC0.00

0.20

0.40

0.60

0.80

1.00

1.20

ESX 3.5ESX 4.0

Rel

ativ

e C

PU

cos

t per

I/O

74

Large: Backup with VADP 1 back up job per ESX, so impact to production is minimized.

75

Backup Server A backup server is an "I/O Machine"• By far, majority of work done is I/O related• Performance of disk is key• Fast internal bus is key. Multiple internal buses desirable. • No share path. 1 port from ESX (source) and 1 port to tape (target)• Lots of data in from clients and out to disk or tape• Not much CPU usage. 1 socket 4-core Xeon 5600 is more than sufficient• Not much RAM usage. 4 GB is more than enough

But Deduplication uses CPU and RAM• Deduplication relies on CPU to compare segments (or blocks) of data to determine if they have been previously backed up or if

they are unique.• This comparison is done in RAM. Consider 32 GB RAM (64 bit Windows)

Size the concurrency properly• Too many simultaneous backups can actually slow the overall backup speed.• Use backup policy to control the number of backups that occur against any datastore. This minimizes that I/O impact on

datastore, as it must still serve production usage. 2 ways of back up:• Mount the VMDK file as a virtual disk (with a drive letter). Back up software can then browse the directory.• Mount the VM as image file.

76

Network Design

77

Methodology Plan how VXLAN and SDN impacts your architecture Define how vShield will complement your VLAN based network Decide if you will use 10 GE or 1 GE• I’d go for 10 GE for the Large Cloud example• If you use 10 GE, define how you will use Network IO Control

Decide if you use IP storage or FC storage Decide the vSwitch to use: local, distributed, Nexus Decide when to use Load Based Teaming Select blade or rack mount• This has impact on NIC ports and Switches

Define the detailed design with vendor of choice

78

VXLAN Complete isolation of network layer• Overlay networks are isolated from each other and the physical network

Separation of Virtualization and Network layers• Physical network has no knowledge of virtual networks• Virtual networks spun up automatically as needed for VDCs

Loss of visibility as all overlay traffic is now UDP tunneled Can’t isolate virtual network traffic from physical network Virtual networks can have overlapping address spaces Today’s network management tools useless in VXLAN environments

79

Network Architecture (still VLAN-based, not vCNS-based)

80

ESXi Network configuration

81

Design Consideration Design consideration for 10 GE• We only have 2-4 physical port.

• This means we only have 1-2 vSwitch. • Some customers have gone with 4 physical ports as 20 GE may not be enough for both Storage and Network

• Distributed Switch relies on vCenter• Database corruption on vCenter will impact it.• vCenter availability is more critical.

• Use Load Based Teaming• This prevents one burst from impacting Production. For example, a large vMotion can send a lot of traffic.

Some best practices• Enable jumbo frame• Disable STP on ESXi-facing ports on the physical switch• Enable PortFast mode on ESXi-facing ports• Do not use DirectPath IO, unless the app really has proof that it needs it.

82

Network IO Control 2x 10 GE is much preferred to 12x 1 GE• 10 GE ports give flexibility. Example, vMotion can exceed 1 GE when physical cable not used by other traffic• But a link failure means losing 10 GE• External communication can still be 1 GE. Not an issue if most communication is among VM.

Use Use ingress traffic shaping to control traffic type into the ESX?

Shares Bandwidth(per pNIC)

Function vShield Remarks

20% VM – Production

VM – Non Production

VM – Admin Network

VM – Back up LAN (agent)

Yes A good rule of thumb is ~8 VM’s per Gigabit

Admin Network is used for basic network services like DNS server, AD Server.

Use vShield App to separate with Production. Complement existing VLAN, no need to create more VLAN

The Infra VM is not connected to Production LAN, rather they are connected to Management LAN.

10% Management LAN

VMware Management

VMware Cluster Heartbeat

No In some cases, the Nexus Control & Nexus Packet need to be physically separated from Nexus Management.

20% vMotion No Non routable, private network

15% Fault Tolerant No Non routable, private network

0 – 10% VM – Troubleshooting Yes Same with Production. Used when we need to isolate the networking performance

5% Host-Based Replication No? Only for ESXi that is assigned to do vSphere replication. From throughput point of view, if the inter-site link is only 100 Mb, then you only need 0.1 GE max.

20% Storage Yes

83

Large: IP Address scheme The example below is based on 1500 server VM and 10000 desktop VM, which is around 125 ESX and 125 ESX respectively. Do we separate the network between Server and Desktop farms? Since we are using the x.x.x.1 address, the basic network address (gateway) will be on x.x.x.254.

Purpose IP Address Total Segments Remarks

ESX iLO 1 per ESX 1 Out of band management & console access

ESX Mgmt 1 per ESX. 1

ESX iSCSI 1-2 per ESX 1 Need 2 (1 address per active path) if we don’t use LBT and do static mapping

ESX vMotion 2 per ESX 1 Multi-NIC vMotion

ESX FT 1 1 Cannot multi path?

Agent VMs 5 per ESX 3 vShield App, TrendMicro DS, Distributed Storage,etc.

Mgmt VMs 1 per DC 1 vCenter, SRM, Update Manager, vCloud, etc.Group in 20 so similar VMs have sequential IP address, easier to remember

Address ESXi #001 ESXi #125 Remarks

iLO 10.10.10.1 10.10.10.125 10.10.10.x for Server farm. Enough for 254 ESX.10.10.11.x for Desktop farm

10.10.12.x for non ESX (e.g. network switch, array, etc)

Mgmt 10.10.13.1 10.10.13.125 10.10.13.x for Server farm. Enough for 254 ESX.10.10.14.x for Desktop farm

iSCSI 10.10.15.110.10.16.1

10.10.15.12510.10.16.125

This is for ESX only. Other devices should be on 10.10.17.xVSA will have many addresses when it scales beyond 3 ESX.

vMotion 10.10.17.110.10.18.1

10.10.17.12510.10.18.125

Fault Tolerance 10.10.19.1 10.10.19.125

Agent VMs 10.10.20.110.10.21.110.10.22.1

10.10.20.12510.10.21.12510.10.22.125

84

Security & Compliance Design

85

Areas to consider

Source & tools• vSphere hardening guide• VMware Configuration Manager• Other industry requirement like PCI-DSS

Take advantage of vCNS• Changing the paradigm in security. From “Hypervisor as another point to secure” to “Hypervisor to give unfair advantage for security team”.• vShield App for firewall and vShield End Point for anti virus (only Trend Micro has the product as at Sep 2011)• Does not need to throw away physical firewall first. Complement it by adding “object-based” rules that follows the VM.

VM

• Guest OS

• vmdk &

• Prevent DoS

• Log review

• VMware Tools

Server

• Lockdown mode

• Firewall

• SSH

• Log review

Storage

• Zoning and LUN masking

• VMFS & LUN

• iSCSI CHAP

• NFS storage

Network

• VLAN & PVLAN

• Management LAN

• No air gap with vShield

• Virtual appliance

Management

• vSphere roles

• Separation of duty

86

Enterprise IT space

Separation of Duties with vSphereVMware Admin >< AD Admin

• In small setup, it’s the same person doing both.• AD Admin has access to NTFS. This can be too powerful if it has data

Segregate the virtual world• Split vSphere access into 3.

• Storage• Server• Network

• Give Network to Network team.• Give Storage to Storage team.• Role with all access to vSphere

should be rarely used.• VM owner can be given some access

that they don’t have in physical world. They will like the empowerment (self service)

vSphere space

VMware Admin

Networking Admin

Server Admin

Operator

VM Owner

Operator

VM Owner

Storage Admin

MS AD Admin

Storage Admin

Network Admin DBA Apps

Admin

87

Folder Properly use it• Do not use Resource Pool to organise VM.• Caveat: the Host/Cluster view + VM is

the only view where you can see both ESX and VM. Study the hierarchy on the right• It is Folder everywhere. • Folder is the way to limit access.

• Certain object don’t have its own access control. They rely on folder.• E.g. You cannot set permissions directly on a

vNetwork Distributed Switches. To set permissions, create a folder on top of it.

88

Storage related accessNon-Storage Admin should not have the following access

• Initiate Storage vMotion• Rename or Move Datastore• Create • Low level file operations

Different ways of controlling access• Network level. The ESXi will not be able to access the entire array as it can’t even see it on the network• Array level. Control which ESXi hosts can or cannot see.

• For iSCSI, we can configure per target using CHAP• For FC, we can use Fibre Channel zoning or LUN masking�

• vCenter level. Using the vCenter permissions (folder level or datastore level). Most granular.

89

Network related accessServer Admin should not have the following access

• Move network• This can be a security concern

• Configure network• Remove network

Server Admin should have• Assign network

• To assign a network to a VM

90

Roles and GroupsCreate new groups for vCenter Server users.

• Avoid using MS AD built-in groups or other existing groups• Do not use default user “Administrator” in any operation

• Each vCenter plug-in should have their own user, so you can differentiate among all the plug-in• Disable the default user “Administrator”

• Use your own personal ID. The idea is security should be trace-able to an individual.• Do not create another generic user (e.g. VMware Admin). This defeats the purpose, and is practically no different to “Administrator”• Creating a generic user increase risk of sharing, since it has no personal data.

Create 3 roles (not user) in MS AD• Network Admin• Storage Admin• Security Admin

Create a unique ID for each of the vSphere plug-in that you use• SRM, Update Manager, Chargeback, CapacityIQ, vShield Zone, Converter, Nexus, etc

• E.g. SRM Admin, Chargeback Admin• This is the ID that the product will use to login to vCenter. • This is not the ID you use to login to this product. Use your personal ID for this purpose.

• This helps in troubleshooting. Otherwise too many “Administrator” and you are not sure who they _really_ are.• Also, if the Administrator password has to change, then you don’t have to change everywhere.

91

VM Design

92

Standard VM sizing: Follow McDonald 1 VM = 1 App = 1 purpose. No bundling of services.• Having multiple application or services in 1 OS tend to create more problem. Apps team knows this better.

Start with Small size, especially for CPU & RAM.• Use as few virtual CPUs (vCPUs) as possible.

• CPU impact on scheduler, hence performance• Hard to take back once you give them. Also, the app might be configured to match the processor (you will not know unless you ask the

application team).• Maintaining a consistent memory view among multiple vCPUs consumes resources.• There is licencing impact if you assign more CPU. vSphere 4.1 multi-core can help (always verify with ISV)• Virtual CPUs not used still consumes timer interrupts and execute the idle loops of the guest OS• In physical world, CPU tend to be oversized. Right size it in virtual world.

• RAM• RAM starts with 1 GB, not 512 MB. Patch can be large (330 MB for XP SP3) and needs RAM• Size impact vMotion, ballooning, etc, so you want to trim the fat• Tier 1 Cluster should use Large Page.

• Anything above XL needs to be discussed case by case. Utilise Hot Add to start small (need DC edition)• See speaker notes for more info

Item Small VM Medium VM Large CustomCPU 1 2 4 8 – 32

RAM 1 GB 2 GB 4 GB 8, 12, 16 GB, etc

Disk 50 GB 100 GB 200 GB 300, 400, etc GB

93

SMP and UP HAL Does not apply to recent OS such as Windows Vista, Win7, Win2008 Design Principle• Going from 1 vCPU to many is ok.

• Windows XP and Windows Server 2003 automatically upgrade to the ACPI Multiprocessor HAL• Going from many to 1 is not ok.

To change from 1 vCPU to 2 vCPU• Must change the kernel to SMP.• "In Windows 2000, you can change to any listed HAL type. However, if you select an incorrect HAL, the computer may not start

correctly. Therefore, only compatible HALs are listed in Windows Server 2003 and Windows XP. If you run a multiprocessor HAL with only a single processor installed, the computer typically works as expected, and there is little or no affect on performance.

• http://support.microsoft.com/default.aspx?scid=kb;EN-US;811366• Step to change: http://support.microsoft.com/kb/237556/

To change from many vCPU to 1.• Step is simple. But MS recommends reinstall.

• “In this scenario, an easier solution is to create the image on the ACPI Uniprocessor computer. “• http://kb.vmware.com/kb/1003978• http://support.microsoft.com/kb/309283

http://support.microsoft.com/default.aspx?scid=kb;EN-US;811366

http://support.microsoft.com/kb/237556/

http://kb.vmware.com/kb/1003978

http://support.microsoft.com/kb/309283

94

MS Windows: Standardisation Data Center edition is cheaper on >6 VM per box MS Licensing is complex. • Table below may not apply in your case

Source: http://www.microsoft.com/windowsserver2008/en/us/hyperv-calculators.aspx

per VM. 10 VM means 10 licence

per 4 VM. 10 VM means 3 licence

per socket. 2 socket means 2 licence. Unlimited VM per box

95

Guest OS Use 64-bit if possible• Access to > 3 GB RAM.• Performance penalty is generally negligible, or even negative• In Linux VM, Highmem could show significant overheads with 32 bit. 64 bit guests can offer better performance.• Large memory footprint workloads will benefit more with 64 bit guests• Some Microsoft & VMware products have dropped support for 32 bit• Increase scalability in VM.

• Example: for Update Manager 4• If it is installed on 64 bit Windows, it can concurrently scan 4000 VM. But if it’s installed on 32 bit, the concurrency drops to 200• Powered on Windows VM scan per VUM server is 72.‐

• Most other numbers are not as drastic as the above example.

Disable unnecessary device from Guest OS Choose the right SCSI controller Set the right IO Time out

• On Windows VM, increase the value of the SCSI TimeoutValue parameter to allow Windows to better tolerate delayed I/O resulting from path failover.

For Windows VM, stagger anti-virus scan. Performance will degrade significantly if you scan all VM simultaneously

96

Management DesignPerformance, Capacity, Configuration

97

vCenter Run vCenter Server as a VM• vCenter Server VM best practices:

• Disable DRS on all vCenter VMs. Move them to first ESXi on your farm. • Always remember where you run your vCenter.• Remember both the host name and IP address of that first ESXi host.

• Start in this order: Active Directory DNS vCenter DB vCenter• Set HA to high priority

• Limitations• Windows patching of vCenter VM can’t be done via Update Manager• Can’t cold clone the VM. Use hot clone instead.• VM-level operation that requires the VM to be powered-off, can be done via ESX.

• Login directly to the ESXi host that has the vCenter VM. Do the changes, then boot the VM.• Not connected to Production LAN. Connect to management LAN, so VLAN Trunking required as vSwitches are shared (assuming

you are not having dedicated IT Cluster) Security• Protect the special-purpose local vSphere administrator account from regular usage. Instead, rely on accounts tied to specific

individuals for clearer accountability. Other configuration• Keep the Statistic Level at Level 1. But use vCenter Operations to complement.• Level 3 is a big jump in terms of data collected

98

Naming conventionObject Standard Examples Remarks

Data center Purpose Production This is the virtual data center in vCenter. Normally, a physical data centers has 1 or many virtual data center.As you will only have a few of these, no need to create cryptic naming convention. Avoid renaming it.

Cluster Purpose As above.

ESXi host name Esxi_locationcode_##.domain.name esxi_SGP_01.vmware.comesxi_KUL_01.vmware.com

Don’t include version no as it may change.No space.

VM Project_Name Purpose ## Intranet WebServer 01 Don’t include OS name.Can include space

Datastore Environment_type_## PROD_FC_01TEST_iSCSI_01DEV_NFS_01Local_ESXname_01

Type is useful when we have multiple type.If you have 1 type, but multiple vendor, you can use vendor name (EMC, IBM, etc) instead.Prefix all Local so they are separated easily in the dialog boxes.

“Admin ID” for ProductName-Purpose VCOps-CollectorChargeback-

All the various plug-in to vSphere needs Admin access.

Folder

Avoid special characters as you (or other VMware and 3rd party products or plug-in) may need to access them programmatically.If you are using VC Ops to manage multiple vCenters, then the naming convention should ensure it’s unique across vCenters.

99

vCenter Server: HAMany vSphere features depend on vCenter

• Distributed Switch• Auto-Deploy• HA (management)• DRS and vMotion• Storage vMotion• Licensing

Many add-on depends on vCenter• vShield• vCenter SRM• VCM• vCenter Operations• vCenter Chargeback• vCloud Director• View + Composer

Implement vCenter Heartbeat• Automated recovery from hardware, OS, application, network• Awareness of all vCenter Server components• Only solution fully supported by VMware• Can protect database (SQL Server) and vCenter plug-ins

• View Composer, Update Manager, Converter, Orchestrator

99

100

vMA: Centralised LoggingBenefits

• Ability to search across ESX• convenience

Best practices• One vMA per 100 hosts with vilogger• Place vMA on management LAN• Use static IP address, FQDN and DNS• Limit use of resxtop (used for real time troubleshooting not monitoring)

Enable remote system logging for targets• vilogger (enable/disable/updatepolicy/list)

• Rotation default is 5• Maxfiles defaults to 5MB• Collection period is 10 seconds

• ESX/ESXi log files go to /var/log/vmware/<hostname>• vxpa logs are not sent to syslog

• See KB1017658

© 2009 VMware Inc. All rights reserved

Thank You

Documents

Software Defined Datacenter Sample Architectures based on vCloud Suite 5.1 Singapore, Q2 2013