25
Fault Tolerant Virtualization with ESXi and iSCSI Shared Storage Infrastructure Level Disaster Recovery Steven Ford, Felix Carrera, Michael Wu, Andy Beltre Institute for Bioscience and Biotechnology Research December 16, 2015

Fault Tolerant Virtualization

Embed Size (px)

Citation preview

Page 1: Fault Tolerant Virtualization

Fault Tolerant Virtualization with ESXiand iSCSI Shared Storage

Infrastructure Level Disaster Recovery

Steven Ford, Felix Carrera, Michael Wu, Andy BeltreInstitute for Bioscience and Biotechnology Research

December 16, 2015

Page 2: Fault Tolerant Virtualization

CONTENTS

1 Introduction 3

2 Is iSCSI a Viable Storage Back-End? 42.1 Local Storage versus iSCSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Sequential Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Sequential Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Multiple ESXi Hosts Sharing a Single iSCSI Target 6

4 Installation and Configuration 74.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 Block Storage Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2.2 ESXi Hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 Installing the Operating System for Block Storage Nodes . . . . . . . . . . . . . . . . . . . . . . . 84.4 Post-Installation OS Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4.1 Network Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.4.2 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4.3 SELinux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.5 Cluster Software Installation and Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.5.1 PCS and Pacemaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.5.2 Corosync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5.3 CMAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5.4 Shoot The Other Node In The Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5.5 No Quorum Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5.6 Start the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.6 DRBD Installation and Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6.1 ELRepo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6.2 DRBD Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6.3 Block Device Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.7 Resource Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.7.1 DRBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.7.2 iSCSI Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7.3 iSCSI Logical Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7.4 IP Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.8 Constraint Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1

Page 3: Fault Tolerant Virtualization

4.8.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.2 Colocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.3 Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Maintenance Tasks 215.1 Checking Pacemaker Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Checking DRBD Device Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Recovering from a Fail-Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Revision History 24

2

Page 4: Fault Tolerant Virtualization

CHAPTER 1

INTRODUCTION

Our recent research on VM disaster recovery options has led us back to looking at shared storage used withVMWare ESXi. Virtualization infrastructures based off of other technologies – such as OpenStack and KVM– have showed promise, but we have decided to pursue modifications to our current infrastructure rather thanimplement a new one. A large factor in this decision is the time it would take to migrate to a new technology.It is important that we achieve fault tolerance on the infrastructure level as soon as possible to greatly reducedowntime for our users in the event of a server failure.

The proposed modifications to our current infrastructure are to add an iSCSI storage backend that uses adistributed replicated block device (DRBD) to maintain data redundancy. The backend will consist of four iSCSItargets split across two separate machines. The machines will serve as a backup for the other, taking on the IPsand iSCSI targets of the other one in the case of a failure. The fail-over of the iSCSI targets will be facilitatedby Pacemaker – an open source cluster manager – in an active/active configuration. Due to the iSCSI targetsalways remaining associated with the same IP address regardless of their physical location, the VMWare ESXiservers are blind to the fail-over process. This means that when a fail-over occurs, the only thing the ESXi serversexperience is a temporary iSCSI disconnect.

This proposed back-end for providing infrastructure level fault tolerance to our virtual machines looks promis-ing, but hinges on the answers to a few key questions:

� Is iSCSI a viable storage back-end for our tier of VM infrastructure? Will our amount of virtual machinedisk IO struggle to travel through the network?

� Can multiple ESXi hosts coordinate the use of a single iSCSI target? In other words, does the VMFS filesystem allow multiple ESXi hosts to access it at the same time?

In the following chapters, we will answer each of these questions. After these questions are addressed, wewill walk through the configuration process. Throughout this text, I will use angled brackets to specify portionsof text that should be substituted with values as you see fit. Be sure to replace all text inside the <...> withthe proper names to match your configuration. For example, instead of drbd-<DEVICE-A>.res you might enterdrbd-bs1-ssd.res.

3

Page 5: Fault Tolerant Virtualization

CHAPTER 2

IS ISCSI A VIABLE STORAGE BACK-END?

An important consideration in using iSCSI for virtual machine storage is how virtual hard disk performance will beaffected. Since all hard disk I/O is being sent over the network, rather than local data buses, there is a potentiallylarge impact to VM performance. This chapter aims to provide a clear outline of what performance impact wecan expect from implementing iSCSI storage for our VM servers.

To observe the difference in performance between local storage and iSCSI storage, we provisioned two virtualmachines. Both machines run Linux and have identical amounts of memory and CPU cores, the only differenceis their storage back-ends. One machine has its virtual hard drive located locally on the ESXi server, while theother machine has its virtual hard drive located on an iSCSI target. The disk performance for these machines wasmeasured using the disk I/O bench-marking tool “Bonnie++.”

2.1 Local Storage versus iSCSI

The first test we ran was aimed at determining general disk performance for a virtual machine running on localstorage versus general disk performance running with its hard drive on an iSCSI target. Both machines werebench-marked, one after the other, and the results were recorded and graphed for comparison.

2.1.1 Sequential Output

(a) Sequential output (write) to a locally stored virtual ma-chine’s disk versus one stored on iSCSI

(b) CPU load for sequential output (write) to a locally storedvirtual machine’s disk versus one stored on iSCSI

4

Page 6: Fault Tolerant Virtualization

The first thing these graphs tell us is that when writing data one byte at a time (Per Char) the CPU becomesthe bottleneck. Byte by byte writes running on local storage top out at a speed of only 1,048KBps, and utilize99% of the machine’s CPU power. Byte by byte writes on iSCSI storage are slightly faster at 1,054KBps, also with99% CPU utilization. The difference in performance between local and iSCSI for byte by byte writes is +0.57%in write speed and 0% in CPU utilization. These numbers do not vary significantly, but byte by byte writes todisk are not a very likely scenario, so lets look at block write speeds.

With block writes, we start to see some real performance. Block writes running on local storage top out at221,121KBps (215.9MBps) and utilize only 26% of the machine’s CPU power. Block writes on iSCSI storageagain outperform local at 282,910KBps (276.3MBps) with an also slightly higher CPU utilization of 27%. Thedifference in performance between local and iSCSI for block writes is +24.52% in write speed and +3.8% in CPUutilization. Here we see a large difference in performance. Now lets look at rewrites.

The rewrite test is performed by reading some data from the hard drive, changing it, and writing it backto the hard drive. Rewrites on local storage reach speeds of 194,854KBps (190.3MBps) and utilize 20% of themachine’s CPU power. Rewrites on iSCSI storage reach speeds of 213,557KBps (208.6MBps) and utilizes 18%of the machine’s CPU power. The difference in performance between local and iSCSI for rewrites is +9.16% inrewrite speed and -10.5% in CPU utilization. This test is particularly important because it simulates loading andchanging files; a function that is a large part of a system’s disk I/O.

2.1.2 Sequential Input

(a) Sequential input (read) from a locally stored virtual ma-chine’s disk versus one stored on iSCSI

(b) CPU load for sequential input (read) from a locally storedvirtual machine’s disk versus one stored on iSCSI

These figures supply even more insight into the performance differences we can expect. Once again, CPUseems to be the bottleneck for byte by byte reads. Due to the insignificance of byte by byte read performance ina production environment, I am going to omit analyzing those results and skip directly to block reads.

The results for reading blocks show the largest difference so far in performance between local storage and iSCSIstorage. Block reads from local storage reach speeds of 428,803KBps (418.6MBps) and utilize 21% of the machine’sCPU power. Block reads from iSCSI storage reach speeds of 476,165KBps (467.9MBps) and utilized 16% of themachine’s CPU power. The difference in performance between local and iSCSI for block reads is +11.09% in readspeed and -27.03% in CPU utilization.

2.2 Conclusion

Our test showed iSCSI storage outperforming local storage in both reads and writes. This may be due to thefact that this storage disks for the iSCSI target are in a RAID array, while the local storage is not. Regardlessof unconsidered factors in the performance difference, the benchmarks of the new infrastructures iSCSI storageprove the iSCSI is a viable back-end for VM storage.

5

Page 7: Fault Tolerant Virtualization

CHAPTER 3

MULTIPLE ESXI HOSTS SHARING A SINGLEISCSI TARGET

Another concern about using iSCSI as a storage back-end for ESXi is the ability of multiple ESXi hosts tosimultaneously access a single iSCSI target. Initially, this virtual server implementation will have only two ESXihosts and two iSCSI targets. However, we will be adding additional ESXi hosts in the future. Even withoutadditional ESXi hosts, all the virtual machines stored on one iSCSI target may not necessarily run on the sameESXi host. Multiple ESXi hosts running virtual machines from the same iSCSI target is critical to implementingthis infrastructure. In this brief chapter, we will evaluate the possibility of using VMWare vCenter Server to shareiSCSI storage between multiple ESXi hosts.

Figure 3.1: VMWare vCenter Server Inven-tory

When connected to a vCenter server with the vSphere client, a nav-igation bar is display on the left of the interface. Figure 3.1 is a view ofIBBR’s vCenter server navigation bar. Inside the navigation bar, thereare five different types of items. There is the vCenter server host on top,here labeled “localhost”. There are groups of ESXi hosts called data-centers. The “Development” datacenter was created for the purpose ofthis project. Directly under a datacenter can be either an ESXi host,or a cluster. “IBBR Servers” contains only ESXi hosts, while “Devel-opment” contains a cluster called “Shared Storage”. Virtual machines(VMs) are the last type of item listed in the navigation bar. Here wesee a few VMs labeled as either Windows or Linux nested inside of the“Shared Storage” cluster.

When grouping ESXi hosts into a VMWare cluster, you have theoption to enable high availability features. High availability allows oneESXi host to boot up another host’s virtual machines in the event otherhost goes offline. This functionality relies on virtual machines being runfrom shared storage. VMWare clusters make shared storage simple byautomatically adding all external datastores configured on one host toall other hosts in the cluster. Once this is done, virtual machines ona shared datastore can be booted up on any ESXi host in the cluster.This includes iSCSI datastores, answering the question of multiple ESXihosts accessing a single iSCSI target.

6

Page 8: Fault Tolerant Virtualization

CHAPTER 4

INSTALLATION AND CONFIGURATION

4.1 Overview

Throughout the research and design process for this project, the proposed pacemaker configuration has changedsignificantly. We began with the intention of configuring one DRBD device, and ended up configuring four. Figure4.1 represents the final configuration that was implemented into our production environment.

Figure 4.1: Visual Representation of Pacemaker Cluster Configuration

The various colors used in the diagram above represent different elements of the configuration. The orangeboxes each represent a physical machine. The green boxes above each of the orange boxes represent pacemaker

7

Page 9: Fault Tolerant Virtualization

resources located on those machines. Solid yellow lines represent physical network connections and dotted yellowlines represent virtual IP addresses. The dotted yellow lines connecting to the solid yellow lines represent thatthose IP addresses operate through that physical network connection.

At the core of this configuration are the Distributed Replicated Block Devices. Each node has data stores for allfour devices (bs1-drbd-ssd, bs2-drbd-ssd, bs1-drbd-hdd, and bs2-drbd-hdd). When the DRBD resources areadded to the cluster as “clones”, each node assumes the master role for two of the four devices, and the slave role forthe other two. When holding the master role for a device, the node then starts an iSCSI target (bs1-iscsi-ssd,bs2-iscsi-ssd, bs1-iscsi-hdd, and bs2-iscsi-hdd), maps the target to the storage device using a Logical Unit(bs1-lun-ssd, bs2-lun-ssd, bs1-lun-hdd, and bs2-lun-hdd), and makes the target available via an IP addressresource (bs1-ip-ssd-102-61, bs2-ip-ssd-102-63, bs1-ip-hdd-102-64, and bs2-ip-hdd-102-65).

4.2 Hardware Configuration

4.2.1 Block Storage Nodes

Two block storage nodes were built. Each node is housed in a 2U rack mounted chassis with redundant power,and eight hot-swappable drive bays. A quad core 3.5GHz Intel Xeon E5-1620 was used for the processor, alongwith 16GB of registered memory. The operating system, CentOS 6, is installed on a 120GB SSD. Two softwareRAID 10 arrays are configured on each node. One array with four 4TB hard disk drives, and the other withfour 1TB solid state drives. Each RAID 10 device is split into two equal partitions to give us our four DRBDdevices bs1-drbd-ssd, bs2-drbd-ssd, bs1-drbd-hdd, and bs2-drbd-hdd. Two 10Gbps network cards were usedto interconnect the nodes for data replication.

4.2.2 ESXi Hosts

Two ESXi hosts were built. Each host is housed in a 2U rack mounted chassis with redundant power. No hot-swappable drive bays were used. Two eight core 2.4GHz Intel Xeon E5-2630 v3 processors were installed, alongwith 128GB of registered memory. Each host is connected to the server space using a 10Gbps network card. Theoperating system, VMWare ESXi 5, is installed on a 60GB solid state drive. Due to the lack of local storage onthese hosts, we refer to them as “lean” VM hosts.

4.3 Installing the Operating System for Block Storage Nodes

It is important that we keep the configuration of both machines consistent. The following steps will need to berepeated for both Block Storage nodes in the cluster.

1. Begin by downloading the latest version of CentOS 6 from one of the mirrors listed on their website (http://isoredirect.centos.org/centos/6/isos/x86_64/), and burning it to a DVD.

2. Boot to the installation DVD and select “Install or Upgrade Existing System” from the welcome screen.

3. Skip testing the media at the Disc Found screen.

4. After the GUI launches, click Next.

5. Select your desired language and keyboard configuration.

6. Select “Basic Storage Devices” and click Next.

7. If prompted to delete all data on the hard drive, select “Yes, discard any data”

8. Enter your desired hostname. The hostnames used for our deployment were vm-block-storage-1 andvm-block-storage-2.

9. Select your time zone and click Next.

10. Assign a root password and click Next.

8

Page 10: Fault Tolerant Virtualization

11. Select “Create Custom Layout” and click Next.

12. Create a 500MB partition on the system disk and mount it on /boot

13. Create a swap partition on the system disk that is equal in size to the RAM in the machine.

14. Create a partition spanning the rest of the system disk and mount it on /.

15. Leave the iSCSI target disks untouched during installation.

16. Click Next and then click “Write Changes to Disk”

17. Click Next to confirm the default boot loader installation options.

18. Select a minimal installation and click Next.

19. Once installation completes, click Reboot.

4.4 Post-Installation OS Configuration

4.4.1 Network Configuration

By default, the network interfaces will not start on boot. To fix this, two lines need to be changed inside of theconfiguration file for the interface. These files are located inside of /etc/sysconfig/network-scripts/ and arenamed ifcfg-<iface>, where <iface> is the name of the interface that each file corresponds to. Open the filefor the 10Gbps interface that will be used to connect the machine to the LAN (we used eth1). Inside of this file,change the value of ONBOOT to yes, and change NM CONTROLLED to no. You can also accomplish this by using sed

with the following commands.

# sed -i "/ONBOOT/ { s/no/yes/ }" /etc/sysconfig/network-scripts/ifcfg-eth0

# sed -i "/NM_CONTROLLED/ { s/yes/no/ }" /etc/sysconfig/network-scripts/ifcfg-eth0

Create corresponding DHCP reservations for the new machines, then bring them online with the followingcommand:

# ifup <iface>

A the cross over connection between the two machines must be also be configured. The cross over connectionwill be made via the 10Gbps network interfaces and will be used to sync block devices between the two nodes. Wewill do this by again editing an interface configuration file. Inside of /etc/sysconfig/network-scripts/, openthe ifcfg-<iface> file for the interface to be used to cross connect the two machines. Change the configurationso the interface has a static IP. The resulting configuration should look similar to the folowing:

DEVICE=eth1

HWADDR=XX:XX:XX:XX:XX:XX

TYPE=Ethernet

UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

ONBOOT=yes

NM_CONTROLLED=no

BOOTPROTO=static

IPADDR=192.168.100.101

NETMASK=255.255.255.0

Do this on both machines; assigning them each one a unique private IP address on the same sub-net. Oncethis is done, connect the two machines to each other via direct cross-over. Bring the interfaces online by runningthe following command on both machines:

9

Page 11: Fault Tolerant Virtualization

# ifup <iface>

To make the machines easier to refer to later on, and easily recognized by the pacemaker cluster software, addentries for both machines to the /etc/hosts files. The entries will look like the following:

192.54.102.60 vm-block-storage-1

192.54.102.62 vm-block-storage-2

Use the outward facing IP addresses for pacemaker, rather than the private crossover. If the private crossoverwas used, pacemaker may see a node as up even though it is unreachable elsewhere due to a network outage.

4.4.2 Updates

Once both machines have their network interfaces configured, run the following command to update the operatingsystem and all its packages to the latest versions.

# yum update -y

4.4.3 SELinux

To ensure proper operation of the fail-over cluster, it is necessary to disable the built-in file system securitythat comes with Linux; called SELinux. To disable SELinux, edit the configuration file called selinux, locatedinside /etc/sysconfig/. Inside of this file, change SELINUX=enforcing to SELINUX=disabled. This change canalternatively be applied using sed by running the following command.

# sed -i "/^SELINUX/ { s/enforcing/disabled/ }" /etc/sysconfig/selinux

After making this change on both machines, reboot them to apply the settings.

4.5 Cluster Software Installation and Configuration

4.5.1 PCS and Pacemaker

The software to monitor the cluster and trigger fail-over is Pacemaker. To configure Pacemaker, we need theprogram PCS as well. Both of these programs are available through yum, and can be installed with the followingcommand:

# yum install -y pacemaker pcs

PCS runs as a service. After initial installation, the service is stopped and configured to only start when startedmanually. Start the service, and configure it to start on boot as well with the following two commands:

# /etc/init.d/pcsd start

# chkconfig pcsd on

The installation of Pacemaker and PCS creates a user account: hacluster. This account will be used duringthe configuration of the cluster, so it is necessary to assign it a password. This is done as follows:

10

Page 12: Fault Tolerant Virtualization

# passwd hacluster

Before proceeding, repeat all these steps on the other machine if not done already.

4.5.2 Corosync

Copy the example configuration file to the active configuration file:

# cp /etc/corosync/corosync.conf.example.udpu /etc/corosync/corosync.conf

Now modify the interface section in the corosync configuration file to contain one entry per machine and theproper subnet via the bindnetaddr variable. The configuration file should resemble the following:

compatibility: whitetank

totem {

version: 2

secauth: off

interface {

member {

memberaddr: 192.54.102.60

}

member {

memberaddr: 192.54.102.62

}

ringnumber: 0

bindnetaddr: 192.54.102.0

mcastport: 5405

ttl: 1

}

transport: udpu

}

logging {

fileline: off

to_logfile: yes

to_syslog: yes

logfile: /var/log/cluster/corosync.log

debug: off

timestamp: on

logger_subsys {

subsys: AMF

debug: off

}

}

Create the folder where the cluster configuration will be saved.

mkdir /etc/cluster

11

Page 13: Fault Tolerant Virtualization

Now create the failover cluster using pcs.

# pcs cluster auth vm-block-storage-1 vm-block-storage-2

Username: hacluster

Password: <hacluster password>

vm-block-storage-1: Authorized

vm-block-storage-2: Authorized

# pcs cluster setup --local --name vm-storage vm-block-storage-1 vm-block-storage-2

4.5.3 CMAN

We must uninstall NetworkManager, then install CMAN:

# yum remove NetworkManager

# yum install cman -y

4.5.4 Shoot The Other Node In The Head

Shoot the other node in the head (STONITH), when enabled, allows a node in the cluster to be kicked off if itsconfiguration varies significantly from other nodes. The purpose of this is so that an incorrectly configured nodedoes not compromise the integrity of the cluster’s resources. The pacemaker service must be started on bothnodes before this setting can be configured. We do not yet want to start this service automatically, we want tocontrol when it is running during the configuration process. Start the service with the following command:

# /etc/init.d/pacemaker start

Now disable STONITH, then verify the configuration:

# pcs property set stonith-enabled=false

# crm_verify -L

4.5.5 No Quorum Policy

Pacemaker is most often used to configure clusters of three or more nodes. The term used to describe a group oftwo or more nodes is a ‘quorum’. A cluster with two or more nodes active is said to ‘have quorum’. When onlyone cluster node remains, the cluster is said to have ‘no quorum’ and will be considered non functioning unlessthis behavior is overridden. This behavior is overridden by setting the no-quorum-policy property through pcs.This property is set in a similar fashion to STONITH.

# pcs property set no-quorum-policy=ignore

4.5.6 Start the Cluster

Now that the services are installed and configured, it’s time to verify that everything works. Start the clusterwith the following command:

12

Page 14: Fault Tolerant Virtualization

# pcs cluster start --all

4.6 DRBD Installation and Configuration

4.6.1 ELRepo

The Distributed Replicated Block Device software is not in the default CentOS package repository. The ELReporepository must be installed before the DRBD package will be available. Install ELRepo on both nodes with usingthe following command:

# rpm -ivh http://www.elrepo.org/elrepo-release-6-6.el6.elrepo.noarch.rpm

4.6.2 DRBD Install

With ELRepo installed, DRBD can be installed using yum:

# yum install -y kmod-drbd84

4.6.3 Block Device Initialization

Zero out the file system table on the partition (if there is one):

# dd if=/dev/zero of=/dev/<DISK> bs=1M count=128

Our configuration will utilize four DRBD devices, we will create a configuration file for each one. Cre-ate four files inside of /etc/drbd.d/ named drbd-bs1-ssd.res, drbd-bs2-ssd.res, drbd-bs1-hdd.res, anddrbd-bs2-hdd.res. Populate the files with the following content.

drbd-bs1-ssd.res:

resource drbd-bs1-ssd {

meta-disk internal;

device /dev/drbd0;

syncer {

verify-alg sha1;

}

on vm-block-storage-1 {

disk /dev/md/SSD_RAID_0p1;

address 192.168.100.101:7789;

}

on vm-block-storage-2 {

disk /dev/md/SSD_RAID_0p1;

address 192.168.100.102:7789;

}

}

drbd-bs2-ssd.res:

13

Page 15: Fault Tolerant Virtualization

resource drbd-bs2-ssd {

meta-disk internal;

device /dev/drbd1;

syncer {

verify-alg sha1;

}

on vm-block-storage-1 {

disk /dev/md/SSD_RAID_0p2;

address 192.168.100.101:7790;

}

on vm-block-storage-2 {

disk /dev/md/SSD_RAID_0p2;

address 192.168.100.102:7790;

}

}

drbd-bs1-hdd.res:

resource drbd-bs1-hdd {

meta-disk internal;

device /dev/drbd2;

syncer {

verify-alg sha1;

}

on vm-block-storage-1 {

disk /dev/md/HDD_RAID_0p1;

address 192.168.100.101:7791;

}

on vm-block-storage-2 {

disk /dev/md/HDD_RAID_0p1;

address 192.168.100.102:7791;

}

}

drbd-bs2-hdd.res:

resource drbd-bs2-hdd {

meta-disk internal;

device /dev/drbd3;

syncer {

verify-alg sha1;

}

on vm-block-storage-1 {

disk /dev/md/HDD_RAID_0p2;

address 192.168.100.101:7792;

}

on vm-block-storage-2 {

disk /dev/md/HDD_RAID_0p2;

address 192.168.100.102:7792;

}

}

Note that the port numbers are different for each DRBD device. Each device uses its own dedicated port tosync over the network.

Initialize and bring up the shared disk on both machines:

14

Page 16: Fault Tolerant Virtualization

# drbdadm create-md drbd-bs1-ssd

# drbdadm create-md drbd-bs2-ssd

# drbdadm create-md drbd-bs1-hdd

# drbdadm create-md drbd-bs2-hdd

# modprobe drbd

# drbdadm up drbd-bs1-ssd

# drbdadm up drbd-bs2-ssd

# drbdadm up drbd-bs1-hdd

# drbdadm up drbd-bs2-hdd

Designate vm-block-storage-1 as the primary for drbd-bs1-ssd and drbd-bs1-hdd with the following com-mands:

# drbdadm --force primary drbd-bs1-ssd

# drbdadm --force primary drbd-bs1-hdd

Designate vm-block-storage-2 as the primary for drbd-bs2-ssd and drbd-bs2-hdd with the following com-mands:

# drbdadm --force primary drbd-bs2-ssd

# drbdadm --force primary drbd-bs2-hdd

The initial sync of the disks will take some time. The process can be monitored with the following commandto continuously output the status:

# while true; do cat /proc/drbd; done

Press Ctrl+C to break the loop and return to the command prompt.

4.7 Resource Definitions

The cluster configuration we are aiming for is known as an Active/Active configuration. In an Active/Activeconfiguration, each cluster node provides a set of services, in our case, iSCSI targets, and can readily assume therole of other cluster nodes if they fail. Both nodes will be the primary source for two iSCSI targets, and thebackup source for the other nodes targets. The services provided by a node are called cluster resources. Thereare many different types of cluster resources, some dealing with local daemons, others providing services to thenetwork. The three types of cluster resources we will be using are as follows:

DRBDThis resource controls which node is the primary and which is the secondary for each DRBD device.

iSCSI TargetThis resource creates the iSCSI target that will be used to access the DRBD device.

iSCSI Logical UnitAbbreviated LUN. This resource associates a local disk object with an iSCSI target.

IP AddressThis resource provides a floating IP address to ensure a specific iSCSI target is always available at a certainIP.

The pcs cluster manager allows us to construct configuration files separate from the active configuration. Thisensures that the services are not started until they are fully configured. The command to write the currentconfiguration to a file is:

15

Page 17: Fault Tolerant Virtualization

# pcs cluster cib <filename>

When making changes to the external configuration file, use -f <filename> to specify the configuration fileto change. After all the desired changes are made, apply them to the active configuration by executing:

# pcs cluster cib-push <filename>

4.7.1 DRBD

Create a separate configuration file for us to edit:

# pcs cluster cib drbd_cfg

Now configure the four DRBD resource:

# pcs -f drbd_cfg resource create bs1-drbd-ssd ocf:linbit:drbd \

drbd_resource=drbd-bs1-ssd \

op monitor interval=30s

# pcs -f drbd_cfg resource create bs2-drbd-ssd ocf:linbit:drbd \

drbd_resource=drbd-bs2-ssd \

op monitor interval=30s

# pcs -f drbd_cfg resource create bs1-drbd-hdd ocf:linbit:drbd \

drbd_resource=drbd-bs1-hdd \

op monitor interval=30s

# pcs -f drbd_cfg resource create bs2-drbd-hdd ocf:linbit:drbd \

drbd_resource=drbd-bs2-hdd \

op monitor interval=30s

Convert the DRBD resources into master/slave clones with the following commands:

# pcs -f drbd_cfg resource master bs1-drbd-ssd-clone bs1-drbd-ssd \

master-max=1 \

master-node-max=1 \

clone-max=2 \

clone-node-max=1 \

notify=true

# pcs -f drbd_cfg resource master bs2-drbd-ssd-clone bs2-drbd-ssd \

master-max=1 \

master-node-max=1 \

clone-max=2 \

clone-node-max=1 \

notify=true

# pcs -f drbd_cfg resource master bs1-drbd-hdd-clone bs1-drbd-hdd \

master-max=1 \

master-node-max=1 \

clone-max=2 \

clone-node-max=1 \

notify=true

# pcs -f drbd_cfg resource master bs2-drbd-hdd-clone bs2-drbd-hdd \

master-max=1 \

master-node-max=1 \

clone-max=2 \

clone-node-max=1 \

notify=true

16

Page 18: Fault Tolerant Virtualization

Apply the configuration:

# pcs cluster cib-push drbd_cfg

4.7.2 iSCSI Target

Before we can configure iSCSI target resources, we must install the scsi-target-utils package from yum.

# yum install -y scsi-target-utils

Now we will create the iSCSI targets that will serve the data to the network. The installation of Pacemakerthat is available from the default repositories did not come with the iSCSI target resource definition we need. Ilocated the heartbeat resource manager script online at https://raw.githubusercontent.com/ClusterLabs/

resource-agents/master/heartbeat/iSCSITarget. Place this script in /usr/lib/ocf/resource.d/heartbeat/,then make it executable with chmod +x /usr/lib/ocf/resource.d/hearbeat/iSCSITarget. Once this is com-plete, create the iSCSI resources with the following commands.

# pcs resource create bs1-iscsi-ssd ocf:heartbeat:iSCSITarget \

params iqn="<TARGET-A-IQN>" tid="1" op monitor interval="30s"

# pcs resource create bs2-iscsi-ssd ocf:heartbeat:iSCSITarget \

params iqn="<TARGET-B-IQN>" tid="2" op monitor interval="30s"

# pcs resource create bs1-iscsi-hdd ocf:heartbeat:iSCSITarget \

params iqn="<TARGET-C-IQN>" tid="3" op monitor interval="30s"

# pcs resource create bs2-iscsi-hdd ocf:heartbeat:iSCSITarget \

params iqn="<TARGET-D-IQN>" tid="4" op monitor interval="30s"

An IQN is constructed using the current date, the machines fully qualified domain name, and a name forthe target. The format is iqn.<YEAR>-<MONTH>.<REVERSE-FQDN>:<TARGET-NAME>. An example of something wemight use is iqn.2015-04.edu.umd.ibbr.vm-block-storage-1:bs1-ssd. It is also important that each of theiSCSI target resources have a unique tid in case they reside on the same machine.

4.7.3 iSCSI Logical Unit

The iSCSI resources do not by default have any disks associated with them. In order to expose a disk to thenetwork using the iSCSI target, we must create a logical unit number resource or LUN. The LUN resources arecreated with the following commands.

# pcs resource create bs1-lun-ssd ocf:heartbeat:iSCSILogicalUnit \

params target_iqn="<TARGET-A-IQN>" lun="2" path="/dev/drbd0"

# pcs resource create bs2-lun-ssd ocf:heartbeat:iSCSILogicalUnit \

params target_iqn="<TARGET-B-IQN>" lun="3" path="/dev/drbd1"

# pcs resource create bs1-lun-hdd ocf:heartbeat:iSCSILogicalUnit \

params target_iqn="<TARGET-C-IQN>" lun="4" path="/dev/drbd2"

# pcs resource create bs2-lun-hdd ocf:heartbeat:iSCSILogicalUnit \

params target_iqn="<TARGET-D-IQN>" lun="5" path="/dev/drbd3"

17

Page 19: Fault Tolerant Virtualization

4.7.4 IP Address

Each iSCSI target has an associated IP address that will always run alongside it. This is so ESXi hosts can alwaysaccess the same iSCSI targets at the same IP addresses. The IP address resources are created with the followingcommands.

# pcs resource create bs1-ip-ssd-102-61 ocf:heartbeat:IPaddr2 \

ip=192.54.102.61 cidr_netmask=32 nic=eth1:0 op monitor interval=30s

# pcs resource create bs2-ip-ssd-102-63 ocf:heartbeat:IPaddr2 \

ip=192.54.102.63 cidr_netmask=32 nic=eth1:1 op monitor interval=30s

# pcs resource create bs1-ip-hdd-102-64 ocf:heartbeat:IPaddr2 \

ip=192.54.102.64 cidr_netmask=32 nic=eth1:2 op monitor interval=30s

# pcs resource create bs2-ip-hdd-102-65 ocf:heartbeat:IPaddr2 \

ip=192.54.102.65 cidr_netmask=32 nic=eth1:3 op monitor interval=30s

The IP resources are named to indicate which target they are associated with, as well as to quickly show whichIP they carry.

18

Page 20: Fault Tolerant Virtualization

4.8 Constraint Definitions

Another important part of cluster resource configuration are resource constraints. Constraints allow us to specifyconditions that must be met for certain resources. Some examples include mandating that two resources reside onthe same node, or specifying the order that resources will be started. We will use constraints to control exactlyhow our cluster starts and allocates resources. We will use location, colocation, and order constraints.

4.8.1 Location

Location constraints ensure that specific resources move onto specific nodes when they are available. Our lo-cation constraints will keep bs1-drbd-ssd-clone and bs1-drbd-hdd-clone on vm-block-storage-1, and keepbs2-drbd-ssd-clone and bs2-drbd-hdd-clone on vm-block-storage-2. The following commands will definethese location constraints.

# pcs constraint location bs1-drbd-ssd-clone prefers vm-block-storage-1=INFINITY

# pcs constraint location bs2-drbd-ssd-clone prefers vm-block-storage-2=INFINITY

# pcs constraint location bs1-drbd-hdd-clone prefers vm-block-storage-1=INFINITY

# pcs constraint location bs2-drbd-hdd-clone prefers vm-block-storage-2=INFINITY

4.8.2 Colocation

Colocation constraints ensure that two specific resources are placed on the same node. We use these constraintsto ensure the IP address, iSCSI target, and LUN stay on the master node of their associated DRBD device. Thefollowing commands define all of the necessary colocation constraints.

# pcs constraint colocation add bs1-ip-ssd-102-61 with bs1-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-ip-ssd-102-63 with bs2-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs1-ip-hdd-102-64 with bs1-drbd-hdd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-ip-hdd-102-65 with bs2-drbd-hdd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs1-lun-ssd with bs1-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-lun-ssd with bs2-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs1-lun-hdd with bs1-drbd-hdd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-lun-hdd with bs2-drbd-hdd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs1-iscsi-ssd with bs1-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-iscsi-ssd with bs2-drbd-ssd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs1-iscsi-hdd with bs1-drbd-hdd-clone \

INFINITY with-rsc-role=Master

# pcs constraint colocation add bs2-iscsi-hdd with bs2-drbd-hdd-clone \

INFINITY with-rsc-role=Master

4.8.3 Order

The last type of constraints used in our configuration are order constraints. It is not enough to have the DRB-D/iSCSI/LUN/IP resource stack on the same node, they must also start up in a specific order. The order is: the

19

Page 21: Fault Tolerant Virtualization

DRBD clone resource is promoted to master, then the iSCSI target resource starts, then the LUN resource, andfinally the IP address resource. This order is important for two reasons: the LUN will not start if the iSCSI targetis not present, and connection errors may occur if iSCSI initiators can contact the target via the IP address beforeit has been linked to the block device. The following commands define all of the necessary order constraints.

# pcs constraint order promote bs1-drbd-ssd-clone then start bs1-iscsi-ssd

# pcs constraint order promote bs2-drbd-ssd-clone then start bs2-iscsi-ssd

# pcs constraint order promote bs1-drbd-hdd-clone then start bs1-iscsi-hdd

# pcs constraint order promote bs2-drbd-hdd-clone then start bs2-iscsi-hdd

# pcs constraint order start bs1-iscsi-ssd then start bs1-lun-ssd

# pcs constraint order start bs2-iscsi-ssd then start bs2-lun-ssd

# pcs constraint order start bs1-iscsi-hdd then start bs1-lun-hdd

# pcs constraint order start bs2-iscsi-hdd then start bs2-lun-hdd

# pcs constraint order start bs1-lun-ssd then start bs1-ip-ssd-102-61

# pcs constraint order start bs2-lun-ssd then start bs2-ip-ssd-102-63

# pcs constraint order start bs1-lun-hdd then start bs1-ip-hdd-102-64

# pcs constraint order start bs2-lun-hdd then start bs2-ip-hdd-102-65

20

Page 22: Fault Tolerant Virtualization

CHAPTER 5

MAINTENANCE TASKS

5.1 Checking Pacemaker Status

The status of the pacemaker cluster can be check from either node in the cluster. To check the status, use thecrm mon command. The output should look like the following.

Last updated: Thu Jul 23 00:03:35 2015

Last change: Wed Jul 15 00:14:21 2015

Stack: cman

Current DC: vm-block-storage-1 - partition with quorum

Version: 1.1.11-97629de

2 Nodes configured

20 Resources configured

Online: [ vm-block-storage-1 vm-block-storage-2 ]

Master/Slave Set: bs1-drbd-ssd-clone [bs1-drbd-ssd]

Masters: [ vm-block-storage-1 ]

Slaves: [ vm-block-storage-2 ]

Master/Slave Set: bs2-drbd-ssd-clone [bs2-drbd-ssd]

Masters: [ vm-block-storage-2 ]

Slaves: [ vm-block-storage-1 ]

Master/Slave Set: bs1-drbd-hdd-clone [bs1-drbd-hdd]

Masters: [ vm-block-storage-1 ]

Slaves: [ vm-block-storage-2 ]

Master/Slave Set: bs2-drbd-hdd-clone [bs2-drbd-hdd]

Masters: [ vm-block-storage-2 ]

Slaves: [ vm-block-storage-1 ]

bs1-ip-ssd-102-61 (ocf::heartbeat:IPaddr2): Started vm-block-storage-1

bs2-ip-ssd-102-63 (ocf::heartbeat:IPaddr2): Started vm-block-storage-2

bs1-ip-hdd-102-64 (ocf::heartbeat:IPaddr2): Started vm-block-storage-1

bs2-ip-hdd-102-65 (ocf::heartbeat:IPaddr2): Started vm-block-storage-2

bs1-iscsi-ssd (ocf::heartbeat:iSCSITarget): Started vm-block-storage-1

bs2-iscsi-ssd (ocf::heartbeat:iSCSITarget): Started vm-block-storage-2

bs1-iscsi-hdd (ocf::heartbeat:iSCSITarget): Started vm-block-storage-1

bs2-iscsi-hdd (ocf::heartbeat:iSCSITarget): Started vm-block-storage-2

bs1-lun-ssd (ocf::heartbeat:iSCSILogicalUnit): Started vm-block-storage-1

bs2-lun-ssd (ocf::heartbeat:iSCSILogicalUnit): Started vm-block-storage-2

bs1-lun-hdd (ocf::heartbeat:iSCSILogicalUnit): Started vm-block-storage-1

bs2-lun-hdd (ocf::heartbeat:iSCSILogicalUnit): Started vm-block-storage-2

The display will update periodically until the program is terminated with Ctrl+C. In a non-failover scenario,all resources beginning with “bs1” should be started on vm-block-storage-1 and all resources beginning with

21

Page 23: Fault Tolerant Virtualization

“bs2” should be started on vm-block-storage-2.

5.2 Checking DRBD Device Status

While the crn mon command can show us which resources are started and their location, it cannot indicate whetheror not DRBD devices are in sync between nodes. It is not likely that DRBD devices will fall out of sync unlessan actual fail-over is triggered by one block storage node going offline for a period of time. To check the status ofDRBD devices, examine the contents pf /proc/drbd using the cat command (cat /proc/drbd). The contentsshould resemble the following.

0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ns:175674772 nr:0 dw:175674772 dr:8720304 al:23396 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

ns:0 nr:130038968 dw:130038968 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ns:157820396 nr:0 dw:157820396 dr:12146212 al:5606 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

ns:0 nr:140320948 dw:140320948 dr:664 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

The output displays two lines for each DRBD device. The second line is mostly data transfer statistics, thesenumbers will vary each time the output is examined. The first line is what we can use to determine if a specificdevice needs our attention. The important parts of this output are the connection state (cs) and the role (ro). Ifall is well, the connection state should indicate “Connected”, and the role should indicate either “Primary/Sec-ondary” or “Secondary/Primary”. If any devices indicate a connection state of “StandAlone” and a role of either“Secondary/Unknown” or “Primary/Unknown”, then the device is in a state called a “Split Brain” and must bemanually reconnected. For instructions on resolving this issue, see the following section titled Recovering from aFail-Over.

For easy reference, the numbers 0-3 refer to each of the following DRBD devices.

0: drbd-bs1-ssd

1: drbd-bs2-ssd

2: drbd-bs1-hdd

3: drbd-bs2-hdd

5.3 Recovering from a Fail-Over

If a fail-over occurs and all pacemaker resources are forced to migrate to a single node, certain steps must be takenbefore bringing the failed node back online. There are many events that can cause a fail-over: a failed hardwaredevice, a power outage, or even an unplugged network cable. Addressing these individual events is beyond thescope of this document. For the purpose of this document, we will assume that the original cause of the fail-overhas been resolved and what is left to be done is bring the cluster back to its normal operational state.

Once the previously failed node is booted up, check the status of the DRBD devices using the method describedin the previous section. Do not start the pacemaker cluster on this node yet! Doing so will causepacemaker to attempt to migrate resources to the node before its data is in sync. It is likely that alldevices are in connection state “StandAlone” and role “Secondary/Unknown”. They may also be in connectionstate “WFConnection” (waiting for connection) but will eventually fail to connect and end up in a stand alonestate. If you check the drbd devices status’ on the other node, they will be in connection state “Standalone” androle “Primary/Unknown”.

On the previously failed node, run the following command to reconnect the DRBD device and write any changesthat have occurred while the node was offline.

22

Page 24: Fault Tolerant Virtualization

# drbdadm connect --discard-my-data <drbd-device>

Then, on the other node, reconnect the device with the following command.

# drbdadm connect <drbd-device>

If you examine the contents of the file /proc/drbd, you should see the device syncing. Repeat these commandsfor each DRBD device as necessary. Once all devices are syncing, use the following command to continuallymonitor their progress.

# while true; do cat /proc/drbd; done

Once all devices are synced, press Ctrl+C to break the command. Now start pacemaker on the previouslyfailed node with the following command.

# pcs cluster start

After pacemaker starts, resources will be moved onto the nodes where they normally reside. The progress ofresource migration can be check with the crm mon command.

23

Page 25: Fault Tolerant Virtualization

CHAPTER 6

REVISION HISTORY

July 22, 2015Initial version of document created upon the completion of the VM infrastructure project

July 23, 2015Added instructions for checking DRBD device status, and recovering from a fail-over

December 16, 2015Reformatted code blocks using the LATEX module ‘Minted’ https://code.google.com/p/minted/. Madeseveral minor edits to improve document clarity.

24