Sharing High-Performance Devices Across Multiple …€¢ What does “sharing devices across multiple virtual machines” in our title mean? ... 5 Virtual Machine ... SPARK Test Results

Sharing High-Performance Devices Across Multiple Virtual Machines

Preamble• What does “sharing devices across multiple virtual machines” in our title mean?• How is it different from virtual networking / NSX, which allow multiple virtual networks to share

underlying networking hardware?• Virtual networking works well for many standard workloads, but in the realm of extreme

performance we need to deliver much closer to bare-metal performance to meet application requirements

• Application areas: Science & Research (HPC), Finance, Machine Learning & Big Data, etc.• This talk is about achieving both extremely high performance and device sharing

2

Sharing High-Performance PCI Devices

1 Technical Background

2 Big Data Analytics with SPARK

3 High Performance (Technical) Computing

3

Direct Device Access TechnologiesAccessing PCI devices with maximum performance

VMware

ESXi

VM Direct Path I/O• Allows PCI devices to be accessed directly by guest OS

– Examples: GPUs for computation (GPGPU), ultra-low latency

interconnects like InfiniBand and RDMA over Converged Ethernet

(RoCE)

• Downsides: No vMotion, No Snapshots, etc.

• Full device is made available to a single VM – no sharing

• No ESXi driver required – just the standard vendor device driver

5

Virtual Machine

Guest OS

Kernel

Application

DirectPath I/O

• The PCI standard includes a specification for SR-IOV, Single Root I/O Virtualization

• A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs

• Downsides: No vMotion, No Snapshots (but note: pvRDMAfeature in ESX 6.5)

• An ESXi driver and a guest driver are required for SR-IOV• Mellanox Technologies supports ESXi SR-IOV for both

InfiniBand and RDMA over Converged Ethernet (RoCE) interconnects

6

SR-IOV

Virtual Machine

Guest OSKernel

Application

NMLX

5 VF

PF VF

vSwitch

VMXN

ET3

Device Partitioning (SR-IOV)

Remote Direct Memory Access (RDMA)

• A hardware transport protocol– Optimized for moving data to/from memory

• Extreme performance– 600ns application-to-application latencies– 100Gbps throughput– Negligible CPU overheads

• RDMA applications– Storage (iSER, NFS-RDMA, NVMoF, Lustre)– HPC (MPI, SHMEM)– Big data and analytics (Hadoop, Spark)

8

How does RDMA achieve high performance?

• Traditional network stack challenges– Per message / packet / byte overheads– User-kernel crossings– Memory copies

• RDMA provides in hardware:– Isolation between applications– Transport

• Packetizing messages• Reliable delivery

– Address translation

• User-level networking– Direct hardware access for data path

9

Kernel

User

RDMA-capablehardware

NVMeF iSER Buf

Buf

Buf

AppA AppB

Buf Buf

Host Configuration – Driver Installation

• VM Direct Path I/O does not require an ESX driver – InfiniBand and RoCE work with the standard guest driver in this case

• To use SR-IOV, a host driver is required:– SR-IOV RoCE bundle: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI65-

MELLANOX-NMLX5_CORE-41688&productId=614

– SR-IOV InfiniBand bundle: will be GA in Q4 2017

– Management tools: http://www.mellanox.com/page/management_tools

– Install and configure the host driver using suitable driver parameters

Verify Virtual Functions are available

11

1) Select Host

3) Select PCI Devices

2) Select Configure Tab

4) Check Virtual Function is available

Host Configuration – Assign a VF to a VM

1) Select VM

2) Select Manage Tab

3) Select VM Hardware 4) Select Edit

SPARK Big Data AnalyticsAccelerating time to solution with shared, high-performance interconnect

SPARK Test Results – vSphere with SR-IOV

Runtime samples SR-IOV TCP (sec) SR-IOV RDMA (sec) Improvement

Average 222 (1.05x) 171 (1.01x) 23%

Min 213 (1.07x) 165 (1.05x) 23%

Max 233 (1.05x) 174 (1.0x) 25%

0

50

100

150

200

250

Average Min Max

Run

time

(sec

s)TCP vs. RDMA (Lower Is Better)

SR-IOV TCP SR-IOV RDMA

16 ESXi6.5 hosts, one Spark VM per host

1 Server used as Named Node

High Performance ComputingResearch, Science, and Engineering applications on vSphere

Two Classes of Workloads: Throughput and Tightly-Coupled

HPCCluster Tightly-Coupled

Throughput“embarrassingly parallel”

Often useMessagePassingInterface

Examples:• Digital movie rendering• Financial risk analysis• Microprocessor design• Genomics analysis

Examples:• Weather forecasting• Molecular modelling• Jet engine design• Spaceship, airplane &

automobile design

ESXi ESXiESXi

17

InfiniBand SR-IOV MPI Example

Cluster 2

Host

VM VM

Host Host

IB

IBVM

VM VMIB IB

VMIB

Cluster 1

SR

-IOV

SR

-IO

V

IB

IB

SR

-IOV

SR

-IO

V

IB

IB

SR

-IO

V

SR

-IO

V

• SR-IOV InfiniBand• All VMs: #vCPU = #cores• 100% CPU overcommit• No memory overcommit

ESXi ESXiESXi LinuxLinuxLinux

18

InfiniBand SR-IOV MPI Performance Test

Cluster 2

Host

VM VM

Host Host

IB

IBVM

VM VMIB IB

VMIB

Cluster 1

SR

-IOV

SR

-IOV

IB

IB

SR

-IOV

SR

-IOV

IB

IB

SR

-IOV

SR

-IOV

93.4 93.4

98.5

169.3169.3

Run time (seconds)

Application: NAMDBenchmark: STMV

20-vCPU VMs for all tests60 MPI processes per job

Bare metal

One vCluster

Two vClusters

10%

Compute AcceleratorsEnabling Machine Learning, Financial and other HPC applications on vSphere

Shared NVIDIA GPGPU Computing

20

ESXi

Host

NVIDIA P100 GPU

VM VM

LinuxCUDA & Driver

TensorFlowLinux

CUDA & Driver

TensorFlow

GRID driver

• TensorFlow RNN• SuperMicro dual 12-core system• 16GB NVIDIA P100 GPU• Two VMs, each with an 8Q GPU profile• NVIDIA GRID 5.0• ESXi 6.5

Scheduling policies:

• Fixed share• Equal share• Best Effort

Shared NVIDIA GPGPU Computing

21Single P100, two 8Q VMs, Legacy scheduler

Summary• Virtualization can support high-performance device sharing for cases in which extreme

performance is a critical requirement• Virtualization supports device sharing and delivers near bare-metal performance

– High Performance Computing– Big Data SPARK Analytics– Machine and Deep Learning with GPGPU

• The VMware platform and partner ecosystem address the extreme performance needs of the most demanding emerging workloads

22

Documents

Sharing High-Performance Devices Across Multiple …€¢ What does “sharing devices across multiple virtual machines” in our title mean? ... 5 Virtual Machine ... SPARK Test Results