Upload
trinhtu
View
216
Download
0
Embed Size (px)
Citation preview
Sharing High-Performance Devices Across Multiple Virtual Machines
Preamble• What does “sharing devices across multiple virtual machines” in our title mean?• How is it different from virtual networking / NSX, which allow multiple virtual networks to share
underlying networking hardware?• Virtual networking works well for many standard workloads, but in the realm of extreme
performance we need to deliver much closer to bare-metal performance to meet application requirements
• Application areas: Science & Research (HPC), Finance, Machine Learning & Big Data, etc.• This talk is about achieving both extremely high performance and device sharing
2
Sharing High-Performance PCI Devices
1 Technical Background
2 Big Data Analytics with SPARK
3 High Performance (Technical) Computing
3
Direct Device Access TechnologiesAccessing PCI devices with maximum performance
VMware
ESXi
VM Direct Path I/O• Allows PCI devices to be accessed directly by guest OS
– Examples: GPUs for computation (GPGPU), ultra-low latency
interconnects like InfiniBand and RDMA over Converged Ethernet
(RoCE)
• Downsides: No vMotion, No Snapshots, etc.
• Full device is made available to a single VM – no sharing
• No ESXi driver required – just the standard vendor device driver
5
Virtual Machine
Guest OS
Kernel
Application
DirectPath I/O
• The PCI standard includes a specification for SR-IOV, Single Root I/O Virtualization
• A single PCI device can present as multiple logical devices (Virtual Functions or VFs) to ESX and to VMs
• Downsides: No vMotion, No Snapshots (but note: pvRDMAfeature in ESX 6.5)
• An ESXi driver and a guest driver are required for SR-IOV• Mellanox Technologies supports ESXi SR-IOV for both
InfiniBand and RDMA over Converged Ethernet (RoCE) interconnects
6
SR-IOV
Virtual Machine
Guest OSKernel
Application
NMLX
5 VF
PF VF
vSwitch
VMXN
ET3
Device Partitioning (SR-IOV)
Remote Direct Memory Access (RDMA)
• A hardware transport protocol– Optimized for moving data to/from memory
• Extreme performance– 600ns application-to-application latencies– 100Gbps throughput– Negligible CPU overheads
• RDMA applications– Storage (iSER, NFS-RDMA, NVMoF, Lustre)– HPC (MPI, SHMEM)– Big data and analytics (Hadoop, Spark)
8
How does RDMA achieve high performance?
• Traditional network stack challenges– Per message / packet / byte overheads– User-kernel crossings– Memory copies
• RDMA provides in hardware:– Isolation between applications– Transport
• Packetizing messages• Reliable delivery
– Address translation
• User-level networking– Direct hardware access for data path
9
Kernel
User
RDMA-capablehardware
NVMeF iSER Buf
Buf
Buf
AppA AppB
Buf Buf
Host Configuration – Driver Installation
• VM Direct Path I/O does not require an ESX driver – InfiniBand and RoCE work with the standard guest driver in this case
• To use SR-IOV, a host driver is required:– SR-IOV RoCE bundle: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI65-
MELLANOX-NMLX5_CORE-41688&productId=614
– SR-IOV InfiniBand bundle: will be GA in Q4 2017
– Management tools: http://www.mellanox.com/page/management_tools
– Install and configure the host driver using suitable driver parameters
Verify Virtual Functions are available
11
1) Select Host
3) Select PCI Devices
2) Select Configure Tab
4) Check Virtual Function is available
Host Configuration – Assign a VF to a VM
1) Select VM
2) Select Manage Tab
3) Select VM Hardware 4) Select Edit
SPARK Big Data AnalyticsAccelerating time to solution with shared, high-performance interconnect
SPARK Test Results – vSphere with SR-IOV
Runtime samples SR-IOV TCP (sec) SR-IOV RDMA (sec) Improvement
Average 222 (1.05x) 171 (1.01x) 23%
Min 213 (1.07x) 165 (1.05x) 23%
Max 233 (1.05x) 174 (1.0x) 25%
0
50
100
150
200
250
Average Min Max
Run
time
(sec
s)TCP vs. RDMA (Lower Is Better)
SR-IOV TCP SR-IOV RDMA
16 ESXi6.5 hosts, one Spark VM per host
1 Server used as Named Node
High Performance ComputingResearch, Science, and Engineering applications on vSphere
Two Classes of Workloads: Throughput and Tightly-Coupled
HPCCluster Tightly-Coupled
Throughput“embarrassingly parallel”
Often useMessagePassingInterface
Examples:• Digital movie rendering• Financial risk analysis• Microprocessor design• Genomics analysis
Examples:• Weather forecasting• Molecular modelling• Jet engine design• Spaceship, airplane &
automobile design
ESXi ESXiESXi
17
InfiniBand SR-IOV MPI Example
Cluster 2
Host
VM VM
Host Host
IB
IBVM
VM VMIB IB
VMIB
Cluster 1
SR
-IOV
SR
-IO
V
IB
IB
SR
-IOV
SR
-IO
V
IB
IB
SR
-IO
V
SR
-IO
V
• SR-IOV InfiniBand• All VMs: #vCPU = #cores• 100% CPU overcommit• No memory overcommit
ESXi ESXiESXi LinuxLinuxLinux
18
InfiniBand SR-IOV MPI Performance Test
Cluster 2
Host
VM VM
Host Host
IB
IBVM
VM VMIB IB
VMIB
Cluster 1
SR
-IOV
SR
-IOV
IB
IB
SR
-IOV
SR
-IOV
IB
IB
SR
-IOV
SR
-IOV
93.4 93.4
98.5
169.3169.3
Run time (seconds)
Application: NAMDBenchmark: STMV
20-vCPU VMs for all tests60 MPI processes per job
Bare metal
One vCluster
Two vClusters
10%
Compute AcceleratorsEnabling Machine Learning, Financial and other HPC applications on vSphere
Shared NVIDIA GPGPU Computing
20
ESXi
Host
NVIDIA P100 GPU
VM VM
LinuxCUDA & Driver
TensorFlowLinux
CUDA & Driver
TensorFlow
GRID driver
• TensorFlow RNN• SuperMicro dual 12-core system• 16GB NVIDIA P100 GPU• Two VMs, each with an 8Q GPU profile• NVIDIA GRID 5.0• ESXi 6.5
Scheduling policies:
• Fixed share• Equal share• Best Effort
Shared NVIDIA GPGPU Computing
21Single P100, two 8Q VMs, Legacy scheduler
Summary• Virtualization can support high-performance device sharing for cases in which extreme
performance is a critical requirement• Virtualization supports device sharing and delivers near bare-metal performance
– High Performance Computing– Big Data SPARK Analytics– Machine and Deep Learning with GPGPU
• The VMware platform and partner ecosystem address the extreme performance needs of the most demanding emerging workloads
22