18
IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice President, Architecture Mellanox Technologies [email protected]

IO Virtualization with InfiniBand - Xen · PDF file• Each VM has direct IO access – “Hypervisor offload ... IO virtualization with InfiniBand multiple nodes (cluster), network-resident

  • Upload
    buinhi

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

IO Virtualization with InfiniBand[InfiniBand as a Hypervisor Accelerator]

Michael KaganVice President, Architecture

Mellanox [email protected]

Leadership in InfiniBand silicon 2April 05

Key messages

• InfiniBand enables efficient server’s virtualization– Cross-domain isolation– Efficient IO sharing– Protection enforcement

• Existing HW fully supports virtualization– The most cost-effective path for single-node virtual servers– SW-transparent scale-out

• VMM support in OpenIB SW stack by fall ’05– Alpha version of FW and driver in June

Leadership in InfiniBand silicon 3April 05

InfiniBand scope in server virtualization

• CPU virtualization– Compute power

• Memory virtualization– Memory allocation– Address translation– Protection

• IO virtualization

• NO

• Partial– No– Yes – for IO accesses– Yes – for IO accesses

• YES

Virtualized server

Hypervisor

Domain0 DomainX DomainY

IO

IO drv

Kernel Kernel

IO drv IO drv

Virtual switch(es)

Bridge

IO drv

CPU memory

Leadership in InfiniBand silicon 4April 05

Switch

End Node

Switch Switch

Switch

End Node

End Node

End Node

End Node

End Node

End Node

End Node

End Node

End Node

InfiniBand – Overview• Performance

– Bandwidth – up to 120Gbit/sec per link– Latency – under 3uSec (today)

• Kernel bypass for IO access– Cross-process protection and isolation

• Quality Of Service– End node– Fabric

• Scalability/flexibility– Up to 48K local nodes, up to 2128 total– Multiple link width/trace (Cu, Fiber)

• Multiple transport services in HW– Reliable and unreliable– Connected and datagram– Automatic path migration in HW

• Memory exposure to remote node– RDMA-read and RDMA-write

• Multiple networks on a single wire– Network partitioning in HW (“VLAN”)– Multiple independent virtual networks on a wire

Link data rate:

Today: 2.5,10,20,30,60Gb/sSpec: up to 120Gb/secCu & Optical

Leadership in InfiniBand silicon 5April 05

InfiniBand communicationC

onsu

mer

cha

nnel

inte

rfac

e

Netw

ork (fabric) interface

Leadership in InfiniBand silicon 6April 05

Consumer Queue Model

• Asynchronous execution• In-order execution on queue• Flexible completion report

Host Channel Adapter (HCA)

• Consumers connected via queues– Local or remote node

• 16M independent queues– 16M IO channel– 16M QoS levels

• transport, priority• Memory access through virtual address

– Remote and local– 2G address spaces, 64-bit each– Access rights and isolation enforced by HW

InfiniBand Channel Interface

PCI-Express

InfiniBand ports

HCA

Leadership in InfiniBand silicon 7April 05

Userland

Kernel

HCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter• HCA configuration via Command Queue

– Initialization– Run-time resource assignment and setup

• HCA resources (queues) allocated for applications– Resource protection through User Access Region

• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation

• Memory protection – via Protection Domains– Many-to one association

• Address space to Protection Domain• QP to Protection Domain

– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation

• Interrupts delivery – Event Queues

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

CC

Q

App

App

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

Leadership in InfiniBand silicon 8April 05

Domain0

Kernel

Userland

Kernel

HCAHCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter• HCA configuration via Command Queue

– Initialization– Run-time resource assignment and setup

• HCA resources (queues) allocated for applications– Resource protection through User Access Region

• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation

• Memory protection – via Protection Domains– Many-to one association

• Address space to Protection Domain• QP to Protection Domain

– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation

• Interrupts delivery – Event Queues

• HCA initialization by VMM– Assign command queue per guest domain– HCA resources partitioned and exported to guest OSes

• HCA resources allocated to guests/their apps– Resource protection through UAR

• Each VM has direct IO access– “Hypervisor offload”

• Memory protection – via Protection Domains• Address translation step generates HW address

– Guest Physical Address to HW address translation– Validate access rights

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

CC

Q

App

App

Up to 16M work queues

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

CQ

CC

Q

DomainZ

Kernel App

App

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

App

App

DomainY

Kernel App

App

App

App

DomainX

Kernel App

App

App

App

App

App

Leadership in InfiniBand silicon 9April 05

Domain0

Kernel

Userland

Kernel

HCAHCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter• HCA configuration via Command Queue

– Initialization– Run-time resource assignment and setup

• HCA resources (queues) allocated for applications– Resource protection through User Access Region

• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation

• Memory protection – via Protection Domains– Many-to one association

• Address space to Protection Domain• QP to Protection Domain

– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation

• Interrupts delivery – Event Queues

• HCA initialization by VMM– Assign command queue per guest domain– HCA resources partitioned and exported to guest OSes

• HCA resources allocated to guests/their apps– Resource protection through UAR

• Each VM has direct IO access– “Hypervisor offload”

• Memory protection – via Protection Domains• Address translation step generates HW address

– Guest Physical Address to HW address translation– Validate access rights

• Guest driver manages HCA resources at run-time– Each OS sees “its own HCA”– HCA HW keeps guest OS honest– Connection manager – see later

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

CC

Q

App

App

Up to 128 CCQ Up to 16M work queues

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

CQ

CC

Q

DomainZ

Kernel

Driver

App

App

CQ

CC

QCQ

CC

QCQ

CC

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

App

App

App

App

DomainY

Kernel

Driver

App

App

App

App

DomainX

Kernel

Driver

App

App

App

App

App

App

Leadership in InfiniBand silicon 10April 05

Address translation and protectionNon-virtual server

• HCA TPT set by driver– Boundaries, access rights– vir2phys table

• Run-time address translation– Access right validation– Translation tables’ walk

Virtual server• VMM sets guest HW address tables

– Address space per guest domain– Managed and updated by VMM

• Guest driver sets HCA TPT– Guest PA in vir2phys table

• Run-time address translation1. Virtual to Guest Phys. Addr2. Guest Physical to HW address

MKey Virtual address

ApplicationMKey entry

HW physical address

MKey table Translation tables

MKey Virtual address

ApplicationMKey entry

VM GPAMKey entry

HW physical address

1

1

1

2

2

2

MKey table

2

Translation tables

Leadership in InfiniBand silicon 11April 05

IO virtualization with InfiniBandsingle node, local IO

• Full offload for local cross-domain access– Eliminate Hypervisor kernel transition on data path

• Reduce cross-domain access latency• Reduce CPU utilization• Kernel bypass on IO access to guest application

• Shared [local] IO– Shared by guest domains

Virtualized serverHypervisor

Domain0 DomainX DomainY

IO

IO drv

Kernel Kernel

IO drv IO drv

Virtual switch(es)

Bridge

IO drv

Virtualized server

Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv

Bridge

IO drv

IO

HCA

HWswitch(es)

IO HypervisorOff-load

Leadership in InfiniBand silicon 12April 05

Virtualized server

IO virtualization with InfiniBandmultiple nodes (cluster), network-resident IO

• SW-transparent remote IO access• No Hypervisor kernel transition• Kernel bypass for guest apps• Shared [remote] IO

– Shared by domains

Virtualized serverHypervisor

Domain0 DomainX DomainY

IO

IO drv

Kernel Kernel

IO drv IO drv

Virtual switch(es)

Bridge

IO drv

Virtualized server

DomainY1

Kernel

IO drv

HCA

HWswitch(es)

InfiniBand switch

IO HypervisorOff-load SW

-tran

spar

ent

Scal

e-ou

t IO

shar

ing

DomainX1

Kernel

IO drv

Domain0

Virtualized server

DomainY2

Kernel

IO drv

HCA

HWswitch(es)

DomainX2

Kernel

IO drv

Domain0

BridgeIO

Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv

Bridge

IO drv

IO

HCA

HWswitch(es)

HC

A

Leadership in InfiniBand silicon 13April 05

Network – IPoverIB• IP over Ethernet

– SW channel for each domain• Virtual NIC in domain• Switch in SW

– Copy, VLANs– Hypervisor call

• Kernel transition– NIC driver in domain0

• External L2 bridge

• IP over IB– HW channel for each domain

• Virtual NIC in domain• Switch in HW

– VLANs, data move– Direct HW access from guest domain

• No Hypervisor transition– IPoverIB in domain0

• Bypass L2 bridge

Virtualized serverHypervisor

Domain0 DomainX DomainY

NIC

N/W drv

Kernel Kernel

N/W drv N/W drv

Virtual switch(es)

Bridge

NIC drv

Virtualized server

Domain0 DomainX DomainY

IPoIB

Kernel Kernel

IPoIB IPoIB

Bridge

NIC drv

NIC

HCA

HWswitch(es)

Leadership in InfiniBand silicon 14April 05

Network – sockets• Sockets over Ethernet

– TCP/IP stack in guest domain– SW L2 channel for guest domain

• Virtual NIC in domain• Switch in SW

– Copy, VLANs• Hypervisor call

– Kernel transition– NIC driver in domain0

• Sockets over InfiniBand (SDP)– HW L4 channel for guest domain

• Socket QP(s) per domain• Transport and switch in HW

– Copy, VLANs– Direct HW access from guest domain

• No Hypervisor transition– Full bypass of domain0

Virtualized server

Domain0 DomainX DomainY

SDP

Kernel Kernel

SDP SDP

Bridge

NIC drv

NIC

HCA

HWswitch(es)

Virtualized serverHypervisor

Domain0 DomainX DomainY

NIC

N/W drv

Kernel Kernel

N/W drv

Virtual switch(es)

Bridge

NIC drv TCP/IP

N/W drv

TCP/IP

Leadership in InfiniBand silicon 15April 05

Storage• Virtualized disk access

– vSCSI driver in guest domain– SCSI “switch” in Hypervisor

• Switch in SW– Copy, isolation

• Hypervisor call– Kernel transition

– Disk driver in domain0• HBA for SAN

• Virtualized disk access– SRP initiator per guest domain– SCSI “switch” in HCA

• Transport and switch in HW– Copy, isolation

– Direct HW access from guest domain• No Hypervisor transition

– Disk driver in domain0• Bypass domain0 for SAN

Virtualized server

Domain0 DomainX DomainY

Kernel Kernel

SRP SRP

SRP target

Adapter

HCA

HWswitch(es)

Virtualized serverHypervisor

Domain0 DomainX DomainY

vSCSI

Kernel Kernel

Virtual switch(es)

SRP target

AdaptervSCSI vSCSI

Leadership in InfiniBand silicon 16April 05

MPI applications[MPI as an example for user-mode access to network]

• MPI over TCP/IP???– Datapath performance hit

• Two kernel transitions on performancepath

– Forget about low latency

• MPI driver in guest app– No datapath performance hit

• Direct access to HCA HW• Full guest kernel bypass• Full Hypervisor bypass

– Event delivery directly to guest OS• Retain control path performance

– Memory registration needs attention• Registration cache

Virtualized server

Domain0 DomainX DomainY

MPI

HCA

HWswitch(es)

Virtualized serverHypervisor

Domain0 DomainX DomainY

NIC

N/W drv

Kernel Kernel

N/W drv

Virtual switch(es)

Bridge

NIC drv TCP/IP

N/W drv

TCP/IP

MPI

Kernel Kernel

Leadership in InfiniBand silicon 17April 05

Plans

• Stage1,2 driver update June/05– Hypervisor bypass for Data Path operation

• Stage3 FW and driver update Aug/05– Full HCA export to guest domain

Domain0

Kernel

Hypervisor

HCA

Up to 128 CCQ Up to 16M work queues

Driver

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

WQ C

Q

CQ

CC

Q CQ

CC

Q

DomainZ

Kernel

Driver

App

App

CQ

CC

QCQ

CC

Q

WQ C

Q

WQ C

QW

Q CQ

WQ C

Q

App

App

DomainY

Kernel

DriverA

ppA

ppA

ppA

pp

DomainX

Kernel

Driver

App

App

App

App

Stage1,2stage3

Virtualized server

Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv

Bridge

IO drv

IO

HCA

HW

switch(es)

Virtualized server

Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv

Bridge

IO drv

IO

HCA

HW

switch(es)

Leadership in InfiniBand silicon 18April 05

Summary

• InfiniBand HCA is a Hypervisor offload engine• InfiniBand enables efficient server’s virtualization

– Cross-domain isolation– Efficient IO sharing– Protection enforcement

• Existing HW fully supports virtualization– The most cost-effective path for single-node virtual servers– SW-transparent scale-out

• VMM support in OpenIB SW stack by fall ’05– Alpha version of FW and driver in June