Realizing the Performance Potential of the Virtual Interface Architecture

Realizing the Performance Potential of the Virtual Interface Architecture

Evan Speight, Hazim Abdel-Shafi, and John K. Bennett

Rice University, Dep. Of Electrical and Computer Engineering

Presented by Constantin Serban, R.U.

VIA Goals

• Communication infrastructure for System Area Networks (SANs)

• Targets mainly high speed cluster applications

• Efficiently harnesses the communication performance of underlying networks

Trends

• The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly.

• The latency introduced by the protocol is typically several times the latency of the transport layer.

• The problem becomes acute especially for small messages

Targets

VI architecture addresses the following issues:

• Decrease the latency especially for small messages (used in synchronization)

• Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized)

• Reduce the CPU processing due to the message overhead

Overhead

Overhead mainly comes from two sources:• Every network access requires one-two

traps into the kernel – user/kernel mode switch is time consuming

• Usually two data copies occur:– From the user buffer to the message passing

API– From message layer to the kernel buffer

VIA approach

• Remove the kernel from the critical path – Moving communication code out of the kernel

into user space

• Provide 0-copy protocol– Data is sent/received directly into the user

buffer, no message copy is performed

VIA emerged as a standardization effort from Compaq, Intel, and Microsoft

It was built on several academic ideas: • The main architecture most similar to U-Net• Essential features derived from VMMCAmong current implementations :

– GigaNet cLan – VIA implemented in hardware– Tandem ServerNet –VIA software driver

emulated– Myricom Myrinet - software emulated in

firmware

VIA architecture

VIA operationsSet-Up/Tear-Down :• VIA is point-to-point connection oriented protocol• VI-endpoint : the core concept in VIA

• Register/De-Register Memory• Connect/Disconnect• Transmit• Receive• RDMA

VIA operationsSet-Up/Tear-Down :VIA is point-to-point

connection oriented protocol• VI-endpoint : the core concept in VIA• VipCreateVi function creates a VI endpoint in the

user space.• The user-level library passes the call to the kernel

agent which passes the creation information to the NIC.

• OS thus controls the application access to the NIC

VIA operations - cont’dRegister/De-Register Memory:• All data buffers and descriptors reside in a

registered memory • NIC performs DMA I/O operation in this

registered memory• Registration pins down the pages into the physical

memory and provides a handle to manipulate the pages and transfer the addresses to the NIC

• It is performed once, usually at the beginning of the communication session

VIA operations - cont’dConnect/Disconnect:

• Before communication, each endpoint is connected to a remote endpoint

• The connection is passed to the kernel agent and down to the NIC

• VIA does not define any addressing scheme, existing schemes can be used in various implementations

VIA operations - cont’dTransmit/receive:• The sender builds a descriptor for the message to

be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area.

• The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor.

• The doorbells are maintained in an internal queue inside the NIC

VIA operations - cont’dTransmit/receive (cont’d):• Meanwhile, the receiver creates a descriptor that

points to an empty data buffer and posts a doorbell in the receiver NIC queue

• When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network.

• The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data

VIA operations - cont’dRDMA:• As a mechanism derived from VMMC, VIA allows

Remote DMA operations: RDMA Read and Write

• Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu

• Each node can read and write to the remote node address directly.

• These operations posts potential implementation problems.

Evaluation Benchmarks

• Two VI implementations :– GigaNet cLan B:125MB/sec, Latency 480ns – Tandem ServerNet, 50MB/S, Latency 300ns

• Performance measured:– Bandwidth and Latency – Poling vs. Blocking– CPU Utilization

Bandwidth

Latency

Latency Polling/Blocking

CPU utilization

MPI performance using VIA

• The challenge is to deliver performance to distributed application

• Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead

• How to optimize this layer in order to use it efficiently with VIA ?

MPI VIA - performance

MPI observations• Difference between MPI-UDP and MPI-

VIA-baseline is remarkable

• MPI-VIA-baseline is dramatically far from VIA-Native

• Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead

MPI Improvements

• Eliminating unnecessary copies:MPI UDP and VIA use a single set of receiving buffers,

thus data should be copied to the application : allow the user to register any buffer

• Choosing a synchronization primitive:All synchronization formerly using OS constructs/events.

Better implementation using swap processor commands

• No Acknowledge: Remove the acknowledge of the message by switching to

a reliable VIA mode

VIA - Disadvantages

• Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead

• Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow

• Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient

Conclusion

• Small latency for small messages. Small messages have a strong impact on application behavior

• Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?)

• At the expense of an uncomfortable API

Documents

Realizing the Performance Potential of the Virtual Interface Architecture