Kernel in the Way Bypass and Offload Technologies · Kernel is continually regressing in terms of the ... Manage RX / TX rings and timing issues in user space. ... Memory management

Kernel in the WayBypass and Offload Technologies

End User Summit 2012New York

Christoph Lameter<[email protected]>

Introduction

● Example of socket based send vs. bypass send

● Why?● How?● What is bypass and offload● Issues in storage● Potential solutions

Application Kernel Driver Device

Socket Buffers

C Library

SystemCall

API

StreamBuffer

Datastructure

Device maps buffe

rs

TXRing

Context Switch

PosixSockets

API

write(fd, data, size);

fwrite(data, size, 1, FILE);

dev_queue_xmit(skb);

Sending a message via the sockets API

Application Kernel Driver Device

Buffers

Verbs Library

VerbsSystem

Calls

MetaDatastructures

Bypass: Device maps user space buffers

TXRing

Context Switch

VerbsLibrary

Kernel Bypass

IB DriverAPI

Device Control

Rsockets by Sean Hefty (Intel)● Presented at the Open Fabrics Developer meeting at the end

of March.

● Socket Emulation layer in user space.

● LD_PRELOAD library to override socket system calls

● Performance comparison shows that kernel processing is detrimental to performance. Bypass is essential.

● IPOIB = Kernel Sockets layer using IP emulation on Infiniband.

● SDP = Kernel Sockets layer using Infiniband native connection.

● IB = Native Infiniband connection. User space → User Space

0

5

1 0

1 5

2 0

2 5

3 0

6 4 1 2 8 2 5 6 5 1 2 1 k 2 k 4 k 8 k 1 6 k 3 2 k 6 4 k 1 2 8 k 2 5 6 k 5 1 2 k 1 m

B a n d w id t h ( G b p s)

I P o I B

S D P

R S O C K E T

I B

0

1

2

3

4

5

6

7

8

9

1 0

I P o I B S D P R S O C K E T I B

6 4 - B y t e P in g - P o n g La t e n cy ( u s)

State of the Art NIC characteristics

● 56 Gigabits per second (Unix network stack was designed for 10Mbits)

● 4-7 Gigabytes per second (Unix: 1 MB/s)● >8 million packets per second (Unix: ~1000

packets per second).● Less than a microsecond per packet for

processing.● Windows 8 will fully support these speeds

through offload/bypass APIs.

Bypass is

● Not using a kernel subsystem although it is there and could provide similar services.

● Implementing what a kernel subsystem does.● Direct access and control of memory mapped

devices from user space.● Some people call zero copy bypass since it

avoids the copying pass of the kernel.

Offload is

● Replacement of what could be done in software with dedicated hardware.

● Overlaps with Bypass because direct device interactions replaces software action in the kernel through the actions of a hardware device.

● Typical case of hardware offload: DMA engines, GPUs, Rendering screens, cryptography, TCP (TOE), FPGAs.

Why bypass the kernel?

● Kernel is too slow and inefficient at high packet rates. Problems already begin at 10G.

● Contemporary devices can map user space memory and perform transfer to user space.

● Kernel must copy data between kernel buffers and userspace.

● Kernel is continually regressing in terms of the overhead of basic system calls and operations. Only new hardware compensates.

10G woes

● Packet rate too high to receive all packets on one processor.

● All sorts of optimization: Interrupt mitigation, multiqueue support, receive flow steering etc.

● Compromises that do not allow full use of the capabilities of the hardware.

● 10G technology goes beyond the boundary of what socket based technology can cleanly support.

● 40G and higher speed technologies are available and introduce different APIs that allow effective operations.

CPU problems

● The processing capacity of an individual hardware thread is mostly stagnating. No major improvements in upcoming processor generations.

● I/O bandwidth, memory capacity and bandwidth are growing fast.

● Multithreaded handling of I/O requires synchronization which is creating additional overhead that in turn limits the processing capability.

Bypass technologies● Kernel: Zero copy (using sendfile() f.e.).● Kernel: Sockets Direct Protocol (SDP)● Kernel: RDMA or “Verbs”● Linux: u/kDAPL● Solarflare: OpenOnload● Myricom: DBL● Numerous user space implementation mostly

based on the kernel RDMA APIs.

Expanding role of Kernel RDMA Verbs API

✗ Initially only Infiniband packets supported

✗ Chelsio added iWarp via verbs API to get around TOE restrictions in the Linux network stack. Recently added UDP support via verbs.

✗ Mellanox added ROCEE to allow Infiniband RDMA type traffic on Ethernet networks.

✗ Mellanox added RAW_ETH_QP to allow another form of kernel bypass for Ethernet devices.

✗ Manufacturers come up with new devices for RDMA verbs.

Trouble with offload APIs

● Complexity● Manage RX / TX rings and timing issues in

user space.● Difficult to code for.● 4k page size is still an issue.● No unified API.● Proprietary vendor solutions.

“Stateless Offload”

● Artificial name created to refer to the ability of the network device to transfer a limited amount of packets in a single system call. Most used for TCP.

● Does reduce kernel overhead but does not copy directly into user space.

● No use of RDMA

● For the purpose of this talk this is not offload as understood here. By offload we mean bypassing the kernel in critical paths. This only batches packets.

Bypass API characteristics

● Direct operation with user space memory bypassing kernel buffers.

● Kernel manages the connection but the involvement in actual read and write operations is minimal.

● I/O is asynchrononous.● Problem of notifications of newly available

packets or completed writes.

Storage

● Similar issues particular evident for remote filesystem clients since they have traditionally used socket based APIs or in kernel APIs for netwrok connectivity.

● Paper “Sockets vs RDMA Interface over 10-Gigabit Networks”

● Storage issues not as severe for local disk access since APIs exist that already do a kind of bypass. DirectIO and Mmapped I/O. Infrastructure for control is missing though.

Storage Challenges

● High end SSD max out PCI-E speeds● >1 mio IOPS● 6.5Gbytes/sec● Linux uses 4K pages => Kernel has to

manage (repeatedly updating!) > 1 million page descriptors per second.

● Design was for 500 IOPs and 10s or maybe 100s of Mbytes per second.

PCI-E “Bypass”

● I/O bus hadware is insufficient for contemporary state of the art devices.

● 40Gb networks cannot be handled by current Westmere processors (There is a limit of 32Gb for 8 lanes)

● 56Gbs transfer rates can barely be handled by upcoming generation (Sandybridge)

● I/O devices with much higher transfer rates are on the horizon.

● PCI-E bypass is being explored by various companies.

● Hardware development is not keeping up

Linux Memory Management● 4K page size. Kernel handles memory in 4K chunks (yes 2M

huge pages are available but they have serious limitations).

● A standard 64G server has 16 million page structs to manage. Memory reclaim can become adventurous.

● With terabytes servers on the horizon we are looking at billions of 4k chunks to manage.

● If I just want to transfer a 4G file from disk the hardware and the memory management system needs to handle 1 million 4K pages.

● 4G can be transferred in ~1 second on state of the art networks (but the kernel overhead will ensure that such a transfer takes much longer).

Storage system bypass by Network Filesystems

● Network filesystems intercept I/O requests in user space and forward to the server bypassing the OS on the client.

● Lustre, Gluster, GPFS● Special protocols for block offload (SRP,iSER)● Network performance is much higher than

local disk performance(!).● Situation may change with new SSD/Flash

technologies. But that is very pricey.

Approaches to fix things

● Memory management needs to handle larger chunks of memory (2M, 1G page sizes). Large physically contiguous chunks of memory need to be managed.

● A processor should not touch the data transferred (zero copy requirement). This implies that POSIX style socket I/O cannot be used. There is a need for new APIs.

● I/O subsystems need to be able to handle large chunks of memory and avoid creating large scatter gather lists. Devices struggle to support those.

A New Standard RDMA system call API

● Socket based● Reuse as much as possible of the POSIX

functionality.● Buffer management in user space● Raw hardware interface elements● There are a couple of academic projects in

this area but none of it will be viable without kernel support.

Conclusion Q&A

● We are heading for a need of offload for key components of the OS.

● OS needs to manage larger chunks of memory effectively.

● A need to replace/augment the POSIX APIs?● RDMA API standardization and generalization

is necessary.

Coming in 2013

● 100 GB/sec networking● >100 GB/sec SSD / Flash devices● More cores in Intel processors.● GPUs already support thousands of hardware

threads. Newer models will offer more.

Documents

Kernel in the Way Bypass and Offload Technologies · Kernel is continually regressing in terms of the ... Manage RX / TX rings and timing issues in user space. ... Memory management