Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Kernel in the WayBypass and Offload Technologies
End User Summit 2012New York
Christoph Lameter<[email protected]>
Introduction
● Example of socket based send vs. bypass send
● Why?● How?● What is bypass and offload● Issues in storage● Potential solutions
Application Kernel Driver Device
Socket Buffers
C Library
SystemCall
API
StreamBuffer
Datastructure
Device maps buffe
rs
TXRing
Context Switch
PosixSockets
API
write(fd, data, size);
fwrite(data, size, 1, FILE);
dev_queue_xmit(skb);
Sending a message via the sockets API
Application Kernel Driver Device
Buffers
Verbs Library
VerbsSystem
Calls
MetaDatastructures
Bypass: Device maps user space buffers
TXRing
Context Switch
VerbsLibrary
Kernel Bypass
IB DriverAPI
Device Control
Rsockets by Sean Hefty (Intel)● Presented at the Open Fabrics Developer meeting at the end
of March.
● Socket Emulation layer in user space.
● LD_PRELOAD library to override socket system calls
● Performance comparison shows that kernel processing is detrimental to performance. Bypass is essential.
● IPOIB = Kernel Sockets layer using IP emulation on Infiniband.
● SDP = Kernel Sockets layer using Infiniband native connection.
● IB = Native Infiniband connection. User space → User Space
0
5
1 0
1 5
2 0
2 5
3 0
6 4 1 2 8 2 5 6 5 1 2 1 k 2 k 4 k 8 k 1 6 k 3 2 k 6 4 k 1 2 8 k 2 5 6 k 5 1 2 k 1 m
B a n d w id t h ( G b p s)
I P o I B
S D P
R S O C K E T
I B
0
1
2
3
4
5
6
7
8
9
1 0
I P o I B S D P R S O C K E T I B
6 4 - B y t e P in g - P o n g La t e n cy ( u s)
State of the Art NIC characteristics
● 56 Gigabits per second (Unix network stack was designed for 10Mbits)
● 4-7 Gigabytes per second (Unix: 1 MB/s)● >8 million packets per second (Unix: ~1000
packets per second).● Less than a microsecond per packet for
processing.● Windows 8 will fully support these speeds
through offload/bypass APIs.
Bypass is
● Not using a kernel subsystem although it is there and could provide similar services.
● Implementing what a kernel subsystem does.● Direct access and control of memory mapped
devices from user space.● Some people call zero copy bypass since it
avoids the copying pass of the kernel.
Offload is
● Replacement of what could be done in software with dedicated hardware.
● Overlaps with Bypass because direct device interactions replaces software action in the kernel through the actions of a hardware device.
● Typical case of hardware offload: DMA engines, GPUs, Rendering screens, cryptography, TCP (TOE), FPGAs.
Why bypass the kernel?
● Kernel is too slow and inefficient at high packet rates. Problems already begin at 10G.
● Contemporary devices can map user space memory and perform transfer to user space.
● Kernel must copy data between kernel buffers and userspace.
● Kernel is continually regressing in terms of the overhead of basic system calls and operations. Only new hardware compensates.
10G woes
● Packet rate too high to receive all packets on one processor.
● All sorts of optimization: Interrupt mitigation, multiqueue support, receive flow steering etc.
● Compromises that do not allow full use of the capabilities of the hardware.
● 10G technology goes beyond the boundary of what socket based technology can cleanly support.
● 40G and higher speed technologies are available and introduce different APIs that allow effective operations.
CPU problems
● The processing capacity of an individual hardware thread is mostly stagnating. No major improvements in upcoming processor generations.
● I/O bandwidth, memory capacity and bandwidth are growing fast.
● Multithreaded handling of I/O requires synchronization which is creating additional overhead that in turn limits the processing capability.
Bypass technologies● Kernel: Zero copy (using sendfile() f.e.).● Kernel: Sockets Direct Protocol (SDP)● Kernel: RDMA or “Verbs”● Linux: u/kDAPL● Solarflare: OpenOnload● Myricom: DBL● Numerous user space implementation mostly
based on the kernel RDMA APIs.
Expanding role of Kernel RDMA Verbs API
✗ Initially only Infiniband packets supported
✗ Chelsio added iWarp via verbs API to get around TOE restrictions in the Linux network stack. Recently added UDP support via verbs.
✗ Mellanox added ROCEE to allow Infiniband RDMA type traffic on Ethernet networks.
✗ Mellanox added RAW_ETH_QP to allow another form of kernel bypass for Ethernet devices.
✗ Manufacturers come up with new devices for RDMA verbs.
Trouble with offload APIs
● Complexity● Manage RX / TX rings and timing issues in
user space.● Difficult to code for.● 4k page size is still an issue.● No unified API.● Proprietary vendor solutions.
“Stateless Offload”
● Artificial name created to refer to the ability of the network device to transfer a limited amount of packets in a single system call. Most used for TCP.
● Does reduce kernel overhead but does not copy directly into user space.
● No use of RDMA
● For the purpose of this talk this is not offload as understood here. By offload we mean bypassing the kernel in critical paths. This only batches packets.
Bypass API characteristics
● Direct operation with user space memory bypassing kernel buffers.
● Kernel manages the connection but the involvement in actual read and write operations is minimal.
● I/O is asynchrononous.● Problem of notifications of newly available
packets or completed writes.
Storage
● Similar issues particular evident for remote filesystem clients since they have traditionally used socket based APIs or in kernel APIs for netwrok connectivity.
● Paper “Sockets vs RDMA Interface over 10-Gigabit Networks”
● Storage issues not as severe for local disk access since APIs exist that already do a kind of bypass. DirectIO and Mmapped I/O. Infrastructure for control is missing though.
Storage Challenges
● High end SSD max out PCI-E speeds● >1 mio IOPS● 6.5Gbytes/sec● Linux uses 4K pages => Kernel has to
manage (repeatedly updating!) > 1 million page descriptors per second.
● Design was for 500 IOPs and 10s or maybe 100s of Mbytes per second.
PCI-E “Bypass”
● I/O bus hadware is insufficient for contemporary state of the art devices.
● 40Gb networks cannot be handled by current Westmere processors (There is a limit of 32Gb for 8 lanes)
● 56Gbs transfer rates can barely be handled by upcoming generation (Sandybridge)
● I/O devices with much higher transfer rates are on the horizon.
● PCI-E bypass is being explored by various companies.
● Hardware development is not keeping up
Linux Memory Management● 4K page size. Kernel handles memory in 4K chunks (yes 2M
huge pages are available but they have serious limitations).
● A standard 64G server has 16 million page structs to manage. Memory reclaim can become adventurous.
● With terabytes servers on the horizon we are looking at billions of 4k chunks to manage.
● If I just want to transfer a 4G file from disk the hardware and the memory management system needs to handle 1 million 4K pages.
● 4G can be transferred in ~1 second on state of the art networks (but the kernel overhead will ensure that such a transfer takes much longer).
Storage system bypass by Network Filesystems
● Network filesystems intercept I/O requests in user space and forward to the server bypassing the OS on the client.
● Lustre, Gluster, GPFS● Special protocols for block offload (SRP,iSER)● Network performance is much higher than
local disk performance(!).● Situation may change with new SSD/Flash
technologies. But that is very pricey.
Approaches to fix things
● Memory management needs to handle larger chunks of memory (2M, 1G page sizes). Large physically contiguous chunks of memory need to be managed.
● A processor should not touch the data transferred (zero copy requirement). This implies that POSIX style socket I/O cannot be used. There is a need for new APIs.
● I/O subsystems need to be able to handle large chunks of memory and avoid creating large scatter gather lists. Devices struggle to support those.
A New Standard RDMA system call API
● Socket based● Reuse as much as possible of the POSIX
functionality.● Buffer management in user space● Raw hardware interface elements● There are a couple of academic projects in
this area but none of it will be viable without kernel support.
Conclusion Q&A
● We are heading for a need of offload for key components of the OS.
● OS needs to manage larger chunks of memory effectively.
● A need to replace/augment the POSIX APIs?● RDMA API standardization and generalization
is necessary.
Coming in 2013
● 100 GB/sec networking● >100 GB/sec SSD / Flash devices● More cores in Intel processors.● GPUs already support thousands of hardware
threads. Newer models will offer more.