73
Copyright © 2016 Solarflare Communications, Inc. All rights reserved. Much Faster Networking David Riddoch [email protected]

Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Copyright © 2016 Solarflare Communications, Inc. All rights reserved.

Much Faster Networking

David [email protected]

Page 2: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

What is kernel bypass?

Page 3: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

The standard receive path

Page 4: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

The standard receive path

Page 5: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

The standard receive path

Page 6: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

The standard receive path

Page 7: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel-bypass receive

Page 8: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel-bypass transmit

Page 9: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel-bypass transmit

Page 10: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel-bypass transmit

Page 11: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel-bypass transmit – even faster

Page 12: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

What does all of thiscleverness achieve?

Page 13: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Better performance!

• Fewer CPU instructions for network operations

• Better cache locality

• Faster response (lower latency)

• Higher throughput (higher bandwidth/message rate)

• Reduced contention between threads

• Better core scaling

• Reduced latency jitter

Page 14: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 15: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Solarflare’s OpenOnload

• Sockets acceleration using kernel bypass

• Standard Ethernet, IP, TCP and UDP

• Standard BSD sockets API

• Binary compatible with existing applications

Page 16: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

OpenOnload intercepts network calls

Page 17: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Single thread throughput and latency

Page 18: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

UDP receive throughput (small messages)

Page 19: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Kernel stack OpenOnload

Page 20: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

A much more challenging application

Page 21: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

HAProxy performance and scaling

0

10

20

30

40

50

1 2 3 4 5 6 7 8 9

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

100K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

0

100

200

300

400

500

600

1 2 3 4 5 6 8 10 12

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

1K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

Solarflare kernel

Other kernel

OpenOnload

Solarflare kernel

Other kernel

OpenOnload

1 KiB message size

100 KiB message size

Co

nn

ecti

on

s/s

(10

00

s)C

on

nec

tio

ns/

s (1

00

0s)

Page 22: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Why doesn’t performance scale when using the kernel stack?

Page 23: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Better question:

How come it scales aswell as it does?

Page 24: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

CPU cores

Page 25: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Received packet is delivered intomemory (or L3 cache)

Page 26: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Interrupt triggers packethandling on this CPU core

Page 27: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Applicationcalls recv()

Page 28: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Socket state pulled from onecache to another; Inefficient

Single core bottleneckfor interrupt handling

Page 29: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Multiple receive channels(up to one per core)

channel_id = hash(4tuple) % n_cores;

Page 30: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Multiple flows

Page 31: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Hopefully!

Page 32: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Usually

Page 33: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

channel_id, n = lookup(4tuple);

if( n > 1 )

channel_id += hash(4tuple) % n_cores;

Page 34: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

SO_REUSEPORT to the rescue

• Multiple listening sockets on the same TCP port• One listening socket per worker thread

• Each gets a subset of incoming connection requests

• New connections go to the worker running on the core that the flow hashes to

• Connection establishment scales with the number of workers

• Received packets are delivered to the ‘right’ core

Page 35: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Problem solved?

Page 36: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 37: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 38: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 39: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

0

100

200

300

400

500

600

1 2 3 4 5 6 8 10 12

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

1K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

Solarflare kernel

Other kernel

OpenOnload

1 KiB message size

Co

nn

ecti

on

s/s

(10

00

s)

Number of CPU cores

Page 40: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

So much for sockets

Page 41: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Let’s get closer to the metal…

Page 42: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Layer-2 APIs

1-6 Mpps/coreMany protocolsEnd-host applications

1-60 Mpps/coreAll protocolsAny network function

Page 43: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

60 million pkt/s?

Page 44: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

17 ns

Page 45: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Some tips for achievingreally fast networking…

Page 46: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Tip 1. The faster your application is already, the more speedup you’ll get from kernel bypass

(This is just Ahmdal’s law)

Page 47: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Tip 2. NUMA locality applies doubly to I/O devices

Page 48: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 49: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 50: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 51: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 52: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,
Page 53: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

• DMA transfers will use the L3 cache if:

• The targeted cache line is resident

• Or if not then up to 10% of L3 is available for write-allocate

• Therefore

• If you want consistent high performance, DMA buffers must be resident in L3 cache

• To achieve that

• Small set of DMA buffers recycled quickly

• (Even if that means doing an extra copy)

Page 54: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Tip 3.Queue management is critical

Page 55: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Queues exist mostly tohandle mismatch betweenarrival rate and service rate

Page 56: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

• Buffers in switches and routers

• Descriptor rings in network adapters

• Socket send and receive buffers

• Shared memory queues, locks etc.

• Run queue in the kernel task scheduler

Page 57: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

What happens when queues start to fill?

Page 58: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Service rate < arrival rate

Queue fill level increases(latency++)

Working-set sizeincreases

(efficiency--)

Service rate drops

Page 59: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Service rate < arrival rate

Queue fills

DROP!

Page 60: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Drops are bad, m’kay

Make SO_RCVBUF bigger!

Page 61: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Dilemma!

Small buffers:

Necessary for stable performance when overloaded

Large buffers:

Necessary for absorbing bursts without loss

Page 62: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

For stable performance when overloaded

Limit working set size

• Limit the sizes of pools, queues, socket buffers etc.

Shed excess load early (Tip 3.1)

• Before you’ve wasted time on requests you’re going to have to drop anyway

Page 63: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Interrupt moves packets fromdescriptor ring to sockets

App thread consumesfrom socket buffer

Page 64: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Drop newest data(only at very high rates)

Drop newest data

Do work for every packet,whether dropped or not

Page 65: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Drop newest data

Page 66: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Tip 3.2 Drop old data for better response time

Page 67: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Last example: Cache locality

Page 68: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

recv()

OpenOnload stack; includes sockets, DMA buffers,TX and RX rings, control plane etc.

Page 69: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

recv()

Page 70: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Problem: Send a small (eg. 200 bytes) reply from a different thread

Page 71: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

send()

send()

Or…

Page 72: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

send()

• Passing a small message to another thread:

• A few cache misses

• send() on a socket last accessed on another core:

• Dozens of cache misses

Page 73: Much Faster Networking · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for every packet,

Thank you!