50
RMI: High Performance User-level Multi- threading on SPDK/DPDK Munenori MAEDA , Shinji KOBAYASHI , Jun KATO , Mitsuru SATO , Hiroki OHTSUJI † Fujitsu Limited ‡ Fujitsu Laboratories Ltd. Copyright 2019 FUJITSU LIMITED 0

RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

  • Upload
    others

  • View
    47

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI: High Performance User-level Multi-threading on SPDK/DPDK

Munenori MAEDA†, Shinji KOBAYASHI†, Jun KATO†, Mitsuru SATO†, Hiroki OHTSUJI‡

† Fujitsu Limited‡ Fujitsu Laboratories Ltd.

Copyright 2019 FUJITSU LIMITED0

Page 2: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Development Team

Copyright 2019 FUJITSU LIMITED

Munenori MAEDA•Fujitsu Limited•TL, Engineer, Ph.D.

Shinji KOBAYASHI•Fujitsu Limited•Engineer

Jun KATO•Fujitsu Limited•Engineer

Mitsuru SATO•Fujitsu Limited•Senior Manager, Ph.D.

Hiroki OHTSUJI (Speaker)

• Fujitsu Laboratories Ltd.•Researcher, Ph.D.

1

Page 3: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Agenda

Background

Why Thread?

RMI Architecture

RMI Design and Feature

ARCHITECTURE

RMI Threading Environment

Framework Overview

Multicore Optimizations

Benchmark Result

RMI Internode Communication

Design Overview

Multicore Optimizations

Performance Evaluation

Summary

Copyright 2019 FUJITSU LIMITED2

Page 4: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Background

Copyright 2019 FUJITSU LIMITED3

Page 5: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Background

Enterprise AFA Requirements

Capacity

Performance

Reliability and Availability

Scalability

Rich Storage Services

• Deduplication

• Erasure Coding

• Cloud native functions

• … and so on

HW Technology Trends

Multicores with Dual or more sockets

High Speed Interconnect with RDMA(RoCE)

Low latency Storage Device(NVMe, 3D XPoint)

SPDK Framework

Kernel-bypass Architecture to exploit HW Performance of the latest devices

Potential to be de facto

Copyright 2019 FUJITSU LIMITED

Our research has been started since 2016

4

Page 6: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Background

What’s the matter of our development? In early evaluation we got

Development Expense and Speed

• Developers are unfamiliar with asynchronous programming

• Debugging takes Long time

Effective Performance and Scalability

• Multicore parallelism sometimes degrades performance

• No load balancer to level CPU utilizations

• Internode Communication impacts I/O latency and Performance

Missing pieces of SPDK

Programmability

RDMA based Internode Communication

Copyright 2019 FUJITSU LIMITED5

Page 7: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

What we need for Programming Framework

Better programmability for Industrial Applications

Compact

Comprehensive

Reliable

Integrity with SPDK

SPDK so far provides basic Storage Services and HW Drivers, including NVMe driver

Use them “as is” to shorten a development period

Copyright 2019 FUJITSU LIMITED6

Page 8: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Multi-threading Framework

Thread vs. Event

Long time debates in Academic and Industrial communities

Both have the Pros and Cons

Yes, SPDK/DPDK is in “Event” side

Why Thread?

Developers are quite familiar with Blocking API

• Traditional POSIX programming:file, socket

Why Thread? (Contd.)

Multi-threading has great success in some domains:

• Apache WEB server and WEB services with stateful sessions

Suited to AFA Storage Services such that

Complicated and large code

Many states to handle HW or SW failures

Copyright 2019 FUJITSU LIMITED7

Page 9: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Internode Communication

Various Comm. Purpose

Delegate I/O request to LUN Node

Update Metadata

Maintain Data Redundancy

Check Sanity/Soundness for HA

Data Redundancy constrains Scalability

Mix. use of Short Messages and Long RDMA transfers (8KB or More)

Copyright 2019 FUJITSU LIMITED

SCSI

Extent Lock

Cache

Deduplication

Log Structured FS

Node#m

Responsible Node#p

RAID

MirrorSynchronous RDMA Write

Cross Access

Update Reference Count

MirrorSynchronous RDMA Write

MirrorSynchronous RDMA Write

Pair Node#n

8

Page 10: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Architecture

Copyright 2019 FUJITSU LIMITED9

Page 11: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Design and Feature

Our Framework: RMI

An abbreviation of Rocket Messaging Infrastructure

Supports Multi-threading

Lightweight user-level threads

Various synchronizing primitives, similar to Java and Go

Enables SPDK compliant

No changes for SPDK NVMe driver and 3rd party driver

SPDK event-driven programming is still effective

Provides Dynamic Load Balancing

Level multicore utilization to exploit full performance

Copyright 2019 FUJITSU LIMITED10

Page 12: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

ARCHITECTURE

Copyright 2019 FUJITSU LIMITED

Storage Services

Storage Protocols

DriversIntel QuickData

Technology(I/OAT) Driver

NVMe Devices

NVMePCIe Devices

NVMe-oFInitiator

RMI Threading Environment

SynchronizingPrimitives

Block Device Abstraction

(BDEV)

RDMA (Verbs)Driver

SCSI

iSCSITarget

NVMe-oFTarget

NVMe

Vender’s Impl.

Synchronous Multi-thread Support

RMI BDEVBridge API

Intel QuickAssistTechnology Driver

AcceleratorInternode Device

LegendRMISPDK3rd Party

Thread Pool /Load Balancer

11

Page 13: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

This Talk …

Two main topics, we present, are:

RMI Threading Environment

RMI Internode Communication

Meanwhile, the following topics are omitted or only outlined:

RMI BDEV Bridge API

RMI Load Balancing Algorithm

RMI Synchronizing Primitives

RMI QuickAssist Driver Glue

Copyright 2019 FUJITSU LIMITED12

Page 14: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment

Copyright 2019 FUJITSU LIMITED13

Page 15: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment

Based on Lthread of DPDK 16.11

User-level: N OS native threads handle M RMI threads

“Cooperative”, or Non-preemptive: running threads voluntarily yield control

Synchronizing Primitives

i-structure (the same as Haskell’s ivar, Java’s Future)

q-structure (the same as Java’s BlockingQueue, Go’s channel)

RCU, Semaphore, and so on.

Thread Pool / Load Balancer

Enables fast and efficient allocation of Threads

Provides a “matching place” among idle threads and tasks

Copyright 2019 FUJITSU LIMITED14

Page 16: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment: Thread Pool

Design Considerations

Practical Limitation of Memory Footprint

• How many available threads?

• It depends on the gross stacks of threads

Locality-aware Task Scheduling

• Existing task scheduling methods are:

(1) Tasks activate Idle Threads

(2) Active Threads pull Waiting Tasks

• Common Problems

• Locality Unawareness

• Unnecessary Context Switches

Copyright 2019 FUJITSU LIMITED

(1)Tasks activate Idle Threads

(2) Active Threads pull Waiting Tasks

15

Page 17: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment: Thread Pool

RMI Approach

Fixed Memory Footprint 2GB

• Gross stack is the product of 16K threads x 128KB fixed stack for each

Hybrid Task Scheduling

• Hybrid of existing (1) and (2) methods

• Idle Threads and Waiting Tasks are both non-empty in some condition

• Lazy binding of Task and Thread

• Locality-aware

• Minimized Context Switching

Copyright 2019 FUJITSU LIMITED

Idle Threads

Active Threads

RMI Thread Pool

Waiting Tasks

Task

16

Page 18: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment: Example

Copyright 2019 FUJITSU LIMITED

static int number_stream(void) {…// Task declaration

struct rmi_exec_task gen_task;

// Content of Task is definedrmi_exec_task_init(&gen_task);gen_task.fn = gen;gen_task.fn_arg = &ga;

// Submit the Task into Thread pool rmi_task_submit_thrpool(&gen_task);

…}

static int gen(void *arg) {struct gen_arg args = *(struct gen_arg*)arg;

for (uint64_t i=2; i < args.max; ++i) {// use of RMI synchronizing primitive:// write integer to bounded-bufferrmi_bbstr_u64_write(args.out, i);

}

rmi_bbstr_u64_write(args.out, 0UL);return 0;

}

Computational Entity defined as Task

17

Page 19: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Threading Environment: Load Balancing

Design Considerations

Possible Migration Target are:

• Active Running Threads, or

• Runnable Tasks in the Thread Pool

Active Running Threads

• [Pros] LB may directly level core utilizations

• [Cons] C programs and libraries are constrained for the use of Thread Local Storage, TLS

• Usually supported by program language like Go

Runnable Tasks ← RMI only supports

• [Pros] No limitations for C programs including OSS, because TLS is not yet accessed

• [Cons] Behind the task rearrangement, LB may gradually level core utilizations

Copyright 2019 FUJITSU LIMITED

Native threads may access TLS safely, assisted by OS.Cooperative threads, however, have no OS assist.

18

Page 20: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Poll HW and SW Queues

Manage Timers

SPDK reactor single iteration

Switch to a thread context

RMI and SPDK Integration

RMI thread scheduler takes over an SPDK reactor, running on each core

SPDK runs under the RMI scheduler context

Copyright 2019 FUJITSU LIMITED

RMI Main Loop

Poller and Timer Poller

Prioritized Event run

Event run

No change, just as-is

19

Page 21: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Multicore Performance Issue

SW Queues in RMI Main Loop are Frequently Accessed

SW Queues contain RMI hub infra., designed as Concurrent Queue:

RMI task queue

Ready queue

SPDK event queue

Concurrent Queue may impact Multicore Performance

Mutex are NEVER used in RMI Main Loop in Principle

Most Non-blocking Concurrent Queues use HW Locks (atomic instructions) instead

HW locks restrict Multicore Scalability

Copyright 2019 FUJITSU LIMITED20

Page 22: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Preliminary Concurrent Queue Evaluation

Round-trip Test for Concurrent Queue

Single Queue per Core, as “Multi Produces, Single Consumer”

Measure Elapsed time from (1) to (4)

All-to-all

Comparison

rte_ring of DPDK 16.11

Vyukov’s queue in Lthread of DPDK 16.11

Copyright 2019 FUJITSU LIMITED

(1)element push (2)element pop

(4)element return by pop (3)element push

core#P core#Q

21

Page 23: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Preliminary Result

Result / rte_ring

Copyright 2019 FUJITSU LIMITED

0.6369 1.50763.84201

11.50558

1.43083.3266

8.40633

22.94589

68.56327

0

10

20

30

40

50

60

70

80

2 4 8 16 32

roun

d tr

ip t

ime

(us)

# of cores

rte_ring of DPDK16.11

8+8

4+4

16+16Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets

Single socket: all cores in a single socket

Dual sockets: divide cores in half

• Long Latency32 core’s RTT is 68.5us.

• Poor ScalabilityRTT is 48x when # of cores is 16x.

22

Page 24: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

0.3856 0.49940.96111

2.25464

0.54771.15905

2.29205

5.47565

10.40527

0

2

4

6

8

10

12

2 4 8 16 32

roun

d tr

ip t

ime

(us)

# of cores

Vyukov’s queue of Lthread

Preliminary Result (Contd.)

Result / Vyukov’s queue

Copyright 2019 FUJITSU LIMITED

Single socket: all cores in a single socket

Dual sockets: divide cores in half

8+8

4+4

16+16Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets

6.6x faster than rte_ring

• Acceptable Latency ?32 core’s RTT is 10us.

• Good ScalabilityRTT is 18x when # of cores is 16x.

23

Page 25: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Multicore Optimizations

Observations

HW lock by inter-socket cores is quite heavy

• High Performance needs to reduce HW locks

Improving latency for concurrent queue is still necessary

• Vyukov’s queue is faster, but its latency is relatively high (>5us each way)

RMI Approach

Optimizations to reduce HW Locks, focusing on some dedicated use:

• RMI task queue for Load Balancing

• Lockless i-structure for HW completions

• RMI Internode Communication (←presented later)

Copyright 2019 FUJITSU LIMITED24

Page 26: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI task queue

Consists of N fast SPSC Queues with No HW Lock and slow MPSC queue

Enables fast push/pop unless SPSC queues are overflowed

Preserves FIFO order even overflowed

Copyright 2019 FUJITSU LIMITED

MPSC rte_ring

N SPSC queues, similar to SPSC rte_ring

Push if SPSC queue is overflowed

25

Page 27: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RTT Benchmark Result

Result / RMI task queue

Copyright 2019 FUJITSU LIMITED

0.39825 0.58810.97185

1.6565

0.85311.2832

1.90447

2.75045

4.75891

0.5477

1.15905

2.29205

5.47565

10.40527

0

2

4

6

8

10

12

2 4 8 16 32

roun

d tr

ip t

ime

(us)

# of cores

rmi_task_queue

4+48+8

Xeon E5-2697 v4 (Broadwell) 2.3GHz 2sockets

• Low Latency32 core’s RTT is 4.8us.

• Super ScalabilityRTT is 12x when # of cores is 16x.

Single socket: all cores in a single socket

Dual sockets: divide cores in half

16+16

Vyukov’s queue

26

Page 28: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Prime Benchmark

Description

“Prime” sifts N natural numbers, N=10M, with 10K filter threads of primes (2,3,5,…)

Filter threads are connected by channels

# of actual threads is less than 16K because of RMI Thread Pool capacity limitation

Compare with a native Go program

Both use compatible synchronization primitives

Go program creates threads directly, while RMI program submits tasks in Thread Pool

Copyright 2019 FUJITSU LIMITED27

Page 29: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Prime Benchmark Result

Good Performance, superior to Go

Need further analysis for Scalability Decrease

Copyright 2019 FUJITSU LIMITED

146.9

84.0

40.5

21.6 12.6 8.4

40.0

21.3 11.8

5.9 3.6 3.7

0.0

50.0

100.0

150.0

1 2 4 8 16 32

Exec

uti

on T

ime

(sec

.)

# of cores

1.0 1.7

3.6

6.8

11.7

17.5

1.0 1.9

3.4

6.8

11.3 10.8

0.0

5.0

10.0

15.0

20.0

1 2 4 8 16 32

Spee

d up

rat

io# of cores

SUPERMICRO 2028U-TN24R4T / Xeon E5-2697 v4 2.3GHz 36 Cores / 512GB

GoRMI

28

Page 30: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

RMI Internode Communication

29

Page 31: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Design of Communication Primitives

Verbs API is commonly used in RDMA over Converged Ethernet (RoCE).

We adopt RoCE for low latency and low CPU usage.

Communication primitives for RMI

Using native Verbs API?

• Verbs API provides asynchronous posting and polling style primitives

• E.g. ibv_post_send() and ibv_poll_cq()

• Polling in every RMI threads is complicated and inefficient

RMI Approach

• Provides simple blocking style primitives implemented using Verbs APIs

• E.g. mmi_send(), mmi_recv(), mmi_read_rdma(), mmi_write_rdma()

Copyright 2019 FUJITSU LIMITED30

Page 32: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Implementation of Blocking Primitives

Send/RDMA Read/RDMA Write

i-structure is used to synchronize with the I/O completion.

Copyright 2019 FUJITSU LIMITED

Poll HW and SW Queues

Manage Timers

SPDK reactor single iteration

Switch to a thread context

RMI Thread

1. Post Send

2. Read i-structure

5. Resume execution

i-structure

3. Poll Send Completion

4. Write i-structure

RMI Main Loop

31

Page 33: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Implementation of Blocking Primitives (Contd.)

Receive

q-structure is used for synchronization as multiple messages may arrive.

Copyright 2019 FUJITSU LIMITED

Poll HW and SW Queues

Manage Timers

SPDK reactor single iteration

Switch to a thread context

RMI Thread

1. Read q-structure

4. Resume execution

q-structure

2. Poll Receive

3. Write q-structure

RMI Main Loop

32

Page 34: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Multicore Performance Issue of Internode Comm.

Spin Locks in the communication library impact performance

Performance degrades when many cores access a Queue Pair (QP)

• QP is the communication endpoint in Verbs API. One QP is needed for each communication target

Copyright 2019 FUJITSU LIMITED

QP

Core 0Core 1

Core N

33

Page 35: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Provide a dedicated QP for each Destination Node/RoCE Port/Core.

Each core communicates with the corresponding core on the destination node.

The illustration below assumes single RoCE Port for simplicity.

Node 1

Node 3

Node 2

Dedicated QP for Each Core

Copyright 2019 FUJITSU LIMITED

Core 0

Core 1

Core N

QP

QP

QP

Core 0

Core 1

Core N

QP

QP

QP

Core 0

Core 1

Core N

QP

QP

QP

Core 0

Core 1

Core N

QP

QP

QP

34

Page 36: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Dedicated QP for Each Core (Contd.)

Pros

No simultaneous access for QPs.

• Spin lock costs are greatly reduced.

• Specifying an appropriate thread model through Mellanox’s experimental Verbs API further improves CPU usage wasted for unnecessary HW locks.

Cons

Many QPs are needed.

• Modern NICs supports more than ~100K QPs.

Copyright 2019 FUJITSU LIMITED

Number of QPs needed = Max. # of destination nodes × # of RoCE ports × # of cores= 15 × 2 × 40 = 1200

We have enough QPs for our case.

35

Page 37: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Software

Latency benchmark using pairs of RMI threads exchanging a message or performing RDMA operations in Ping-Pong style.

Hardware

Processors: 2 Intel Xeon Gold 6148 (Skylake)

• 38 cores out of 40 cores are used for experiments.

• HyperThreading is disabled.

NIC: Mellanox ConnectX-4 EN

• Dual port 40GbE.

Network: 40GbE switch

Node 2Node 1

40 GbE Switch

Performance Evaluation Environment

Copyright 2019 FUJITSU LIMITED

VLANVLAN

Thread 1 Thread 1

Thread 2 Thread 2

Thread N Thread N

Node 1 Node 2

36

Page 38: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results

Latency of Dedicated QP is ~1/10 of that of Shared QP when many threads are communicating.

Optimization of acquiring a lock before polling is adopted for Shared QP.

Copyright 2019 FUJITSU LIMITED

0

100

200

300

400

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of RDMA 512B

Dedicated QP RDMA Read

Dedicated QP RDMA Write

Shared QP RDMA Read

Shared QP RDMA Write

Minimum 2 * 40GbE1

10

0

100

200

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of Message Send/Receive

Dedicated QP

Shared QP

1

10

37

Page 39: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results (Contd.)

CPU Usage of pthread_spin_lock() is reduced if Exp. Verbs is used.

We use IBV_EXP_THREAD_UNSAFE in ibv_exp_create_res_domain().

Copyright 2019 FUJITSU LIMITED

0

2

4

6

8

10

12

14

16

1 4 16 64 256

CP

U U

sage

(%

)

# of RMI Threads

pthread_spin_lock CPU Usage (RDMA 512B)

Read Without Exp. VerbsRead With Exp. VerbsWrite Without Exp. VerbsWrite With Exp. Verbs

Reduced

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64 128 256

CP

U U

sage

(%

)

# of RMI Threads

pthread_spin_lock CPU Usage (Message)

Without Exp. Verbs

With Exp. Verbs Reduced

38

Page 40: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Summary

Designed and Implemented RMI on SPDK/DPDK for Enterprise AFA

Better programmability

• Light-weight user-level threads

• Rich synchronizing primitives

• Blocking style communication primitives

SPDK compliant

• Various libraries and HW drivers can be used “as is.”

High Performance

• Load balancing with fast concurrent queue

• Lockless i-structure

• Low latency RDMA

Copyright 2019 FUJITSU LIMITED39

Page 41: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Questions ?

Copyright 2019 FUJITSU LIMITED

E-mail:

40

Page 42: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

41

Page 43: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Bdev Extension

SPDK design

Bdev fn_table may define:

• What to handle

• Submit_request

• Get_io_channel

• Destruct

• How to dispatch

• Function call

• Event call

spdk_bdev_io_submit/_complete

• Invoke Bdev fn_table, like as Vtable

RMI design

Bdev fn_table Extension:

• Adds RMI’s dispatch manner

• Task submit

rmi_bdev_io_submit/_complete

• Determine whether transferring direction is from SPDK to RMI, or vice verse

• Select dispatch mechanism

• [RMI→SPDK] event call

• [SPDK→RMI] task submit

• [*→*] Function call

• Invoke extended Bdev fn_table

Copyright 2019 FUJITSU LIMITED42

Page 44: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Control Transfer between RMI and SPDK

Copyright 2019 FUJITSU LIMITED

spdk_event_queue

Reactor ready_queue

RMI Thread Pool

rmi_task_queue

RMI Scheduler

RMI Threads

SPDK Layer RMI Layer

task submitevent call

thread resume,i-structure write,

etc.

Dedicated asynchronous primitives may be called anytime, anywhere to

transfer controls between layers.

43

Page 45: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Lockless i-structure

The HW lock in i-structure can be safely removed if posting Send is performed AFTER the RMI Thread is blocked.

Copyright 2019 FUJITSU LIMITED

Poll HW and SW Queues

Manage Timers

SPDK reactor single iteration

Switch to a thread context

RMI Thread

1. Read i-structure

5. Resume execution

Lockless i-structure

3. Poll Send Completion

4. Write i-structure

2. Post Send

RMI Main Loop

44

Page 46: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Lockless i-structure (Contd.)

Generic i-structure

Post Send before reading i-structure.

HW lock in i-structure is mandatory as another core may write to the i-structure.

Lockless i-structure

Post Send in the pending function which is executed after the RMI thread is blocked by reading i-structure.

HW lock in i-structure is not necessary.

Copyright 2019 FUJITSU LIMITED

static void pending_send(rmi_istr_int *istr, void *arg) {// Executed in the scheduler context just after the RMI

// Thread is blocked.post_send(…);}

…int ret = rmi_istr_lockless_int_read(&istr,

pending_send, &args);// Blocks until the Send is completed.

…post_send(…);

int ret = rmi_istr_int_read(&istr);// Blocks until the Send is completed.

45

Page 47: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results (Contd.)

Latency of RDMA 8KB is limited by the 40GbE throughput.

Copyright 2019 FUJITSU LIMITED

0

100

200

300

400

500

600

700

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of RDMA 8KB

Dedicated QP RDMA Read

Dedicated QP RDMA Write

Shared QP RDMA Read

Shared QP RDMA Write

Minimum 2 * 40GbE

46

Page 48: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results (Contd.)

CPU Usage of pthread_spin_lock() is reduced if Exp. Verbs is used.

We use IBV_EXP_THREAD_UNSAFE in ibv_exp_create_res_domain().

Copyright 2019 FUJITSU LIMITED

0

2

4

6

8

10

12

14

16

1 2 4 8 16 32 64 128 256

CP

U U

sage

(%

)

# of RMI Threads

pthread_spin_lock CPU Usage (RDMA 8KB)

Read Without Exp. VerbsRead With Exp. VerbsWrite Without Exp. VerbsWrite With Exp. Verbs

Reduced

47

Page 49: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results (Contd.)

Shared QP without Lock Optimization has severe Performance Degradation.

Copyright 2019 FUJITSU LIMITED

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of RDMA 512B

Dedicated QP RDMA Read

Dedicated QP RDMA Write

Shared QP RDMA Read

Shared QP RDMA Write

Shared QP w/o Lock Opt. RDMA Read

Shared QP w/o Lock Opt. RDMA Write

Minimum 2 * 40GbE

0

100

200

300

400

500

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of Message Send/Receive

Dedicated QP

Shared QP

Shared QP w/o Lock Opt.

48

Page 50: RMI: High Performance User-level Multi- threading …...User-level: N OS native threads handle M RMI threads “Cooperative”, or Non-preemptive: running threads voluntarily yield

Performance Evaluation Results (Contd.)

Shared QP without Lock Optimization has severe Performance Degradation.

Copyright 2019 FUJITSU LIMITED

0

100

200

300

400

500

600

700

800

1 2 4 8 16 32 64 128 256

Late

ncy

(u

s)

# of RMI Threads

Latency of RDMA 8KB

Dedicated QP RDMA Read

Dedicated QP RDMA Write

Shared QP RDMA Read

Shared QP RDMA Write

Shared QP w/o Lock Opt. RDMA Read

Shared QP w/o Lock Opt. RDMA Write

Minimum 2 * 40GbE

49