Linux Block IO: Introducing Multiqueue SSD Access on
Multicore Systems
Matias Bjørling1,2, Jens Axboe2, David Nellans2, Philippe Bonnet1
2013/03/05
1IT University of Copenhagen - Fusion-io2
1
Non-Volatile Memory Devices • Higher IOPS and lower
latency than magnetic disk • Hardware continues to
improve – Parallel architecture – Future NVM technologies
• Software design decisions – Fast sequential access – Slow random access
2
2010 2011 2012 2013
?
The Block Layer in the Linux Kernel
• Provides – Single point of access for block devices – IO scheduling – Merging & reordering – Accounting
• Optimized for traditional hard-drives – Trade CPU cycles for sequential disk accesses – Synchronization is a performance bottleneck
3
NVM device
In-Kernel IO Submission / Completion
Process App/FS/
etc.
Device driver
Kernel
Low-level block IO
device driver Block IO layer
Request Queue bio
Scheduling Accounting
Reordering Merging
4
Block Layer IO for Disks
• Single request queue – Single lock data structure – Cache-coherence expensive on multi-CPU systems
5
Methodology
• Null block device driver – In kernel dummy device – Acts as proxy for a no latency device – Avoids the DMA overhead – Removes driver effects from block layer – ”Device” completions can be in-line or soft irq
6
NVM device
Process App/FS/
etc. Null driver Block IO layer
Current Block Layer (IOPS)
Null block device, noop IO scheduler, 512B random reads
Westmere-EP
Nehalem-EX Westmere-EX
7
Sandy-bridge E
Multi-Queue Block Layer
• Balance IO workload across cores • Reduce cache-line sharing • Similar functionality as SQ • Allow multiple hardware queues
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 2
Process
Multi-Queue Block Layer
• Two levels of queues – Software Queues – Hardware Dispatch
Queues • Per CPU accounting • Tag-based completions
9
Userspace
Kernel
Single/Multi-queue capable hardware device
Process Process
Block Layer
...
...
SoftwareQueues
Hardware Dispatch Queues
Staging (Insertion/Merge)
Submission / Completion
Fairness / IO Scheduling
Tagging
Accounting
Submit IO
Status / Completion interrupt
Block IO device driver
Async/Sync
Mapping Strategies
• Per-core software queues • Varying hardware
dispatch queues
• Single software queue • Varying hardware
dispatch queues
10 Scalability = Multiple hardware dispatch queues
MQ Scalability
MQ Scalability
Multi-Queue IOPS
11
SQ – Current block layer, Raw – Block layer bypass, MQ – Proposed design
Multi-socket Scaling
12
AIO interface
• Alternative interface to synchronous • Prevents blocking of user threads • Move bio list to a lockless list • Remove legacy code and locking
13
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 2
Process
Multi-Queue, 8 socket
14
Latency Comparison
• Sync IO, 512B • Queue depth grows with CPU count
1,5 0,8
1,2
4,6
1,2 1,9
0
1
2
3
4
5
SQ Raw MQ
Late
ncy
(us)
1 core
6 cores
1,9 1 1,4
15,8
2,2 3,8
29,4
2,1 5,4
05
101520253035
SQ Raw MQ
Late
ncy
(us)
1 core
7 cores
12 cores
15
1 socket
2 socket 5.4x
2.3x
Latency Comparison
2,7 1,5 2 37
5,1 10
251
34
84
0
50
100
150
200
250
300
SQ Raw MQ
Late
ncy
(us)
1 core
9 cores
32 cores
16
4 socket
8 socket
2.9x
2,9 1,6 2,3
386
7,9 10,5
1200
10 153
SQ Raw MQ 0
200
400
600
800
1000
1200
1400
Late
ncy
(us)
1 core
11 cores
80 cores
7.8x
Conclusions
• Current block layer is inadequate for NVM • Proposed Multi-Queue block layer
– 3.5x-10x increase in IOPS – 10x-38x reduction in latency – Supports multiple hardware queues – Simplifies driver development – Maintains industry standard interfaces
17
Thank You & Questions
Implementation available at git://git.kernel.dk/?p=linux-block.git
18
Systems
• 1 Socket – Sandy Bridge-E, i7-3930K, 3.2Ghz, 12MB L3
• 2 Socket – Westmere-EP, X5690, 3.46Ghz, 12MB L3
• 4 Socket – Nehalem-EX, X7560, 2.66Ghz, 24MB L3
• 8 Socket – Westmere-EX, E7-2870, 2.4Ghz, 30MB L3
19