Upload
sandisk
View
3.016
Download
2
Tags:
Embed Size (px)
DESCRIPTION
On Tuesday, July 2 2013, Jens Axboe is presenting “Linux Block I/O: Introducing Multi-Queue SSD access on Multi-Core Systems.” Jens, Fusion-io Fellow and lead developer of the Linux file system, will discuss achieving maximum and scalable I/O throughput for high-performance flash devices in the Linux I/O stack. The paper presents a new queuing model for storage devices that is both scalable and provides the feature set support any modern flash-based device and driver needs. The new architecture achieves near linear scaling on even complex machine topologies. Work is already in progress to convert existing Linux block drivers and SCSI low-level drivers to this model. It is expected to hit the mainline Linux kernels in version 3.11 or 3.12. Authors of the paper include Matias Bjørling (IT University of Copenhagen), Jens Axboe and David Nellans (Fusion-io), and Philippe Bonnet (IT University of Copenhagen).
Citation preview
Linux Block IO: Introducing Multiqueue SSD Access on
Multicore Systems
Matias Bjørling1,2, Jens Axboe2, David Nellans2, Philippe Bonnet1
2013/03/05
1IT University of Copenhagen - Fusion-io2
1
Non-Volatile Memory Devices • Higher IOPS and lower
latency than magnetic disk • Hardware continues to
improve – Parallel architecture – Future NVM technologies
• Software design decisions – Fast sequential access – Slow random access
2
2010 2011 2012 2013
?
The Block Layer in the Linux Kernel
• Provides – Single point of access for block devices – IO scheduling – Merging & reordering – Accounting
• Optimized for traditional hard-drives – Trade CPU cycles for sequential disk accesses – Synchronization is a performance bottleneck
3
NVM device
In-Kernel IO Submission / Completion
Process App/FS/
etc.
Device driver
Kernel
Low-level block IO
device driver Block IO layer
Request Queue bio
Scheduling Accounting
Reordering Merging
4
Block Layer IO for Disks
• Single request queue – Single lock data structure – Cache-coherence expensive on multi-CPU systems
5
Methodology
• Null block device driver – In kernel dummy device – Acts as proxy for a no latency device – Avoids the DMA overhead – Removes driver effects from block layer – ”Device” completions can be in-line or soft irq
6
NVM device
Process App/FS/
etc. Null driver Block IO layer
Current Block Layer (IOPS)
Null block device, noop IO scheduler, 512B random reads
Westmere-EP
Nehalem-EX Westmere-EX
7
Sandy-bridge E
Multi-Queue Block Layer
• Balance IO workload across cores • Reduce cache-line sharing • Similar functionality as SQ • Allow multiple hardware queues
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 2
Process
Multi-Queue Block Layer
• Two levels of queues – Software Queues – Hardware Dispatch
Queues • Per CPU accounting • Tag-based completions
9
Userspace
Kernel
Single/Multi-queue capable hardware device
Process Process
Block Layer
...
...
SoftwareQueues
Hardware Dispatch Queues
Staging (Insertion/Merge)
Submission / Completion
Fairness / IO Scheduling
Tagging
Accounting
Submit IO
Status / Completion interrupt
Block IO device driver
Async/Sync
Mapping Strategies
• Per-core software queues • Varying hardware
dispatch queues
• Single software queue • Varying hardware
dispatch queues
10 Scalability = Multiple hardware dispatch queues
MQ Scalability
MQ Scalability
Multi-Queue IOPS
11
SQ – Current block layer, Raw – Block layer bypass, MQ – Proposed design
Multi-socket Scaling
12
AIO interface
• Alternative interface to synchronous • Prevents blocking of user threads • Move bio list to a lockless list • Remove legacy code and locking
13
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 1
Process
Device driver
OS Block Layer
libaio
Socket 2
Process
Multi-Queue, 8 socket
14
Latency Comparison
• Sync IO, 512B • Queue depth grows with CPU count
1,5 0,8
1,2
4,6
1,2 1,9
0
1
2
3
4
5
SQ Raw MQ
Late
ncy
(us)
1 core
6 cores
1,9 1 1,4
15,8
2,2 3,8
29,4
2,1 5,4
05
101520253035
SQ Raw MQ
Late
ncy
(us)
1 core
7 cores
12 cores
15
1 socket
2 socket 5.4x
2.3x
Latency Comparison
2,7 1,5 2 37
5,1 10
251
34
84
0
50
100
150
200
250
300
SQ Raw MQ
Late
ncy
(us)
1 core
9 cores
32 cores
16
4 socket
8 socket
2.9x
2,9 1,6 2,3
386
7,9 10,5
1200
10 153
SQ Raw MQ 0
200
400
600
800
1000
1200
1400
Late
ncy
(us)
1 core
11 cores
80 cores
7.8x
Conclusions
• Current block layer is inadequate for NVM • Proposed Multi-Queue block layer
– 3.5x-10x increase in IOPS – 10x-38x reduction in latency – Supports multiple hardware queues – Simplifies driver development – Maintains industry standard interfaces
17
Thank You & Questions
Implementation available at git://git.kernel.dk/?p=linux-block.git
18
Systems
• 1 Socket – Sandy Bridge-E, i7-3930K, 3.2Ghz, 12MB L3
• 2 Socket – Westmere-EP, X5690, 3.46Ghz, 12MB L3
• 4 Socket – Nehalem-EX, X7560, 2.66Ghz, 24MB L3
• 8 Socket – Westmere-EX, E7-2870, 2.4Ghz, 30MB L3
19