38
1.1 E8 STORAGE Come join us at [email protected] https://goo.gl/FOuczd

E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

1 . 1

E8 STORAGECome join us at [email protected]

https://goo.gl/FOuczd

Page 2: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

2 . 1

HIGH PERFORMANCE STORAGEDEVICES IN KERNEL

E8 Storage

By / Evgeny Budilovsky @budevg

12 Jan 2016

Page 3: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 1

STORAGE STACK 101

Page 4: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 2

APPLICATIONApplication invokes system calls (read, write, mmap) onfiles

Block device files - direct access to block deviceRegular files - access through specific file system (ext4,btrfs, etc.)

Page 5: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 3

FILE SYSTEMsystem calls go through VFS layerspecific file system logic appliedread/write operations translated into read/write ofmemory pagesPage cache involved to reduce disk access

Page 6: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 4

BIO LAYER (STRUCT BIO)File system constructs BIO unit which is the main unit of IODispatch bio to block layer / bypass driver (make_request_fndriver)

Page 7: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 5

BLOCK LAYER (STRUCT REQUEST)Setup struct request and move itto request queue (single queueper device)IO scheduler can delay/mergerequestsDispatch requests

To hardware drivers(request_fn driver)To scsi layer (scsi devices)

Page 8: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

SCSI LAYER (STRUCT SCSI_CMND)translate request intoscsi commanddispatch commandto low level scsidrivers

Page 9: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 6

WRITE (O_DIRECT)# Application perf record ­g dd if=/dev/zero of=1.bin bs=4k \      count=1000000 oflag=direct ... perf report ­­call­graph ­­stdio 

# Submit into block queue (bypass page cache) write   system_call_fastpath     sys_write       vfs_write         new_sync_write           ext4_file_write_iter             __generic_file_write_iter               generic_file_direct_write                 ext4_direct_IO                   ext4_ind_direct_IO                     __blockdev_direct_IO                       do_blockdev_direct_IO                         submit_bio                           generic_make_request 

Page 10: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 73 . 8

# after io scheduling submit to scsi layer and to the hardware      ... io_schedule   blk_flush_plug_list    queue_unplugged     __blk_run_queue       scsi_request_fn         scsi_dispatch_cmd           ata_scsi_queuecmd 

Page 11: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 8

3 . 9

WRITE (WITH PAGE CACHE)# Application perf record ­g dd if=/dev/zero of=1.bin bs=4k \      count=1000000 ... perf report ­­call­graph ­­stdio 

# write data into page cache write   system_call_fastpath     sys_write       vfs_write         new_sync_write           ext4_file_write_iter             __generic_file_write_iter               generic_perform_write                 ext4_da_write_begin                   grab_cache_page_write_begin                     pagecache_get_page                       __page_cache_alloc                         alloc_pages_current 

Page 12: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 10

# asynchronously flush dirty pages to disk 

kthread   worker_thread     process_one_work       bdi_writeback_workfn         wb_writeback           __writeback_inodes_wb             writeback_sb_inodes               __writeback_single_inode                 do_writepages                   ext4_writepages                     mpage_map_and_submit_buffers                       mpage_submit_page                         ext4_bio_write_page                           ext4_io_submit                             submit_bio                               generic_make_request                                 blk_queue_bio 

Page 13: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

3 . 10

4 . 1

HIGH PERFORMANCE BLOCKDEVICES

Page 14: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 2

CHANGES IN STORAGE WORLDRotational devices (HDD)"hundreds" of IOPS , "tens" ofmilliseconds latencyToday, flash based devices (SSD)"hundreds of thousands" of IOPS,"tens" of microseconds latencyLarge internal data parallelismIncrease in cores and NUMAarchitectureNew standardized storageinterfaces (NVME)

Page 15: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 3

NVM EXPRESS

Page 16: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 4

HIGH PERFORMANCE STANDARD

Standardized interface for PCIeSSDsDesigned from the ground up toexploit

Low latency of today’s PCIe-based SSD’sParallelism of today’s CPU’s

Page 17: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

BEATS AHCI STANDARD FOR SATA HOSTS  AHCI NVME

Maximum QueueDepth

1 command queue 32 commandsper Q

64K queues 64KCommands per Q

Un-cacheable registeraccesses (2K cycles

6 per non-queued command 9 perqueued command

2 per command

MSI-X and InterruptSteering

Single interrupt; no steering 2K MSI-X interrupts

Parallelism & MultipleThreads

Requires synchronization lock toissue command

No locking

Efficiency for 4KBCommands

Command parameters require twoserialized host DRAM fetches

Commandparameters in one64B fetch

Page 18: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 54 . 6

IOPS

consumer grade NVME SSDs (enterprise grade have muchbetter performance)100% writes less impressive due to NAND limitation

Page 19: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 7

BANDWIDTH

Page 20: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

4 . 8

LATENCY

Page 21: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 1

THE OLD STACK DOES NOT SCALE

Page 22: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 2

THE NULL_BLK EXPERIMENTJens Axboe (Facebook)null_blk configuration

queue_mode=1(rq) completion_nsec=0irqmode=0(none)

fioEach thread does pread(2), 4k, randomly, O_DIRECT

Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)

Page 23: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 3

LIMITED PERFORMANCE

Page 24: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 4

PERFSpinlockcontention

Page 25: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 5

PERF40% from sending request (blk_queue_bio)20% from completing request (blk_end_bidi_request)18% sending bio from the application to the bio layer(blk_flush_plug_list)

Page 26: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 6

OLD STACK HAS SEVERE SCALING ISSUES

Page 27: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

5 . 7

PROBLEMSGood scalability before block layer (file system, pagecache, bio)Single shared queue is a problemWe can use bypass mode driver which will work with bio'swithout getting into shared queue.Problem with bypass driver: code duplication

Page 28: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 1

BLOCK MULTI-QUEUE TO THERESCUE

Page 29: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 2

HISTORYPrototyped in 2011Paper in SYSTOR 2013Merged into linux 3.13 (2014)A replacement for old block layer with different driver API

Drivers gradually converted to blk-mq (scsi-mq, nvme-core, virtio_blk)

Page 30: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 3

ARCHITECTURE - 2 LAYERS OF QUEUESApplication works with per-CPU software queueMultiple software queuesmap into hardware queuesNumber of HW queues isbased on number of HWcontexts supported bydeviceRequests from HW queuesubmitted by low leveldriver to the device

Page 31: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

ARCHITECTURE - ALLOCATION AND TAGGINGIO tag

Is an integer value that uniquely identifies IO submittedto hardwareOn completion we can use the tag to find out which IOwas completedLegacy drivers maintained their own implementationof tagging

With block-mq, requests allocated at initialization time(based on queue depth)Tag and request allocations combinedAvoids per request allocations in driver and tagmaintenance

Page 32: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 46 . 5

ARCHITECTURE - I/O COMPLETIONSGenerally we want completions to be as local as possibleUse IPIs to complete requests on submitting nodeOld block layer was using software interrupts instead ofIPIsBest case there is an SQ/CQ pair for each core, with MSI-Xinterrupt setup for each CQ, steered to the relevant coreIPIs used when there aren't enough interrupts/HW queues

Page 33: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 6

NULL_BLK EXPERIMENT AGAINnull_blk configuration

queue_mode=2(multiqueue) completion_nsec=0irqmode=0(none) submit_queues=32

fioEach thread does pread(2), 4k, randomly, O_DIRECT

Each added thread alternates between the two availableNUMA nodes (2 socket system, 32 threads)

Page 34: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 7

SUCCESS

Page 35: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 8

WHAT ABOUT HARDWARE WITHOUT MULTIQUEUE SUPPORT

Same null_blk setup1/2/n hw queues in blk-mqmq-1 and mq-2 so closesince we have 2 socketssystemNuma issues eliminatedonce we have queue pernuma node

Page 36: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

CONVERSION PROGRESSmtip32xx (micron SSD)NVMevirtio_blk, xen blockdriverrbd (ceph block)loopubiSCSI (scsi-mq)

Page 37: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

6 . 97 . 1

SUMMARYStorage device performance has accelerated fromhundreds of IOPS to hundreds thousands of IOPSBottlenecks in software gradually eliminated byexploiting concurrency and introducing lock-lessarchitectureblk-mq is one example

Questions ?

Page 38: E8 STORAGE - Meetupfiles.meetup.com/18720713/High performance storage...12 Jan 2016 3.1 STORAGE STACK 101 3.2 APPLICATION Application invokes system calls (read, write, mmap) on files

8 . 1

REFERENCES1.

2. 3. 4. 5. 6. 7.

Linux Block IO: Introducing Multi-queue SSD Access onMulti-core SystemsThe Performance Impact of NVMe and NVMe over FabricsNull block device driverblk-mq: new multi-queue block IO queueing mechanismfioperfSolving the Linux storage scalability bottlenecks - by JensAxboe