KV-Direct: High-Performance In-Memory Key-Value Store with

KV-Direct: High-Performance In-Memory Key-Value Store with

Programmable NICBojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei

Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, Lintao Zhang

presentation by Mieczysław Krawiarzall illustrations are extracted from the paper

In-Memory Key-Value Stores● Very popular, not only for web caching

● Traditionally use OS and TCP / IP stack○ OS primitives are slow

○ CPU performance doesn’t match modern NICs

KVS Architectures

RDMA● Direct Memory Access

○ In PCI many devices are allowed to become a bus-master

○ Memory transfer between components not involving CPU

● Remote Direct Memory Access○ DMA analogue allowing memory access between separate computers, see eg. RFC 5040

○ Memory transfer between computers not involving OSes

○ One-sided RDMA allows to bypass remote CPU

● Both provide high-throughput, low-latency communication and decrease CPU

usage

KVS Architectures

KV-Direct Architecture

Goals

● High batch throughput for small KV● Low tail-latency● High efficiency under write-intensive workloads● Fast atomic operations on popular keys● Support for vector-type operations● Power efficiency

KV Operations

Programmable NIC with FPGA

PCIe DMA Performance

Optimizing for DMA1. Minimize DMA requests per KV operation2. Hide PCIe latency maintaining consistency3. Dispatch load between NIC DRAM and host memory

KV Processor Architecture

Hash TableThe storage is partitioned into two parts

1. Hash Index, containing Hash Buckets2. Dynamically allocated memory managed by Slab Allocator

Their proportion is configured at the beginning as a hash index ratio constant

Hash Index

● each row is a hash bucket, which contains several hash slots and some metadata

● hash slot consists of an address to dynamically allocated memory and a hash for parallel optimization

● KVs smaller than inline threshold are kept in hash buckets, their positions are indicated by bitmaps

Comparison of inline thresholds

Slab Memory Allocator - slab allocation algorithm● caching on NIC’s DRAM, synchronized with host’s memory in batches● slabs of sizes 32B, 64B, …, 512B, for each size there is a free slab pool● each slab pool is an array of slab entries, each is an address and a slab type● global allocation bitmap helping to merge small free slabs● merging is done lazily● main logic runs on CPU and communicates through PCIe

Slab Memory Allocator

Execution time of merging 4 billion slab slots

Out-of-Order Execution Engine1. Use parallelism to increase performance and decrease latency2. 256 in-flight KV operations should be performed to match the hardware3. Subsequent operations on the same KV should be fast

Out-of-Order Execution Engine● reservation station is used to track all operations with their execution contexts● KV operations are stored on chip’s BRAM, index by hashes of keys● dependent operations are those with the same key hash● 1024 hashes stored in reservation station, collision probability <= ¼● newest values are cached for data forwarding

Effectiveness of Out-of-Order Execution Engine

DRAM Load Dispatcher1. Use memory both on host and on NIC

a. DRAM on NIC - 4 GiB storage, 12.8 GB/s throughput, low latency b. DRAM on host - 64 GiB storage, 14 GB/s throughput

2. How to get good latency and throughput?3. Split memory on host into cacheable and non-cacheable parts4. load dispatch ratio - fraction of host’s DRAM that is cacheable, l5. If h(l) is the cache hit probability, then balance is achieved when

DRAM Load Dispatcher

Determining optimal hash index ratio

Vector Operation DecoderProblems

1. Network is slower than PCIe (2μs latency, 5 GB/s throughput)2. RDMA packets have larger overhead than PCIe packets

Vector Operation DecoderSolution

● client-side batching○ batching multiple KV operations into one packet○ support for vector operations

● VOD decodes multiple KV operations contained in one packet

Throughput (GB/s) of operations on vectors

Efficiency of Network Batching

Hardware - FPGA● Intel Stratix V FPGA based programmable NIC● 2x PCIe Gen3 x8● 4GiB DDR3-1600 single-channel DRAM● 180 MHz clock, upper bound of 180 Mops● For development - Intel FPGA SDK for OpenCL

Hardware - servers● testbed of eight servers and a switch● each server has

○ an 8 core Xeon processor with hyper-threading disabled○ 128 GB of 8-channel ECC memory○ Archlinux

● tests○ YCSB workload○ skewed Zipf workload with 0.99 skewness, long-tail workload

Throughput of under YCSB workload

Latency under peak throughput of YCSB workload

Comparison with other KVS systems

Scaling to more NICs

Questions?

Documents

KV-Direct: High-Performance In-Memory Key-Value Store with