A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

A Lock-Free, Cache-Efficient Multi-A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Core Synchronization Mechanism for Line-Rate Network Traffic MonitoringLine-Rate Network Traffic Monitoring

Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon2

1The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent

April 2010

2

OutlineOutline

Motivation

MCRingBuffer, a multi-core ring buffer

Parallel network monitoring prototype

Conclusions

3

Network Traffic MonitoringNetwork Traffic Monitoring

Monitoring data streams in today’s networks is essential for network management:

Accounting resource provisioning failure diagnosis intrusion detection/prevention

Goal: achieve line-rate monitoring Monitoring speed must keep up with link bandwidth (i.e.,

prepare for the worst)

Challenges: Data volume keeps increasing (e.g., to Gigabit scales) Single CPU systems may no longer support line-rate

monitoring

4

Can Multi-Core Help?Can Multi-Core Help?

Can multi-core architectures help line-rate monitoring? Parallelize packet processing

The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging

Inter-core communication has overhead: Upper layer: protocol messages Lower layer: thread synchronization in shared data structures

coreraw packets

CPU Quad-core CPU

core

corecore

core

Single-core case Multi-core case

raw packets

5

Can Multi-Core Help?Can Multi-Core Help?

Multi-core helps only if we minimize inter-core communication overhead

Let’s focus on minimizing thread synchronization Benefit a broad class of multi-threaded network monitoring

applications

6

Our ContributionOur Contribution

Why lock-free? Allows concurrent thread accesses

Why cache-efficient? Saves expensive memory accesses

We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures

Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed

network traffic monitoring

7

Producer/Consumer Problem Producer/Consumer Problem

Classical OS problem

Ring buffer: bounded buffer with fixed number of slots

Thread synchronization: Producer inserts elements when buffer is not full Consumer extracts elements when buffer is not empty First-in-first-out (FIFO): inserted elements and extracted

elements in the same order

Ring buffer

element

Producer Consumer

8

Producer/Consumer ProblemProducer/Consumer Problem Ring buffer in multi-core context:

L1 cache

Producer Consumer

core core

L1 cache

CPU

L2 cache

Control variables

Ring buffer

Memory

System bus

Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.

9

Lamport’s Lock-Free Ring Lamport’s Lock-Free Ring BufferBuffer

Operate on control variables: read and write, which resp. point to next read and write slots

readwrite

0N-1

Insert(T element)1: wait until NEXT(write) != read2: buffer[write] = element3: write = NEXT(write)

Extract(T* element)1: wait until read != write2: *element = buffer[read]3: read = NEXT(read)

[Lamport, Comm. of ACM, 1977]

NEXT(x) = (x + 1) % N

10

Previous WorkPrevious Work

FastForward [Giacomoni et al., PPoPP, 2008]: couple data/control operations need a special NULL data element defined by applications

Hardware-primitive ring buffers support multiple-producers/multiple-consumers use hardware synchronization primitives (e.g., compare and

swap) Hardware primitives are expensive in general

11

MCRingBuffer OverviewMCRingBuffer Overview

Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization

Properties: Lock-free: allow concurrent accesses of producer and

consumer Cache-efficient: improve cache locality of synchronization Generic: no assumption on data types and insert/extract

patterns Deployable: works on general-purpose multi-core CPUs

Components: Cache-line protection Batch updates of control variables

12

MCRingBuffer AssumptionsMCRingBuffer Assumptions

Assumptions inherited from Lamport’s ring buffer: single-producer/single-consumer reading/writing read/write are atomic memory accesses follow sequential consistency

13

Cache-line ProtectionCache-line Protection Cache is in unit of cache lines

False sharing occurs when two threads access different variables on the same cache line

Cache line invalidated when a thread modifies a variable Cache line reloaded from memory when a thread reads a

different variable, even unchanged

cache

read write N

N (ring buffer size) is reloaded from memory even if it’s constant

read/write modified frequently for thread synchronization

14

Cache-line ProtectionCache-line Protection

Add padding bytes to avoid false sharing

cache

read write

N

cachePad1

cachePad2

int readint writechar cachePad1[CL–2*sizeof(int)]int Nchar cachePad2[CL–sizeof(int)]

CL = cache line size

15

Cache-line ProtectionCache-line Protection Use cache-line protection to minimize memory

accessescache

read write

localWrite

cachePad1

cachePad2nextRead

localRead cachePad3nextWrite

N cachePad4

Shared variables

Consumer’s local variables

Producer’s local variables

Constants

Shared variables are main controls of synchronization

Use local variables to “guess” shared variables

Goal: minimize freq. of reading shared control variables

16

Batch Updates of Control Batch Updates of Control VariablesVariables

Intuition: nextRead/nextWrite are the positions where to read/write Update read/write after batchSize reads/writes

buffer[nextWrite] = elementnextWrite = NEXT(nextWrite)wBatch++if (wBatch >= batchSize) { write = nextWrite wBatch = 0}

*element = buffer[nextRead]nextRead = NEXT(nextRead)rBatch++if (rBatch >= batchSize) { read = nextRead rBatch = 0}

Producer Consumer

Goal: minimize freq. of writing shared control variables

17

Batch Updates of Control Batch Updates of Control VariablesVariables

Limitation: read/write advanced on per-batch basis

elements may not be extracted even buffer is not empty

However, if elements are raw packets in high-speed networks, read/write will be updated regularly

18

Correctness of MCRingBufferCorrectness of MCRingBuffer

Correctness based on Lamport’s ring buffer:

Lamport’s: Insert only if write – read < N Extract only if read < write

We prove for MCRingBuffer: Insert only if nextWrite – nextRead < N Extract only if nextRead < nextWrite

Details in the paper.

19

EvaluationEvaluation Hardware: Intel Xeon 5355 Quad-core

sibling cores: pair of cores sharing L2 cache non-sibling cores: pair of cores not sharing L2 cache

Ring buffers: LockRingBuffer: lock-based ring buffer BasicRingBuffer: Lamport’s ring buffer MCRingBuffer:

batchSize = 1: cache-line protection batchSize > 1: cache-line protection + batch control updates

Metrics: Throughput: number of insert/extract pairs per second Number of L2 cache misses: number of cache-line reload

operations

20

Experiment 1Experiment 1 Throughput vs. element size

Sibling cores Non-Sibling cores

MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size

buffer capacity = 2K elements

21

Experiment 2Experiment 2

Throughput vs. buffer capacity

Sibling cores Non-Sibling cores

MCRingBuffer’s throughput invariant with large enough buffer capacity

element size = 128 bytes

22

Experiment 3Experiment 3

Code profiling from Intel VTune Performance Analyzer

BasicRingBuffer MCRingBuffer(batchSize = 50)

# core cycles 1130M / 1097M 137M / 113M

# retired instructions 358M / 287M 231M / 219M

# L2 cache misses 746K / 808K 102K / 80K

Metric numbers for 10M inserts/extractselement size = 8 bytes, capacity = 2K elements

MCRingBuffer improves cache locality

23

Recap of EvaluationRecap of Evaluation

MCRingBuffer improves throughput in various scenarios:

Different data sizes Different buffer capacities Sibling/non-sibling cores

MCRingBuffer has higher throughput gain via: careful organization of control variables careful accesses to control variables

MCRingBuffer’s gain does not require any special insert/extract patterns

24

Parallel Traffic MonitoringParallel Traffic Monitoring

Applying MCRingBuffer to parallel traffic monitoring

Dispatcher

SubAnanlyzer

MainAnanlyzer

SubAnanlyzer

rawpackets

SubAnanlyzer

…

ring buffer

decoded packets state reports

25

Parallel Traffic MonitoringParallel Traffic Monitoring

Dispatch stage: Decode raw packets Distribute decoded packets by

(srcIP, dstIP)

SubAnalysis stage: Local analysis on address pairs e.g., 5-tuple flow stats, vertical

portscans

MainAnalysis stage: Global analysis: aggregate

results of all SubAnalyzers e.g., source’s volume,

horizontal portscans

…

Dispatch SubAnalysis MainAnalysis

Evaluation results:MCRingBuffer helps scale uppacket processing throughput(details in paper)

26

Take-away MessagesTake-away Messages

Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism

Next question: How do we apply MCRingBuffer to different network

monitoring problems?

Documents

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon