View
215
Download
0
Embed Size (px)
Citation preview
A Lock-Free, Cache-Efficient Multi-A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Core Synchronization Mechanism for Line-Rate Network Traffic MonitoringLine-Rate Network Traffic Monitoring
Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon2
1The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent
April 2010
2
OutlineOutline
Motivation
MCRingBuffer, a multi-core ring buffer
Parallel network monitoring prototype
Conclusions
3
Network Traffic MonitoringNetwork Traffic Monitoring
Monitoring data streams in today’s networks is essential for network management:
Accounting resource provisioning failure diagnosis intrusion detection/prevention
Goal: achieve line-rate monitoring Monitoring speed must keep up with link bandwidth (i.e.,
prepare for the worst)
Challenges: Data volume keeps increasing (e.g., to Gigabit scales) Single CPU systems may no longer support line-rate
monitoring
4
Can Multi-Core Help?Can Multi-Core Help?
Can multi-core architectures help line-rate monitoring? Parallelize packet processing
The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging
Inter-core communication has overhead: Upper layer: protocol messages Lower layer: thread synchronization in shared data structures
coreraw packets
CPU Quad-core CPU
core
corecore
core
Single-core case Multi-core case
raw packets
5
Can Multi-Core Help?Can Multi-Core Help?
Multi-core helps only if we minimize inter-core communication overhead
Let’s focus on minimizing thread synchronization Benefit a broad class of multi-threaded network monitoring
applications
6
Our ContributionOur Contribution
Why lock-free? Allows concurrent thread accesses
Why cache-efficient? Saves expensive memory accesses
We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures
Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed
network traffic monitoring
7
Producer/Consumer Problem Producer/Consumer Problem
Classical OS problem
Ring buffer: bounded buffer with fixed number of slots
Thread synchronization: Producer inserts elements when buffer is not full Consumer extracts elements when buffer is not empty First-in-first-out (FIFO): inserted elements and extracted
elements in the same order
Ring buffer
element
Producer Consumer
8
Producer/Consumer ProblemProducer/Consumer Problem Ring buffer in multi-core context:
L1 cache
Producer Consumer
core core
L1 cache
CPU
L2 cache
Control variables
Ring buffer
Memory
System bus
Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.
9
Lamport’s Lock-Free Ring Lamport’s Lock-Free Ring BufferBuffer
Operate on control variables: read and write, which resp. point to next read and write slots
readwrite
0N-1
Insert(T element)1: wait until NEXT(write) != read2: buffer[write] = element3: write = NEXT(write)
Extract(T* element)1: wait until read != write2: *element = buffer[read]3: read = NEXT(read)
[Lamport, Comm. of ACM, 1977]
NEXT(x) = (x + 1) % N
10
Previous WorkPrevious Work
FastForward [Giacomoni et al., PPoPP, 2008]: couple data/control operations need a special NULL data element defined by applications
Hardware-primitive ring buffers support multiple-producers/multiple-consumers use hardware synchronization primitives (e.g., compare and
swap) Hardware primitives are expensive in general
11
MCRingBuffer OverviewMCRingBuffer Overview
Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization
Properties: Lock-free: allow concurrent accesses of producer and
consumer Cache-efficient: improve cache locality of synchronization Generic: no assumption on data types and insert/extract
patterns Deployable: works on general-purpose multi-core CPUs
Components: Cache-line protection Batch updates of control variables
12
MCRingBuffer AssumptionsMCRingBuffer Assumptions
Assumptions inherited from Lamport’s ring buffer: single-producer/single-consumer reading/writing read/write are atomic memory accesses follow sequential consistency
13
Cache-line ProtectionCache-line Protection Cache is in unit of cache lines
False sharing occurs when two threads access different variables on the same cache line
Cache line invalidated when a thread modifies a variable Cache line reloaded from memory when a thread reads a
different variable, even unchanged
cache
read write N
N (ring buffer size) is reloaded from memory even if it’s constant
read/write modified frequently for thread synchronization
14
Cache-line ProtectionCache-line Protection
Add padding bytes to avoid false sharing
cache
read write
N
cachePad1
cachePad2
int readint writechar cachePad1[CL–2*sizeof(int)]int Nchar cachePad2[CL–sizeof(int)]
CL = cache line size
15
Cache-line ProtectionCache-line Protection Use cache-line protection to minimize memory
accessescache
read write
localWrite
cachePad1
cachePad2nextRead
localRead cachePad3nextWrite
N cachePad4
Shared variables
Consumer’s local variables
Producer’s local variables
Constants
Shared variables are main controls of synchronization
Use local variables to “guess” shared variables
Goal: minimize freq. of reading shared control variables
16
Batch Updates of Control Batch Updates of Control VariablesVariables
Intuition: nextRead/nextWrite are the positions where to read/write Update read/write after batchSize reads/writes
buffer[nextWrite] = elementnextWrite = NEXT(nextWrite)wBatch++if (wBatch >= batchSize) { write = nextWrite wBatch = 0}
*element = buffer[nextRead]nextRead = NEXT(nextRead)rBatch++if (rBatch >= batchSize) { read = nextRead rBatch = 0}
Producer Consumer
Goal: minimize freq. of writing shared control variables
17
Batch Updates of Control Batch Updates of Control VariablesVariables
Limitation: read/write advanced on per-batch basis
elements may not be extracted even buffer is not empty
However, if elements are raw packets in high-speed networks, read/write will be updated regularly
18
Correctness of MCRingBufferCorrectness of MCRingBuffer
Correctness based on Lamport’s ring buffer:
Lamport’s: Insert only if write – read < N Extract only if read < write
We prove for MCRingBuffer: Insert only if nextWrite – nextRead < N Extract only if nextRead < nextWrite
Details in the paper.
19
EvaluationEvaluation Hardware: Intel Xeon 5355 Quad-core
sibling cores: pair of cores sharing L2 cache non-sibling cores: pair of cores not sharing L2 cache
Ring buffers: LockRingBuffer: lock-based ring buffer BasicRingBuffer: Lamport’s ring buffer MCRingBuffer:
batchSize = 1: cache-line protection batchSize > 1: cache-line protection + batch control updates
Metrics: Throughput: number of insert/extract pairs per second Number of L2 cache misses: number of cache-line reload
operations
20
Experiment 1Experiment 1 Throughput vs. element size
Sibling cores Non-Sibling cores
MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size
buffer capacity = 2K elements
21
Experiment 2Experiment 2
Throughput vs. buffer capacity
Sibling cores Non-Sibling cores
MCRingBuffer’s throughput invariant with large enough buffer capacity
element size = 128 bytes
22
Experiment 3Experiment 3
Code profiling from Intel VTune Performance Analyzer
BasicRingBuffer MCRingBuffer(batchSize = 50)
# core cycles 1130M / 1097M 137M / 113M
# retired instructions 358M / 287M 231M / 219M
# L2 cache misses 746K / 808K 102K / 80K
Metric numbers for 10M inserts/extractselement size = 8 bytes, capacity = 2K elements
MCRingBuffer improves cache locality
23
Recap of EvaluationRecap of Evaluation
MCRingBuffer improves throughput in various scenarios:
Different data sizes Different buffer capacities Sibling/non-sibling cores
MCRingBuffer has higher throughput gain via: careful organization of control variables careful accesses to control variables
MCRingBuffer’s gain does not require any special insert/extract patterns
24
Parallel Traffic MonitoringParallel Traffic Monitoring
Applying MCRingBuffer to parallel traffic monitoring
Dispatcher
SubAnanlyzer
MainAnanlyzer
SubAnanlyzer
rawpackets
SubAnanlyzer
…
ring buffer
decoded packets state reports
25
Parallel Traffic MonitoringParallel Traffic Monitoring
Dispatch stage: Decode raw packets Distribute decoded packets by
(srcIP, dstIP)
SubAnalysis stage: Local analysis on address pairs e.g., 5-tuple flow stats, vertical
portscans
MainAnalysis stage: Global analysis: aggregate
results of all SubAnalyzers e.g., source’s volume,
horizontal portscans
…
Dispatch SubAnalysis MainAnalysis
Evaluation results:MCRingBuffer helps scale uppacket processing throughput(details in paper)
26
Take-away MessagesTake-away Messages
Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism
Next question: How do we apply MCRingBuffer to different network
monitoring problems?