Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Kargus: A Highly‐scalable Software‐based Intrusion Detection System
M. Asim Jamshed*, Jihyung Lee†, SangwooMoon†, InsuYun*, Deokjin Kim‡, Sungryoul Lee‡, Yung Yi†, KyoungSoo Park*
* Networked & Distributed Computing Systems Lab, KAIST† Laboratory of Network Architecture Design & Analysis, KAIST
‡ Cyber R&D Division, NSRI
InternetInternet
Network Intrusion Detection Systems (NIDS)
• Detect known malicious activities– Port scans, SQL injections, buffer overflows, etc.
• Deep packet inspection– Detect malicious signatures (rules) in each packet
• Desirable features– High performance (> 10Gbps) with precision– Easy maintenance
• Frequent ruleset updates
2
NIDSNIDSAttack
Hardware vs. Software
• H/W‐based NIDS– Specialized hardware
• ASIC, TCAM, etc.– High performance– Expensive
• Annual servicing costs– Low flexibility
• S/W‐based NIDS– Commodity machines– High flexibility– Low performance
• DDoS/packet drops
3
IDS/IPS Sensors (10s of Gbps)
IDS/IPS M8000(10s of Gbps)
Open‐source S/W
~ US$ 20,000 ‐ 60,000
~ US$ 10,000 ‐ 24,000
≤ ~2 Gbps
Goals
• S/W‐based NIDS– Commodity machines– High flexibility
4
– High performance
Typical Signature‐based NIDS Architecture
5
PacketAcquisition
Preprocessing
DecodeFlow management
Reassembly
MatchSuccess
Match Failure(Innocent Flow)
Multi‐string Pattern Matching
Evaluation Failure(Innocent Flow)
EvaluationSuccess
Rule Options Evaluation
Output
MaliciousFlow
alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS 80 (msg:“possible attack attempt BACKDOOR optix runtime detection"; content:"/whitepages/page_me/100.html"; pcre:"/body=\x2521\x2521\x2521Optix\s+Pro\s+v\d+\x252E\d+\S+sErver\s+Online\x2521\x2521\x2521/")
Bottlenecks
* PCRE: Perl Compatible Regular Expression
Contributions
A highly‐scalable software‐based NIDS for high‐speed networkGoal A highly‐scalable software‐based NIDS for high‐speed networkGoal
Slow software NIDS Fast software NIDS
Inefficient packet acquisition
Expensive string & PCRE pattern matching
Multi‐core packet acquisition
Parallel processing & GPU offloading
Bottlenecks Solutions
Fastest S/W signature‐based IDS: 33Gbps100% malicious traffic: 10 GbpsReal network traffic: ~24 Gbps
OutcomeFastest S/W signature‐based IDS: 33Gbps100% malicious traffic: 10 GbpsReal network traffic: ~24 Gbps
Outcome
6
Challenge 1: Packet Acquisition
• Default packet module: Packet CAPture (PCAP) library– Unsuitable for multi‐core environment– Low performing– More power consumption
• Multi‐core packet capture library is required
7
[Core 1] [Core 2] [Core 3] [Core 4] [Core 5]
10 Gbps NIC B10 Gbps NIC A
[Core 1] [Core 2] [Core 3] [Core 4] [Core 5]
10 Gbps NIC B10 Gbps NIC A
[Core 7] [Core 8] [Core 9] [Core 10] [Core 11]
10 Gbps NIC D10 Gbps NIC C
[Core 7] [Core 8] [Core 9] [Core 10] [Core 11]
10 Gbps NIC D10 Gbps NIC C
Packet RX bandwidth*
0.4‐6.7 GbpsCPU utilization
100 %
* Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache
Solution: PacketShader I/O
• PacketShader I/O– Uniformly distributes packets based on flow info by RSS hashing
• Source/destination IP addresses, port numbers, protocol‐id– 1 core can read packets from RSS queues of multiple NICs– Reads packets in batches (32 ~ 4096)
• Symmetric Receive‐Side Scaling (RSS)– Passes packets of 1 connection to the same queue
8
* S. Han et al., “PacketShader: a GPU‐accelerated software router”, ACM SIGCOMM 2010
RxQA1
RxQA1
RxQB1
RxQB1
RxQA2
RxQA2
RxQB2
RxQB2
RxQA3
RxQA3
RxQB3
RxQB3
RxQA4
RxQA4
RxQB4
RxQB4
RxQA5
RxQA5
RxQB5
RxQB5
[Core 1] [Core 2] [Core 3] [Core 4] [Core 5]
10 Gbps NIC B10 Gbps NIC B10 Gbps NIC A10 Gbps NIC A
RxQA1
RxQB1
RxQA2
RxQB2
RxQA3
RxQB3
RxQA4
RxQB4
RxQA5
RxQB5
[Core 1] [Core 2] [Core 3] [Core 4] [Core 5]
10 Gbps NIC B10 Gbps NIC A
Packet RX bandwidth 0.4 ‐ 6.7 Gbps
40 Gbps
CPU utilization 100 %
16‐29%
Challenge 2: Pattern Matching
• CPU intensive tasks for serial packet scanning
• Major bottlenecks– Multi‐string matching (Aho‐Corasick phase)– PCRE evaluation (if ‘pcre’ rule option exists in rule)
• On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache– Aho‐Corasick analyzing bandwidth per core: 2.15 Gbps– PCRE analyzing bandwidth per core: 0.52 Gbps
9
Solution: GPU for Pattern Matching
• GPUs– Containing 100s of SIMD processors
• 512 cores for NVIDIA GTX 580– Ideal for parallel data processing without branches
• DFA‐based pattern matching on GPUs– Multi‐string matching using Aho‐Corasick algorithm– PCRE matching
• Pipelined execution in CPU/GPU– Concurrent copy and execution
10
Engine Thread
Packet Acquisition
PreprocessMulti‐stringMatching
Rule OptionEvaluation
GPU Dispatcher Thread
Offloading Offloading
GPU
Multi‐stringMatching
PCREMatching
Multi‐string Matching Queue PCRE Matching Queue
Engine Thread
Packet AcquisitionPacket
AcquisitionPreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule OptionEvaluationRule OptionEvaluation
GPU Dispatcher Thread
OffloadingOffloading OffloadingOffloading
GPU
Multi‐stringMatching
Multi‐stringMatching
PCREMatchingPCRE
Matching
Multi‐string Matching Queue PCRE Matching Queue
Aho‐Corasick bandwidth 2.15 Gbps
39 Gbps
PCRE bandwidth0.52 Gbps
8.9 Gbps
Optimization 1: IDS Architecture
• How to best utilize the multi‐core architecture?• Pattern matching is the eventual bottleneck
• Run entire engine on each core
11
Function Time % ModuleacsmSearchSparseDFA_Full 51.56 multi‐string matching
List_GetNextState 13.91 multi‐string matching
mSearch 9.18 multi‐string matching
in_chksum_tcp 2.63 preprocessing
* GNU gprof profiling results
Solution: Single‐process Multi‐thread
• Runs multiple IDS engine threads & GPU dispatcher threads concurrently– Shared address space– Less GPU memory consumption– Higher GPU utilization & shorter service latency
12
GPU memory usage
1/6
Packet Acquisition
Core 1Core 1
Preprocess
Multi‐stringMatching
Rule Option
Evaluation
Packet Acquisition
Core 2Core 2
Preprocess
Multi‐stringMatching
Rule Option
Evaluation
Packet Acquisition
Core 3Core 3
Preprocess
Multi‐stringMatching
Rule Option
Evaluation
Packet Acquisition
Core 4Core 4
Preprocess
Multi‐stringMatching
Rule Option
Evaluation
Packet Acquisition
Core 5Core 5
Preprocess
Multi‐stringMatching
Rule Option
Evaluation
Core 6
GPU Dispatcher ThreadSingle thread pinned at core 1
Packet AcquisitionPacket
Acquisition
Core 1
PreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule Option
Evaluation
Rule Option
Evaluation
Packet AcquisitionPacket
Acquisition
Core 2
PreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule Option
Evaluation
Rule Option
Evaluation
Packet AcquisitionPacket
Acquisition
Core 3
PreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule Option
Evaluation
Rule Option
Evaluation
Packet AcquisitionPacket
Acquisition
Core 4
PreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule Option
Evaluation
Rule Option
Evaluation
Packet AcquisitionPacket
Acquisition
Core 5
PreprocessPreprocess
Multi‐stringMatching
Multi‐stringMatching
Rule Option
Evaluation
Rule Option
Evaluation
Core 6
GPU Dispatcher ThreadGPU Dispatcher ThreadSingle thread pinned at core 1
Architecture
• Non Uniform Memory Access (NUMA)‐aware• Core framework as deployed in dual hexa‐core system• Can be configured to various NUMA set‐ups accordingly
13
▲ Kargus configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUs
• Caveats– Long per‐packet processing latency:
• Buffering in GPU dispatcher
– More power consumption• NVIDIA GTX 580: 512 cores
• Use:– CPU when ingress rate is low (idle GPU)– GPU when ingress rate is high
Optimization 2: GPU Usage
14
• Load balancing between CPU & GPU– Reads packets from NIC queues per cycle– Analyzes smaller # of packets at each cycle (a < b < c)– Increases analyzing rate if queue length increases– Activates GPU if queue length increases
CPU CPU GPU
Solution: Dynamic Load Balancing
15
a b
b ca
c
α β γ
Internal packet queue (per engine)
GPU Queue Length
Packet latency withGPU : 640 μsecs
CPU: 13 μsecs
Optimization 3: Batched Processing
• Huge per‐packet processing overhead– > 10 million packets per second for small‐sized packets at 10 Gbps– reduces overall processing throughput
• Function call batching– Reads group of packets from RX queues at once– Pass the batch of packets to each function
16
Decode(p) Preprocess(p) Multistring_match(p)
Decode(list‐p) Preprocess(list‐p) Multistring_match(list‐p)
2x faster processing rate
Kargus Specifications
17
NUMA node 1
12 GB DRAM (3GB x 4)
Intel 82599 Gigabit Ethernet Adapter (dual port)
NVIDIA GTX 580 GPU
NUMA node 2
Intel X5680 3.33 GHz (hexacore)12 MB L3 NUMA‐Shared Cache
$1,210
$512
$370
$100
Total Cost (incl. serverboard) = ~$7,000
IDS Benchmarking Tool
• Generates packets at line rate (40 Gbps) – Random TCP packets (innocent)– Attack packets are generated by attack rule‐set
• Support packet replay using PCAP files• Useful for performance evaluation
18
Kargus Performance Evaluation
• Micro‐benchmarks– Input traffic rate: 40 Gbps– Evaluate Kargus (~3,000 HTTP rules) against:
• Kargus‐CPU‐only (12 engines)• Snort with PF_RING• MIDeA*
• Refer to the paper for more results
19
* G. Vasiliadis et al., “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS ‘11
0
5
10
15
20
25
30
35
64 218 256 818 1024 1518
Throug
hput (G
bps)
Packet size (Bytes)
MIDeASnort w/ PF_RingKargus CPU‐onlyKargus CPU/GPU
0
5
10
15
20
25
30
35
64 218 256 818 1024 1518
Throug
hput (G
bps)
Packet size (Bytes)
Innocent Traffic Performance
20
Actual payload analyzing bandwidth
• 2.7‐4.5x faster than Snort• 1.9‐4.3x faster than MIDeA
Malicious Traffic Performance
21
0
5
10
15
20
25
30
35
64 256 1024 1518
Throug
hput (G
bps)
Packet size (Bytes)
Kargus, 25%Kargus, 50%Kargus, 100%Snort+PF_Ring, 25%Snort+PF_Ring, 50%Snort+PF_Ring, 100%
• 5x faster than Snort
Real Network Traffic
• Three 10Gbps LTE backbone traces of a major ISP in Korea:– Time duration of each trace: 30 mins ~ 1 hour– TCP/IPv4 traffic:
• 84 GB of PCAP traces• 109.3 million packets• 845K TCP sessions
• Total analyzing rate: 25.2 Gbps– Bottleneck: Flow Management (preprocessing)
22
Effects of Dynamic GPU Load Balancing
23
400450500550600650700750800850900
0 5 10 20 33
Kargus w/o LB (polling)
Kargus w/o LB
Kargus w/ LB
Offered Incoming Traffic (Gbps) [Packet Size: 1518 B]
Power Con
sumpt
ion(W
atts)
• Varying incoming traffic rates– Packet size = 1518 B
8.7%20%
Conclusion
• Software‐based NIDS:– Based on commodity hardware
• Competes with hardware‐based counterparts
– 5x faster than previous S/W‐based NIDS– Power efficient– Cost effective
24
> 25 Gbps (real traffic)> 33 Gbps (synthetic traffic)US $~7,000/‐
Backup Slides
Kargus vs. MIDeA
27
UPDATE MIDEA KARGUS OUTCOME
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Kargus vs. MIDeA
28
UPDATE MIDEA KARGUS OUTCOME
Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Kargus vs. MIDeA
29
UPDATE MIDEA KARGUS OUTCOME
Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization
Detection engineGPU‐support for Aho‐Corasick
GPU‐support for Aho‐Corasick & PCRE
65% faster detection rate
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Kargus vs. MIDeA
30
UPDATE MIDEA KARGUS OUTCOME
Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization
Detection engineGPU‐support for Aho‐Corasick
GPU‐support for Aho‐Corasick & PCRE
65% faster detection rate
Architecture Process‐based Thread‐based 1/6GPU memory usage
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Kargus vs. MIDeA
31
UPDATE MIDEA KARGUS OUTCOME
Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization
Detection engineGPU‐support for Aho‐Corasick
GPU‐support for Aho‐Corasick & PCRE
65% faster detection rate
Architecture Process‐based Thread‐based 1/6GPU memory usage
Batch processingBatching only for
detection engine (GPU)Batching from packet acquisition to output
1.9x higher throughput
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Kargus vs. MIDeA
32
UPDATE MIDEA KARGUS OUTCOME
Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization
Detection engineGPU‐support for Aho‐Corasick
GPU‐support for Aho‐Corasick & PCRE
65% faster detection rate
Architecture Process‐based Thread‐based 1/6GPU memory usage
Batch processingBatching only for
detection engine (GPU)Batching from packet acquisition to output
1.9x higher throughput
Power‐efficient
AlwaysGPU (does not offload
only when packet sizeis too small)
Opportunistic offloading toGPUs
(Ingress traffic rate)15% power saving
* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011
Receive‐Side Scaling (RSS)
33
• RSS uses Toeplitz hash function (with a random secret key)
Algorithm: RSS Hash Computation
function ComputeRSSHash(Input[], RSK)ret = 0;for each bit b in Input[] do
if b == 1 thenret ^= (left‐most 32 bits of RSK);
endifshift RSK left 1 bit position;
end forend function
Symmetric Receive‐Side Scaling
34
• Update RSK (Shinae et al.)
0x6d5a 0x6d5a 0x6d5a 0x6d5a
0x6d5a 0x6d5a 0x6d5a 0x6d5a
0x6d5a 0x6d5a 0x6d5a 0x6d5a
0x6d5a 0x6d5a 0x6d5a 0x6d5a
0x6d5a 0x6d5a 0x6d5a 0x6d5a
0x6d5a 0x56da 0x255b 0x0ec2
0x4167 0x253d 0x43a3 0x8fb0
0xd0ca 0x2bcb 0xae7b 0x30b4
0x77cb 0x2d3a 0x8030 0xf20c
0x6a42 0xb73b 0xbeac 0x01fa
Why use a GPU?
35
GTX 580: 512 cores
ALU
Xeon X5680: 6 cores
Control
Cache
ALU
ALU ALU
ALUALU
ALU
VS
*Slide adapted from NVIDIA CUDA C A Programming Guide Version 4.2 (Figure 1‐2)
GPU Microbenchmarks –Aho‐Corasick
36
0
5
10
15
20
25
30
35
40
32 64 128 256 512 1,024 2,048 4,096 8,192 16,384
Throug
hput (G
bps)
The number of packets in a batch (pkts/batch)
GPU throughput (2B per entry)
CPU throughput
2.15 Gbps
39 Gbps
GPU Microbenchmarks – PCRE
37
0
1
2
3
4
5
6
7
8
9
10
32 64 128 256 512 1,024 2,048 4,096 8,192 16,384
Throug
hput (G
bps)
The number of packets in a batch (pkts/batch)
GPU throughput
CPU throughput
0.52 Gbps
8.9 Gbps
• Use of global variables minimal– Avoids compulsory cache misses– Eliminates cross‐NUMA cache bouncing effects
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
64 128 256 512 1024
Innocent Traffic Malicious Traffic
Effects of NUMA‐aware Data Placement
38
Packet Size (Bytes)
Performan
ce Spe
edup
1518
CPU‐only analysis for small‐sized packets
• Offloading small‐sized packets to the GPU is expensive– Contention across page‐locked DMA accessible memory with GPU– GPU operational cost of packet metadata increases
39
0
2,000
4,000
6,000
8,000
10,000
12,000
64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128
Latenc
y (m
sec)
Packet Size (Bytes)
GPU total latency CPU total latencyGPU pattern matching latency CPU pattern matching latency
82
Challenge 1: Packet Acquisition
• Default packet module: Packet CAPture (PCAP) library– Unsuitable for multi‐core environment– Low Performing
40
0.4 0.8 1.5 2.9 5.0 6.7
0
20
40
60
80
100
0
5
10
15
20
25
30
35
40
64 128 256 512 1024 1518
CPU Utiliz
ation (%
)
Rec
eiving Throu
ghpu
t (Gbp
s)
Packet Size (bytes)
PCAP polling PCAP polling CPU %
Solution: PacketShader* I/O
41
0.4 0.8 1.5 2.9 5.0 6.7
0
20
40
60
80
100
0
5
10
15
20
25
30
35
40
64 128 256 512 1024 1518
CPU Utiliz
ation (%
)
Rec
eiving Throu
ghpu
t (Gbp
s)
Packet Size (bytes)
PCAP polling PSIO PCAP polling CPU % PSIO CPU %