SSLShader: Cheap SSL Acceleration with
Commodity Processors
Keon Jang+, Sangjin Han+, Seungyeop Han*,
Sue Moon+, and KyoungSoo Park+
KAIST+ and University of Washington*
1
Security of Paper Submission Websites
2
Network and Distributed System Security Symposium
Security Threats in the Internet
Public WiFi without encryption
• Easy target that requires almost no effort
Deep packet inspection by governments
• Used for censorship
• In the name of national security
NebuAd’s targeted advertisement
• Modify user’s Web traffic in the middle
3
Secure Sockets Layer (SSL)
A de-facto standard for secure communication
• Authentication, Confidentiality, Content integrity
4
Client Server TCP handshake
Encrypted data
Key exchange using public key algorithm
(e.g., RSA) Server identification
SSL Deployment Status
Most of Web-sites are not SSL-protected
• Less than 0.5%
• [NETCRAFT Survey Jan ‘09]
Why is SSL not ubiquitous?
• Small sites: lack of recognition, manageability, etc.
• Large sites: cost
• SSL requires lots of computation power
5
SSL Computation Overhead
Performance overhead (HTTPS vs. HTTP)
• Connection setup
• Data transfer
Good privacy is expensive
• More servers
• H/W SSL accelerators
Our suggestion:
• Offload SSL computation to GPU
6
22x
50x
SSL-accelerator leveraging GPU
• High-performance
• Cost-effective
SSL reverse proxy
• No modification on existing servers
SSLShader
7
SSLShader
Web Server
SMTP Server
POP3 Server
Plain TCP SSL-encrypted session
Our Contributions
GPU cryptography optimization
• The fastest RSA on GPU
• Superior to high-end hardware accelerators
• Low latency
SSLShader
• Complete system exploiting GPU for SSL processing
• Batch processing
• Pipelining
• Opportunistic offloading
• Scaling with multiple cores and NUMA nodes
8
CRYPTOGRAPHIC PROCESSING
WITH GPU
9
How GPU Differs From CPU?
Intel Xeon 5650 CPU:
6 cores
NVIDIA GTX580 GPU:
512 cores
Control
ALU
ALU
ALU
ALU
ALU ALU
Cache
ALU
10
62×109 870×109 < Instructions / sec
Core
Cache
void VecAdd(
int *A, int *B, int *C, int N)
{
//iterate over N elements
for(int i = 0; i < N; i++)
C[i] = A[i] + B[i]
}
VecAdd(A, B, C, N);
__global__ void VecAdd(
int *A, int *B, int *C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i]
}
//Launch N threads
VecAdd<<<1, N>>>(A, B, C);
Single Instruction Multiple Threads (SIMT)
11
GPU code CPU code
Example code: vector addition (C = A + B)
1/3지점 8분 10초
Parallelism in SSL Processing
12
Client 1
Client 2
Client N 1. Independent Sessions
SSL Record SSL Record SSL Record 2. Independent SSL Record
3. Parallelism in Cryptographic Operations
SSLShader
Our GPU Implementation
Choices of cipher-suite
Optimization of GPU algorithms • Exploiting massive parallel processing
• Parallelization of algorithms
• Batch processing
• Data copy overhead is significant
• Concurrent copy and execution
13
앞에랑 매핑이 되게-_- 그림을 가져와서 매핑이 되게 하는게 좋을듯
Client Server
Encryption: AES Message Authentication: SHA1
Key exchange: RSA
Basic RSA Operations
M: plain-text, C: cipher-text
(e, n): public key, (d, n): private key
Encryption:
C = Me mod n
Decryption:
M = Cd mod n
14
1024/2048 bits integer (300 ~ 600 digits)
Small number: 3, 17, 65537
Decryption at the server side is the bottleneck
Exponentiation many multiplications
Server-side
Server
Client
Breakdown of Large Integer Multiplication
15
Schoolbook multiplication
649 X 627 ---------
63 280 4200 180 800
12000 5400 32000
+ 360000 ---------
406923
Accumulation is difficult to parallelize due to
“overlapping digits”
“carry propagation”
3 x 3 = 9 multiplications 9 addition of 6-digits integers
O(s) Parallel Multiplications
16
Example of 649 x 627 = 406,923
2s steps
1 or 2 steps (s – 1 worst case)
s = # of words in a large integer (E.g., 1024-bits = 16 x 64 bits word)
More Optimizations on RSA
Common optimizations for RSA • Chinese Remainder Theorem (CRT) • Montgomery Multiplication • Constant Length Non-zero Window (CLNW)
Parallelization of serial algorithms • Faster Calculation of M×n • Interleaving of T + M×n • Mixed-Radix Conversion Offloading
GPU specific optimizations • Warp Utilization • Loop Unrolling • Elimination of Divergence • Avoiding Bank Conflicts • Instruction-Level Optimization
17
4054 6620 13281 9891 10146 6627 21041
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000
Throughput (operations/s)
Initial (1)
(2)
(3) Warp
Utilization
(4)
(5) (6) 64-bit words (7) Avoiding bank
conflicts
(8) Instruction-level
Optimization CLNW (9) Post-exponentiation offloading
Read our paper for details
Parallelism in SSL Processing
18
Client 1
Client 2
Client N 1. Independent Sessions
SSL Record SSL Record SSL Record 2. Independent SSL Record
3. Parallelism in Cryptographic Operations
SSLShader
Batch Processing
GTX580 Throughput w/o Batching
19
0.08x 0.02x
1.57x
0.02x 0.0x
0.5x
1.0x
1.5x
2.0x
RSA AES-ENC AES-DEC SHA1
Throughput relative to a “single CPU core”
Intel Nehalem single core (2.66Ghz)
22.1x
6.8x 7.7x 9.4x
0x
5x
10x
15x
20x
25x
RSA AES-ENC AES-DEC SHA1
Throughput relative to a "single CPU core"
GTX580 Throughput w/ Batching
20
Difference: ratio of computation to copy
Batch size: 32~4096 depending on the algorithm
Copy Overhead in GPU Cryptography
GPU processing works by
• Data copy: CPU GPU
• Execution in GPU
• Data copy: GPU -> CPU
21
AES-ENC (Gbps)
AES-DEC (Gbps)
HMAC-SHA1 (Gbps)
GTX580 w/ copy 8.8 10 31
GTX580 no copy 21.8 33 124
0
20
40
60
80
100
120
140
AES-ENC AES-DEC HMAC-SHA1
Th
rou
gh
pu
t (G
bp
s)
↑2.4x ↑3.3x
↑4x
w/o copy
w/ copy
w/o copy
Hiding Copy Overhead
22
Synchronous Execution
Pipelining
Processing time : 3t
t
Amortized processing time : t
…
…
…
Data copy: CPU -> GPU
Execution in GPU
Data copy: GPU -> CPU
Data copy: CPU -> GPU
Execution in GPU
Data copy: GPU -> CPU
0x
5x
10x
15x
20x
AES-ENC AES-DEC SHA1
Throughput relative to a single core
GTX580 Performance w/ Pipelining
23
↑ 36% ↑ 36%
↑ 51% w/o copy
synchronous
pipelining
9x 9x
14x
Summary of GPU Cryptography
Performance gain from GTX580
• GPU performs as fast as 9 ~ 28 CPU cores
• Superior to high-end hardware accelerators
Lessons • Batch processing is essential to fully utilize a GPU
• AES and SHA1 are bottlenecked by data copy • PCIe 3.0
• Integrated GPU and CPU
24
RSA-1024 (ops/sec)
AES-ENC (Gbps)
AES-DEC (Gbps)
SHA1 (Gbps)
GTX580 91.9K 11.5 12.5 47.1
CPU core 3.3K 1.3 1.3 3.3
16분 30초
BUILDING SSL-PROXY THAT
LEVERAGES GPU
25
SSLShader Design Goals
Use existing application without modification
• SSL reverse proxy
Effectively leverage GPU
• Batching cryptographic operations
• Load balancing between CPU and GPU
Scale performance with architecture evolution
• Multi-core CPUs
• Multiple NUMA nodes
26
Batching Crypto Operations
Network workloads vary over time • Waiting for fixed batch size doesn’t work
27
Output queue
GPU
Input queue
CPU
GPU
SSL Stack
Batch size is dynamically adjusted to queue length
Balancing Load Between CPU and GPU
For small batch, CPU is faster than GPU • Opportunistic offloading
28
Output queue
GPU
Input queue
CPU processing
GPU processing when input queue length > threshold
GPU queue
CPU
Cryptographic operation Minimum Maximum
RSA (1024-bit) 16 512
AES Decryption 32 2048
AES Encryption 128 2048
HMAC-SHA1 128 2048
Input queue length > threshold
Scaling with Multiple Cores
29
Per-core worker threads • Network I/O, cryptographic operation
Sharing a GPU with multiple cores • More parallelism with larger batch size
Output queues
GPU
CPU
CPU
CPU
Input queues GPU
queue CPU
GPU
Core0
Core1
Core2
Scaling with NUMA systems
A process = worker threads + a GPU thread
• Separate process per NUMA node
• Minimizes data sharing across NUMA nodes
30
CPU0
IOH0
GPU0
RAM
NIC0
CPU1
IOH1
GPU1
RAM
NIC1
Node 0 Node 1
Evaluation
Experimental configurations
31
Model Spec Qty
CPU Intel X5650 2.66Ghz x 6 croes 2
GPU NVIDIA GTX580 1.5Ghz x 512 cores 2
NIC Intel X520-DA2 10GbE x 2 2
SSLShader+ Lighttpd
8 Clients
4x 10GbE
…
Server Server
Lighttpd
SSLShader
Lighttpd
OpenSSL
HTTP
Clients
GPU
Clients
HTTPS HTTPS
Server Specification
Evaluation Metrics
HTTPS connection handling performance
• Use small content size
• Stress on RSA computation
Latency distribution at different loads
• Test opportunistic offloading
Data transfer rate at various content size
32
HTTPS Connection Rate
33
29K
21K
11K
3.6K 0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
1024 bits 2048 bits
SSLShader
lighttpd
2.5x
6x
RSA Key Size
Connections / sec
CPU Usage Breakdown (RSA 1024)
34
Kernel NIC device driver,
2.32
SSLShader, 5.31
Libc , 9.88
IPP + libcrypto,
12.89
lighttpd, 4.9
others, 4.35
Kernel (Including
TCP/IP stack), 60.35
Current Bottleneck
Latency at Light Load
35
0102030405060708090
100
1 10 100 1000
CD
F (
%)
Latency (ms)
Similar latency at light load
Lighttpd at 1k connections / sec
SSLShader at 1k connections / sec
Latency at Heavy Load
36
Lower latency and higher throughput at heavy load
0
20
40
60
80
100
1 10 100 1000
CD
F (
%)
Latency (ms)
Lighttpd at 11k connections / sec
SSLShader at 29k connections / sec
Data Transfer Performance
37
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB
Rel
ati
ve
Per
form
an
ce
Content Size
2.1x
0.87x
Lighttpd performance
Typical web content size is under 100KB
SSLShader: 13 Gbps
CONCLUSIONS
38
Summary
Cryptographic algorithms in GPU
• Fast RSA, AES, and SHA1
• Superior to high-end hardware accelerators
SSLShader
• Transparent integration
• Effective utilization of GPU for SSL processing
• Up to 6x connections / sec
• 13 Gbps throughput
39
Linux network stack performance
Copy overhead
QUESTIONS?
THANK YOU!
For more details
https://shader.kaist.edu/sslshader
40