Download pdf - SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSLShader: Cheap SSL Acceleration with

Commodity Processors

Keon Jang+, Sangjin Han+, Seungyeop Han*,

Sue Moon+, and KyoungSoo Park+

KAIST+ and University of Washington*

1

Security of Paper Submission Websites

2

Network and Distributed System Security Symposium

Security Threats in the Internet

Public WiFi without encryption

• Easy target that requires almost no effort

Deep packet inspection by governments

• Used for censorship

• In the name of national security

NebuAd’s targeted advertisement

• Modify user’s Web traffic in the middle

3

Secure Sockets Layer (SSL)

A de-facto standard for secure communication

• Authentication, Confidentiality, Content integrity

4

Client Server TCP handshake

Encrypted data

Key exchange using public key algorithm

(e.g., RSA) Server identification

SSL Deployment Status

Most of Web-sites are not SSL-protected

• Less than 0.5%

• [NETCRAFT Survey Jan ‘09]

Why is SSL not ubiquitous?

• Small sites: lack of recognition, manageability, etc.

• Large sites: cost

• SSL requires lots of computation power

5

SSL Computation Overhead

Performance overhead (HTTPS vs. HTTP)

• Connection setup

• Data transfer

Good privacy is expensive

• More servers

• H/W SSL accelerators

Our suggestion:

• Offload SSL computation to GPU

6

22x

50x

SSL-accelerator leveraging GPU

• High-performance

• Cost-effective

SSL reverse proxy

• No modification on existing servers

SSLShader

7

SSLShader

Web Server

SMTP Server

POP3 Server

Plain TCP SSL-encrypted session

Our Contributions

GPU cryptography optimization

• The fastest RSA on GPU

• Superior to high-end hardware accelerators

• Low latency

SSLShader

• Complete system exploiting GPU for SSL processing

• Batch processing

• Pipelining

• Opportunistic offloading

• Scaling with multiple cores and NUMA nodes

8

CRYPTOGRAPHIC PROCESSING

WITH GPU

9

How GPU Differs From CPU?

Intel Xeon 5650 CPU:

6 cores

NVIDIA GTX580 GPU:

512 cores

Control

ALU

ALU

ALU

ALU

ALU ALU

Cache

ALU

10

62×109 870×109 < Instructions / sec

Core

Cache

http://i.haymarket.net.au/News/NVIDIA_Fermi_GTX480_Die_Shot.jpg

void VecAdd(

int *A, int *B, int *C, int N)

{

//iterate over N elements

for(int i = 0; i < N; i++)

C[i] = A[i] + B[i]

}

VecAdd(A, B, C, N);

__global__ void VecAdd(

int *A, int *B, int *C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i]

}

//Launch N threads

VecAdd<<<1, N>>>(A, B, C);

Single Instruction Multiple Threads (SIMT)

11

GPU code CPU code

Example code: vector addition (C = A + B)

1/3지점 8분 10초

Parallelism in SSL Processing

12

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Our GPU Implementation

Choices of cipher-suite

Optimization of GPU algorithms • Exploiting massive parallel processing

• Parallelization of algorithms

• Batch processing

• Data copy overhead is significant

• Concurrent copy and execution

13

앞에랑 매핑이 되게-_- 그림을 가져와서 매핑이 되게 하는게 좋을듯

Client Server

Encryption: AES Message Authentication: SHA1

Key exchange: RSA

Basic RSA Operations

M: plain-text, C: cipher-text

(e, n): public key, (d, n): private key

Encryption:

C = Me mod n

Decryption:

M = Cd mod n

14

1024/2048 bits integer (300 ~ 600 digits)

Small number: 3, 17, 65537

Decryption at the server side is the bottleneck

Exponentiation many multiplications

Server-side

Server

Client

Breakdown of Large Integer Multiplication

15

Schoolbook multiplication

649 X 627 ---------

63 280 4200 180 800

12000 5400 32000

+ 360000 ---------

406923

Accumulation is difficult to parallelize due to

“overlapping digits”

“carry propagation”

3 x 3 = 9 multiplications 9 addition of 6-digits integers

O(s) Parallel Multiplications

16

Example of 649 x 627 = 406,923

2s steps

1 or 2 steps (s – 1 worst case)

s = # of words in a large integer (E.g., 1024-bits = 16 x 64 bits word)

More Optimizations on RSA

Common optimizations for RSA • Chinese Remainder Theorem (CRT) • Montgomery Multiplication • Constant Length Non-zero Window (CLNW)

Parallelization of serial algorithms • Faster Calculation of M×n • Interleaving of T + M×n • Mixed-Radix Conversion Offloading

GPU specific optimizations • Warp Utilization • Loop Unrolling • Elimination of Divergence • Avoiding Bank Conflicts • Instruction-Level Optimization

17

4054 6620 13281 9891 10146 6627 21041

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

Throughput (operations/s)

Initial (1)

(2)

(3) Warp

Utilization

(4)

(5) (6) 64-bit words (7) Avoiding bank

conflicts

(8) Instruction-level

Optimization CLNW (9) Post-exponentiation offloading

Read our paper for details

Parallelism in SSL Processing

18

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Batch Processing

GTX580 Throughput w/o Batching

19

0.08x 0.02x

1.57x

0.02x 0.0x

0.5x

1.0x

1.5x

2.0x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a “single CPU core”

Intel Nehalem single core (2.66Ghz)

22.1x

6.8x 7.7x 9.4x

0x

5x

10x

15x

20x

25x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a "single CPU core"

GTX580 Throughput w/ Batching

20

Difference: ratio of computation to copy

Batch size: 32~4096 depending on the algorithm

Copy Overhead in GPU Cryptography

GPU processing works by

• Data copy: CPU GPU

• Execution in GPU

• Data copy: GPU -> CPU

21

AES-ENC (Gbps)

AES-DEC (Gbps)

HMAC-SHA1 (Gbps)

GTX580 w/ copy 8.8 10 31

GTX580 no copy 21.8 33 124

0

20

40

60

80

100

120

140

AES-ENC AES-DEC HMAC-SHA1

Th

rou

gh

pu

t (G

bp

s)

↑2.4x ↑3.3x

↑4x

w/o copy

w/ copy

w/o copy

Hiding Copy Overhead

22

Synchronous Execution

Pipelining

Processing time : 3t

t

Amortized processing time : t

…

…

…

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

0x

5x

10x

15x

20x

AES-ENC AES-DEC SHA1

Throughput relative to a single core

GTX580 Performance w/ Pipelining

23

↑ 36% ↑ 36%

↑ 51% w/o copy

synchronous

pipelining

9x 9x

14x

Summary of GPU Cryptography

Performance gain from GTX580

• GPU performs as fast as 9 ~ 28 CPU cores


Lessons • Batch processing is essential to fully utilize a GPU

• AES and SHA1 are bottlenecked by data copy • PCIe 3.0

• Integrated GPU and CPU

24

RSA-1024 (ops/sec)

AES-ENC (Gbps)

AES-DEC (Gbps)

SHA1 (Gbps)

GTX580 91.9K 11.5 12.5 47.1

CPU core 3.3K 1.3 1.3 3.3

16분 30초

BUILDING SSL-PROXY THAT

LEVERAGES GPU

25

SSLShader Design Goals

Use existing application without modification

• SSL reverse proxy

Effectively leverage GPU

• Batching cryptographic operations

• Load balancing between CPU and GPU

Scale performance with architecture evolution

• Multi-core CPUs

• Multiple NUMA nodes

26

Batching Crypto Operations

Network workloads vary over time • Waiting for fixed batch size doesn’t work

27

Output queue

GPU

Input queue

CPU

GPU

SSL Stack

Batch size is dynamically adjusted to queue length

Balancing Load Between CPU and GPU

For small batch, CPU is faster than GPU • Opportunistic offloading

28

Output queue

GPU

Input queue

CPU processing

GPU processing when input queue length > threshold

GPU queue

CPU

Cryptographic operation Minimum Maximum

RSA (1024-bit) 16 512

AES Decryption 32 2048

AES Encryption 128 2048

HMAC-SHA1 128 2048

Input queue length > threshold

Scaling with Multiple Cores

29

Per-core worker threads • Network I/O, cryptographic operation

Sharing a GPU with multiple cores • More parallelism with larger batch size

Output queues

GPU

CPU

CPU

CPU

Input queues GPU

queue CPU

GPU

Core0

Core1

Core2

Scaling with NUMA systems

A process = worker threads + a GPU thread

• Separate process per NUMA node

• Minimizes data sharing across NUMA nodes

30

CPU0

IOH0

GPU0

RAM

NIC0

CPU1

IOH1

GPU1

RAM

NIC1

Node 0 Node 1

Evaluation

Experimental configurations

31

Model Spec Qty

CPU Intel X5650 2.66Ghz x 6 croes 2

GPU NVIDIA GTX580 1.5Ghz x 512 cores 2

NIC Intel X520-DA2 10GbE x 2 2

SSLShader+ Lighttpd

8 Clients

4x 10GbE

…

Server Server

Lighttpd

SSLShader

Lighttpd

OpenSSL

HTTP

Clients

GPU

Clients

HTTPS HTTPS

Server Specification

Evaluation Metrics

HTTPS connection handling performance

• Use small content size

• Stress on RSA computation

Latency distribution at different loads

• Test opportunistic offloading

Data transfer rate at various content size

32

HTTPS Connection Rate

33

29K

21K

11K

3.6K 0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

1024 bits 2048 bits

SSLShader

lighttpd

2.5x

6x

RSA Key Size

Connections / sec

CPU Usage Breakdown (RSA 1024)

34

Kernel NIC device driver,

2.32

SSLShader, 5.31

Libc , 9.88

IPP + libcrypto,

12.89

lighttpd, 4.9

others, 4.35

Kernel (Including

TCP/IP stack), 60.35

Current Bottleneck

Latency at Light Load

35

0102030405060708090

100

1 10 100 1000

CD

F (

%)

Latency (ms)

Similar latency at light load

Lighttpd at 1k connections / sec

SSLShader at 1k connections / sec

Latency at Heavy Load

36

Lower latency and higher throughput at heavy load

0

20

40

60

80

100

1 10 100 1000

CD

F (

%)

Latency (ms)

Lighttpd at 11k connections / sec

SSLShader at 29k connections / sec

Data Transfer Performance

37

0.0x

0.5x

1.0x

1.5x

2.0x

2.5x

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Rel

ati

ve

Per

form

an

ce

Content Size

2.1x

0.87x

Lighttpd performance

Typical web content size is under 100KB

SSLShader: 13 Gbps

CONCLUSIONS

38

Summary

Cryptographic algorithms in GPU

• Fast RSA, AES, and SHA1


SSLShader

• Transparent integration

• Effective utilization of GPU for SSL processing

• Up to 6x connections / sec

• 13 Gbps throughput

39

Linux network stack performance

Copy overhead

QUESTIONS?

THANK YOU!

For more details

https://shader.kaist.edu/sslshader

40

http://shader.kaist.edu/sslshader