GPU A l t d D di f Hi hGPU Accelerated Decoding of High ... · – 2 NVIDIA GTX 295s – 3 20GHz Intel Core i7 9653.20GHz Intel Core i7 965 – 12GB of 1330MHz DDR3 memory – Costs

GPU A l t d D di f Hi hGPU Accelerated Decoding of High Performance Error Correcting Codes

Andrew D. Copeland, Nicholas B. Chang,d St h Land Stephen Leung

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

HPEC_GPU_DECODE-1ADC 10/1/2009

MIT Lincoln Laboratory

conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Outline

• IntroductionWi l C i ti– Wireless Communications

– Performance vs. Complexity– Candidate Implementationsp

• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation• Results

S• Summary

MIT Lincoln LaboratoryHPEC_GPU_DECODE-2

ADC 10/1/2009

Wireless Communications

Coding Modulation Demodulation Decoding

Generic Communication System


... ...

Channel•MultipathNoiseg g

Data Datag g•Noise

•Interference


ADC 10/1/2009


O FOur Focus




... ...


Data Datag g•Noise

•Interference


ADC 10/1/2009


O FOur Focus




... ...


Data Datag g

Simple Error Correcting Code

•Noise•Interference

01011 Coding

010110101101011


ADC 10/1/2009


O FOur Focus




... ...


Data Datag g

Simple Error Correcting Code

•Noise•Interference

01011 Coding

010110101101011

• Allow for larger link distance d/ hi h d t t

High Performance Codes

and/or higher data rate• Superior performance comes

with significant increase in complexity


ADC 10/1/2009

complexity

Performance vs. Complexity

• Increased performance req ires largerrequires larger computational cost

• Off the shelf systems require 5dB more power to

4-State STTC

8-State STTC16-State STTC

32 St t STTC

Commercial Interests

Custom Designs

require 5dB more power to close link than custom system

• Best custom design only

32-State STTC

GF(2) LDPC BICM-ID

64-State STTC

Best custom design only require 0.5dB more power than theoretical minimum

GF(2) LDPC BICM ID

Direct GF(256) LDPC


ADC 10/1/2009

Performance vs. Complexity

• Increased performance req ires larger 4-State STTC

8-State STTC16-State STTC

32 St t STTC

Commercial Interests

Custom Designs

requires larger computational cost

• Off the shelf systems require 5dB more power to 32-State STTC

GF(2) LDPC BICM-ID

64-State STTCrequire 5dB more power to close link than custom system

• Best custom design only GF(2) LDPC BICM ID

Direct GF(256) LDPC

Best custom design only require 0.5dB more power than theoretical minimum�

Our Choice


ADC 10/1/2009

Candidate Implementations

• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB

PC 9x FPGA ASIC 4x GPU

ImplementationImplementation Time (beyond development)Performance

(decode time)(decode time)

Hardware Cost


ADC 10/1/2009




Implementation -Implementation Time (beyond development)Performance

(decode time)9 days

(decode time)

Hardware Cost ≈ $2,000


ADC 10/1/2009




Implementation - 1 personImplementation Time (beyond development)

1 person year (est.)

Performance(decode time)

9 days 5.1 minutes(decode time)

(est.)

Hardware Cost ≈ $2,000 ≈ $90,000


ADC 10/1/2009




Implementation - 1 person 1 5 personImplementation Time (beyond development)


1.5 person years (est.)

+1 year to Fab.Performance

(decode time)9 days 5.1 minutes 0.5s

(if ibl )(decode time)(est.) (if possible)

Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million


ADC 10/1/2009




Implementation - 1 person 1 5 person ?Implementation Time (beyond development)



+1 year to Fab.

?

Performance(decode time)

9 days 5.1 minutes 0.5s(if ibl )

?(decode time)

(est.) (if possible)

Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC

cost)


ADC 10/1/2009

cost)

Outline

• Introduction• Low Density Parity Check (LDPC) Codes

– Decoding LDPC GF(256) CodesAl ith i D d f LDPC D d– Algorithmic Demands of LDPC Decoder

• NVIDIA GPU Architecture• GPU Implementation• GPU Implementation• Results• Summary• Summary


ADC 10/1/2009

Low Density Parity Check (LDPC) Codes

LDPC GF(256) Parity-Check Matrix

⎤⎡ 000000000000 5017452 ⎤⎡c

NChecks×NSymbols

NChecks

NSymbols• Proposed by Gallager (1962) codes were binary

LDPC Codes

⎥⎥⎥⎥⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎢⎢⎢⎢⎡

000000000000000000000000000000000000

000000000000000000000000000000000000

39621678214172

17412910721878119

9721345017452

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎢⎡

0

01

c

c

n

NChecks– codes were binary• Not used until mid 1990s• GF(256) Davey and McKay

(1998)

⎥⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢⎢

⎣ 000000000000000000000000

000000000000000000000000000000000000

17811023822097134

15218113949114

396216

191

⎥⎦⎢⎣⎥⎥⎦⎢

⎢⎣ 15c

C d dSyndrome

(1998)– symbols in [0, 1, 2, …, 255]

• Parity check matrix defines graph Codewordgraph

– Symbols connected to checks nodes

• Has error state feedback SY C

H

Tanner Graph

– Syndrome = 0 (no errors)– Potential to re-transmit

YMBOL

HECKS


ADC 10/1/2009

LS S

Decoding LDPC GF(256) Codes

• Solved using LDPC Sum-Product Decoding Algorithm– A belief propagation algorithm

Tanner Graph

p p g g• Iterative Algorithm

– Number of iterations depends on SNR

SYM

CH

Tanner Graph

SymbolUpdate

Check Update

SymbolUpdate

Check Update

HardDecision

HardDecision M

BOLS

HECKS

Update UpdateUpdate Update DecisionDecision

S S

ML Probabilities

CodewordSyndrome

decoded symbols


ADC 10/1/2009

y

Algorithmic Demands of LDPC Decoder

• Decoding of a entire sequence 37 TFLOPSComputational Complexity

– Single frame 922 MFLOPS– Check update 95% of total

• Data moved between device memory and multiprocessors 44.2 TB

Memory Requirements

p– Single codeword 1.1 GB

• Data moved within Walsh Transform and permutations (part of check update) 177 TB

– Single codeword 4.4 GB

Decoder requires architectures with high


ADC 10/1/2009

computational and memory bandwidths

Outline

• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture

– Dispatching Parallel Jobs • GPU Implementation

R l• Results• Summary


ADC 10/1/2009

NVIDIA GPU Architecture

GPU • GPU contains N multiprocessors– Each Performs Single

Instruction Multiple Thread (SIMT) operations

GPU

(SIMT) operations• Constant and Texture caches

speed up certain memory access patterns

• NVIDIA GTX 295 (2 GPUs)– N = 30 multiprocessors– M = 8 scalar processors

Shared memory 16KB– Shared memory 16KB– Device memory 896MB– Up to 512 threads on each

multiprocessor– Capable of 930 GFLOPs – Capable of 112 GB/sec


ADC 10/1/2009

NVIDIA CUDA Programming Guide Version 2.1

Dispatching Parallel Jobs

• Program decomposed into blocks of threads– Thread blocks are executed on individual multiprocessors– Threads within blocks coordinate through shared memory– Threads optimized for coalesced memory access

GridBlock 1 Block 2 Block 3 Block M

…• Calling parallel jobs (kernels) within C function is simple

grid.x = M; threads.x = N;k l ll id h d ( i bl )

Calling parallel jobs (kernels) within C function is simple• Scheduling is transparent and managed by GPU and API

kernel_call<<<grid, threads>>>(variables);

Thread blocks execute independently of one anotherThreads allow faster computation and better use of memory bandwidth


ADC 10/1/2009

Threads allow faster computation and better use of memory bandwidthThreads work together using shared memory

Outline

• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation

– Check Update Implementation R l• Results

• Summary


ADC 10/1/2009

GPU Implementation of LDPC Decoder

• Quad GPU Workstation– 2 NVIDIA GTX 295s

3 20GHz Intel Core i7 965– 3.20GHz Intel Core i7 965 – 12GB of 1330MHz DDR3 memory– Costs $5000

• Software written using NVIDIA’s Compute Unified Device Architecture (CUDA)

• Replaced Matlab functions with CUDA MEX functions• Replaced Matlab functions with CUDA MEX functions• Utilized static variables

– GPU Memory allocated first time function is calledConstants used to decode many codewords only loaded once– Constants used to decode many codewords only loaded once

• Codewords distributed across the 4 GPUs


ADC 10/1/2009

Check Update Implementation

Permutation WalshTransform

Inverse Walsh Trans. Permutation

X

.

.

.

X

Permutation Walsh Transform

Inverse Walsh Trans. PermutationX

.

.

.

• Th d bl k i d t h h kSYM

CHE

• Thread blocks assigned to each check independently of all other checks

• and are 256 element vectorsEach thread is assigned to element of GF(256) M

BOLS

ECKS

– Each thread is assigned to element of GF(256)– Coalesced memory reads– Threads coordinate using shared memory

grid.x = num_checks;threads.x = 256;check_update<<<grid, threads>>>(Q,R);


ADC 10/1/2009

Structure of algorithm exploitable on GPU

Outline

• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation• Results• Summary


ADC 10/1/2009



Implementation Time (beyond development)

- 1 person year (est.)


+1 year to Fab.

?

Performance(time to decode half

second of data)

9 days 5.1 minutes 0.5s(if possible)

?

H d C t $2 000 $90 000 $1 illi $5 000Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC

cost)


ADC 10/1/2009



Implementation Time (beyond development)

- 1 person year (est.)


+1 year to Fab.

6 person weeks

Performance(time to decode half

second of data)

9 days 5.1 minutes 0.5s(if possible)

5.6 minutes

H d C t $2 000 $90 000 $1 illi $5 000Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC

cost)


ADC 10/1/2009

Implementation Results

• Decoded test data in 5.6 minutes instead of 9 days 104

Decode Time

y– A 2336x speedup

15 iterations103

10

(Min

utes

)

15 IterationsGPUMatlab15 iterations

102

code

Tim

e 15 Iterations Matlab

0 5 0 0 5 1 1 5 2 2 5 3 3 5 4100

101D

ec

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4SNR (dB)

GPU implementation supports analysis,


ADC 10/1/2009

p pp y ,testing, and demonstration activities

Implementation Results

• Decoded test data in 5.6 minutes instead of 9 days 104

Decode Time

y– A 2336x speedup

• Hard coded version runs fixed number of 15 iterations

103

10

(Min

utes

)

15 IterationsGPUMatlabiterations

– Likely used on FPGA or ASIC versions

15 iterations102

code

Tim

e 15 Iterations MatlabHard Coded

0 5 0 0 5 1 1 5 2 2 5 3 3 5 4100

101D

ec

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4SNR (dB)

GPU implementation supports analysis,


ADC 10/1/2009

p pp y ,testing, and demonstration activities

Summary

• GPU architecture can provide significant performance improvement with limited hardware cost andimprovement with limited hardware cost and implementation effort

• GPU implementation is a good fit for algorithms withGPU implementation is a good fit for algorithms with parallel steps that require systems with large computational and memory bandwidths

– LDPC GF(256) decoding is a well-suited algorithm

• Implementation allows reasonable algorithm development cycle

– New algorithms can be tested and demonstrated in minutes instead of days


ADC 10/1/2009

Backup


ADC 10/1/2009

Experimental Setup

• Data rate 1.3 Gb/s • Half second burst

Test System Parameters

• Half second burst – 40,000 codewords (6000 symbols)

• Parity check matrix 4000 checks × 6000 symbolssymbols

– Corresponds to rate 1/3 code• Performance numbers based on 15 iterations

of decoding algorithmg g


ADC 10/1/2009

Documents

GPU A l t d D di f Hi hGPU Accelerated Decoding of High ... · – 2 NVIDIA GTX 295s – 3 20GHz Intel Core i7 9653.20GHz Intel Core i7 965 – 12GB of 1330MHz DDR3 memory – Costs