Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
GPU A l t d D di f Hi hGPU Accelerated Decoding of High Performance Error Correcting Codes
Andrew D. Copeland, Nicholas B. Chang,d St h Land Stephen Leung
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
HPEC_GPU_DECODE-1ADC 10/1/2009
MIT Lincoln Laboratory
conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Outline
• IntroductionWi l C i ti– Wireless Communications
– Performance vs. Complexity– Candidate Implementationsp
• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation• Results
S• Summary
MIT Lincoln LaboratoryHPEC_GPU_DECODE-2
ADC 10/1/2009
Wireless Communications
Coding Modulation Demodulation Decoding
Generic Communication System
Coding Modulation Demodulation Decoding
... ...
Channel•MultipathNoiseg g
Data Datag g•Noise
•Interference
MIT Lincoln LaboratoryHPEC_GPU_DECODE-3
ADC 10/1/2009
Wireless Communications
O FOur Focus
Coding Modulation Demodulation Decoding
Generic Communication System
Coding Modulation Demodulation Decoding
... ...
Channel•MultipathNoiseg g
Data Datag g•Noise
•Interference
MIT Lincoln LaboratoryHPEC_GPU_DECODE-4
ADC 10/1/2009
Wireless Communications
O FOur Focus
Coding Modulation Demodulation Decoding
Generic Communication System
Coding Modulation Demodulation Decoding
... ...
Channel•MultipathNoiseg g
Data Datag g
Simple Error Correcting Code
•Noise•Interference
01011 Coding
010110101101011
MIT Lincoln LaboratoryHPEC_GPU_DECODE-5
ADC 10/1/2009
Wireless Communications
O FOur Focus
Coding Modulation Demodulation Decoding
Generic Communication System
Coding Modulation Demodulation Decoding
... ...
Channel•MultipathNoiseg g
Data Datag g
Simple Error Correcting Code
•Noise•Interference
01011 Coding
010110101101011
• Allow for larger link distance d/ hi h d t t
High Performance Codes
and/or higher data rate• Superior performance comes
with significant increase in complexity
MIT Lincoln LaboratoryHPEC_GPU_DECODE-6
ADC 10/1/2009
complexity
Performance vs. Complexity
• Increased performance req ires largerrequires larger computational cost
• Off the shelf systems require 5dB more power to
4-State STTC
8-State STTC16-State STTC
32 St t STTC
Commercial Interests
Custom Designs
require 5dB more power to close link than custom system
• Best custom design only
32-State STTC
GF(2) LDPC BICM-ID
64-State STTC
Best custom design only require 0.5dB more power than theoretical minimum
GF(2) LDPC BICM ID
Direct GF(256) LDPC
MIT Lincoln LaboratoryHPEC_GPU_DECODE-7
ADC 10/1/2009
Performance vs. Complexity
• Increased performance req ires larger 4-State STTC
8-State STTC16-State STTC
32 St t STTC
Commercial Interests
Custom Designs
requires larger computational cost
• Off the shelf systems require 5dB more power to 32-State STTC
GF(2) LDPC BICM-ID
64-State STTCrequire 5dB more power to close link than custom system
• Best custom design only GF(2) LDPC BICM ID
Direct GF(256) LDPC
Best custom design only require 0.5dB more power than theoretical minimum�
Our Choice
MIT Lincoln LaboratoryHPEC_GPU_DECODE-8
ADC 10/1/2009
Candidate Implementations
• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB
PC 9x FPGA ASIC 4x GPU
ImplementationImplementation Time (beyond development)Performance
(decode time)(decode time)
Hardware Cost
MIT Lincoln LaboratoryHPEC_GPU_DECODE-9
ADC 10/1/2009
Candidate Implementations
• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB
PC 9x FPGA ASIC 4x GPU
Implementation -Implementation Time (beyond development)Performance
(decode time)9 days
(decode time)
Hardware Cost ≈ $2,000
MIT Lincoln LaboratoryHPEC_GPU_DECODE-10
ADC 10/1/2009
Candidate Implementations
• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB
PC 9x FPGA ASIC 4x GPU
Implementation - 1 personImplementation Time (beyond development)
1 person year (est.)
Performance(decode time)
9 days 5.1 minutes(decode time)
(est.)
Hardware Cost ≈ $2,000 ≈ $90,000
MIT Lincoln LaboratoryHPEC_GPU_DECODE-11
ADC 10/1/2009
Candidate Implementations
• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB
PC 9x FPGA ASIC 4x GPU
Implementation - 1 person 1 5 personImplementation Time (beyond development)
1 person year (est.)
1.5 person years (est.)
+1 year to Fab.Performance
(decode time)9 days 5.1 minutes 0.5s
(if ibl )(decode time)(est.) (if possible)
Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million
MIT Lincoln LaboratoryHPEC_GPU_DECODE-12
ADC 10/1/2009
Candidate Implementations
• Half second burst of data– Decode complexity 37 TFLOPS– Memory transferred during decode 44 TB
PC 9x FPGA ASIC 4x GPU
Implementation - 1 person 1 5 person ?Implementation Time (beyond development)
1 person year (est.)
1.5 person years (est.)
+1 year to Fab.
?
Performance(decode time)
9 days 5.1 minutes 0.5s(if ibl )
?(decode time)
(est.) (if possible)
Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC
cost)
MIT Lincoln LaboratoryHPEC_GPU_DECODE-13
ADC 10/1/2009
cost)
Outline
• Introduction• Low Density Parity Check (LDPC) Codes
– Decoding LDPC GF(256) CodesAl ith i D d f LDPC D d– Algorithmic Demands of LDPC Decoder
• NVIDIA GPU Architecture• GPU Implementation• GPU Implementation• Results• Summary• Summary
MIT Lincoln LaboratoryHPEC_GPU_DECODE-14
ADC 10/1/2009
Low Density Parity Check (LDPC) Codes
LDPC GF(256) Parity-Check Matrix
⎤⎡ 000000000000 5017452 ⎤⎡c
NChecks×NSymbols
NChecks
NSymbols• Proposed by Gallager (1962) codes were binary
LDPC Codes
⎥⎥⎥⎥⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎡
000000000000000000000000000000000000
000000000000000000000000000000000000
39621678214172
17412910721878119
9721345017452
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
⎥⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎢⎡
0
01
c
c
n
NChecks– codes were binary• Not used until mid 1990s• GF(256) Davey and McKay
(1998)
⎥⎥⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢⎢⎢
⎣ 000000000000000000000000
000000000000000000000000000000000000
17811023822097134
15218113949114
396216
191
⎥⎦⎢⎣⎥⎥⎦⎢
⎢⎣ 15c
C d dSyndrome
(1998)– symbols in [0, 1, 2, …, 255]
• Parity check matrix defines graph Codewordgraph
– Symbols connected to checks nodes
• Has error state feedback SY C
H
Tanner Graph
– Syndrome = 0 (no errors)– Potential to re-transmit
YMBOL
HECKS
MIT Lincoln LaboratoryHPEC_GPU_DECODE-15
ADC 10/1/2009
LS S
Decoding LDPC GF(256) Codes
• Solved using LDPC Sum-Product Decoding Algorithm– A belief propagation algorithm
Tanner Graph
p p g g• Iterative Algorithm
– Number of iterations depends on SNR
SYM
CH
Tanner Graph
SymbolUpdate
Check Update
SymbolUpdate
Check Update
HardDecision
HardDecision M
BOLS
HECKS
Update UpdateUpdate Update DecisionDecision
S S
ML Probabilities
CodewordSyndrome
decoded symbols
MIT Lincoln LaboratoryHPEC_GPU_DECODE-16
ADC 10/1/2009
y
Algorithmic Demands of LDPC Decoder
• Decoding of a entire sequence 37 TFLOPSComputational Complexity
– Single frame 922 MFLOPS– Check update 95% of total
• Data moved between device memory and multiprocessors 44.2 TB
Memory Requirements
p– Single codeword 1.1 GB
• Data moved within Walsh Transform and permutations (part of check update) 177 TB
– Single codeword 4.4 GB
Decoder requires architectures with high
MIT Lincoln LaboratoryHPEC_GPU_DECODE-17
ADC 10/1/2009
computational and memory bandwidths
Outline
• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture
– Dispatching Parallel Jobs • GPU Implementation
R l• Results• Summary
MIT Lincoln LaboratoryHPEC_GPU_DECODE-18
ADC 10/1/2009
NVIDIA GPU Architecture
GPU • GPU contains N multiprocessors– Each Performs Single
Instruction Multiple Thread (SIMT) operations
GPU
(SIMT) operations• Constant and Texture caches
speed up certain memory access patterns
• NVIDIA GTX 295 (2 GPUs)– N = 30 multiprocessors– M = 8 scalar processors
Shared memory 16KB– Shared memory 16KB– Device memory 896MB– Up to 512 threads on each
multiprocessor– Capable of 930 GFLOPs – Capable of 112 GB/sec
MIT Lincoln LaboratoryHPEC_GPU_DECODE-19
ADC 10/1/2009
NVIDIA CUDA Programming Guide Version 2.1
Dispatching Parallel Jobs
• Program decomposed into blocks of threads– Thread blocks are executed on individual multiprocessors– Threads within blocks coordinate through shared memory– Threads optimized for coalesced memory access
GridBlock 1 Block 2 Block 3 Block M
…• Calling parallel jobs (kernels) within C function is simple
grid.x = M; threads.x = N;k l ll id h d ( i bl )
Calling parallel jobs (kernels) within C function is simple• Scheduling is transparent and managed by GPU and API
kernel_call<<<grid, threads>>>(variables);
Thread blocks execute independently of one anotherThreads allow faster computation and better use of memory bandwidth
MIT Lincoln LaboratoryHPEC_GPU_DECODE-20
ADC 10/1/2009
Threads allow faster computation and better use of memory bandwidthThreads work together using shared memory
Outline
• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation
– Check Update Implementation R l• Results
• Summary
MIT Lincoln LaboratoryHPEC_GPU_DECODE-21
ADC 10/1/2009
GPU Implementation of LDPC Decoder
• Quad GPU Workstation– 2 NVIDIA GTX 295s
3 20GHz Intel Core i7 965– 3.20GHz Intel Core i7 965 – 12GB of 1330MHz DDR3 memory– Costs $5000
• Software written using NVIDIA’s Compute Unified Device Architecture (CUDA)
• Replaced Matlab functions with CUDA MEX functions• Replaced Matlab functions with CUDA MEX functions• Utilized static variables
– GPU Memory allocated first time function is calledConstants used to decode many codewords only loaded once– Constants used to decode many codewords only loaded once
• Codewords distributed across the 4 GPUs
MIT Lincoln LaboratoryHPEC_GPU_DECODE-22
ADC 10/1/2009
Check Update Implementation
Permutation WalshTransform
Inverse Walsh Trans. Permutation
X
.
.
.
X
Permutation Walsh Transform
Inverse Walsh Trans. PermutationX
.
.
.
• Th d bl k i d t h h kSYM
CHE
• Thread blocks assigned to each check independently of all other checks
• and are 256 element vectorsEach thread is assigned to element of GF(256) M
BOLS
ECKS
– Each thread is assigned to element of GF(256)– Coalesced memory reads– Threads coordinate using shared memory
grid.x = num_checks;threads.x = 256;check_update<<<grid, threads>>>(Q,R);
MIT Lincoln LaboratoryHPEC_GPU_DECODE-23
ADC 10/1/2009
Structure of algorithm exploitable on GPU
Outline
• Introduction• Low Density Parity Check (LDPC) Codes• NVIDIA GPU Architecture• GPU Implementation• Results• Summary
MIT Lincoln LaboratoryHPEC_GPU_DECODE-24
ADC 10/1/2009
Candidate Implementations
PC 9x FPGA ASIC 4x GPU
Implementation Time (beyond development)
- 1 person year (est.)
1.5 person years (est.)
+1 year to Fab.
?
Performance(time to decode half
second of data)
9 days 5.1 minutes 0.5s(if possible)
?
H d C t $2 000 $90 000 $1 illi $5 000Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC
cost)
MIT Lincoln LaboratoryHPEC_GPU_DECODE-25
ADC 10/1/2009
Candidate Implementations
PC 9x FPGA ASIC 4x GPU
Implementation Time (beyond development)
- 1 person year (est.)
1.5 person years (est.)
+1 year to Fab.
6 person weeks
Performance(time to decode half
second of data)
9 days 5.1 minutes 0.5s(if possible)
5.6 minutes
H d C t $2 000 $90 000 $1 illi $5 000Hardware Cost ≈ $2,000 ≈ $90,000 ≈ $1 million ≈ $5,000 (includes PC
cost)
MIT Lincoln LaboratoryHPEC_GPU_DECODE-26
ADC 10/1/2009
Implementation Results
• Decoded test data in 5.6 minutes instead of 9 days 104
Decode Time
y– A 2336x speedup
15 iterations103
10
(Min
utes
)
15 IterationsGPUMatlab15 iterations
102
code
Tim
e 15 Iterations Matlab
0 5 0 0 5 1 1 5 2 2 5 3 3 5 4100
101D
ec
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4SNR (dB)
GPU implementation supports analysis,
MIT Lincoln LaboratoryHPEC_GPU_DECODE-27
ADC 10/1/2009
p pp y ,testing, and demonstration activities
Implementation Results
• Decoded test data in 5.6 minutes instead of 9 days 104
Decode Time
y– A 2336x speedup
• Hard coded version runs fixed number of 15 iterations
103
10
(Min
utes
)
15 IterationsGPUMatlabiterations
– Likely used on FPGA or ASIC versions
15 iterations102
code
Tim
e 15 Iterations MatlabHard Coded
0 5 0 0 5 1 1 5 2 2 5 3 3 5 4100
101D
ec
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4SNR (dB)
GPU implementation supports analysis,
MIT Lincoln LaboratoryHPEC_GPU_DECODE-28
ADC 10/1/2009
p pp y ,testing, and demonstration activities
Summary
• GPU architecture can provide significant performance improvement with limited hardware cost andimprovement with limited hardware cost and implementation effort
• GPU implementation is a good fit for algorithms withGPU implementation is a good fit for algorithms with parallel steps that require systems with large computational and memory bandwidths
– LDPC GF(256) decoding is a well-suited algorithm
• Implementation allows reasonable algorithm development cycle
– New algorithms can be tested and demonstrated in minutes instead of days
MIT Lincoln LaboratoryHPEC_GPU_DECODE-29
ADC 10/1/2009
Backup
MIT Lincoln LaboratoryHPEC_GPU_DECODE-30
ADC 10/1/2009
Experimental Setup
• Data rate 1.3 Gb/s • Half second burst
Test System Parameters
• Half second burst – 40,000 codewords (6000 symbols)
• Parity check matrix 4000 checks × 6000 symbolssymbols
– Corresponds to rate 1/3 code• Performance numbers based on 15 iterations
of decoding algorithmg g
MIT Lincoln LaboratoryHPEC_GPU_DECODE-31
ADC 10/1/2009