35
© 2005, it- instituto de telecomunicações. Todos os direitos reservados. From Low-architectural Expertise Up to High-throughput Non-binary LDPC Decoders: Optimization Guidelines using High-level Synthesis João Andrade 1 , Nithin George 2 , Kimon Karras 3 , David Novo 1 , Vitor Silva 1 , Paolo Ienne 2 , Gabriel Falcão 1 1 University of Coimbra, PT; 2 EPFL, CH; 3 Xilinx Research Labs, IE FPL 2015, London, UK, 1-4 Sept. 2015 0

From Low-architectural Expertise Up to High-throughput Non-binary

  • Upload
    vutuyen

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: From Low-architectural Expertise Up to High-throughput Non-binary

©2005, it- institutode telecomunicações. Todososdireitos reservados.

From Low-architectural Expertise Up toHigh-throughput Non-binary LDPC Decoders:Optimization Guidelines using High-level Synthesis

João Andrade1, Nithin George2, Kimon Karras3, David Novo1,Vitor Silva1, Paolo Ienne2, Gabriel Falcão1

1 University of Coimbra, PT; 2 EPFL, CH; 3 Xilinx Research Labs, IE

FPL 2015, London, UK, 1-4 Sept. 20150

Page 2: From Low-architectural Expertise Up to High-throughput Non-binary

Outline

The Challenge

The ProblemNon-binary LDPC decoding

Decoding architectureHLS decoder design

Experimental Results

Conclusion

1 | FPL’15, London, UK

Page 3: From Low-architectural Expertise Up to High-throughput Non-binary

2 | FPL’15, London, UK

The Challenge

Page 4: From Low-architectural Expertise Up to High-throughput Non-binary

Design an efficient LDPC decoder fast

• RTL requires too specialized knowledge• Our background is GPU and not hardware

• Error-prone design space exploration (DSE)

• Extensive code refactoring for DSE

• High-level synthesis (HLS) has been around for years

• Fast time to market

3 | FPL’15, London, UK

Page 5: From Low-architectural Expertise Up to High-throughput Non-binary

Design an efficient LDPC decoder fast• Adjust DS decisions much faster wo/ extensive refactoring

• Bitwidth for a particular SNR operation point• Decoding schedule• Decoding algorithm

• C/C++ code base can be used with Vivado HLS• C/C++ supported• Cycle-accurate simulation after C-synthesis• Code annotations (#pragma) or Tcl commands

Why?• Power budgets of GPUs way above requirements

• Real-time operation is required• High decoding throughputs• Low latencies

4 | FPL’15, London, UK

Page 6: From Low-architectural Expertise Up to High-throughput Non-binary

5 | FPL’15, London, UK

The Problem

Page 7: From Low-architectural Expertise Up to High-throughput Non-binary

FEC on communication systems

• Belief propagation problem → LDPC codes decoding

• Non-binary LDPC can tackle

• Quantum-key distribution

• Erasure channel (burst)

• AWGN channel

• But have a very high (non-linear) numerical complexity

• Irregular data patterns and intensive access profile

6 | FPL’15, London, UK

Page 8: From Low-architectural Expertise Up to High-throughput Non-binary

Soft-Decoding AlgorithmsBelief-Propagation I

• LDPC decoding is a particular case of belief-propagation

α α αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1 CN2 CN3

VN1 VN3 VN4 VN5 VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

dc = 4

dv = 2

• Messages circulate through a bipartite graph structure withcomputation applied at the node- and edge-level

7 | FPL’15, London, UK

Page 9: From Low-architectural Expertise Up to High-throughput Non-binary

Soft-Decoding AlgorithmsBelief-Propagation II

• The bipartite model can be employed to generalize otheralgorithms w/ the following constraints

• node level functions → must produce/consume data coherently

• edge level functions → produce/consume without restrictions

• By defining these kernels different algorithms can be defined

• CN and VN → Hadamard products

• Edges permute/depermute

• Edges apply the Fast Walsh-Hadamard Transform

8 | FPL’15, London, UK

Page 10: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderMapping the LDPC Tanner graph

• We attempt an isomorphic transformation of the Tanner graphH =

α 0 1 α 0 1α2 α 0 1 1 00 α α2 0 α2 1

αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1

CN2

CN3

VN1

VN3

VN4

VN5

VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

α α

vnUpdate();

permute();

depermute();

fwht();

cnUpdate();

index_lut

• Therein, each node/edge-level kernel become their ownC-function and nodes/edges an iteration within a loop structure

9 | FPL’15, London, UK

Page 11: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderWhere to begin?

• There are several dimensions to non-binary LDPC decoding

(code related)• N VNs and M CNs to process

• Each VN connects to dv CNs

• Each CN connects to dc VNs

(Galois Field related)• 2m probabilities to compute per probability mass-function (pmf)

10 | FPL’15, London, UK

Page 12: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderHow to express computation?

• Suppose a GPU SIMT-architecture mindset//flat loop unsuitable for Vivado HLS optimizationsfor(int i = 0; i < edges*q*d_v; i++){

int e = i/(d_v*q); //get VN idint g = i%q; //get GF(q) elementint t = (i/q)%d_v; //get d_v element

computation();}

• What is it any different than this?//nested loop suitable for Vivado HLS optimizationsfor(int e = 0; e < edges; e++)

for(int g = 0; g < q; g++)for(int t = 0; t < d_v; t++)

computation();

• Optimizations are hardly picked up in the former

11 | FPL’15, London, UK

Page 13: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderLoop structures

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays

partitioned in Solutions IV-VII

E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available

• Loop trip counts• E: edges or

N×dv = M×dc

• GF: 2m

• LOGGF: m• Dv/Dc: dv /dc

• Local BRAM copiesare maintained

• Data streamsfrom DRAM

12 | FPL’15, London, UK

Page 14: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderLoop structures

//nested loop structure of vn_procE:for(int e = 0; e < edges; e++)

GF:for(int g = 0; g < q; g++)Dv:for(int t = 0; t < d_v; t++)

//computation follows

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays

partitioned in Solutions IV-VII

E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available

13 | FPL’15, London, UK

Page 15: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderLoop structures

E:for(int e = 0; e < limit; e++){GF_read:for(int g = 0; g < GF; g++)

//load data into temporary bufferGF_write:for(int g = 0; g < GF; g++)

//permute and store back to memory}

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays

partitioned in Solutions IV-VII

E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available

14 | FPL’15, London, UK

Page 16: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderLoop structures

E:for(int e = 0; e < edges; e++){G_read:for(int g = 0; g < q; g++){

//load data into temporary array}LogGF:for(int c=0;c<m;c++)

GF:for(int g = 0; g < q; g++)//perform Radix-2 computation

G_write:for(int g = 0; g < q; g++){//store data back to memory

}}

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays

partitioned in Solutions IV-VII

E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available

15 | FPL’15, London, UK

Page 17: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoder solutionsPartitioning

• Scheduling analysis after C-synthesis shows lack of mem. ports• BRAMs are instantiated with dual-port control• Further action is required

set_directive_resource -core RAM_T2P_BRAMset_directive_array_partition -type cyclic -factor 4 -dim 1

Partitioned arraysOriginal arrays

l_mv

l_mvc

l_mcv

l_mv_0

l_mvc_0

l_mv_1

l_mv_2

l_mv_3

l_mvc_1

l_mvc_2

l_mvc_3

l_mcv_0

l_mcv_1

l_mcv_2

l_mcv_30

1

2

3

4

5

.

.

.

0

1

2

3

4

5

.

.

.

0

1

2

3

4

5

.

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

2x2m

RW ports availablePartitioning

16 | FPL’15, London, UK

Page 18: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution I

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution I, base version wo/ optimizations

17 | FPL’15, London, UK

Page 19: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution II

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution II Full unroll of inner loops LOGGF and GFset_directive_unroll "*/LOGGF"set_directive_unroll "*/GF"

18 | FPL’15, London, UK

Page 20: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution III

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution III, II+pipeline of outer loops E to II=1set_directive_pipeline "*/E" -II=1

19 | FPL’15, London, UK

Page 21: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution IV

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution IV, I+cyclic partitioning of all BRAM arrays by afactor of 2m

set_directive_array_partition -type cyclic -factor 2^m -dim 1 "decoder" l_buffer

20 | FPL’15, London, UK

Page 22: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution V

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution V, IV+full unroll of inner loops LOGGF and GF

21 | FPL’15, London, UK

Page 23: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution VI

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution VI, III+IV (unroll, pipeline, partition)

22 | FPL’15, London, UK

Page 24: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoderOptimization solution VII

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

E: edges

fwht

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

• Solution VII, IV+pipeline of inner loops LOGGF GF to II=1set_directive_pipeline "*/LOGGF" -II=1set_directive_pipeline "*/GF" -II=1

23 | FPL’15, London, UK

Page 25: From Low-architectural Expertise Up to High-throughput Non-binary

HLS decoder

• Define fixed-point computation suitable for a target SNR/BER#include<ap_cint.h>//data is stored in llr type variables//computation is performed in llr_ type variables//use floating-pointtypedef float llr;typedef float llr_;//use Q8.7 fixed-pointtypedef ap_fixed< 8, 1, AP_RND_INF, SC_SAT > llr;typedef ap_fixed< 16, 3, AP_RND_INF, SC_SAT > llr_;

24 | FPL’15, London, UK

Page 26: From Low-architectural Expertise Up to High-throughput Non-binary

Experimental resultsClock frequency and latency

OptimizationsI II III IV V VI VII

La

ten

cy

[c

yc

les

]

10 3

10 4

10 5

10 6

0

50

100

150

200

250

OptimizationsI II III IV V VI VII

La

ten

cy

[c

yc

les

]

10 4

10 5

10 6

10 7

Fre

qu

en

cy

[M

Hz]

0

50

100

150

200

250

OptimizationsI II III IV V VI VII

La

ten

cy

[c

yc

les

]

10 4

10 5

10 6

10 7

Fre

qu

en

cy

[M

Hz]

0

50

100

150

200

250

• Best clock frequency of operation obtained for Solution VI• Lowest latency always achieved for Solution VI• Solution III is a good compromise between VI and theremaining Solutions

• Solution VII replication of pipelined loops is a poor designchoice

• Most alike to OpenCL strategy

25 | FPL’15, London, UK

Page 27: From Low-architectural Expertise Up to High-throughput Non-binary

Experimental resultsFPGA utilization

I

IV

II

V

III

VI

I

IV

IIV

III

VI

I

IV

II

III

V

VI

partition

unroll

unroll

partition

pipeline

pipeline

partition

higher utilization

latency unchanged

low

er

late

ncy

utiliz

atio

n u

nch

an

ged

• Under 20% LUTutil.

• Multiple decoderinstantiation

• What about pin,clock and meminterface?

26 | FPL’15, London, UK

Page 28: From Low-architectural Expertise Up to High-throughput Non-binary

HLS host platform

• Originally RTL-project and can be Tcl’d automatically• VC709 board target(693K CLBs)

Board DRAM 0 DRAM 1

Memory Interface

...

AXI4

Interconnect

AXI4

Interconnect

BRAMs KBRAMs 1

HLS IP

Core 1

HLS IP

Core K

FPGA

core 0

core 1

core 2

AXI4 I.

Mem. Int.

BRAMs 2

HLS IP

Core 2

• DMA via PCIe → 3KLUTs

• Two DRAM banks controlled(MIG)

• Two AXI interconnect can beconfigured for up to 16 HLScores

• Data streams from the DRAMbank 0 and to bank 1

• Each HLS core performscomputation to its own“BRAM” space

27 | FPL’15, London, UK

Page 29: From Low-architectural Expertise Up to High-throughput Non-binary

Experimental resultsPareto exploration

LUTs [%]0 10 20 30 40 50 60 70 80 90

La

ten

cy

[7

s]

10 0

10 1

10 2

10 3

10 4

10 5

GF(4) GF(8) GF(16)Non-optimal points

Final decoder design w/ DRAM controllersand several accelerators instantiated

Single acceleratorw/o DRAM controllers

ParetoOptimalPoints

• The host HLS arch and the multiple decoders elevatethe LUT utilization to ∼80%

• {14, 5, 3} decoders for GF(22), GF(23) and GF(24)

28 | FPL’15, London, UK

Page 30: From Low-architectural Expertise Up to High-throughput Non-binary

Experimental resultsComparison with RTL decoders

Decoder m K LUT [%] FF BRAM DSP Thr. [Mbit/s] Clk [MHz]

This work

2 1 14 7 0.5 0.5 1.17 25014 80 35 6 6 14.54 219

3 1 21 9 0.9 0.9 0.95 2506 81 34 5 5 4.81 210

4 1 30 13 2 2 0.66 2163 73 32 5 5 1.85 201

Zhang TCS–I’11 4

1

48 (Slices) 41 – 9.3 –

Emden ISTC’102 33.16

1004 – 13.228 1.56

Spagnol SiPS’09 3 13 3 1 – ≤4.7 99Boutillon TCS–I’13 6 19 6 1 – 2.95 61Andrade ICASSP’14 8 85 (LEs) 62 7 1.1 163Scheiber ICECS’13 1 14 (Slices) 21 – 13.4 122

∗ Differences in technology nodes and FPGA are not considered.

29 | FPL’15, London, UK

Page 31: From Low-architectural Expertise Up to High-throughput Non-binary

Comparison with previous HLS works

• Maxeler decoders allow for ∼1 Gbit/s decoding throughputsPratas, GlobalSIP’13Andrade, ASAP’14

• OpenCL (Altera) decoder peaks at hundreds of Kbit/sAndrade, ICASSP’14

FPGA GF(23)

GF(23)

(floating-point)Util.[%] I IV V VI I IV V VILUTs 0.64 1.13 5.20 10.4 0.65 1.48 11.7 17.3FF 0.28 0.53 2.52 3.94 0.29 0.51 3.30 7.56DSP 0.06 0.06 0.89 0.89 0.06 0.06 1.78 1.78BRAM 0.44 0.82 0.82 0.82 0.78 1.63 2.72 1.63

• Vivado HLS decoder reaches dozens of MBit/sScheiber, ICECS’13

30 | FPL’15, London, UK

Page 32: From Low-architectural Expertise Up to High-throughput Non-binary

Summary

• Code writing-style counts• Language is the same, model is not

• Design optimizations come hand-in-hand with the codewriting-style

• Clearly defined bounds are better• Optimizations can be double-edge swords

• We can achieve same ballpark figures of RTL• Higher utilization

• Outlook• When will platforms be automatically generated?

• When will the C programming model merge?

31 | FPL’15, London, UK

Page 33: From Low-architectural Expertise Up to High-throughput Non-binary

32 | FPL’15, London, UK

*(b++)=*(a++)*c;

Qué?

b[i]=a[i]*c;

Ah, si!

Page 34: From Low-architectural Expertise Up to High-throughput Non-binary

Thank you. Questions arewelcome.

33 | FPL’15, London, UK

Page 35: From Low-architectural Expertise Up to High-throughput Non-binary

What tool to use?

What HLS tool?• How much are we willing to lose in control?

• A lot? → OpenCL (C-based)• Dataflow? → MaxCompiler (JAVA)• Some? → LegUp, Vivado HLS (C/C++, SystemC)• None? → Stick to RTL (Verilog, VHDL)

• Vivado HLS allows fine control over• Loop scheduling → unroll, pipeline, merge, flatten• AXI4 blocks → master/slave memory and stream interfaces• Arbitrary bitwidth → fixed-point types supported• No clock, no external memory interfaces, and no pin I/O layout

34 | FPL’15, London, UK