46
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs Alex Brant Advisor: Guy Lemieux University of British Columbia 1

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Embed Size (px)

DESCRIPTION

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs. Alex Brant Advisor: Guy Lemieux University of British Columbia. Outline. Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary. Motivation - 1. FPGA Overlays - PowerPoint PPT Presentation

Citation preview

Page 1: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Coarse and Fine Grain Programmable Overlay Architectures for FPGAsAlex Brant

Advisor: Guy Lemieux

University of British Columbia

1

Page 2: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Outline

Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

2

Page 3: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Motivation - 1FPGA Overlays

FPGA designs that can be further programmed by the userWhat are the benefits?

Ease of use (simpler languages, tools, etc.)Optimized for particular problem domainsOpen access to architecture & CADUser-configured logic added to fixed FPGA bitstreamDynamic reconfiguration on any devicePortability between vendors and devices

3

Page 4: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Motivation - 2Fine Grain Overlay – ZUMAFPGA-like architecture

Compatible with VTR CAD tools“Virtual” FPGA for portability of designsOpen source for research and applications

Implements fine grain part of MALIBU architectureGeneric implementation has high area overhead

Overcome by utilizing low level FPGA resources, implementing more efficient structures

4

Page 5: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Motivation - 3Coarse Grain Overlay – CARBONArray of time-multiplexed ALUs

Fast compileHigh densityEfficient mapping of word oriented circuits

Implements coarse grain part of MALIBUTime-multiplexing limits overall performance

Performance gained using overclocking with error tolerance (CARBON-Razor)

5

Page 6: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Contributions

Area efficient implementation of fine grain routing and logic with LUTRAMs

Area efficient 2-stage local routing network and configuration controller

Extension of Razor error tolerance from pipelined processors to 2D processing arrays

Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

6

Page 7: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Publications

7

ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012)

Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012)

CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)

Page 8: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

8

Page 9: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

FPGA Architecture

9

Implements any logic function

Page 10: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

MALIBU Architecture

10

Hybrid coarse/fine grain FPGA Time-multiplexed ALU (CG) combined with FPGA cluster CG passes data to neighbors through memories

Page 11: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

MALIBU Hybrid FPGA CGs are run on fast system clock (e.g. > 1GHz) System clock / Schedule length = User clock rate Advantages:

Greater density from time-multiplexing Ability to trade-off between area and speed Compiles up to 300x faster than normal FPGA Better performance for word-oriented circuits

11

Page 12: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Tolerance

Works with feed-forward pipeline circuits Detects timing errors by capturing data a second time

with a delayed clock Tolerates errors by stalling pipeline one cycle

12

Page 13: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Example

Data captured in main FF

13

Page 14: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Example

Data captured in main FF Fraction of cycle later, data captured by shadow latch

14

Page 15: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Example

Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared

15

Page 16: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Example

Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared

If different, shadow data loaded to main FF, pipeline is stalled

16

Page 17: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Razor Timing Error Example

Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared

If different, shadow data loaded to main FF, pipeline is stalled If not, pipelining proceeds normally

17

Page 18: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

18

Page 19: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

ZUMA Overlay

19

Island style FPGA architecture, implemented on an FPGA

Initially implemented in generic Verilog High area overhead, 125+ host LUTs for each ZUMA

LUT (eLUT) Area efficiency improvements:

Implementation of routing and logic with FPGA LUTRAMs

Design of efficient 2-stage local interconnect

Page 20: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

ZUMA Layout

20

K-LUT FFTwo Stage

Crossbar Network

S-Block

Input Block

Logic Cluster

One tile of ZUMA Architecture

Page 21: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Details - LUTRAM

21

we

data outConfig Bits

2k

Decoder

rd addr

wr addr

data in

k

kConfig Bits

2k

Reprogrammable LUTRAM in Xilinx and Altera Devices

Page 22: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Details – LUTRAM Multiplexer

22

6-LUTs0

yy

s1

d1d2d3

d0d1d2d3

d0

d4d5

6-LUT, configured as a 4-to-1 MUX

6-LUT6-LUT, configured as a 6-to-1 MUX in RAM mode

LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory

Page 23: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Details – Local Routing Crossbar

23

K-LUTk

1 1

k

k

1 1

k

k

1 1

k

P

1 1

N

P

1 1

N

P

1 1

N

1

k

P k x kLUTRAMs

k P x NLUTRAMs

N k-input LUTs

K-LUT1

k

K-LUT1

k

P=(I+N)/k

I+NInputs

N*kOutputs

Reduced Two Stage Network ZUMAeLUTs

Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster

Page 24: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Results

24

Both Xilinx and Altera versions implementedOur generic version is 125-150 LUTs per eLUTArea overhead as low as 40 Host LUTs per eLUT

with improvementsCompared to previous work (vFPGA) on 4-LUT

host, overhead reduced 3x with same parameters

Page 25: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary

25

Page 26: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON Overlay FPGA implementation of MALIBU CG

Modifications to support FPGA block RAMs Critical Path is Memory to ALU to Memory

26

Page 27: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON-Razor

Razor is applied to the CARBON overlay Error tolerance on memory to memory critical path

How to do it: Shadow registers apply to CARBON memories CARBON schedule 1-3 extra timeslots for error

recovery Stall propagation extend from 1D pipeline (Razor)

to 2D array (CARBON)

27

Page 28: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON-Razor Memory

28

Shadow register paired with RAM Stratix memory mode allows read-back of previously written

data

Page 29: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error PropagationCan’t propagate errors to entire chip fast enough

We can propagate it one tile per cycleError propagation logic can then combine multiple

errors into one stall region

Page 30: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

0

Page 31: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

0 1

1

1

1

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

Page 32: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

2 2

2

0 1

2

1

1

1

2

2

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

Page 33: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

3

3 2

3

2

3 2

0 1

2

1

1

1

2

2

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

Page 34: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

4 3

3 2

3

2

3 2

0 1

2

1

1

1

2

2

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

Page 35: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

2D Error Propagation Example

4 3

3 2

3

2

3 2

0 1

2

1

1

1

2

2

Error at tile at cycle 0 Each cycle, stall

propagates to nearest neighbors

Page 36: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Stall Propagation Logic

36

When an error is detected at a CG: Instruction schedule stalls Memories in CG load from shadow register Any writes from neighbor captured in shadow register

Next cycle: Schedule resumes Neighbor’s write performed from shadow register 4 neighbors stall, unless they stalled last cycle

Stall region continues in expanding diamond shaped wave

Page 37: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Carbon Schedule Extension We add 1-3 cycles of slack to schedule

Allows margin of safety Speedup determined by difference in FMAX and schedule

length If no hard deadline is needed (eg. when used as compute

accelerator), average extension of schedule can be used to find speedup

FMAX-Razor * SLBase

FMAX-Base * SLRazor

37

Speedup =

Page 38: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Results

38

Performance compared between CARBON and CARBON-Razor for 4 benchmarks

Maximum performance found by pushing clock speed and shadow register delay

Average increases to 14% with no hard deadline

Benchmark SL Extra Cycles Speedup

Random Ops 24 2 11%

Wang 28 1 6%

Mean(256) 67 2 20%

PR 29 1 3%

Average 13%

Page 39: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Contributions

39

Area efficient implementation of FPGA routing and logic with LUTRAMs

Area efficient 2-stage local routing network and configuration controller

Extension of Razor error tolerance from pipelined processors to 2D processing arrays

Design of an overclockable coarse grain FPGA overlay with in-circuit error correction

Page 40: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

SummaryFine Grain Overlay – ZUMAFPGA-like architecture, compatible with VTR CAD toolsHigh area overhead implementing fine grain structures

Overcome by utilizing FPGA resources, implementing alternate structuresArea reduced to 40 host LUTs per eLUT, 3x improvement

Coarse Grain Overlay – CARBONFast compile, efficient mapping of word oriented circuitsTime-multiplexing decreases overall performance

Performance gained using overclocking with error toleranceSpeedup of 13% on average compared to baseline design

40

Page 41: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

41

Thank you

Page 42: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

ZUMA Config Controller

42

data

2k bit counter

Bitstream In(ROM, JTAG)

Tile

addr

Overflow

FF

Begin Config

weD Q

Count

we

dataTile

addrFF

weD Q we

Shift Chain

dataTile

addr

weLUTRAM

data[0]

data[1]LUTRAM

Page 43: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

LUTRAM Crossbar

43

2m x n Memory

rd addr

wr addr

data in

data out

m

m

LUTRAM

we

nn

n x m Crossbar

data in

data out

Page 44: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON Razor Timing

44

Shadow register latches correct data if delay is sufficient

Page 45: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON-Razor Stall Logic

45

Page 46: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

CARBON-Razor Test

46

f~

Dynamic PLLØ+Δ

SystemClock

RazorClock

freq

.

phas

e

enab

le

Rand

omVe

ctor

s

Out

put

Vect

ors

Erro

rCo

unt

Nios II/f

MAL

IBU

–Raz

or