HPEC using FPGAs

HPEC using FPGAsChallenges and Benefits

Dr. Aravind DasuAssistant Professor

Electrical & Computer Engineering

2

Utah State University

Cache Valley 90 miles North of Salt Lake City

David. Sant. Engineering Innovation Building

3

Agenda On-board computing for Spacecraft

A primer on FPGAs (5 slides)

HPEC using FPGAs (26 slides) The Polymorphic Systolic Array Framework

Improving productivity Enabling real time and responsive reconfiguration

Future technologies for FPGAs

Acknowledgements

4

On-board Computing Civilian and Military space missions getting more complex

Need to support several types of data from several types of sensors

Missions will require spacecraft computer to be more responsive Need for In-situ data processing (signal processing) Not just compression, but data analysis, decision making etc.

Power budget, form factors of spacecraft computer extremely tight State of the art RadHard microprocessor from BAE systems or RISC

processor? Aging workhorse, time to upgrade big time

5

So, what do we upgrade to? Commodity Microprocessors

Cell, GPU, Many/Multi core Very powerful Blows out the power budget RadHard parts need to be custom ordered

Commodity DSP chips Good as long as you stick to just one chip Rahhard parts can be custom ordered

Commodity Reconfigurable chips FPGAs (field programmable gate arrays)

Can perform like a custom silicon chip Best performance/power ratios RadHard parts already available with steady roadmap from Xilinx

6

Programming perspective

Microprocessors Optimistic view point

DSP chips FPGAs

Frozen pizza Take ‘n’ bakeRaw ingredients

7

Quick Primer on FPGAs Mixture of blocks on a die

Some dedicated DSP (MAC units) PPC (optional) RAM

Some programmable Look Up Tables (LUT) Gazillions of network switches

Hidden Special circuit

ICAP (internal configuration access port)

8

Simple View of Programming an FPGA

An FPGA is essentially a vast set of SRAM cells waiting to be loaded with 0s and 1s to mimic Boolean logic

NMOS transistor

All computations are assumed to be based on Boolean LogicSo, Problem solving concept => algorithmsAlgorithms => Discrete set of simple tasks (add/multiply…)Simple tasks => A set of Boolean functions talking to each otherBoolean function=> simple manipulation of 1 and 0 bits

Each bit stored in a small memory cell(SRAM)

9

Programming an FPGA Each Look Up Table (LUT) has a unique mailing address

16 bits go into each Look Up Table (LUT)

Each routing switch has a unique mailing address One bit for each switch

Executable for an FPGA is sequence of bits that have to be delivered precisely to each LUT and Switch Box This binary/executable is called “Configuration Bitstream” or

simply “Bitstream”

10

Programming an FPGA Programming the FPGA is like having a Mailman deliver bits to each address correctly

Slow process

But a Bitstream is slightly more complex

Each FPGA is like a Country (has a unique code) A “Bitstream” before entering the chip has to undergo security clearance (CRC or cyclic

redundancy check) Port of Entry = ICAP

FPGA addresses are hierarchical (state, county, city, suburb, house address) Term used for encoding all this overhead is “Frame Address” All this address stuff is overhead

Actual useful stuff is inside the mail envelope

11

So what does a real configured/programmed FPGA look like?

Before ProgrammingNice clean plateEmpty LUTs, Switches….

After ProgrammingMessy plate of spaghettiConfigured LUTs,

Switches….All those green things are wires that have been setup to carry data between LUTs, FFs etc…

12

High Performance Embedded Computing (HPEC) using FPGAs Signal processing algorithms

Wildly useful and hence widely used Computationally quite parallel/pipeline-amenable

Proven to be accelerate-able by Systolic Array designs on FPGAs

The Good of FPGAs: FPGAs claim to have orders of magnitude performance advantage over DSP

chips (www.xilinx.com www.altera.com) They can be reconfigured partially and dynamically

The Bad (no the Ugly): Productivity is the biggest barrier

The number of signal processing folks willing to adopt FPGAs is small and stagnant Partial dynamic reconfiguration is very slow compared to processing speeds

http://www.xilinx.com/

http://www.altera.com/

13

Elaborating the Good of FPGAs: Extreme DSP computing

14

Elaborating the Good of FPGAs:Partial Dynamic Reconfiguration

At some point in time……

Abruptly…say we need to quickly increase parallelism support for application α ( 5)

At the cost of taking away parallelism support for the other application,

Because we did not have enough space on the chip to support high levels of parallelism for both applications, or

There was a power budget we couldn’t satisfy

Can we dynamically reconfigure the chip, without disturbing the execution of either application?

And do it fast enough?Remember, programming the FPGA is a very very very slow process: RELATIVE to execution speeds of applications

FPGA

Circuitα

Circuitα

Circuitα

Circuitα

Four parallel processing circuits for Application α

Circuitβ

Circuitβ

Circuitβ

Seven parallel processing circuits for application β

Circuitβ

Circuitβ

Circuitβ

Circuitβ

FPGA

Circuitα

Circuitα

Circuitα

Circuitα

4 parallel processing circuits for Application α

Circuitβ

Circuitβ

Circuitβ

7 parallel processing circuits for application β

Circuitβ

Circuitβ

Circuitβ

Circuitβ

FPGA

Circuitα

Circuitα

Circuitα

Circuitα


Circuitβ

Circuitβ

Circuitβ


Circuitβ

Circuitβ

Circuitβ

Circuitβ

FPGA

Circuitα

Circuitα

Circuitα

Circuitα


Circuitβ

Circuitβ

Circuitβ


Circuitβ

Circuitβ

Circuitβ

FPGA

Circuitα

Circuitα

Circuitα

Circuitα


Circuitβ

Circuitβ

Circuitβ


Circuitβ

Circuitβ

Circuitβ

Circuitα

15

Productivity It’s a funny thing in the FPGA world

FPGA programmers are essentially VLSI design guys

They don’t buy $5K parts to get average performance

Every clock cycle is precious

Every LUT/FF/MAC/BRAM is precious

They don’t adopt new programming languages in a hurry

They love to have full control over every operation

16

Productivity, so what does it mean? Wants an entire system on FPGA modeled, performance

predicted, designed, implemented, debugged, verified, guaranteed timing closure, low power, high throughput….

Done really really fast, just like software

And then wants to make some minor changes and do it quickly all over again, just like software…

17

Why cant new designs be compiled, loaded onto FPGAs and tested super fast?

Need to look at traditional design flow1. Hardware-Software partition (quick)2. Create macro and micro architectures for hardware portion (a month, two months..)3. Write bug free VHDL/Verilog code for architectures (a few months)4. Synthesize, translate, map, place and route (5 to 15 hours)5. Simulate

If there is a functional or timing bug, you pay a penalty of a few days to weeks

6. Load configuration onto chip Test again. If there is a timing bug, you pay a penalty of several weeks

7. If you decide to make a micro architecture change, go back to step 28. Good luck trying to finish your project on time and budget9. This will still not get you a dynamically reconfigurable design

18

One way to Improve Productivity

Stick to the traditional design flow as much as possible FPGA users are once bitten twice shy Very conservative and believe in the existing flow

But introduce structure into the flow, i.e. physical structure, macro-architecture structure

Make Partial Dynamic Reconfiguration (PDR) almost automatic FPGA designers are not conversant with PDR designs

19

Augmented Design Flow: Exclusively for Signal Processing Algorithms

Hardware-Software Partitioning (just a concept and specific to an application)

Structured Macro-architecture via Floor Planning Generic structure applicable to many algorithms

Structure Micro-architecture design Project, Schedule data flow model of Sig. Proc. Kernel onto things called Sockets of

Macro-architecture Well understood process

Embed dynamic reconfiguration capability New technology Works in tandem with Macro-architecture

Code, Synthesize….

Test on chip

Structured Macro-architecture

Some important Terms/Elements: Socket: A physical region on the FPGA chip reserved by designer to be loaded

with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR) Switch Box: A circuit that makes the array of Sockets re-partition-able PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s

systolic array data-flow functionality. To activate a socket, a PE must be loaded into it

21

Socket/PRR: Under the Hood

Yellow box: A socket/PRRIt contains BRAMs, MACs and LUTs/FFs (purple and blue/green/black stuff)

If you want to dynamically reconfigure the parallelism of Systolic Arrays on an FPGA:All PRRs must be created with identical resources of MACs, BRAMs, LUTs, FFs.

Physical fabric of Virtex SX 35 FPGA

Simple circuitNeed to set mux sel lines & fifo controlsResides in static region on FPGA

Change SB connections to change partitioning of sockets/PRRs between systolic array kernels’ nodes

Switch Box: Stuff that makes the Array of Sockets Re-partition-able

23

Ok, time to port Macro-architecture Framework onto Chip

Virtex 4 SX 35

Static region (luminescent green stuff)•Microprocessor•Switch Boxes•Cache•Controller

PRRs/Sockets(white boxes)•To be filled with Systolic Array Processing Elements

What really happened when we tried it

25

Now to the Micro-architecture…First, Hardware Software Partitioning

Example: Extended Kalman Filter (EKF). A critical navigation algorithm and a nasty signal processing kernel.All stuff with rounded edges are tasks that can change based on physics of the problem. So put it all in software (Microblaze).All else is consistent and so put them in hardware (PolySAF)

26

Designing/Deriving the Processing Element: Example EKF

Works on Faddeev Algorithm to compute Schur compliment

27

One of the many possible ways

Port

28

Code, Synthesize, …Optimize Port: Code, synthesize, Translate, Map, Place and Route

For One Socket/PRR (just a few days worth of work)

Move Nets around to meet timing: Manually pick up a wire in this small bowl of spaghetti of wires, and move it around. Nuisance of a task, but necessary But you need to do it only in one PRR (just a few hours worth of work)

Copy Locally optimized bitstream/circuit of the one PRR to all PRRs Automatically obtain Global Timing closure for the PolySAF If Microprocessor, Cache are retained for multiple designs, then global

timing closure for whole chip is also automatically gifted to you

29

Have we answered the Productivity problem?Time to Grade the Approach

Need to look at traditional design flow1. Hardware-Software partition (quick)2. Create macro and micro architectures for hardware portion (a month, two months..)

Applicable to a wide range of Sig. Proc. Algorithms3. Write bug free VHDL/Verilog code for architectures (a few months)

Reuse most of the macro structure and code only for one PRR4. Synthesize, translate, map, place and route (5 to 15 hours)

Do for only one PRR5. Simulate

If there is a functional or timing bug, you pay a penalty of a few days to weeks

6. Load configuration onto chip Test again. If there is a timing bug, you pay a penalty of several weeks

7. If you decide to make a micro architecture change, go back to step 38. Good luck trying to finish your project on time and budget

30

Want the details, the math, the algorithms etc?

Read this paper A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam,

“Dynamically Reconfigurable Systolic Array Accelerators: A case study with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques. Vol 4, Issue 1. Jan 2010.

Author preprint available on line at Reconfigurable Computing Group www.usu.edu/rcg

http://www.usu.edu/rcg

31

Now, onto Partial Dynamic Reconfiguration in the PolySAF

3 nodes EKF2 nodes DWT

Detach Socket2 nodes EKF2 nodes DWT

Reconfigure Reset new PRRRe-attach2 nodes EKF3 nodes DWT

DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression

32

How to Physically Reconfigure PRR? Known Methods

33

Comparison of all known options

Best known technique: from Microsoft Research Labs (2008) eMIPS projectToo Slow, Too expensive (hogs up valuable on-chip BRAMs)

34

Embedding Dynamic Reconfiguration into the System

Active Bitstream (PRR) to PRR: Hardware Circuit

ARC

ICAP

PRR(source)

active bitstream

PRR(destination)

FPGA

PRR(destination)

ICAP wrapper

snoop

35

Accelerated Relocation Circuit (ARC)

Manipulate Frame addresses FAR is Frame address register Lots of unnecessary overhead can be avoided

No need for CRC processing

36

Results…reconfiguration times in millisecs

All systems run @ 100 MHz Footprint of ARC: 1064 LUTs, 638 FFs and 1 BRAM

* Estimated values for state of the art competing technologies

Test Circuit ResourcesBitstream

Size (Bytes)

#.of.frames

ARC

BiRF*IEEE

TVLSI 2009

Microsoft*Tech. Report 2008

PolySAF node LUT FF DSP BRAM Same Side/

Opp SideSame side BRAM Same

SideOpp Side

FSA_no_DSP 486 273 0 0 31159 195 0.48 84.7 14 3.38 8.86

DSA_no_DSP 438 273 0 0 30693 195 0.48 83.4 14 3.33 8.73

Matrix_Multno_DSP 1234 988 0 0 68469 432 1.07 186.1 30 7.42 19.47

FSA_with_DSP 423 216 1 0 32349 195 0.48 87.9 15 3.50 9.20

DSA_with_DSP 375 216 1 0 32349 195 0.48 89.8 15 3.58 9.20

Matrx_Multwith_DSP 502 466 8 0 65261 432 1.07 177.3 29 7.07 18.56

RFT cases

DCT 1419 1636 8 8 44397 540 1.34 120.64 22 4.81 12.62

CSC 318 438 1 12 17313 301 0.74 47.04 9 1.87 4.92

DWT 940 389 0 4 47897 303 0.75 130.2 21 5.19 13.62

37

Next steps…Improve, Formalize and Collaborate

Performance prediction Model Predict how big circuit will be, how it will perform using Excel and Matlab Big leap in productivity

Arithmetic Precision manipulation is extraordinarily powerful when it comes to FPGAs If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save

medium to massive amounts of area, power in the circuit mapped onto the FPGA Great opportunity for Small Satellites

Efficient communication between Microprocessor and PolySAF via threads

Validate and brutally test this on a large number of algorithms (FFTs, Filters, Hyperspectral processing…..) NASA can help with this Technology is attractive for software defined radios, precision navigation…

38

Kaleidoscope: Future of FPGA Near term

Maybe better tools to program and debug FPGAs? Mentor’s Catapult, AutoESL compiler, Synfora compiler….

Maybe some sort of standardization in FPGA programming Hopefully DARPA HPCS program will produce something

Longer term (Revolutionary things to come) Vertically Integrated FPGA + DRAM on a single chip

1000x improvement in performance/watt Visit Micron Research Center at USU to learn more www.usu.edu/mrc

http://www.usu.edu/mrc

39

Acknowledgements Joe Bredekamp and the NASA AISR program

Applied Information Systems Research Funding from NASA is valuable

Focused research Want my technology to be adopted for real missions

Xilinx and Mentor Graphics (donated > $ 100K worth software)

My Grad Students

Documents

HPEC using FPGAs