Upload
thalia
View
52
Download
0
Tags:
Embed Size (px)
DESCRIPTION
HPEC using FPGAs. Challenges and Benefits. Dr. Aravind Dasu Assistant Professor Electrical & Computer Engineering. Utah State University. Cache Valley 90 miles North of Salt Lake City. David. Sant . Engineering Innovation Building. Agenda. On-board computing for Spacecraft - PowerPoint PPT Presentation
Citation preview
HPEC using FPGAsChallenges and Benefits
Dr. Aravind DasuAssistant Professor
Electrical & Computer Engineering
2
Utah State University
Cache Valley 90 miles North of Salt Lake City
David. Sant. Engineering Innovation Building
3
Agenda On-board computing for Spacecraft
A primer on FPGAs (5 slides)
HPEC using FPGAs (26 slides) The Polymorphic Systolic Array Framework
Improving productivity Enabling real time and responsive reconfiguration
Future technologies for FPGAs
Acknowledgements
4
On-board Computing Civilian and Military space missions getting more complex
Need to support several types of data from several types of sensors
Missions will require spacecraft computer to be more responsive Need for In-situ data processing (signal processing) Not just compression, but data analysis, decision making etc.
Power budget, form factors of spacecraft computer extremely tight State of the art RadHard microprocessor from BAE systems or RISC
processor? Aging workhorse, time to upgrade big time
5
So, what do we upgrade to? Commodity Microprocessors
Cell, GPU, Many/Multi core Very powerful Blows out the power budget RadHard parts need to be custom ordered
Commodity DSP chips Good as long as you stick to just one chip Rahhard parts can be custom ordered
Commodity Reconfigurable chips FPGAs (field programmable gate arrays)
Can perform like a custom silicon chip Best performance/power ratios RadHard parts already available with steady roadmap from Xilinx
6
Programming perspective
Microprocessors Optimistic view point
DSP chips FPGAs
Frozen pizza Take ‘n’ bakeRaw ingredients
7
Quick Primer on FPGAs Mixture of blocks on a die
Some dedicated DSP (MAC units) PPC (optional) RAM
Some programmable Look Up Tables (LUT) Gazillions of network switches
Hidden Special circuit
ICAP (internal configuration access port)
8
Simple View of Programming an FPGA
An FPGA is essentially a vast set of SRAM cells waiting to be loaded with 0s and 1s to mimic Boolean logic
NMOS transistor
All computations are assumed to be based on Boolean LogicSo, Problem solving concept => algorithmsAlgorithms => Discrete set of simple tasks (add/multiply…)Simple tasks => A set of Boolean functions talking to each otherBoolean function=> simple manipulation of 1 and 0 bits
Each bit stored in a small memory cell(SRAM)
9
Programming an FPGA Each Look Up Table (LUT) has a unique mailing address
16 bits go into each Look Up Table (LUT)
Each routing switch has a unique mailing address One bit for each switch
Executable for an FPGA is sequence of bits that have to be delivered precisely to each LUT and Switch Box This binary/executable is called “Configuration Bitstream” or
simply “Bitstream”
10
Programming an FPGA Programming the FPGA is like having a Mailman deliver bits to each address correctly
Slow process
But a Bitstream is slightly more complex
Each FPGA is like a Country (has a unique code) A “Bitstream” before entering the chip has to undergo security clearance (CRC or cyclic
redundancy check) Port of Entry = ICAP
FPGA addresses are hierarchical (state, county, city, suburb, house address) Term used for encoding all this overhead is “Frame Address” All this address stuff is overhead
Actual useful stuff is inside the mail envelope
11
So what does a real configured/programmed FPGA look like?
Before ProgrammingNice clean plateEmpty LUTs, Switches….
After ProgrammingMessy plate of spaghettiConfigured LUTs,
Switches….All those green things are wires that have been setup to carry data between LUTs, FFs etc…
12
High Performance Embedded Computing (HPEC) using FPGAs Signal processing algorithms
Wildly useful and hence widely used Computationally quite parallel/pipeline-amenable
Proven to be accelerate-able by Systolic Array designs on FPGAs
The Good of FPGAs: FPGAs claim to have orders of magnitude performance advantage over DSP
chips (www.xilinx.com www.altera.com) They can be reconfigured partially and dynamically
The Bad (no the Ugly): Productivity is the biggest barrier
The number of signal processing folks willing to adopt FPGAs is small and stagnant Partial dynamic reconfiguration is very slow compared to processing speeds
13
Elaborating the Good of FPGAs: Extreme DSP computing
14
Elaborating the Good of FPGAs:Partial Dynamic Reconfiguration
At some point in time……
Abruptly…say we need to quickly increase parallelism support for application α ( 5)
At the cost of taking away parallelism support for the other application,
Because we did not have enough space on the chip to support high levels of parallelism for both applications, or
There was a power budget we couldn’t satisfy
Can we dynamically reconfigure the chip, without disturbing the execution of either application?
And do it fast enough?Remember, programming the FPGA is a very very very slow process: RELATIVE to execution speeds of applications
FPGA
Circuitα
Circuitα
Circuitα
Circuitα
Four parallel processing circuits for Application α
Circuitβ
Circuitβ
Circuitβ
Seven parallel processing circuits for application β
Circuitβ
Circuitβ
Circuitβ
Circuitβ
FPGA
Circuitα
Circuitα
Circuitα
Circuitα
4 parallel processing circuits for Application α
Circuitβ
Circuitβ
Circuitβ
7 parallel processing circuits for application β
Circuitβ
Circuitβ
Circuitβ
Circuitβ
FPGA
Circuitα
Circuitα
Circuitα
Circuitα
4 parallel processing circuits for Application α
Circuitβ
Circuitβ
Circuitβ
7 parallel processing circuits for application β
Circuitβ
Circuitβ
Circuitβ
Circuitβ
FPGA
Circuitα
Circuitα
Circuitα
Circuitα
4 parallel processing circuits for Application α
Circuitβ
Circuitβ
Circuitβ
6 parallel processing circuits for application β
Circuitβ
Circuitβ
Circuitβ
FPGA
Circuitα
Circuitα
Circuitα
Circuitα
5 parallel processing circuits for Application α
Circuitβ
Circuitβ
Circuitβ
6 parallel processing circuits for application β
Circuitβ
Circuitβ
Circuitβ
Circuitα
15
Productivity It’s a funny thing in the FPGA world
FPGA programmers are essentially VLSI design guys
They don’t buy $5K parts to get average performance
Every clock cycle is precious
Every LUT/FF/MAC/BRAM is precious
They don’t adopt new programming languages in a hurry
They love to have full control over every operation
16
Productivity, so what does it mean? Wants an entire system on FPGA modeled, performance
predicted, designed, implemented, debugged, verified, guaranteed timing closure, low power, high throughput….
Done really really fast, just like software
And then wants to make some minor changes and do it quickly all over again, just like software…
17
Why cant new designs be compiled, loaded onto FPGAs and tested super fast?
Need to look at traditional design flow1. Hardware-Software partition (quick)2. Create macro and micro architectures for hardware portion (a month, two months..)3. Write bug free VHDL/Verilog code for architectures (a few months)4. Synthesize, translate, map, place and route (5 to 15 hours)5. Simulate
If there is a functional or timing bug, you pay a penalty of a few days to weeks
6. Load configuration onto chip Test again. If there is a timing bug, you pay a penalty of several weeks
7. If you decide to make a micro architecture change, go back to step 28. Good luck trying to finish your project on time and budget9. This will still not get you a dynamically reconfigurable design
18
One way to Improve Productivity
Stick to the traditional design flow as much as possible FPGA users are once bitten twice shy Very conservative and believe in the existing flow
But introduce structure into the flow, i.e. physical structure, macro-architecture structure
Make Partial Dynamic Reconfiguration (PDR) almost automatic FPGA designers are not conversant with PDR designs
19
Augmented Design Flow: Exclusively for Signal Processing Algorithms
Hardware-Software Partitioning (just a concept and specific to an application)
Structured Macro-architecture via Floor Planning Generic structure applicable to many algorithms
Structure Micro-architecture design Project, Schedule data flow model of Sig. Proc. Kernel onto things called Sockets of
Macro-architecture Well understood process
Embed dynamic reconfiguration capability New technology Works in tandem with Macro-architecture
Code, Synthesize….
Test on chip
Structured Macro-architecture
Some important Terms/Elements: Socket: A physical region on the FPGA chip reserved by designer to be loaded
with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR) Switch Box: A circuit that makes the array of Sockets re-partition-able PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s
systolic array data-flow functionality. To activate a socket, a PE must be loaded into it
21
Socket/PRR: Under the Hood
Yellow box: A socket/PRRIt contains BRAMs, MACs and LUTs/FFs (purple and blue/green/black stuff)
If you want to dynamically reconfigure the parallelism of Systolic Arrays on an FPGA:All PRRs must be created with identical resources of MACs, BRAMs, LUTs, FFs.
Physical fabric of Virtex SX 35 FPGA
Simple circuitNeed to set mux sel lines & fifo controlsResides in static region on FPGA
Change SB connections to change partitioning of sockets/PRRs between systolic array kernels’ nodes
Switch Box: Stuff that makes the Array of Sockets Re-partition-able
23
Ok, time to port Macro-architecture Framework onto Chip
Virtex 4 SX 35
Static region (luminescent green stuff)•Microprocessor•Switch Boxes•Cache•Controller
PRRs/Sockets(white boxes)•To be filled with Systolic Array Processing Elements
What really happened when we tried it
25
Now to the Micro-architecture…First, Hardware Software Partitioning
Example: Extended Kalman Filter (EKF). A critical navigation algorithm and a nasty signal processing kernel.All stuff with rounded edges are tasks that can change based on physics of the problem. So put it all in software (Microblaze).All else is consistent and so put them in hardware (PolySAF)
26
Designing/Deriving the Processing Element: Example EKF
Works on Faddeev Algorithm to compute Schur compliment
27
One of the many possible ways
Port
28
Code, Synthesize, …Optimize Port: Code, synthesize, Translate, Map, Place and Route
For One Socket/PRR (just a few days worth of work)
Move Nets around to meet timing: Manually pick up a wire in this small bowl of spaghetti of wires, and move it around. Nuisance of a task, but necessary But you need to do it only in one PRR (just a few hours worth of work)
Copy Locally optimized bitstream/circuit of the one PRR to all PRRs Automatically obtain Global Timing closure for the PolySAF If Microprocessor, Cache are retained for multiple designs, then global
timing closure for whole chip is also automatically gifted to you
29
Have we answered the Productivity problem?Time to Grade the Approach
Need to look at traditional design flow1. Hardware-Software partition (quick)2. Create macro and micro architectures for hardware portion (a month, two months..)
Applicable to a wide range of Sig. Proc. Algorithms3. Write bug free VHDL/Verilog code for architectures (a few months)
Reuse most of the macro structure and code only for one PRR4. Synthesize, translate, map, place and route (5 to 15 hours)
Do for only one PRR5. Simulate
If there is a functional or timing bug, you pay a penalty of a few days to weeks
6. Load configuration onto chip Test again. If there is a timing bug, you pay a penalty of several weeks
7. If you decide to make a micro architecture change, go back to step 38. Good luck trying to finish your project on time and budget
30
Want the details, the math, the algorithms etc?
Read this paper A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam,
“Dynamically Reconfigurable Systolic Array Accelerators: A case study with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques. Vol 4, Issue 1. Jan 2010.
Author preprint available on line at Reconfigurable Computing Group www.usu.edu/rcg
31
Now, onto Partial Dynamic Reconfiguration in the PolySAF
3 nodes EKF2 nodes DWT
Detach Socket2 nodes EKF2 nodes DWT
Reconfigure Reset new PRRRe-attach2 nodes EKF3 nodes DWT
DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression
32
How to Physically Reconfigure PRR? Known Methods
33
Comparison of all known options
Best known technique: from Microsoft Research Labs (2008) eMIPS projectToo Slow, Too expensive (hogs up valuable on-chip BRAMs)
34
Embedding Dynamic Reconfiguration into the System
Active Bitstream (PRR) to PRR: Hardware Circuit
ARC
ICAP
PRR(source)
active bitstream
PRR(destination)
FPGA
PRR(destination)
ICAP wrapper
snoop
35
Accelerated Relocation Circuit (ARC)
Manipulate Frame addresses FAR is Frame address register Lots of unnecessary overhead can be avoided
No need for CRC processing
36
Results…reconfiguration times in millisecs
All systems run @ 100 MHz Footprint of ARC: 1064 LUTs, 638 FFs and 1 BRAM
* Estimated values for state of the art competing technologies
Test Circuit ResourcesBitstream
Size (Bytes)
#.of.frames
ARC
BiRF*IEEE
TVLSI 2009
Microsoft*Tech. Report 2008
PolySAF node LUT FF DSP BRAM Same Side/
Opp SideSame side BRAM Same
SideOpp Side
FSA_no_DSP 486 273 0 0 31159 195 0.48 84.7 14 3.38 8.86
DSA_no_DSP 438 273 0 0 30693 195 0.48 83.4 14 3.33 8.73
Matrix_Multno_DSP 1234 988 0 0 68469 432 1.07 186.1 30 7.42 19.47
FSA_with_DSP 423 216 1 0 32349 195 0.48 87.9 15 3.50 9.20
DSA_with_DSP 375 216 1 0 32349 195 0.48 89.8 15 3.58 9.20
Matrx_Multwith_DSP 502 466 8 0 65261 432 1.07 177.3 29 7.07 18.56
RFT cases
DCT 1419 1636 8 8 44397 540 1.34 120.64 22 4.81 12.62
CSC 318 438 1 12 17313 301 0.74 47.04 9 1.87 4.92
DWT 940 389 0 4 47897 303 0.75 130.2 21 5.19 13.62
37
Next steps…Improve, Formalize and Collaborate
Performance prediction Model Predict how big circuit will be, how it will perform using Excel and Matlab Big leap in productivity
Arithmetic Precision manipulation is extraordinarily powerful when it comes to FPGAs If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save
medium to massive amounts of area, power in the circuit mapped onto the FPGA Great opportunity for Small Satellites
Efficient communication between Microprocessor and PolySAF via threads
Validate and brutally test this on a large number of algorithms (FFTs, Filters, Hyperspectral processing…..) NASA can help with this Technology is attractive for software defined radios, precision navigation…
38
Kaleidoscope: Future of FPGA Near term
Maybe better tools to program and debug FPGAs? Mentor’s Catapult, AutoESL compiler, Synfora compiler….
Maybe some sort of standardization in FPGA programming Hopefully DARPA HPCS program will produce something
Longer term (Revolutionary things to come) Vertically Integrated FPGA + DRAM on a single chip
1000x improvement in performance/watt Visit Micron Research Center at USU to learn more www.usu.edu/mrc
39
Acknowledgements Joe Bredekamp and the NASA AISR program
Applied Information Systems Research Funding from NASA is valuable
Focused research Want my technology to be adopted for real missions
Xilinx and Mentor Graphics (donated > $ 100K worth software)
My Grad Students