Fast Compilation for Reconfigurable Hardware

Fast Compilation for Reconfigurable Hardware

Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University

Computer Science Department

Joint work withSrihari Cadambi, Herman Schmit, Matt Moe,

Robert Taylor, Ronald Laufer

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 2

GoalTo program reconfigurable devices using the standard

software development processes:

– Compile C or Java– Do it quickly

Partitioner

DIL

Java

Data-flow Intermediate Language

Configuration

Reconfigurable HW CPU

This talk


Compiler Performance on 1D DCT (8 inputs 8 bit each)

DIL Classical tools

Total Compile time 2.4s ~75minSynopsis+Design Manager

Place and route 1s Design Manager 14m22sTarget clock speed 75Mhz 33MhzCircuit size 7816 bit-ops 899 CLBsApplication speed-up 20 ~20Target PipeRench Xilinx 4085XL

Compilation: ~700x faster


The Place and Route Problem

Interconnection

operators

+

.

<<[1,2]

>><<

&~ ~

+

Processing elements

<< >>

.[1,2]

Interconnectionnetwork

&

<<


Our Target:

• Medium grain processing elements (4 bits)

• Pipelined architecture

• Virtualized hardware

• Local interconnection network

• Wide pipelined bus


The Place and Route Problem

Interconnection

operators

+

.

<<[1,2]

>><<

&~ ~

+

Processing elements

<< >>

.[1,2]

Interconnectionnetwork

&

<<

Stripe


Why Place and Route Is Hard

• Hard constraints:– Stripe width – Pipelined bus width

• Word-based circuit– interconnection network switches words– fixed PE size

• Scarce input ports for the interconnection network


How We Simplify Place and Route

• Computation-oriented programs (restricted language, with unidirectional data flow)

• Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder

instead of 100 gates)• There is a wide pipelined bus available• Timing is very predictable


The Key Idea

• Global analysis and transformations guarantee placeability using lazy noops (conservatively)

• Deterministic, greedy place & route (no backtracking)

• All passes linear time in the size of the circuit


Guaranteeing Placement

+

.

<<[1,2]

>><<

&~

+

.

<<

[1,2]

>>

<<

&~

noop

noop

Complexpermutation

Simplepermutation

Simplepermutation

The inserted noops are sufficient but not necessary

Simplepermutation


Placement of a Non-lazy Noop

&~

noop

+

+

&~

noop

noop


Lazy Noops Are Not Placed

&~

+

+

&~

noop

noop


Place and Route Overview

• Analysis:– Noops have been inserted to guarantee that the

graph is routable.

• Place & Route: – will determine which lazy noops are instantiated

Next: actual Place and Route


Already placed

Step1: Analyze Routability

+

&~

noop

noop

&

~

+ + + + + + +

Q: can we place the + given the placement of its ancestors?


Step 2: If a Node Is Unroutable

Solution: promote a lazy noop

+

&~

noop

noop

+

&~

noop

noop


Step 3: Choosing a Noop

Closest noop which is routable.

+

&~

noop

noop

+

&~

noop

noop


Other Details

• Operators are decomposed in pieces for:– timing constraints– size constraints

• When placing optimize for– register pressure when accessing the bus– constraints placed on future nodes

• Long critical paths are sliced with pipeline registers


Compilation Times (Seconds on PII/400)

1.36

2.27

1.25

0.13

2.43

0.84

8.07

0.07

0.950.47

0.86

0

1

2

3

4

5

6

7

8

9


Compilation Speed (PII/400)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Bit

Op

era

tio

ns/

Ker

nel

0

2000

4000

6000

8000

10000

12000

Bit

Op

era

tio

ns

Co

mp

iled

/Sec

bitopsbitops/sec


Compilation Times Breakdown

0%

20%

40%

60%

80%

100%other

place

analysis

library

simplification

evaluation

Place and route


Placed Circuit Utilization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

utilization effective utilization


Simulated Speed-up vs. UltraSparc @ 300Mhz

328.8

29.020.6

90.961.8

26.0

76.1

1.0

10.0

100.0

1000.0

ATR Cordic DCT FIR IDEA Nqueens Over


Conclusions

• Fast compilation from HLL achievable (seconds not tens of minutes.)

• High-quality output achievable (60% density)

• Linear-time Place and Route feasible using the technique of lazy noops


Future Work

• Time-multiplexing the bus

• Porting to commercial FPGAs

• Front-end from C/Java to DIL


How We Simplify Place and Route

• Computation-oriented programs (restricted language, with unidirectional data flow)

Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder

instead of 100 gates) There is a wide pipelined bus available• Timing is very predictable


Timing and Size Guarantees

+24

2424+

+

+

2424

24

8

88

88

8 8

8

8


Optimize for Register Pressure

&

~

+ + + + + + +

Cost: 1 2 1 -- -- 0

Best position

+

&~

noop

noop


KernelsBenchmark Description

ATR Automatic Target Recognition (image pattern scan)Cordic Honeywell timing benchmark for vector rotation.CSD Canonical signed multiplier with the constant 123.DCT One-dimensional 8-point discrete cosine transform.Encoder Huffman encoder for fixed frequencies.FIR Finite Impulse Response filter with 20 taps.IDEA PGP encryption algorithm.Nqueens 8x8 queens solution tester.Over Porter-Duff “over” operator.Square Squaring a 16-bit number.Varpoly Evaluating a degree-3 polynomial with variable coefficients

in a given point.

Documents

Fast Compilation for Reconfigurable Hardware