28
Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department Joint work with Srihari Cadambi, Herman Schmit, Matt Moe, Robert Taylor, Ronald Laufer

Fast Compilation for Reconfigurable Hardware

  • Upload
    hali

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Fast Compilation for Reconfigurable Hardware. Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University Computer Science Department. Joint work with Srihari Cadambi, Herman Schmit, Matt Moe, Robert Taylor, Ronald Laufer. Goal. - PowerPoint PPT Presentation

Citation preview

Page 1: Fast Compilation for Reconfigurable Hardware

Fast Compilation for Reconfigurable Hardware

Mihai Budiu and Seth Copen Goldstein Carnegie Mellon University

Computer Science Department

Joint work withSrihari Cadambi, Herman Schmit, Matt Moe,

Robert Taylor, Ronald Laufer

Page 2: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 2

GoalTo program reconfigurable devices using the standard

software development processes:

– Compile C or Java– Do it quickly

Partitioner

DIL

Java

Data-flow Intermediate Language

Configuration

Reconfigurable HW CPU

This talk

Page 3: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 3

Compiler Performance on 1D DCT (8 inputs 8 bit each)

DIL Classical tools

Total Compile time 2.4s ~75minSynopsis+Design Manager

Place and route 1s Design Manager 14m22sTarget clock speed 75Mhz 33MhzCircuit size 7816 bit-ops 899 CLBsApplication speed-up 20 ~20Target PipeRench Xilinx 4085XL

Compilation: ~700x faster

Page 4: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 4

The Place and Route Problem

Interconnection

operators

+

.

<<[1,2]

>><<

&~ ~

+

Processing elements

<< >>

.[1,2]

Interconnectionnetwork

&

<<

Page 5: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 5

Our Target:

• Medium grain processing elements (4 bits)

• Pipelined architecture

• Virtualized hardware

• Local interconnection network

• Wide pipelined bus

Page 6: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 6

The Place and Route Problem

Interconnection

operators

+

.

<<[1,2]

>><<

&~ ~

+

Processing elements

<< >>

.[1,2]

Interconnectionnetwork

&

<<

Stripe

Page 7: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 7

Why Place and Route Is Hard

• Hard constraints:– Stripe width – Pipelined bus width

• Word-based circuit– interconnection network switches words– fixed PE size

• Scarce input ports for the interconnection network

Page 8: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 8

How We Simplify Place and Route

• Computation-oriented programs (restricted language, with unidirectional data flow)

• Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder

instead of 100 gates)• There is a wide pipelined bus available• Timing is very predictable

Page 9: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 9

The Key Idea

• Global analysis and transformations guarantee placeability using lazy noops (conservatively)

• Deterministic, greedy place & route (no backtracking)

• All passes linear time in the size of the circuit

Page 10: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 10

Guaranteeing Placement

+

.

<<[1,2]

>><<

&~

+

.

<<

[1,2]

>>

<<

&~

noop

noop

Complexpermutation

Simplepermutation

Simplepermutation

The inserted noops are sufficient but not necessary

Simplepermutation

Page 11: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 11

Placement of a Non-lazy Noop

&~

noop

+

+

&~

noop

noop

Page 12: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 12

Lazy Noops Are Not Placed

&~

+

+

&~

noop

noop

Page 13: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 13

Place and Route Overview

• Analysis:– Noops have been inserted to guarantee that the

graph is routable.

• Place & Route: – will determine which lazy noops are instantiated

Next: actual Place and Route

Page 14: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 14

Already placed

Step1: Analyze Routability

+

&~

noop

noop

&

~

+ + + + + + +

Q: can we place the + given the placement of its ancestors?

Page 15: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 15

Step 2: If a Node Is Unroutable

Solution: promote a lazy noop

+

&~

noop

noop

+

&~

noop

noop

Page 16: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 16

Step 3: Choosing a Noop

Closest noop which is routable.

+

&~

noop

noop

+

&~

noop

noop

Page 17: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 17

Other Details

• Operators are decomposed in pieces for:– timing constraints– size constraints

• When placing optimize for– register pressure when accessing the bus– constraints placed on future nodes

• Long critical paths are sliced with pipeline registers

Page 18: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 18

Compilation Times (Seconds on PII/400)

1.36

2.27

1.25

0.13

2.43

0.84

8.07

0.07

0.950.47

0.86

0

1

2

3

4

5

6

7

8

9

Page 19: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 19

Compilation Speed (PII/400)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Bit

Op

era

tio

ns/

Ker

nel

0

2000

4000

6000

8000

10000

12000

Bit

Op

era

tio

ns

Co

mp

iled

/Sec

bitopsbitops/sec

Page 20: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 20

Compilation Times Breakdown

0%

20%

40%

60%

80%

100%other

place

analysis

library

simplification

evaluation

Place and route

Page 21: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 21

Placed Circuit Utilization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

utilization effective utilization

Page 22: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 22

Simulated Speed-up vs. UltraSparc @ 300Mhz

328.8

29.020.6

90.961.8

26.0

76.1

1.0

10.0

100.0

1000.0

ATR Cordic DCT FIR IDEA Nqueens Over

Page 23: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 23

Conclusions

• Fast compilation from HLL achievable (seconds not tens of minutes.)

• High-quality output achievable (60% density)

• Linear-time Place and Route feasible using the technique of lazy noops

Page 24: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 24

Future Work

• Time-multiplexing the bus

• Porting to commercial FPGAs

• Front-end from C/Java to DIL

Page 25: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 25

How We Simplify Place and Route

• Computation-oriented programs (restricted language, with unidirectional data flow)

Hardware resources virtualized• Relatively rich interconnection network• High granularity placement (I.e. one 32-bit adder

instead of 100 gates) There is a wide pipelined bus available• Timing is very predictable

Page 26: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 28

Timing and Size Guarantees

+24

2424+

+

+

2424

24

8

88

88

8 8

8

8

Page 27: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 29

Optimize for Register Pressure

&

~

+ + + + + + +

Cost: 1 2 1 -- -- 0

Best position

+

&~

noop

noop

Page 28: Fast Compilation for Reconfigurable Hardware

FPGA, Feb 23 1999 (c) 1998 by Mihai Budiu 30

KernelsBenchmark Description

ATR Automatic Target Recognition (image pattern scan)Cordic Honeywell timing benchmark for vector rotation.CSD Canonical signed multiplier with the constant 123.DCT One-dimensional 8-point discrete cosine transform.Encoder Huffman encoder for fixed frequencies.FIR Finite Impulse Response filter with 20 taps.IDEA PGP encryption algorithm.Nqueens 8x8 queens solution tester.Over Porter-Duff “over” operator.Square Squaring a 16-bit number.Varpoly Evaluating a degree-3 polynomial with variable coefficients

in a given point.