35
© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

© 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

Embed Size (px)

Citation preview

Page 1: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

© 2008 Altera Corporation

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware

Adrian Ludwin, Vaughn Betz and Ketan Padalia

Page 2: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

2

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

0

5

10

15

20

25

30

35

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Re

lati

ve

to

19

99

SPEC CINT2000 Largest device in Quartus II

FPGA Size vs CPU PerformanceFPGA Size vs CPU Performance

CPUs: 7x faster

FPGAs:33x bigger

Page 3: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

3

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Our ContributionsOur Contributions

Parallelized existing high-quality placer Routability, timing and power driven Deterministic Good speedups with identical quality

Present results on multicore PCs Identify and quantify bottlenecks

Page 4: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

4

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Non-DeterminismNon-Determinism

Extremely difficult to test for correctness Extremely difficult to reproduce problems Very unpopular with customers

Some outright refuse to use ND algorithms All customers value reproducible results

We show that making our algorithms deterministic has a relatively small impact

on performance.

Page 5: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

5

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Serial EquivalencySerial Equivalency

Any number of cores returns same result Including a single core (hence “serial”)

Easy if algorithm is already deterministic Even easier to test than determinism

Serial equivalency has no additional overhead over determinism in our

algorithms.

Page 6: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

6

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Algorithm RuntimesAlgorithm Runtimes

Other Fitter (eg route)

Other CAD (eg map)

Placer Algorithms

The placer algorithms in this paper are a significant portion of overall runtime, but

are not a majority

Page 7: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

7

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

AgendaAgenda

Part I: Pipelined Moves Part II: Parallel Moves

Page 8: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

8

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}

Proposal

Evaluation

Algorithm Pseudo-CodeAlgorithm Pseudo-Code

Page 9: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

9

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}

Proposal

Evaluation

40%time

60%time

Expected speedup: 1/0.6 ≈ 1.7x

Effect of Pipelining ProposalsEffect of Pipelining Proposals

Page 10: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

10

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Core 0 Core 1

Evaluation

Proposal

Simplistic ImplementationSimplistic Implementation

Page 11: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

11

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Evaluation(C1)

Proposal(C0)

Move1

Move1

Move0

In this example, C1 has just started evaluating a move, while C0 has just started proposing the next one.

Since proposals are faster than evaluations (at least in theory), C0 will finish before C1. It then stalls until C0 is ready to take the move.

Simplistic ImplementationSimplistic Implementation

When C1 is ready, it grabs the proposed move and starts evaluating it, and C0 can begin proposing the next move.

Page 12: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

12

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Proposal(C0)

Evaluation(C1)

Move2

Move1

Move2

Simplistic ImplementationSimplistic Implementation

Page 13: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

13

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Naïve Pipelined ProblemsNaïve Pipelined Problems

1. Proposal/evaluation runtime variability If evaluation is faster than proposal, then the

stall happens on the critical path

2. Large penalty for stalling After C0 stalls, it takes almost as long to

wake it up as it does to propose the move in the first place!

Page 14: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

14

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Proposal(C0)

Evaluation(C1)

Better ImplementationBetter Implementation

Page 15: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

15

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

MoveMoveMoveEvaluation

QueueEvaluation

(C1)MoveMoveMove

Proposal(C0)

Better ImplementationBetter Implementation

Page 16: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

16

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Evaluation Queue

MoveMove Move MoveProposal

(C0)Evaluation

(C1)

Better ImplementationBetter Implementation

The queue buffers proposal/evaluation runtime variability and “hides” the stalls on C0 from the critical path on C1.

Page 17: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

17

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Evaluation Queue

MoveMove Move MoveProposal

(C0)Evaluation

(C1)

Accepted Moves Queue

Proposal State UpdatesProposal State Updates

Page 18: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

18

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Block

1

Block

2

Move 1

Move 5

Proposal ExampleProposal ExampleIn this example, we propose a move for block 1 to an empty locationSince we don’t know if it will ultimately be accepted by the evaluation stage, we

assume (for the time being) that it will be rejected.Some time later, if we haven’t heard back from the evaluation stage, it might be reasonable to propose a move for another block to the same “empty” location.

Page 19: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

19

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Block

1

Block

2

Move 5?

Move 1

accepted

Evaluation ExampleEvaluation ExampleIn the meantime, however, the evaluation stage has accepted Move 1 – it just

wasn’t able to tell the proposal stage about it in time (race condition!)But the later move to the no-longer-empty location is already in the pipe. It can no

longer be performed as proposed; what should we do about this?

Page 20: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

20

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Resolving CollisionsResolving Collisions

When two moves have collided, we can: Abandon the later moves (non-deterministic) Attempt to “fix” colliding moves

We fix it by reproposing it In this example, Move 5 becomes a swap This gives the same move as in the serial flow

Therefore, the placer is serially equivalent

Page 21: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

21

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

mem ctrlmem ctrl

PlatformsPlatforms

C0 C1

$0 $1

mem ctrl

2 GB

C0 C1

$0 $1

C2 C3

$2 $3

mem ctrl mem ctrl

4 GB 4 GB

C0 C1

$0/1

C2 C3

$2/3

16 GB

mem ctrl

nb opt-mc c2-mcopt-dcopt-dp c2-dcc2-dpNetburst x2

(Pentium 4)

Dual-core Opteron x2 Core 2 Duo x2

To test a two-core algorithm on a four-core machine, we can either use two cores on the same package (“dc” = “dual core”) …

… or we can use one core on each package (“dp” = “dual processor”). This decision has a large influence on the performance of the algorithm.

Page 22: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

22

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Pipelined Results - 11 CircuitsPipelined Results - 11 Circuits

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

nb opt-dc opt-dp c2-dc c2-dp

par

alle

l sp

eed

up

The results are far lower than the 1.7x ideal. Note that the best and worst results are both on the same platform (Core 2). Where is the runtime going on c2-dp?

Page 23: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

23

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Algorithm Components – c2-dpAlgorithm Components – c2-dp

0

10

20

30

40

50

60

serial serial withtimers

pipelinedequivalent

pipelinedwith timers

pipelined

mic

rose

con

d

evaluation infrastructure reproposals stall proposal all

This is the pipelined algorithm, but with both stages taking turns on the same

core.

This uses high-resolution timers to show the runtime of

each stage.

For the pipelined algorithm, we ignore

the proposal time since it’s “hidden.”

But why has the evaluation time gotten so big?

Page 24: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

24

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Explaining the ResultsExplaining the Results

Reproposals, stalls are very fast Memory is bottleneck on 4/5 platforms

Exception: c2-dc has large, shared cache Many, many more details are in the paper

Page 25: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

25

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Pipelined Moves SummaryPipelined Moves Summary

Poor inherent scalability, memory usage Reasonable speedups for amount of work

Far less work than fully parallel moves

Page 26: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

26

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

AgendaAgenda

Part I: Pipelined Moves Part II: Parallel Moves

Page 27: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

27

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}

Processing(propose and evaluate)

Finalization(resolve collisions and commit)

99%time

1%time

Stages with Thread-Safe CodeStages with Thread-Safe Code

Page 28: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

28

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Core 0 Core 1 Core 2 Core 3

Queue

Finalize

Process(C2)

Process(C3)

Process(C0)

Process(C1)

Finalization(resolve collisions and commit)

Processing(propose and evaluate)

Processing(propose and evaluate)

Processing(propose and evaluate)

Processing(propose and evaluate)

Page 29: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

29

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Queue

Finalize

Process(C2)

Process(C3)

Process(C0)

Process(C1)

Move0

Move1

Move0

Move1

Move2

Move3

Move4

Finalize(C0)

All four cores begin processing moves at the same time. Since

finalizing moves is so fast, it would be a waste to devote a core to that

task. Instead, all cores have the ability to finalize moves at the

appropriate time, as this example will show.

If one finishes out of order, it sits in the priority queue until the earlier

moves are finished. Meanwhile, the core that processed it goes onto the next move. It does not stall and wait

for any other cores.

The priority queue now has two moves ready to be finalized.

The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the

move it inserted went to the front of the queue.

Page 30: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

30

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Supervisor (2)Supervisor (2)

Queue

Finalize

Process(C2)

Process(C1)

Process(C3)

Move0

Move1

Finalize(C0)

Move2

Move3

Move4

The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the

move it inserted went to the front of the queue.

Page 31: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

31

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Supervisor (3)Supervisor (3)

Queue

Finalize

Process(C0)

Process(C1)

Process(C2)

Process(C3)

Process(C2)

Move2

Move3

Finalize(C2)

Move2

Move3

Move4

Move6

Move5

Process(C2)

Move7

Once a core has finished finalizing moves, it immediately goes back to

processing them. The algorithm continues, with any core being able

to finalize moves whenever it’s appropriate.

Page 32: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

32

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

1.0

1.5

2.0

nb opt-dc opt-dpaaa

c2-dc c2-dp

par

alle

l sp

eed

up

Pipelined Parallel Moves - 2 Cores Parallel Moves - 4 Cores

Parallel Results - 11 CircuitsParallel Results - 11 Circuits

opt-mc c2-mc

Page 33: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

33

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Algorithm Components – c2-mcAlgorithm Components – c2-mc

0

20

40

60

80

100

serial serial +timers

parallelequiv.

parallelper move

parallelest.

parallel

mic

rose

con

ds

process infrastructure repropose/reevaluate stall all

Page 34: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

34

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

Parallel Moves SummaryParallel Moves Summary

Memory still bottleneck Especially at 4 cores But less than in pipelined

Much more scalable (N instead of 1.7x)

Page 35: © 2008 Altera Corporation High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz and Ketan Padalia

35

© 2008 Altera Corporation - Public

Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation

ConclusionsConclusions

Significant parallelism in existing placer Believe sufficient parallelism for 8-16 cores More independent moves could scale further

Determinism has a relatively low cost Memory is largest parallel bottleneck

Better hardware will help A first-order concern for algorithm developers