Download ppt - Spatial Computation

Spatial Computation

Thesis committee:Seth Goldstein

Peter Lee

Todd Mowry

Babak Falsafi

Nevin Heintze

Ph.D. Thesis defense, December 8, 2003

SCS

Mihai BudiuCMU CS

2

Spatial Computation

Thesis committee:Seth Goldstein

Peter Lee

Todd Mowry

Babak Falsafi

Nevin Heintze

Ph.D. Thesis defense, December 8, 2003

SCSA model of general-purpose computationbased on Application-Specific Hardware.

3

Thesis StatementApplication-Specific Hardware (ASH):

• can be synthesized by adapting software compilation for predicated architectures,

• provides high-performance for programs withhigh ILP, with very low power consumption,

• is a more scalable and efficient computation substrate than monolithic processors.

4

Outline• Introduction

• Compiling for ASH

• Media processing on ASH

• ASH vs. superscalar processors

• Conclusions

5

CPU Problems

• Complexity

• Power

• Global Signals

• Limited ILP

6

Design Complexity

from Michael Flynn’s FCRC 2003 talk

58%/Year

21%/Year

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

xxx

x xx

x

Logic transistors/chip

Transistors/staff*month

Source: S. Malik, orig Sematech

Prod

uctiv

ity

10

1,000,000

10,000,000

100,000,000

1000

100

10,000

100,000

10

1000

100

10,000

100,000

1,000,000

10,000,000

Chi

p si

ze (K

tran

sist

ors)

Design Time:CAD productivity favors FPL

2.5

.10

.35

7

Communication vs. Computation

5ps 20ps

gate wire

Power consumption on wires is also dominant

8

Our Approach: ASH

Application-Specific Hardware

9

1.

2.

1.

2.Programs

Programs

Resource Binding Time

CPU ASH

10

Hardware Interface

CPU ASH

ISA

software

hardware

software

hardwaregates

virtual ISA

11

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

12

Contributions

Compilation

Computerarchitecture

Reconfigurablecomputing

Embeddedsystems

Asynchronouscircuits

High-levelsynthesis

Dataflowmachines

Nanotechnology

theory

syste

ms

13


• CASH: Compiling for ASH



• Conclusions

14

Computation = Dataflow

• Operations ) functional units• Variables ) wires• No interpretation

x = a & 7;...

y = x >> 2;

Programs

&

a 7

>>

2

x

Circuits

15

Basic Operation

+data

valid

ack

latch

16

+

Asynchronous Computation

data

valid

ack

1

+

2

+

3

+

4

+

8

+

7

+

6

+

5

latch

17

Distributed Control Logic

+ -

ackrdy

FSM

asynchronous control

short, local wires

18

Forward Branches

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Conditionals ) Speculation critical path

19

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

20

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

ret

21

no speculation

sequencingof side-effects

Predication and Side-Effects

Load

addr

data

pred

token

token

tomemory

22

Thesis StatementApplication-Specific Hardware:




23

Outline• Introduction• CASH: Compiling for ASH

– An optimization on the SIDE

• Media processing on ASH• ASH vs. superscalar processors• Conclusions

skip to

24

Availability Dataflow Analysis

y

y = a*b;

...

if (x) {

...

... = a*b;

}

25

Dataflow Analysis Is Conservative

if (x) {

...

y = a*b;

}

...

... = a*b;y?

26

Static Instantiation, Dynamic Evaluation

flag = false;

if (x) {

...

y = a*b;

flag = true;

}

...

... = flag ? y : a*b;

27

SIDE Register Promotion Impact

0

5

10

15

20

25

30

ad

pcm

_e

ad

pcm

_d

gsm

_e

gsm

_d

ep

ic_

e

ep

ic_

d

mp

eg

2_

e

mp

eg

2_

d

jpe

g_

e

jpe

g_

d

pe

gw

it_e

pe

gw

it_d

g7

21

_e

g7

21

_d

pg

p_

e

pg

p_

d

rast

a

me

sa

09

9.g

o

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

18

3.e

qu

ake

18

8.a

mm

p

16

4.g

zip

17

5.v

pr

17

6.g

cc

18

1.m

cf

19

7.p

ars

er

25

4.g

ap

30

0.tw

olf

%st promo

%st PRE

53

0

5

10

15

20

25

30

35

40

45

adp

cm_e

adp

cm_d

gsm

_e

gsm

_d

epic

_e

epic

_d

mpe

g2_e

mpe

g2_d

jpeg

_e

jpeg

_d

peg

wit_

e

peg

wit_

d

g72

1_e

g72

1_d

pgp

_e

pgp

_d

rast

a

mes

a

099

.go

124

.m88

ksim

129

.co

mp

ress

130

.li

132

.ijpe

g

134

.pe

rl

147

.vo

rtex

183

.eq

uake

188

.am

mp

164

.gzi

p

175

.vp

r

176

.gcc

181

.mcf

197

.pa

rser

254

.ga

p

300

.twol

f

% ld promo

% ld PRE

Loads

Stores

% r

educ

tion

28

Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH

• ASH vs. superscalar processors• Conclusions

29

Performance Evaluation

ASH

LSQ

limited BW

L18K

L21/4M

Mem

CPU: 4-way OOO

Assumption: all operations have the same latency.

30

Media Kernels, vs 4-way OOO

0

0.5

1

1.5

2

2.5

3ad

pcm

_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Tim

es f

aste

r

125.85.8

31

Media Kernels, IPC

0

5

10

15

20

25

adpc

m_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Base IPC

ASH IPC

4

32

Speed-up IPC Correlation

0

1

2

3

4

5

6

7

8

9

10ad

pcm

_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

Tim

es b

igg

er

Speed-up

IPC Ratio

12

33

Low-Level EvaluationC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

Results shown so far.All results in thesis.

Results in the next two slides.

ASIC

180nm std. cell library, 2V

~1999technology

34

Area

0

2

4

6

8

10

12

adpc

m_d

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Sq

uar

e m

m

Reference: P4 in 180nm has 217mm2

35

Power

vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W

0

50

100

150

200

250

300

350

Tim

es s

mal

ler

than

OO

O

power ratio 70 41 41 129 147 94 121 136 303 303

adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e

36





37

Outline• Introduction• CASH: Compiling for ASH• Media processing on ASH

– dataflow pipelining

• ASH vs. superscalar processors• Conclusions

skip to

38

Pipeliningi

+

<=

100

1

*

+

sum

pipelinedmultiplier(8 stages)

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

cycle=1

39

Pipeliningi

+

<=

100

1

*

+

sum

cycle=2

40

Pipeliningi

+

<=

100

1

*

+

sum

cycle=3

41

Pipeliningi

+

<=

100

1

*

+

sum

cycle=4

42

Pipeliningi

+

<=

100

1

i=1

i=0

+

sum

cycle=5

pipeline balancing

43


• CASH: Compiling for ASH



• Conclusions

44

This Is Obvious!

ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).

45

SpecInt95, ASH vs 4-way OOO

-50

-40

-30

-20

-10

0

10

20

300

99

.go

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

Pe

rce

nt

slo

we

r /

fas

ter

46

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

...

if (exception) break;

}

i

+

<

1

&

!

exception

result available before inputs

ASH crit path

CPU crit path

47

SpecInt95, perfect prediction

-60

-40

-20

0

20

40

60

099.

go

124.

m88

ksim

129.

com

pres

s

130.

li

132.

ijpeg

134.

perl

147.

vort

ex

Per

ce

nt

slo

we

r/fa

ster

baseline

prediction

no data

48

ASH Problems

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient• ...

49





50

Outline

Introduction

+ CASH: Compiling for ASH

+ Media processing on ASH

+ ASH vs. superscalar processors

= Conclusions

51

• low power

• simple verification?

• specialized to app.

• unlimited ILP

• simple hardware

• no fixed window

• economies of scale

• highly optimized

• branch prediction

• control speculation

• full-dataflow

• global signals/decision

Strengths

52

Conclusions

• Compiling “around the ISA” is a fruitful research approach.

• Distributed computation structures require more synchronization overhead.

• Spatial Computation efficiently implements high-ILP computation with very low power.

53

Backup Slides

• Control logic • Pipeline balancing• Lenient execution• Dynamic Critical Path• Memory PRE• Critical path analysis• CPU + ASH

54

Control Logic

C

C

Reg

rdyin

ackin

rdyoutackout

datain dataout

back back to talk

55

Last-Arrival Events

+

data

valid

ack

• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

56

Dynamic Critical Path

3. Some edges may repeat 2. Trace back along

last-arrival edges

1. Start from last node

back back to analysis

57

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

58

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Solve the problem of unbalanced pathsback back to talk

59

Pipeliningi

+

<=

100

1

*i=1

i=0

+

sum

cycle=6

60

Pipeliningi

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

Longlatency pipe

predicate

cycle=7

61

Predicate ackedge is on thecritical path.

Pipeliningi

+

<=

100

1

*

+

sum

critical pathi’s loop

sum’s loop

62

Pipelinine balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

decouplingFIFO

cycle=7

63

Pipelinine balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

critical path

decouplingFIFO

back back to presentation

64

Register Promotion

…=*p(p2)

*p=…(p1)

…=*p

*p=…(p1)

(p2 Æ : p1)

Load is executed only if store is not

65

Register Promotion (2)

…=*p(p2)

*p=…(p1)

…=*p(false)

*p=…(p1)

• When p2 ) p1 the load becomes dead...• ...i.e., when store dominates load in CFG

back

66

¼ PRE

...=*p(p1) ...=*p(p2) ...=*p(p1 Ç p2)

This corresponds in the CFG to lifting the load to a basic block dominating the original loads

67

Store-store (1)

*p=...(p2)

*p=…(p1)

*p=...(p2)

*p=…(p1 Æ : p2)

• When p1 ) p2 the first store becomes dead...• ...i.e., when second store post-dominates first in CFG

68

Store-store (2)

*p=...(p2)

*p=…(p1)

*p=...(p2)

*p=…(p1 Æ : p2)

• Token edge eliminated, but...• ...transitive closure of tokens preserved

back

69

A Code Fragment

for(i = 0; i < 64; i++) {

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

Y[i] = X[j].q;

}

SpecINT95:124.m88ksim:init_processor, stylized

70

Dynamic Critical Path

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

load predicate

loop predicate

sizeof(X[j])

definition

71

MIPS gcc CodeLOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

EXIT:

L1! L2 ! L3 ! L5 ! L14-instructions loop-carried dependence

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

72

If Branch Prediction Correct

L1! L2 ! L3 ! L5 ! L1Superscalar is issue-limited!2 cycles/iteration sustained

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

LOOP:


L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++


EXIT:

73

Critical Path with Prediction

Loads are notspeculative

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

74

Prediction + Load Speculation

~4 cycles!Load not pipelined(self-anti-dependence)

ack edge

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

75

OOO Pipe Snapshot

IF DA EX WB CT

L5L1L2

L1L2L3L4

L1L3

L5L3L2

L1L3L3

registerrenaming

LOOP:


L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++


EXIT:

76

Unrolling?

for(i = 0; i < 64; i++) {

for (j = 0; X[j].r != 0xF; j+=2) {

if (X[j].r == i)

break;

if (X[j+1].r == 0xF)

break;

if (X[j+1].r == i)

break;

}

Y[i] = X[j].q;

}

when 1 iteration

back back to talk

77

Ideal Architecture

High-ILPcomputation

Low ILP computation+ OS+ VM

CPU ASH

Memory

back