32
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other Spatial Styles

CprE / ComS 583 Reconfigurable Computing

  • Upload
    kris

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

CprE / ComS 583 Reconfigurable Computing. Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other Spatial Styles. Systolic Architectures. Goal – general methodology for mapping computations into hardware (spatial computing) structures - PowerPoint PPT Presentation

Citation preview

Page 1: CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583Reconfigurable Computing

Prof. Joseph ZambrenoDepartment of Electrical and Computer EngineeringIowa State University

Lecture #13 – Other Spatial Styles

Page 2: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.2

Systolic Architectures

• Goal – general methodology for mapping computations into hardware (spatial computing) structures

• Composition:• Simple compute cells (e.g. add, sub, max, min)• Regular interconnect pattern• Pipelined communication between cells• I/O at boundaries

xx + x min

x c

Page 3: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.3

Example – Matrix-Vector Product

2

1

2

1

21

22221

11211

nnnnnn

n

n

y

y

y

x

x

x

aaa

aaa

aaa

for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j];

Page 4: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.4

Matrix-Vector Product (cont.)

t = 3

t = 2

t = 1

t = 4

a31

a21

a11

a41

a22

a12

a23

a13

a23

a14

x1 x2 x3 x4

y2

y3

y4

y1

t = n+1

t = n+2

t = n+3

t = nxn…

Page 5: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.5

Banded Matrix-Vector Product

0

0

4

3

2

1

4

3

2

1

45444342

34333231

232221

1211

nn y

y

y

y

y

x

x

x

x

x

aaaa

aaaa

aaa

aa

q

p

for (i=1; i<=n; i++) for (j=1; j<=p+q-1; j++) y[i] += a[i][j-q-i] * x[j];

Page 6: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.6

Banded Matrix-Vector Product (cont.)

t = 3

t = 2

t = 1

t = 4

a12

a11

a22

a21

a31

t = 3

t = 4

t = 5

t = 2

x2

x3

yout yin

xin xout

ain

t = 1 x1

t = 5 a23 – a32 –

yi

Page 7: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.7

Outline

• Recap• Non-Numeric Systolic Examples• Systolic Loop Transformations

• Data dependencies• Iteration spaces• Example transformations

• Reading – Cellular Automata• Reading – Bit-Serial Architectures

Page 8: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.8

Example – Relational Database

• Relation is a collection of tuples that all have the same attributes • Tuple is a fixed number of objects• Represented in a table

tuple # Name School Age QB Rating

0 T. Brady Michigan 30 141.8

1 T. Romo E. Illinois 27 112.9

2 J. Delhomme LA-Lafayette 32 111.9

3 P. Manning Tennessee 31 110.4

4 R. Grossman Florida 27 45.2

Page 9: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.9

• Intersection: A ∩ B – all records in both relation A and B

• Must compare all |A| x |B| tuples• Compare via sequence compare

• Or along row or column to get inclusion bitvector

Database Operations

B1

A1 T11

A2 T21

A3 T31

B2

T12

T22

T32

B3

T13

T23

T33

A1

A2

A3

B3

B2

B1

T12

T21

Page 10: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.10

Database Operations (cont.)

• Tuple Comparison• Problem – tuples are long, comparison time might limit

computation rate• Strategy – perform comparison in pipelined manner by

fields• Stagger fields• Arrange to compute field i on cycle after i-1

• Cell: tout = tin and ain xnor bin

True

B[j,1]B[j,2]

B[j,3]B[j,4]

A[i,1]A[i,2]

A[i,3]A[i,4]

Page 11: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.11

Database Intersection

b51

a21True

True

True

True

True

True

True

b52b43

a13

b44

b42

a22

b34

a14

b41

a31

b33

a23

b31

a41

b23

a23

b21

a51

b13

a43

b32

a32

b24

a24

b22

a42

b14

a34

a52 a44

T12

T21

T22

Page 12: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.12

Database Intersection (cont.)

b51

a21True

True

True

True

True

True

True

b52b43

a13

b44

b42

a22

b34

a14

b41

a31

b33

a23

b31

a41

b23

a23

b21

a51

b13

a43

b32

a32

b24

a24

b22

a42

b14

a34

a52 a44

OR

FALSE

OR

OR

OR

OR

OR

OR

T3

T2

T1

Page 13: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.13

Database Operations (cont.)

• Unique: remove duplicate elements in multirelation A• Intersect A with A

• Union: A U B – one copy of all tuples in A and B• Concatenate A and B• Use Unique to remove duplicates

• Projection: collapse A by removing select fields of every tuple• Sample fields in A’• Use Unique to remove duplicates

Page 14: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.14

Database Join

• Join: AJCA,CBB – where columns CA in A

intersect columns CB in B, concatenate tuple Ai and Bj

• Match CA of A with CB of B

• Keep all Ti,j

• Filter i,j for which Ti,j = true

• Construct join from matched pairs

• Claim: Typically, | AJCA,CBB | << | A | | B |

Page 15: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.15

Database Summary

• Input database – O(n) data• Operations require O(n2) data

• O(n) if sorted first• O(n log(n)) to sort

• Systolic implementation – works on O(n) processing elements in O(n) time

• Typical database [KunLoh80A]:• 1500 bit tuples• 10,000 records in a relation• ~1 4-LUT per bit-compare

• ~1600 XC4062 FPGAs• ~84 XC4LX200 FPGAs

Page 16: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.16

Systolic Loop Transformations

• Automatically re-structure code for• Parallelism• Locality

• Driven by dependency analysis

Page 17: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.17

Defining Dependencies

• Flow Dependence• Anti-Dependence• Output Dependence• Input Dependence

S1) a = 0;S2) b = a;S3) c = a + d + e;S4) d = b;S5) b = 5+e

W R δf

R W δa

W W δo

R R δi

true

false

Page 18: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.18

Example Dependencies

S1) a = 0;S2) b = a;S3) c = a + d + e;S4) d = b;S5) b = 5+e S1 δf S2 due to a

S1 δf S3 due to a

S2 δf S4 due to b

S3 δa S4 due to d

S4 δa S5 due to b

S2 δo S5 due to b

S3 δi S5 due to e

2

3

4

5

1

Page 19: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.19

Data Dependencies in Loops

• Dependence can flow across iterations of the loop

• Dependence information is annotated with iteration information

• If dependence is across iterations it is loop carried otherwise loop independent

for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i];}

δf loop carried

δf loop independent

Page 20: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.20

Unroll Loop to Find Dependencies

for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i];}

δf loop carried

δf loop independent

A[0] = B[0];

B[1] = A[0];

A[1] = B[1];

B[2] = A[1];

A[2] = B[2];

B[3] = A[2];

...

i = 0

i = 1

i = 2

Distance/direction of dependence is

also important

Page 21: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.21

Thought Exercise

• Consider the Laplace Transformation:

• In teams of two, try to determine the flow dependencies, anti dependencies, output dependencies, and input dependencies• Use loop unrolling to find dependencies

• Most dependencies found gets a prize

0 )( )( )( dttfssFfL st

for (i=1; i<N; i++) for (j=1; j<N; j++) c = -4*a[i][j] + a[i-1][j] + a[i+1][j]; c += a[i][j+1] + a[i][j-1] b[i][j] = c; }}

Page 22: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.22

Iteration Space

• Every iteration generates a point in an n-dimensional space, where n is the depth of the loop nest

for (i=0; i<n; i++) {

...}

for (i=0; i<n; i++) { for (j=0; j<5; j++) {

... }}

[3; 2]

[4]

Page 23: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.23

Distance Vectorsfor (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i];}

A[0] = B[0];

B[1] = A[0];

A[1] = B[1];

B[2] = A[1];

A[2] = B[2];

B[3] = A[2];

...

i = 0

i = 1

i = 2

Distance vector is the difference

between the target and source iterations

d = It - Is

Exactly the distance of the dependence, i.e.,

Is + d = It

Page 24: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.24

A0,1= =A0,1

B0,2= =B0,1

C1,1= =C0,2

A0,1= =A0,1

B0,2= =B0,1

C1,1= =C0,2

for (i=0; i<n; i++) { for (j=0; j<m; j++) { A[i,j] = ; = A[i,j]; B[i,j+1] = ; = B[i,j]; C[i+1,j] = ; = C[i,j+1];}

Distance Vectors Example

A0,2= =A0,2

B0,3= =B0,2

C1,2= =C0,3

A1,2= =A1,2

B1,3= =B1,2

C2,2= =C1,3

A2,2= =A2,2

B2,3= =B2,2

C3,2= =C2,3

A1,1= =A1,1

B1,2= =B1,1

C2,1= =C1,2

A2,1= =A2,1

B2,2= =B2,1

C3,1= =C2,2

A0,0= =A0,0

B0,1= =B0,0

C1,0= =C0,1

A1,0= =A1,0

B1,1= =B1,0

C2,0= =C1,1

A2,0= =A2,0

B2,1= =B2,0

C3,0= =C2,1

j

i

A0,2= =A0,2

B0,3= =B0,2

C1,2= =C0,3

A1,1= =A1,1

B1,2= =B1,1

C2,1= =C1,2

A0,0= =A0,0

B0,1= =B0,0

C1,0= =C0,1

A2,0= =A2,0

B2,1= =B2,0

C3,0= =C2,1

B yields [0; 1]

A yields [0; 0]

C yields [1; -1]

Page 25: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.25

FIR Distance Vectors

for (i=0; i<n; i++) for (j=0; j<m; j++) Y[i] = Y[i]+X[i+j]*W[j];

Y yields: δa [0; 0]

Y yields: δf [0; 1]

X yields: δi [1; -1]

W yields: δi [1; 0]

Y0= =Y0

=X3

=W3

Y1= =Y1

=X4

=W3

Y2= =Y2

=X5

=W3

Y3= =Y3

=X6

=W3Y0= =Y0

=X2

=W2

Y1= =Y1

=X3

=W2

Y2= =Y2

=X4

=W2

Y3= =Y3

=X5

=W2Y0= =Y0

=X1

=W1

Y1= =Y1

=X2

=W1

Y2= =Y2

=X3

=W1

Y3= =Y3

=X4

=W1Y0= =Y0

=X0

=W0

Y1= =Y1

=X1

=W0

Y2= =Y2

=X2

=W0

Y3= =Y3

=X3

=W0

Page 26: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.26

Re-label / Pipeline Variables

• Remove anti-dependencies and input dependencies by relabeling or pipelining variables

• Creates new flow dependencies• Removes anti/input dependencies

for (i=0; i<n; i++) { for (j=0; j<m; j++) { Wi[j] = Wi-1[j]; Xi[i+j]=Xi-1[i+j]; Yj[i] = Yj-1[i]+Xi[i+j]*Wi[j]; }}

1- 0 1

1 1 0 D

Y W X

Page 27: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.27

FIR Dependencies

Y20= =Y1

0

X02= =X-1

2

W02= =W-1

2

1- 0 1

1 1 0 D

Y W X

Y21= =Y1

1

X13= =X0

3

W12= =W0

2

Y22= =Y1

2

X24= =X1

4

W22= =W1

2

Y10= =Y0

0

X01= =X-1

1

W01= =W-1

1

Y11= =Y0

1

X12= =X0

2

W11= =W0

1

Y12= =Y0

2

X23= =X1

3

W21= =W1

1

Y00= =Y-1

0

X00= =X-1

0

W00= =W-1

0

Y01= =Y-1

1

X11= =X0

1

W02= =W0

0

Y02= =Y-1

2

X22= =X1

2

W20= =W1

0

j

i

Page 28: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.28

Transforming to Time and Space

• Using data dependencies, find T• T defines a mapping of the iteration space into

a time component π, and a space component, S

• T = [π; S]• If π·I1 = π·I2, then I1 and I2 execute at the same

time• π·d – amount of time units to move data items

(π·d > 0)• Any S can be picked that makes T a bijection

• See [Mol83A] for more details

Page 29: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.29

Calculating T for FIR

• For π = [p1 p2]

• Since π·d > 0, we see that:• p2 != 0 (from Y)

• p1 != 0 (from W)

• p1 > p2 (from X)

• Smallest solution π = [2 1]• S can be [1 0], [0 1], [1 1]

1- 0 1

1 1 0 D

Y W X

Page 30: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.30

An Example Transformation

Space

Tim

e

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

(3,0)

(3,1)

(3,2)

(3,3)

(0,4) (1,4) (2,4) (3,4)

1- 0 1

1 1 0 D

Y W X

XY

W

Page 31: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.31

An Example Transformation (cont.)

ji

j2i

j

iT

0 1 1

1 2 1 DT

1 1

1 2 T

Space

Tim

e

(0,0)

(0,1)

(0,2)

(0,3)

(1,0)

(1,1)

(1,2)

(1,3)

(2,0)

(2,1)

(2,2)

(2,3)

(3,0)

(3,1)

(3,2)

(3,3)

(0,4) (1,4) (2,4) (3,4)

XY

W

Y W X

Page 32: CprE / ComS 583 Reconfigurable Computing

CprE 583 – Reconfigurable ComputingOctober 2, 2007 Lect-13.32

Summary

• Non-numeric (database ops) example of systolic computing• Multiple use of each input data item• Concurrency• Regular data and control flow

• Loop transformations• Data dependency analysis• Restructure code for parallelism, locality