27
1 RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. The Power of Streams on the SRC MAP ® Wim Bohm Colorado State University RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in outer loops become special purpose hardware function unitsBasic blocks in inner loop bodies are merged and become pipelined circuits Sequential semantics obeyed One basic block executed at the time Pipelined inner loops are slowed down to disambiguate read/ write conflicts if necessary MAP C compiler identifies (cause of) loop slowdown You want more parallelism? Unroll loops Use (OpenMP style) parallel sections

MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

1

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

The Power of Streams on the

SRC MAP®

Wim Bohm Colorado State University

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

MAP C �  Pure C runs on the MAP

–  Generated code: circuits •  Basic blocks in outer loops become special purpose hardware “function units”

•  Basic blocks in inner loop bodies are merged and become pipelined circuits

�  Sequential semantics obeyed –  One basic block executed at the time –  Pipelined inner loops are slowed down to disambiguate read/

write conflicts if necessary –  MAP C compiler identifies (cause of) loop slowdown

�  You want more parallelism? –  Unroll loops –  Use (OpenMP style) parallel sections

Page 2: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

2

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Memory Model �  Variable = register �  Array = memory

–  OBM Declare OBMx to be array with element type t where t is some 64 bits type SRC6: 6 OBMs –  Block RAM An array lives by default in Block RAM(s) More freedom for element type SRC6: 444 Block RAMs (VP100)

�  1 array access = 1 clock cycle –  but compiler can slow down loops for read / write

disambiguation –  loop slow down: loop runs at >1 clock cycle rate

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Programming with loops and arrays �  Transformational Approach

–  Start with pure C code •  Compiler indicates (reasons for) loop slow downs

–  Performance tune (removing inefficiencies) •  avoid re-reading of data from OBMs using Delay Queues •  avoid read / write conflicts in same iteration

–  Partition Code and Data •  distribute data over OBMs and Block RAMs, unroll loops •  distribute code over two FPGAs, MPI type communication

–  only one chip at the time can access a particular OBM �  Goal 1: 1 loop iteration = 1 clock cycle

–  Can very often be achieved �  Goal 2: Exploit as much bandwidth as possible

–  Through loop unrolling and task parallelism –  Balance act: chip area versus bandwidth

Page 3: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

3

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

How to performance tune: Macros �  C code can be extended using macros allowing for

program transformations that cannot be expressed straightforwardly in C

�  Macros have semantics unlike C functions: they have a rate and a delay (so they can be integrated in loops)

–  rate (#clocks between inputs) –  delay (#clocks between in and output) –  macros can have state (kept between macro calls)

�  Two types of macros –  system provided –  user provided (Verilog or schematic capture)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues for efficient Window Access

�  Keep data on chip using Delay Queues –  Delay queue = system macro implemented with

SRL16s (small, fixed size) BRAMs (large, variable size)

�  Simple example: –  3x3 window

stepping 1 by 1 in column major order

–  image 16 deep

9 points in stencil 8 have been seen before

Per window the leading point should be the only data access.

input array traversal

Page 4: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

4

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues 1

f1

Data access input(i) f7

f8

f1

f4

f5

f6

f1

f2

f0

f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

two 16–word Delay Queues

shift and store previous columns

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues 2

f1

f7

f8

f1

f4

f5

f6

f1

f0

f1

f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

Data access input(i)

Page 5: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

5

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues 3

f1

f7

f8

f1

f4

f5

f6

f0

f1

f2

f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

f2 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

Data access input(i)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

… Delay Queues 16

f0

f7

f8

f1

f4

f5

f0

f13

f14

f15

f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1

f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0

Data access input(i)

Page 6: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

6

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

…Delay Queues 32

f15

f7

f8

f1

f13

f14

f15

f29

f30

f31

f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0

f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19 f18 f17 f16

Data access input(i)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

…Delay Queues 35

f18

f0

f1

f2

f16

f17

f18

f32

f33

f34

f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3

Compute first window

f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19

Data access input(i)

Page 7: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

7

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues 36

f19

f1

f2

f3

f17

f18

f19

f33

f34

f35

f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4

Compute next widow

f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20

Data access input(i)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues 37

f20

f2

f3

f4

f18

f19

f20

f34

f35

f36

f19 f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5

and next … 1 clock cycle per window

f36 f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21

Data access input(i)

Page 8: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

8

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Delay Queues: Performance

3x3 window access code 512 x 512 pixel image

Routine Style Number Comment of clocks

Straight Window 2,376,617 close to 9 clocks per iteration 2,340,900: the difference is pipeline prime effect Delay Queue 279,999 close to1 clock per iteration 262144: theoretical limit

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Streams �  A stream is a flexible communication mechanism

–  Providing parallelism, locality and implicit synchronization –  Freeing the programmer of timing headaches –  On the MAP it provides a uniform communication model

•  From µ-proc to MAP (DMA stream) •  From Global Common Memory to MAP •  From MAP to MAP (GPIO stream) •  From FPGA to FPGA (Bridge stream) •  Inside FPGA between parallel sections

�  Stream buffer = FIFO –  Element can travel through in one clock –  Processes, FPGAs, MAPs, GCM are all a few clock cycles

away from each other

Page 9: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

9

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Streams cont’

�  Stream producers and consumers –  Live in their own parallel sections –  Loop based

�  MAP streams are producer driven –  Producer can conditionally put elements in the stream –  Consumer cannot conditionally get

•  but can run at its own speed –  Conditional gets would cause loop slowdown

�  Two case studies –  Dilate Erode (Image Processing kernel) –  LU Decomposition (Numerical kernel)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Dilate / Erode image operators �  Dilate enlarges light objects and shrinks dark objects by taking max pixel value in (masked) window

�  Erode enlarges dark objects and shrinks light objects by taking min pixel value in (masked) window �  together they simplify the image, leaving only dominant trends �  Focus Of Attention Codes often use a number of dilates followed

by same number of erodes –  ideal scenario for multiple stream connected loops –  internally the loops use delay queues for windows

Page 10: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

10

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Erode dilate example

D = MAX( )

D = MAX( )

E = MIN( )

E = MIN( )

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D

Page 11: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

11

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D

Page 12: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

12

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D

Page 13: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

13

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D

Page 14: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

14

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E

Page 15: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

15

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E

Page 16: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

16

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E E

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E E E

Page 17: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

17

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E E E E

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E E E E E

Page 18: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

18

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

D D D D D D D D

E E E E E E E E

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Dilate / erode on 8 3x3 windows

Page 19: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

19

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Dilate / Erode Pipeline 1 Word (8 Pixels)

Delay Queues

Streams

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Down Sampling

�  Complication: In FOA the image is often first down-sampled for (sequential) efficiency reasons.

�  When the Dilate/Erode pipe is preceded by a down sample, elements are accessed out of (streaming) order. This can be achieved by

–  Storing image in 1 or 4 memories •  this requires DMA-ing the data in before the computation starts •  1 memory: loop slowdown

or –  Streaming image through delay queues

•  The computation now runs fully pipelined at stream speed

Page 20: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

20

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Erode Dilate Results �  No problem with producer controlled streams �  Pipeline parallelism works great here

–  8 D/Es (16 sections) run about as fast as 1 D/E –  the delay is ~2 microseconds per D/E

�  16 operators plus 1 down sample in parallel pipeline each 8 iterations loop-unrolled ( it all fits in one FPGA )

–  8 down samples: 24 maxs –  128 dilates/erodes every clock cycle

dilate / erode: 4 mins / maxs for cross mask 8 mins / maxs for square mask totalling 792 ops / clock cycle at 100Mhz: 79.2 GigaOpS ( ideal ) �  No time difference for 2 FPGAs vs 1 FPGA

–  bridge stream just as fast as internal stream –  But there is no need for the second FPGA (27% LUTs on 1 FPGA)

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Downsample, dilate, erode performance on a 640 x 480 pixel image

Down sample: 4*640*480 = 1,228,800 Dilates Erodes: 16*6*320*240 = 14,645,600 Total number of ops: 15,974,400 Code versions: DMA into DMA into stream in one OBM four OBMs from GCM (units) MAP 782 343 230 ( µ-secs ) > 20 > 46 > 69 ( GOPS ) MAP Speedup Vs Pentium > 46 > 104 > 156 Pentium speed: 36 milliseconds ( ~ 444 MOPS )

Page 21: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

21

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

LU Decomposition

�  You know it: [ Cormen et. al. pp 750… ] for k = 1 to n { for i = k+1 to n Aik /= Akk for i = k+1 to n Aij -= Aik*Akj }

A

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Pipelining

�  Iteration k+1 can start after iteration k has finished row k+1 –  neither row k nor column k are read or written anymore

�  This naturally leads to a a streaming algorithm –  every iteration becomes a process –  results from iteration/process k flow to process k+1 –  process k uses row k to update sub matrix A[k+1:n, k+1:n]

�  Actually, only a fixed number P of processes run in parallel –  after that, data is re-circulated

Page 22: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

22

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Streaming LU

�  Pipeline of P processes –  parallel sections in OpenMP lingo –  algorithm runs in ceil( N/P ) sweeps

�  In one sweep P outer k iterations run in parallel �  After each sweep P rows and columns are done

–  Problem size n is reduced to (n-P)2 –  Rest of the (updated) array is re-circulated

�  Each process owns a row –  Process p owns row p + (i-1)P in sweep i, i=1 to ceil (N/P) –  Array elements are updated while streaming through the pipe

of processes �  Total work O(n3), speedup ~P �  Forward / backward substitution are simple loops

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

First, Flip the matrix vertically

P1

P2

Page 23: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

23

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

P2

P1

Feed matrix to Processor 1

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

P2

P1

P3

Page 24: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

24

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

P3

P2

Feed matrix to Processor 2

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

P3

P2

P4

Page 25: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

25

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

P4

P3

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Process Pipeline OBM1

0

1

P=12 on SRC6

P-1

OBM2

OBM2

OBM1

OBM3

OBMS: OBM1 -> OBM2 on even sweeps OBM2 -> OBM1 on odd sweeps OBM3 : siphoning off finished results Processes are grouped in a parallel sections construct each process is a section Process 0 reads either from OBM1 or OBM2 writes to stream S0 Process i reads from Si-1 writes to Si Process P-1 reads from Sp-1 writes finished data to OBM3 writes rest to OBM1 or OBM0

S0

S1

Page 26: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

26

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Inside a process for(i = (s-1)*P; i < n; i++) { for(j = (s-1)*P; j < n; j++) { //ping-pong: read from OBM 1 or 2 if(k&0x1) w= OBM1 [ i * n + j ]; else w= OBM2 [ i * n + j ]; // if (i < me) leave data unchanged if (i ==me) { // store my row if (j == i ) piv = w; myRow[ j ] = w; } else if (i > me) { // update this row with my row // if (j < me) leave data unchanged if (j == me) { w /= piv; mul = w;} else if (j > me) w -= mul*myRow[ j ]; } put_stream( &S0, w); } }

OBM1

P0

P1

P=12 on SRC6

OBM2

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

LU (plus solvers) Performance

n=64 n=128 n=256 n=512 (unit) Pentium IV (2.8 GHz ) 1205 1469 1109 694 (MFlops) MAP (P=12) 492 1104 1718 2087 (MFlops) MAP (P=23*) 591 1405 2599 3563 (MFlops) MAP speedup (P=23) .49 .95 2.3 5.1 *: FPGA1: 10 LU processes + forward and backward solvers FPGA2: 13 LU processes

Page 27: MAP C - CSUcs475/f14/Lectures/Streams.pdf · Stream producers and consumers – Live in their own parallel sections – Loop based ! MAP streams are producer driven – Producer can

27

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

Conclusions �  C (and FTN) run on FPGA based HPEC system

–  DEBUG Mode allows most code development and performance tuning on workstation

–  We can apply standard software design •  stepwise refinement using macros •  parallel tasks connected by streams (put get: implicit timing)

�  Streams allow for massive bandwidth (800MBs/stream) and very short latency (few clock cycles)

–  between on chip processes –  between processors and processors –  between processors and memories

�  Future work –  multi-MAP stream and barrier codes