Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
The Power of Streams on the
SRC MAP®
Wim Bohm Colorado State University
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
MAP C � Pure C runs on the MAP
– Generated code: circuits • Basic blocks in outer loops become special purpose hardware “function units”
• Basic blocks in inner loop bodies are merged and become pipelined circuits
� Sequential semantics obeyed – One basic block executed at the time – Pipelined inner loops are slowed down to disambiguate read/
write conflicts if necessary – MAP C compiler identifies (cause of) loop slowdown
� You want more parallelism? – Unroll loops – Use (OpenMP style) parallel sections
2
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Memory Model � Variable = register � Array = memory
– OBM Declare OBMx to be array with element type t where t is some 64 bits type SRC6: 6 OBMs – Block RAM An array lives by default in Block RAM(s) More freedom for element type SRC6: 444 Block RAMs (VP100)
� 1 array access = 1 clock cycle – but compiler can slow down loops for read / write
disambiguation – loop slow down: loop runs at >1 clock cycle rate
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Programming with loops and arrays � Transformational Approach
– Start with pure C code • Compiler indicates (reasons for) loop slow downs
– Performance tune (removing inefficiencies) • avoid re-reading of data from OBMs using Delay Queues • avoid read / write conflicts in same iteration
– Partition Code and Data • distribute data over OBMs and Block RAMs, unroll loops • distribute code over two FPGAs, MPI type communication
– only one chip at the time can access a particular OBM � Goal 1: 1 loop iteration = 1 clock cycle
– Can very often be achieved � Goal 2: Exploit as much bandwidth as possible
– Through loop unrolling and task parallelism – Balance act: chip area versus bandwidth
3
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
How to performance tune: Macros � C code can be extended using macros allowing for
program transformations that cannot be expressed straightforwardly in C
� Macros have semantics unlike C functions: they have a rate and a delay (so they can be integrated in loops)
– rate (#clocks between inputs) – delay (#clocks between in and output) – macros can have state (kept between macro calls)
� Two types of macros – system provided – user provided (Verilog or schematic capture)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues for efficient Window Access
� Keep data on chip using Delay Queues – Delay queue = system macro implemented with
SRL16s (small, fixed size) BRAMs (large, variable size)
� Simple example: – 3x3 window
stepping 1 by 1 in column major order
– image 16 deep
9 points in stencil 8 have been seen before
Per window the leading point should be the only data access.
input array traversal
4
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues 1
f1
Data access input(i) f7
f8
f1
f4
f5
f6
f1
f2
f0
f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
two 16–word Delay Queues
shift and store previous columns
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues 2
f1
f7
f8
f1
f4
f5
f6
f1
f0
f1
f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
Data access input(i)
5
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues 3
f1
f7
f8
f1
f4
f5
f6
f0
f1
f2
f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
f2 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
Data access input(i)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
… Delay Queues 16
f0
f7
f8
f1
f4
f5
f0
f13
f14
f15
f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0
Data access input(i)
6
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
…Delay Queues 32
f15
f7
f8
f1
f13
f14
f15
f29
f30
f31
f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0
f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19 f18 f17 f16
Data access input(i)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
…Delay Queues 35
f18
f0
f1
f2
f16
f17
f18
f32
f33
f34
f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3
Compute first window
f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19
Data access input(i)
7
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues 36
f19
f1
f2
f3
f17
f18
f19
f33
f34
f35
f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4
Compute next widow
f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20
Data access input(i)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues 37
f20
f2
f3
f4
f18
f19
f20
f34
f35
f36
f19 f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5
and next … 1 clock cycle per window
f36 f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21
Data access input(i)
8
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Delay Queues: Performance
3x3 window access code 512 x 512 pixel image
Routine Style Number Comment of clocks
Straight Window 2,376,617 close to 9 clocks per iteration 2,340,900: the difference is pipeline prime effect Delay Queue 279,999 close to1 clock per iteration 262144: theoretical limit
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Streams � A stream is a flexible communication mechanism
– Providing parallelism, locality and implicit synchronization – Freeing the programmer of timing headaches – On the MAP it provides a uniform communication model
• From µ-proc to MAP (DMA stream) • From Global Common Memory to MAP • From MAP to MAP (GPIO stream) • From FPGA to FPGA (Bridge stream) • Inside FPGA between parallel sections
� Stream buffer = FIFO – Element can travel through in one clock – Processes, FPGAs, MAPs, GCM are all a few clock cycles
away from each other
9
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Streams cont’
� Stream producers and consumers – Live in their own parallel sections – Loop based
� MAP streams are producer driven – Producer can conditionally put elements in the stream – Consumer cannot conditionally get
• but can run at its own speed – Conditional gets would cause loop slowdown
� Two case studies – Dilate Erode (Image Processing kernel) – LU Decomposition (Numerical kernel)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Dilate / Erode image operators � Dilate enlarges light objects and shrinks dark objects by taking max pixel value in (masked) window
� Erode enlarges dark objects and shrinks light objects by taking min pixel value in (masked) window � together they simplify the image, leaving only dominant trends � Focus Of Attention Codes often use a number of dilates followed
by same number of erodes – ideal scenario for multiple stream connected loops – internally the loops use delay queues for windows
10
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Erode dilate example
D = MAX( )
D = MAX( )
E = MIN( )
E = MIN( )
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D
11
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D
12
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D
13
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D
14
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E
15
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E
16
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E E
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E E E
17
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E E E E
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E E E E E
18
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
D D D D D D D D
E E E E E E E E
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Dilate / erode on 8 3x3 windows
19
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Dilate / Erode Pipeline 1 Word (8 Pixels)
Delay Queues
Streams
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Down Sampling
� Complication: In FOA the image is often first down-sampled for (sequential) efficiency reasons.
� When the Dilate/Erode pipe is preceded by a down sample, elements are accessed out of (streaming) order. This can be achieved by
– Storing image in 1 or 4 memories • this requires DMA-ing the data in before the computation starts • 1 memory: loop slowdown
or – Streaming image through delay queues
• The computation now runs fully pipelined at stream speed
20
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Erode Dilate Results � No problem with producer controlled streams � Pipeline parallelism works great here
– 8 D/Es (16 sections) run about as fast as 1 D/E – the delay is ~2 microseconds per D/E
� 16 operators plus 1 down sample in parallel pipeline each 8 iterations loop-unrolled ( it all fits in one FPGA )
– 8 down samples: 24 maxs – 128 dilates/erodes every clock cycle
dilate / erode: 4 mins / maxs for cross mask 8 mins / maxs for square mask totalling 792 ops / clock cycle at 100Mhz: 79.2 GigaOpS ( ideal ) � No time difference for 2 FPGAs vs 1 FPGA
– bridge stream just as fast as internal stream – But there is no need for the second FPGA (27% LUTs on 1 FPGA)
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Downsample, dilate, erode performance on a 640 x 480 pixel image
Down sample: 4*640*480 = 1,228,800 Dilates Erodes: 16*6*320*240 = 14,645,600 Total number of ops: 15,974,400 Code versions: DMA into DMA into stream in one OBM four OBMs from GCM (units) MAP 782 343 230 ( µ-secs ) > 20 > 46 > 69 ( GOPS ) MAP Speedup Vs Pentium > 46 > 104 > 156 Pentium speed: 36 milliseconds ( ~ 444 MOPS )
21
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
LU Decomposition
� You know it: [ Cormen et. al. pp 750… ] for k = 1 to n { for i = k+1 to n Aik /= Akk for i = k+1 to n Aij -= Aik*Akj }
A
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Pipelining
� Iteration k+1 can start after iteration k has finished row k+1 – neither row k nor column k are read or written anymore
� This naturally leads to a a streaming algorithm – every iteration becomes a process – results from iteration/process k flow to process k+1 – process k uses row k to update sub matrix A[k+1:n, k+1:n]
� Actually, only a fixed number P of processes run in parallel – after that, data is re-circulated
22
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Streaming LU
� Pipeline of P processes – parallel sections in OpenMP lingo – algorithm runs in ceil( N/P ) sweeps
� In one sweep P outer k iterations run in parallel � After each sweep P rows and columns are done
– Problem size n is reduced to (n-P)2 – Rest of the (updated) array is re-circulated
� Each process owns a row – Process p owns row p + (i-1)P in sweep i, i=1 to ceil (N/P) – Array elements are updated while streaming through the pipe
of processes � Total work O(n3), speedup ~P � Forward / backward substitution are simple loops
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
First, Flip the matrix vertically
P1
P2
23
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
P2
P1
Feed matrix to Processor 1
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
P2
P1
P3
24
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
P3
P2
Feed matrix to Processor 2
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
P3
P2
P4
25
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
P4
P3
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Process Pipeline OBM1
0
1
P=12 on SRC6
P-1
OBM2
OBM2
OBM1
OBM3
OBMS: OBM1 -> OBM2 on even sweeps OBM2 -> OBM1 on odd sweeps OBM3 : siphoning off finished results Processes are grouped in a parallel sections construct each process is a section Process 0 reads either from OBM1 or OBM2 writes to stream S0 Process i reads from Si-1 writes to Si Process P-1 reads from Sp-1 writes finished data to OBM3 writes rest to OBM1 or OBM0
S0
S1
26
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Inside a process for(i = (s-1)*P; i < n; i++) { for(j = (s-1)*P; j < n; j++) { //ping-pong: read from OBM 1 or 2 if(k&0x1) w= OBM1 [ i * n + j ]; else w= OBM2 [ i * n + j ]; // if (i < me) leave data unchanged if (i ==me) { // store my row if (j == i ) piv = w; myRow[ j ] = w; } else if (i > me) { // update this row with my row // if (j < me) leave data unchanged if (j == me) { w /= piv; mul = w;} else if (j > me) w -= mul*myRow[ j ]; } put_stream( &S0, w); } }
OBM1
P0
P1
P=12 on SRC6
OBM2
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
LU (plus solvers) Performance
n=64 n=128 n=256 n=512 (unit) Pentium IV (2.8 GHz ) 1205 1469 1109 694 (MFlops) MAP (P=12) 492 1104 1718 2087 (MFlops) MAP (P=23*) 591 1405 2599 3563 (MFlops) MAP speedup (P=23) .49 .95 2.3 5.1 *: FPGA1: 10 LU processes + forward and backward solvers FPGA2: 13 LU processes
27
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
RSS!2006 Copyright © 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
Conclusions � C (and FTN) run on FPGA based HPEC system
– DEBUG Mode allows most code development and performance tuning on workstation
– We can apply standard software design • stepwise refinement using macros • parallel tasks connected by streams (put get: implicit timing)
� Streams allow for massive bandwidth (800MBs/stream) and very short latency (few clock cycles)
– between on chip processes – between processors and processors – between processors and memories
� Future work – multi-MAP stream and barrier codes