View
68
Download
0
Category
Preview:
DESCRIPTION
ROUTING ? Sorting? Image Processing? Sparse Matrices?. Reconfigurable Meshes !. Heiko Schröder, 2003. Reconfigurable architectures. FPGAs reconfigurable multibus reconfigurable networks (Transputers, PVM) dynamically reconfigurable mesh Aim: efficiency - PowerPoint PPT Presentation
Citation preview
Heiko Schröder, 2003
ROUTING ?Sorting?Image Processing?Sparse Matrices?
Reconfigurable Meshes !
Heiko Schröder, 2003 Reconfigurable mesh 2
Reconfigurable architecturesReconfigurable architectures
• FPGAs
• reconfigurable multibus
• reconfigurable networks (Transputers, PVM)
• dynamically reconfigurable mesh
Aim:efficiency
special purpose --> general purpose architectures
Heiko Schröder, 2003 Reconfigurable mesh 3
contentscontents
1.) Motivation for the reconfigurable mesh
2.) Routing (and sorting):• better than PRAM
• better than mesh
3.) Image processing
4.) Sparse matrix
multiplication
5.) Bounded bus length
Heiko Schröder, 2003 Reconfigurable mesh 4
PRAMPRAM
0 1 2 3 4 5 6 7 8 9
8 976 5432 0 1
0 1 2 3 4 5 6 7 8 9
diameter O(1) bisection width (n)
cut
EREW CRCW
Heiko Schröder, 2003 Reconfigurable mesh 5
Mesh/TorusMesh/Torus
Diameter ( ) bisection width ( )
nn
2D mesh
Heiko Schröder, 2003 Reconfigurable mesh 6
HypercubeHypercube
0-D0
11-D
00
01
10
112-D
000 010
001 011
100 110
101 111
3-D
0 1
4-D
diameter O(log n)bisection width (n)
Heiko Schröder, 2003 Reconfigurable mesh 7
reconfigurable meshreconfigurable mesh
reconfigurable mesh = mesh + interior connections
15 positionsdiameter 1 !!
low cost
Heiko Schröder, 2003 Reconfigurable mesh 8
global ORglobal OR
1 0 000 1 0
* * “V”
Time: O(1) on RM-- (log n) on EREW-PRAM
Heiko Schröder, 2003 Reconfigurable mesh 9
Prefix sumPrefix sum
0 1 1 0 1 0 0 1 1 1
*
6
012345
789
Fast butexpensive
Time : O(1)Area: (nxn)
Heiko Schröder, 2003 Reconfigurable mesh 10
Modulo 3 counterModulo 3 counter
10 11 10
*1 mod 3
Time: O(1) on RM (log n / log log n) on CRCW-PRAM
Heiko Schröder, 2003 Reconfigurable mesh 11
• 2 digit numbers to the basis of k represent all numbers smaller than k2.
• 1.) determine x mod k (=lsd)
• 2.) count number of “wraps” (=msd).
modulo k2 counter (ranking)modulo k2 counter (ranking)
10 11 10
*1 mod k
--> modulo k2 counting in 2 steps on a k x k2 array
Heiko Schröder, 2003 Reconfigurable mesh 12
enumeration / prefix sumenumeration / prefix sum
1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8
time: O(log n)
wire efficiency ! -- (compared with tree)1/2 number of processors
Heiko Schröder, 2003 Reconfigurable mesh 13
permutation routing - 2 stepspermutation routing - 2 steps
n x n
2 steps !!!
Heiko Schröder, 2003 Reconfigurable mesh 14
Kunde’s all-to-all mappingKunde’s all-to-all mapping
Sorting:sort blocksall-to-all (columns)sort blocks all-to-all (rows)o-e-sort blocks
Heiko Schröder, 2003 Reconfigurable mesh 15
sorting in constant timesorting in constant time
n2
3
n1
3
Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks
block
broadcast (1)
Sort blocks:
broadcast (1)
rank (2)
Heiko Schröder, 2003 Reconfigurable mesh 16
• better than PRAM --- but useless!!!
Heiko Schröder, 2003 Reconfigurable mesh 17
Kunde’s all-to-all mappingKunde’s all-to-all mapping
n2
3
n x n
Heiko Schröder, 2003 Reconfigurable mesh 18
vertical all-to-allvertical all-to-all
Heiko Schröder, 2003 Reconfigurable mesh 19
horizontal all-to-allhorizontal all-to-all
Heiko Schröder, 2003 Reconfigurable mesh 20
Use of bus -- no conflictUse of bus -- no conflict
1 step
2 steps
3 steps
k/2 steps
3 steps
2 steps
1 step
(k/2)2 steps
Heiko Schröder, 2003 Reconfigurable mesh 21
sorting in optimal time Kunde / Schröder
sorting in optimal time Kunde / Schröder
(k/2)2 stepsk=n1/3
each step takes n1/3 time --> T= n/4
x 2
T = n/2all-to-all
Sorting:sort blocks (O(n2/3))all-to-all (n/2)sort blocks (O(n2/3))all-to-all (n/2)o-e sort blocks (O(n2/3))(snake like order of blocks)
time: n + o(n)
x 2
/2
Heiko Schröder, 2003 Reconfigurable mesh 22
Why optimal?Why optimal?
Sorter for n keys
Bisection of data with k wires
Sorting time n/k
Heiko Schröder, 2003 Reconfigurable mesh 23
Use of theoremUse of theorem
1.) n keys on a kxk RM:Time n/k
Proof:Wherever the data is stored there is always a bisection of length k-- this can be demonstrated sweeping left right through the array.Q.e.d.
2.) nxn keys on an nxn RM:Time n.
Proof: trivial
Heiko Schröder, 2003 Reconfigurable mesh 24
n + o(n)n + o(n)
Optimal --- but ...
Heiko Schröder, 2003 Reconfigurable mesh 25
enumeration / prefix sumenumeration / prefix sum
1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8
time: O(log n)
wire efficiency ! -- (compared with tree)1/2 number of processors
Heiko Schröder, 2003 Reconfigurable mesh 26
• move and smooth
ABCD-routingABCD-routing
A
BC
D
Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n.Determine destination position of each packet.
Heiko Schröder, 2003 Reconfigurable mesh 27
elementary stepselementary steps
21
108
5
7
9
36
4
21
108
5
7
9
36
4
move
21
109
5
7
8
36
4
smooth
21109
5 78
3 64
collect
Heiko Schröder, 2003 Reconfigurable mesh 28
time analysistime analysis
A B C D
move smooth
A B C D
collect
time: 3 x n/2
T=3n+o(n)
Heiko Schröder, 2003 Reconfigurable mesh 29
T < 2nT < 2n
4 destination squarestime: 3n + 4 log n
16 destination squarestime: 2n + 16 log n
64 destination squarestime: 12/7 n + 64 log n
mesh-diameter: 2n
Heiko Schröder, 2003 Reconfigurable mesh 30
enough of routing/sortingenough of routing/sorting
Constant factor !Can we do better ?What kind of problems ?
Image processingSparse problems !
Heiko Schröder, 2003 Reconfigurable mesh 31
Image processingImage processing
•Border following
•Edge detection
•Component labeling
•Skeletons
•Transforms
Heiko Schröder, 2003 Reconfigurable mesh 32
Component labellingComponent labelling
ObjectDefine border (candidates)Set bus
While own label is not received:1.) Candidates brake busand send their label a) clockwiseb) anti-clockwise2.) Candidates switch offand restore bus if they see smaller labelTime: O(1) -- O(log n)
Heiko Schröder, 2003 Reconfigurable mesh 33
TransformsTransforms
• Wavelet transform: Time log n on RM
-- time n on mesh
• FFT: Time n on RM and mesh
• Hough transform: Time m x log n on RM
-- time m x n on mesh
Heiko Schröder, 2003 Reconfigurable mesh 34
systolic matrix multiplicationsystolic matrix multiplication
B
A C
c a bij ik kjk
n
1
time: ni
j
ijc
Heiko Schröder, 2003 Reconfigurable mesh 35
sparse matrix multiplicationsparse matrix multiplication
Ax
B=
C
c a bij ik kjk
n
1
Time: n (nxn mesh)
A and B column sparse (k2)A and B row sparse (k2)A row sparse, B column sparse (k2)A column sparse, B row sparse k n
Heiko Schröder, 2003 Reconfigurable mesh 36
unlimited bus lengthunlimited bus length
• ring broadcast
1 2 32 2 2 2 2 2 2 2 2 3 3 3 3 33 3 3 3 1
2 2 2 2 3 1 1 1 1 1 1 1 1 1 2 2 2 2 21 1 1 1 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1
Heiko Schröder, 2003 Reconfigurable mesh 37
A row-sparse B column-sparseA row-sparse B column-sparse
Repeat r times
Begin
horizontal ring broadcast aik
Repeat c times
vertical ring broadcast bkj
End. B
A C
r
c
Heiko Schröder, 2003 Reconfigurable mesh 38
A and B column-sparseA and B column-sparse
Repeat c1 times
Begin
horizontal ring broadcast aik (meets bkj)
Repeat c2 times
vertical ring broadcast product to final position
End.
A B/C
c1
c2 elements
T
i
i}k{
j
Heiko Schröder, 2003 Reconfigurable mesh 39
lower bound (c,r) c=r=klower bound (c,r) c=r=k
c a bij ik kjk
n
1
A
B
C
k=3
nk
nk
n=48
t nk
Heiko Schröder, 2003 Reconfigurable mesh 40
splitting the problemsplitting the problem
Repeat k times
Begin
vertical ring broadcast
Repeat s times
horizontal ring broadcast
End.
A B/C
first s
s
k B-elements
T
A=As +Ar
C=AsB+ArB
s
time: ks
s
C=C+Cs r
Heiko Schröder, 2003 Reconfigurable mesh 41
CRCR
A
first s
s
s
A has nk non-zero elements Ar has at most nk/s non-zero rows for s= n Ar has at most k n non-zero rows.
As B is a RR- problem it takes time k n .
A=As +Ar
Heiko Schröder, 2003 Reconfigurable mesh 42
Ar B calculating productsAr B calculating products
Ar
k A-elements
Ar
k B-elements
B/CT
time: k2
elementsk n2
Heiko Schröder, 2003 Reconfigurable mesh 43
column sumcolumn sum
ii+1i-1
j
row itime: log n
k n only elements per column rout time: k n
Heiko Schröder, 2003 Reconfigurable mesh 44
routing within columnsrouting within columns
rout time: k n
Heiko Schröder, 2003 Reconfigurable mesh 45
Reconfigurable architecturesReconfigurable architectures
Reconfigurable mesh ?
constant diameter !
No !!!
Physical laws!
Heiko Schröder, 2003 Reconfigurable mesh 46
Physical limitsPhysical limits
c=300 000 km/sec • 30cm/ns
• on chip: 1cm/ns
• --> bounded bus length
good idea !
Heiko Schröder, 2003 Reconfigurable mesh 47
bounded broadcastbounded broadcast
1 2 31 2 2 3 3 3
1 1 2 2 2 2 2 3 33 33 3 1 1 1 1 1 2 2
3 1 1 2 2 23 3 1 1 1 2 22 2
2 2 3 3 3 3 3 1 11 11 1 2 3 3
time: k + n/l
Heiko Schröder, 2003 Reconfigurable mesh 48
creating main stationscreating main stations
1 2 3 1 2 3 1 2 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 2 2 2 2 23 3 3 3 3 3 3 3 3 3 3 3 3 3 3
time: k
Heiko Schröder, 2003 Reconfigurable mesh 49
Create main stations 1,…,k for A and B (time: n/l+k)
For i=1,…,k do
Begin
horizontal ring broadcast i of A
For j=1,…,k do
vertical ring broadcast j of B
End.
A row-sparse B column-sparseA row-sparse B column-sparse
B
A C
k
k
Heiko Schröder, 2003 Reconfigurable mesh 50
Create main stations 1, … , k for A (time: n/l+k)
For i=1,…,k do
Begin
horizontal ring broadcast i
k bounded vertical broadcasts of products
merging new products
End.
A and B column-sparseA and B column-sparse
A B/C
k
k elements
T
i
i
Heiko Schröder, 2003 Reconfigurable mesh 51
remove minor stationsremove minor stations
1 2 32 21 3 3 31 1 2 2 2 2 2 3 33 3
1 1 1 1 1 2 2 2 2 23 3 3 3 33 3 3 3 3 1 1 1 1 1 2 22 2
2 2 2 2 2 3 3 3 3 3 1 11 1
Heiko Schröder, 2003 Reconfigurable mesh 52
resultsresults
Time: n (nxn mesh)A and B column sparse (k2) (k2+2n/l)A and B row sparse (k2) (k2 +2n/l)A row sparse, B column sparse (k2) (k2 +n/l)A column sparse, B row sparse (+11n/l) 3k n
•image processing
•sorting
•routing
•load balancingbetter than the mesh !
(Kunde, Middendorf, Schmeck, Schröder, Turner)
(Kapoor, Kunde, Kaufmann, Schroeder, Sibeyn)
•The RM is in some cases “better” than PRAM•The RM is always at least as “good” as mesh •The RM is often “better” than the mesh
Heiko Schröder, 2003 Reconfigurable mesh 54
Fault tolerant On-board computingFault tolerant On-board computing
10 km/s 1 image/s 100 Mbit/image 4000 s/orbit 400 Gbit/orbit download: 400 Mbit/orbit
On-board imageanalysis andcompression
800 km
Singapore
Heiko Schröder, 2003 Reconfigurable mesh 55
Due to radiation:•Single event upsets (many)•latch ups (extra hardware)•total loss (rare at 800 km)
1 task per processorseveral tasks/instruments/sources per processor1 task per 3 to 4 processors
!
?16 processors (+ spares?)fault tolerant reconfigurable network
Heiko Schröder, 2003 Reconfigurable mesh 56
Methods currently usedMethods currently used
shadow-processorsmajorityvoting
Byzantine systems ASTRIUM, deep space
Heiko Schröder, 2003 Reconfigurable mesh 57
1 CAN2 CANs •Industrial spec.
•mil-spec.•radiation tolerant•radiation hardened
386 is modern
Heiko Schröder, 2003 Reconfigurable mesh 58
Fault tolerance through reconfigurationin regular networks
Heiko Schröder, 2003 Reconfigurable mesh 59
Every fault pattern, that does not contain a 2x2 array of faulty PEs survives.
PS(7)=0.7
A simple solution with high fault tolerance (torus)
processor Data sourceinstrument
“atomic fault pattern”
Heiko Schröder, 2003 Reconfigurable mesh 60
To the right
up
Replacement paths
Replace to the right -- 1 fault per row
Faulty processors
Heiko Schröder, 2003 Reconfigurable mesh 61
To the right
Replacement paths
Heiko Schröder, 2003 Reconfigurable mesh 62
Preserving horizontal connections Preserving vertical connections
spares Replacement paths
Two separate networks for horizontal and vertical connections
Heiko Schröder, 2003 Reconfigurable mesh 63
Number of switches: 2, 4, 6, 30Wire area: 0, 2.8, 3.8, 4
S
N
E
W
P
I
kNp 1
1
PS
# faults
1
161 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Heiko Schröder, 2003 Reconfigurable mesh 64
… an arrary of SHARCs to provides throughput 160 Mb/s.… 2.5 billion floating point operations per second. … first demonstration of real-time image processing in space.
image cube froma 30 km wide swath of Korea’s coastline.(Launch: 2001?)
Nemo
Heiko Schröder, 2003 Reconfigurable mesh 65
1.) You need to know which algorithm you want to use.But in image processing (if you do not use FFT – wavelet can also be a problem)You can usually assume that every calculation you do depends only on data in the close neighbourhood. That results in the fact that at any stge of your processing you need to have in your memory only the data of 3 neighbouring planes. 2.) you have several choices of processing the data (each time you do all the processing related to a single plane in parallel). You can either slice your cube into le to cut “Horizontal planes or into vertical planes (sometimes it is desirable to cut “diagonally”).3.) It is important that you read every data element only once.4.) Such data processing you would call systolic, i.e. you move the raw data through the architecture with constant speed and constant direction.
In the picture below I have cut the data-cube vertically -- obviously there are many directions of slicing vertically.
If the memory of the processors is large enough, you might be able to hold the complete cub in memory – then you can also process FFTs and wavelets easily.
Heiko Schröder, 2003 Reconfigurable mesh 66
Compressionratio (CR=4loss-less)
Segmentation gain (SG=16, 1/16 of a useful image is useful)
Classification gain(CG=5, 1 in 5 images contain useful information)
U=.8
U=.2
U=4
U=1 U=16
The satellite efficiency cube
Not likely
LOSSY=60U=32
U=64
(0,0,0)
Heiko Schröder, 2003 Reconfigurable mesh 67
Our aim: High performance via COTS16 processors (+ spares) off-the-shelfconnected via afault tolerant reconfigurable network
In X-SAT restricted to image processing
Mesh/torus
Heiko Schröder, 2003 Reconfigurable mesh 68
processorsfault
tolerantmesh
on-board
Heiko Schröder, 2003 Reconfigurable mesh 69
switch
current communication
FPGA
ctrlh/vo/er/w
Instructionsto PEs
link to PE
Heiko Schröder, 2003 Reconfigurable mesh 70
spares
C3 -- torus
spares
Replacement algorithm exists for up to 4 faults.Reconfiguration software runs on FPGA.Could be repaired within << 1sec.
Heiko Schröder, 2003 Reconfigurable mesh 71
ctrlh/vo/er/w
Instructionsto PEs
Diagnosticset switches
Heiko Schröder, 2003 Reconfigurable mesh 72
4 FPGAsConnected to k*(k+1) SA-processors each
k+1 horizontal and vertical connections plus diagonals
Theorem:Given a 2x2 array of FPGAs, each connected to k2+k processors, with k+1 vertical and horizontal connections and 1 connection in the diagonals.
The processors can be connected to a 2kx2k mesh as long as the sum of working processors is at least 4k2.
Heiko Schröder, 2003 Reconfigurable mesh 73
Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors
Case 1: red and green are greater than k2
Figure 1 shows all possible combinations forred and green having more then k2 processors.(in the drawing k=8).The light red and green areas show the minimal Number of working processors. If more than k2 processors are Working in the red and green FPGAs, they are added in the order indicted by the arrows in the dark red and dark green areas. These placesare otherwise occupied by processors belonging to the yellow and blue areas.
The border between yellow and blue is determined by the 4 numbers and is within the orange area. The maximal number for the Yellow or green area is k2+k. If red has also this many elements then the yellow areaneeds to be extended to the right into the orange area. The maximal size of yellow plus orange is(k-1)(k+1)+2(k-2)=k2-3+2k= k(k+1)+k-5 which is greater or equal to k(k+1) for k 5. Please note that due to the above inequality for k=8 (as shown in the drawing) the length of the left and right arrow can be reduced by 3 each. It is easy to see from the drawing that for any possible sum of red and yellow this can be done with the length of the yellow/blue borderline not exceeding k+1. Remark: The length of the border between orange and blue is k+1. Also there is at most one diagonal connection required.
Let the sum of red and yellow be smaller than 2k(k+1), i.e. smaller or equal to 2k2+2k-1. The maximal number of red yellow (assuring that no border is longer than k+1) and orange elements is 2k2+3k-5, which is sufficient to cover all 2k2+2k-1 elements as long as 3k-5 2k-1, i.e. k 4. The case where red and yellow have together 2k2+2k elements can be solved easily as shown in Figure …
Figure 1
Heiko Schröder, 2003 Reconfigurable mesh 74
Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors
1
aAll
All cases with only 3 FPGAs >0All cases with 4 FPGAs>0 under case 1 above
Heiko Schröder, 2003 Reconfigurable mesh 75
Case 2: Two FPGAs have more than 64 processors, but no two neighbouring FPGAs do:Let the red and blue have between k2+1 and k2+k working processors. Case 2a: Assume that either green or yellow have more than k2-k elements.Lets assume that green has at least k2-k+1 elements (and due to the general assumption of case 2 it has at most k2 elements). If yellow has more than k2-k elements we produce the mirror image.
Under these assumptions red can have up to k2+k elements and green can have up to k2 elementsAs indicated by the horizontal arrows (see Figure 2).Green plus blue can range from 2k2-k+2 to 2k2+k. Thus 2k-4 k needs to be satisfied, thus k 4.
Case 2b: Neither green nor yellow are greater k2-k, then green and yellow have to be exactly k2-k and red and blue need to be k2+k in order to have 4k2 elements.This can be solved as shown in Figure 3.
Figure 2 Figure 3
Heiko Schröder, 2003 Reconfigurable mesh 76
Case 3: Only red is larger than k2.All other colours are at least k2-k and at most k2.Case 3a: No colour is k2-k.Yellow occupies the yellow area plus orange if required (see Figure 4). Red occupies the bright red area plus the rest of orange plus the dark red in the order indicated by the two vertical arrows. Blue occupies the light blue and dark blue area in the order indicated by the arrow.
Case 3b: One has k2-k elements. This can then either be a neighbour of red (say green) – see Figure 5, or it can be opposite red – see Figure 6.
This completes the proof that for k >4 there is always a Solution, as long as at least 4k2 processors are active.
Figure 5 Figure 6Figure 4
Heiko Schröder, 2003 Reconfigurable mesh 77
For the case of more than 4 FPGAsWe also assume that we have k +1Horizontal and vertical connections plusone diagonal connection. We also attachk2 +k processors to every FPGA.
Heiko Schröder, 2003 Reconfigurable mesh 78
Heiko Schröder, 2003 Reconfigurable mesh 79
27x27=7298x90=72021 yellow (9 lower bound)
24x24=5768x72=5768 yellow (0 lower bound)
Heiko Schröder, 2003 Reconfigurable mesh 80
30x30=9008x110=88029 yellow (20 lower bound)
Heiko Schröder, 2003 Reconfigurable mesh 81
Recommended