Heiko Schröder, 2003

ROUTING ?Sorting?Image Processing?Sparse Matrices?

Reconfigurable Meshes !

Heiko Schröder, 2003 Reconfigurable mesh 2

Reconfigurable architecturesReconfigurable architectures

• FPGAs

• reconfigurable multibus

• reconfigurable networks (Transputers, PVM)

• dynamically reconfigurable mesh

Aim:efficiency

special purpose --> general purpose architectures

contentscontents

1.) Motivation for the reconfigurable mesh

2.) Routing (and sorting):• better than PRAM

• better than mesh

3.) Image processing

4.) Sparse matrix

multiplication

5.) Bounded bus length

PRAMPRAM

0 1 2 3 4 5 6 7 8 9

8 976 5432 0 1

0 1 2 3 4 5 6 7 8 9

diameter O(1) bisection width (n)

EREW CRCW

Mesh/TorusMesh/Torus

Diameter ( ) bisection width ( )

2D mesh

HypercubeHypercube

000 010

001 011

100 110

101 111

diameter O(log n)bisection width (n)

reconfigurable meshreconfigurable mesh

reconfigurable mesh = mesh + interior connections

15 positionsdiameter 1 !!

low cost

global ORglobal OR

1 0 000 1 0

* * “V”

Time: O(1) on RM-- (log n) on EREW-PRAM

Prefix sumPrefix sum

0 1 1 0 1 0 0 1 1 1

012345

Fast butexpensive

Time : O(1)Area: (nxn)

Modulo 3 counterModulo 3 counter

10 11 10

*1 mod 3

Time: O(1) on RM (log n / log log n) on CRCW-PRAM

• 2 digit numbers to the basis of k represent all numbers smaller than k2.

• 1.) determine x mod k (=lsd)

• 2.) count number of “wraps” (=msd).

modulo k2 counter (ranking)modulo k2 counter (ranking)

10 11 10

*1 mod k

--> modulo k2 counting in 2 steps on a k x k2 array

enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors

permutation routing - 2 stepspermutation routing - 2 steps

2 steps !!!

Kunde’s all-to-all mappingKunde’s all-to-all mapping

Sorting:sort blocksall-to-all (columns)sort blocks all-to-all (rows)o-e-sort blocks

sorting in constant timesorting in constant time

Complete sort: sort blocks all-to-all (2) sort blocks all-to-all (2) o-e-sort blocks

broadcast (1)

Sort blocks:

broadcast (1)

rank (2)

• better than PRAM --- but useless!!!

Kunde’s all-to-all mappingKunde’s all-to-all mapping

vertical all-to-allvertical all-to-all

horizontal all-to-allhorizontal all-to-all

Use of bus -- no conflictUse of bus -- no conflict

1 step

2 steps

3 steps

k/2 steps

3 steps

2 steps

1 step

(k/2)2 steps

sorting in optimal time Kunde / Schröder

(k/2)2 stepsk=n1/3

each step takes n1/3 time --> T= n/4

T = n/2all-to-all

Sorting:sort blocks (O(n2/3))all-to-all (n/2)sort blocks (O(n2/3))all-to-all (n/2)o-e sort blocks (O(n2/3))(snake like order of blocks)

time: n + o(n)

Why optimal?Why optimal?

Sorter for n keys

Bisection of data with k wires

Sorting time n/k

Use of theoremUse of theorem

1.) n keys on a kxk RM:Time n/k

Proof:Wherever the data is stored there is always a bisection of length k-- this can be demonstrated sweeping left right through the array.Q.e.d.

2.) nxn keys on an nxn RM:Time n.

Proof: trivial

n + o(n)n + o(n)

Optimal --- but ...

enumeration / prefix sumenumeration / prefix sum

1 1 1 1 1 1 1 11 2 1 2 1 2 1 21 2 3 4 1 2 3 41 2 3 4 5 6 7 8

time: O(log n)

wire efficiency ! -- (compared with tree)1/2 number of processors

• move and smooth

ABCD-routingABCD-routing

Row-major enumeration of A, B, C and D packets within each quadrant in time 4 log n.Determine destination position of each packet.

elementary stepselementary steps

smooth

collect

time analysistime analysis

A B C D

move smooth

A B C D

collect

time: 3 x n/2

T=3n+o(n)

T < 2nT < 2n

4 destination squarestime: 3n + 4 log n

16 destination squarestime: 2n + 16 log n

64 destination squarestime: 12/7 n + 64 log n

mesh-diameter: 2n

enough of routing/sortingenough of routing/sorting

Constant factor !Can we do better ?What kind of problems ?

Image processingSparse problems !

Image processingImage processing

•Border following

•Edge detection

•Component labeling

•Skeletons

•Transforms

Component labellingComponent labelling

ObjectDefine border (candidates)Set bus

While own label is not received:1.) Candidates brake busand send their label a) clockwiseb) anti-clockwise2.) Candidates switch offand restore bus if they see smaller labelTime: O(1) -- O(log n)

TransformsTransforms

• Wavelet transform: Time log n on RM

-- time n on mesh

• FFT: Time n on RM and mesh

• Hough transform: Time m x log n on RM

-- time m x n on mesh

systolic matrix multiplicationsystolic matrix multiplication

c a bij ik kjk

time: ni

sparse matrix multiplicationsparse matrix multiplication

c a bij ik kjk

Time: n (nxn mesh)

A and B column sparse (k2)A and B row sparse (k2)A row sparse, B column sparse (k2)A column sparse, B row sparse k n

unlimited bus lengthunlimited bus length

• ring broadcast

1 2 32 2 2 2 2 2 2 2 2 3 3 3 3 33 3 3 3 1

2 2 2 2 3 1 1 1 1 1 1 1 1 1 2 2 2 2 21 1 1 1 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1

A row-sparse B column-sparseA row-sparse B column-sparse

Repeat r times

horizontal ring broadcast aik

Repeat c times

vertical ring broadcast bkj

End. B

A and B column-sparseA and B column-sparse

Repeat c1 times

horizontal ring broadcast aik (meets bkj)

Repeat c2 times

vertical ring broadcast product to final position

c2 elements

lower bound (c,r) c=r=klower bound (c,r) c=r=k

c a bij ik kjk

splitting the problemsplitting the problem

Repeat k times

vertical ring broadcast

Repeat s times

horizontal ring broadcast

first s

k B-elements

A=As +Ar

C=AsB+ArB

time: ks

C=C+Cs r

first s

A has nk non-zero elements Ar has at most nk/s non-zero rows for s= n Ar has at most k n non-zero rows.

As B is a RR- problem it takes time k n .

A=As +Ar

Ar B calculating productsAr B calculating products

k A-elements

k B-elements

time: k2

elementsk n2

column sumcolumn sum

ii+1i-1

row itime: log n

k n only elements per column rout time: k n

routing within columnsrouting within columns

rout time: k n

Reconfigurable architecturesReconfigurable architectures

Reconfigurable mesh ?

constant diameter !

No !!!

Physical laws!

Physical limitsPhysical limits

c=300 000 km/sec • 30cm/ns

• on chip: 1cm/ns

• --> bounded bus length

good idea !

bounded broadcastbounded broadcast

1 2 31 2 2 3 3 3

1 1 2 2 2 2 2 3 33 33 3 1 1 1 1 1 2 2

3 1 1 2 2 23 3 1 1 1 2 22 2

2 2 3 3 3 3 3 1 11 11 1 2 3 3

time: k + n/l

creating main stationscreating main stations

1 2 3 1 2 3 1 2 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 2 2 2 2 2 2 2 2 2 2 2 2 2 23 3 3 3 3 3 3 3 3 3 3 3 3 3 3

time: k

Create main stations 1,…,k for A and B (time: n/l+k)

For i=1,…,k do

horizontal ring broadcast i of A

For j=1,…,k do

vertical ring broadcast j of B

A row-sparse B column-sparseA row-sparse B column-sparse

Create main stations 1, … , k for A (time: n/l+k)

For i=1,…,k do

horizontal ring broadcast i

k bounded vertical broadcasts of products

merging new products

A and B column-sparseA and B column-sparse

k elements

remove minor stationsremove minor stations

1 2 32 21 3 3 31 1 2 2 2 2 2 3 33 3

1 1 1 1 1 2 2 2 2 23 3 3 3 33 3 3 3 3 1 1 1 1 1 2 22 2

2 2 2 2 2 3 3 3 3 3 1 11 1

resultsresults

Time: n (nxn mesh)A and B column sparse (k2) (k2+2n/l)A and B row sparse (k2) (k2 +2n/l)A row sparse, B column sparse (k2) (k2 +n/l)A column sparse, B row sparse (+11n/l) 3k n

•image processing

•sorting

•routing

•load balancingbetter than the mesh !

(Kunde, Middendorf, Schmeck, Schröder, Turner)

(Kapoor, Kunde, Kaufmann, Schroeder, Sibeyn)

•The RM is in some cases “better” than PRAM•The RM is always at least as “good” as mesh •The RM is often “better” than the mesh

Fault tolerant On-board computingFault tolerant On-board computing

10 km/s 1 image/s 100 Mbit/image 4000 s/orbit 400 Gbit/orbit download: 400 Mbit/orbit

On-board imageanalysis andcompression

800 km

Singapore

Due to radiation:•Single event upsets (many)•latch ups (extra hardware)•total loss (rare at 800 km)

1 task per processorseveral tasks/instruments/sources per processor1 task per 3 to 4 processors

?16 processors (+ spares?)fault tolerant reconfigurable network

Methods currently usedMethods currently used

shadow-processorsmajorityvoting

Byzantine systems ASTRIUM, deep space

1 CAN2 CANs •Industrial spec.

•mil-spec.•radiation tolerant•radiation hardened

386 is modern

Fault tolerance through reconfigurationin regular networks

Every fault pattern, that does not contain a 2x2 array of faulty PEs survives.

PS(7)=0.7

A simple solution with high fault tolerance (torus)

processor Data sourceinstrument

“atomic fault pattern”

To the right

Replacement paths

Replace to the right -- 1 fault per row

Faulty processors

To the right

Replacement paths

Preserving horizontal connections Preserving vertical connections

spares Replacement paths

Two separate networks for horizontal and vertical connections

Number of switches: 2, 4, 6, 30Wire area: 0, 2.8, 3.8, 4

# faults

161 2 3 4 5 6 7 8 9 10 11 12 13 14 15

… an arrary of SHARCs to provides throughput 160 Mb/s.… 2.5 billion floating point operations per second. … first demonstration of real-time image processing in space.

image cube froma 30 km wide swath of Korea’s coastline.(Launch: 2001?)

1.) You need to know which algorithm you want to use.But in image processing (if you do not use FFT – wavelet can also be a problem)You can usually assume that every calculation you do depends only on data in the close neighbourhood. That results in the fact that at any stge of your processing you need to have in your memory only the data of 3 neighbouring planes. 2.) you have several choices of processing the data (each time you do all the processing related to a single plane in parallel). You can either slice your cube into le to cut “Horizontal planes or into vertical planes (sometimes it is desirable to cut “diagonally”).3.) It is important that you read every data element only once.4.) Such data processing you would call systolic, i.e. you move the raw data through the architecture with constant speed and constant direction.

In the picture below I have cut the data-cube vertically -- obviously there are many directions of slicing vertically.

If the memory of the processors is large enough, you might be able to hold the complete cub in memory – then you can also process FFTs and wavelets easily.

Compressionratio (CR=4loss-less)

Segmentation gain (SG=16, 1/16 of a useful image is useful)

Classification gain(CG=5, 1 in 5 images contain useful information)

U=1 U=16

The satellite efficiency cube

Not likely

LOSSY=60U=32

(0,0,0)

Our aim: High performance via COTS16 processors (+ spares) off-the-shelfconnected via afault tolerant reconfigurable network

In X-SAT restricted to image processing

Mesh/torus

processorsfault

tolerantmesh

on-board

switch

current communication

ctrlh/vo/er/w

Instructionsto PEs

link to PE

spares

C3 -- torus

spares

Replacement algorithm exists for up to 4 faults.Reconfiguration software runs on FPGA.Could be repaired within << 1sec.

ctrlh/vo/er/w

Instructionsto PEs

Diagnosticset switches

4 FPGAsConnected to k*(k+1) SA-processors each

k+1 horizontal and vertical connections plus diagonals

Theorem:Given a 2x2 array of FPGAs, each connected to k2+k processors, with k+1 vertical and horizontal connections and 1 connection in the diagonals.

The processors can be connected to a 2kx2k mesh as long as the sum of working processors is at least 4k2.

Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

Case 1: red and green are greater than k2

Figure 1 shows all possible combinations forred and green having more then k2 processors.(in the drawing k=8).The light red and green areas show the minimal Number of working processors. If more than k2 processors are Working in the red and green FPGAs, they are added in the order indicted by the arrows in the dark red and dark green areas. These placesare otherwise occupied by processors belonging to the yellow and blue areas.

The border between yellow and blue is determined by the 4 numbers and is within the orange area. The maximal number for the Yellow or green area is k2+k. If red has also this many elements then the yellow areaneeds to be extended to the right into the orange area. The maximal size of yellow plus orange is(k-1)(k+1)+2(k-2)=k2-3+2k= k(k+1)+k-5 which is greater or equal to k(k+1) for k 5. Please note that due to the above inequality for k=8 (as shown in the drawing) the length of the left and right arrow can be reduced by 3 each. It is easy to see from the drawing that for any possible sum of red and yellow this can be done with the length of the yellow/blue borderline not exceeding k+1. Remark: The length of the border between orange and blue is k+1. Also there is at most one diagonal connection required.

Let the sum of red and yellow be smaller than 2k(k+1), i.e. smaller or equal to 2k2+2k-1. The maximal number of red yellow (assuring that no border is longer than k+1) and orange elements is 2k2+3k-5, which is sufficient to cover all 2k2+2k-1 elements as long as 3k-5 2k-1, i.e. k 4. The case where red and yellow have together 2k2+2k elements can be solved easily as shown in Figure …

Figure 1

Proof:There are 3 cases, which we treat separately:1. Two neighbouring FPGAs have more than k2 working processors 2. No two neighbouring FPGAs have more than k2 working processors but two opposite FPGAs have more than k2 working processors 3. Only one FPGA has more than k2 processors

All cases with only 3 FPGAs >0All cases with 4 FPGAs>0 under case 1 above

Case 2: Two FPGAs have more than 64 processors, but no two neighbouring FPGAs do:Let the red and blue have between k2+1 and k2+k working processors. Case 2a: Assume that either green or yellow have more than k2-k elements.Lets assume that green has at least k2-k+1 elements (and due to the general assumption of case 2 it has at most k2 elements). If yellow has more than k2-k elements we produce the mirror image.

Under these assumptions red can have up to k2+k elements and green can have up to k2 elementsAs indicated by the horizontal arrows (see Figure 2).Green plus blue can range from 2k2-k+2 to 2k2+k. Thus 2k-4 k needs to be satisfied, thus k 4.

Case 2b: Neither green nor yellow are greater k2-k, then green and yellow have to be exactly k2-k and red and blue need to be k2+k in order to have 4k2 elements.This can be solved as shown in Figure 3.

Figure 2 Figure 3

Case 3: Only red is larger than k2.All other colours are at least k2-k and at most k2.Case 3a: No colour is k2-k.Yellow occupies the yellow area plus orange if required (see Figure 4). Red occupies the bright red area plus the rest of orange plus the dark red in the order indicated by the two vertical arrows. Blue occupies the light blue and dark blue area in the order indicated by the arrow.

Case 3b: One has k2-k elements. This can then either be a neighbour of red (say green) – see Figure 5, or it can be opposite red – see Figure 6.

This completes the proof that for k >4 there is always a Solution, as long as at least 4k2 processors are active.

Figure 5 Figure 6Figure 4

For the case of more than 4 FPGAsWe also assume that we have k +1Horizontal and vertical connections plusone diagonal connection. We also attachk2 +k processors to every FPGA.

27x27=7298x90=72021 yellow (9 lower bound)

Heiko Schröder, 2003

Documents

Heiko Schröder, 1998 Architectures Special Purpose Meshweb.cecs.pdx.edu/~mperkows/temp/May22/warwick.pdf · Performance (MIPS) Year 4004 8080 80286 80386 Pentium 2 Pentium 80486

1997-31 Hillu Schröder

Heiko Prigge Minibook

The Future of Parallel Computing Special Purpose Mesh Architectures Heiko Schröder, 1998 P A R C SA ISA PIPS RM OH

Schröder, Philipp; Schröder, Philipp; Aguiló-Aguayo, Noemí

HEIKO STEUER - uni-freiburg.de

Dirk Schröder Bildschirmkarten im Internet

Heiko Schröder, 1998 ROUTING ? Sorting? Image Processing? Sparse Matrices? Reconfigurable Meshes !

Friedrich Schröder Sonnenstern

Schröder Packfix Tragetaschen

1 Trends and options in parallel computing © Heiko Schröder, 2003

Klimaanlagen HEIKO - testsieger.de · Technische Daten. Heiko ist eine Marke, die mit zahlreichen proökologischen Lösungen für die Umwelt sorgt. Klimageräte von Heiko bevorzugen

W. Udo Schröder, 2019

Klassifikation der Trilobitentrilobiten.de/Res/Klassifikation_der_Trilobiten.pdf · Klassifikation der Trilobiten Jens Koppka & Heiko Sonntag 2003 (Systematik und Tafeln zur Klassifikation

Forschungsstelle Osteuropa Bremen Arbeitspapiere … · Herausgegeben von Heiko Pleines und Hans-Henning Schröder Forschungsstelle Osteuropa an der Universität Bremen Klagenfurter

Heiko Schröder, 2003 Real time image processing on-board satellites

Close Up - Schröder Küchen

Congress - static.cambridge.orgcambridge.org:id:... · Le Bourget 2 2004 Christine Boutin (Jean-Bernard Milliard; Vincent You) Bochum 3 2003 Gerhard Schröder Bochum 3 2003 Olaf Scholz

:: DIAsDEM :: Seminar: Web Mining WS 2003/2004 Ingo Kampe Heiko Scharff

Veit Echterhoff / Antonius Schröder