Domain Decomposition€¦ · Domain Decomposition Höchstleistungsrechenzentrum Stuttgart Adamidis/Bönisch How to split the Domain in the Dimensions (II) • That depends on: –

31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-1

Höchstleistungsrechenzentrum Stuttgart

Domain Decomposition

Parallelization of Mesh Based Applications

Panagiotis AdamidisThomas Bönisch

University of StuttgartHigh-Performance Computing-Center Stuttgart (HLRS)

www.hlrs.de

Slide 2Domain Decomposition

Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch

Outline

• Introduction

• Basics

• Boundary Handling

• Example: Finite Volume Flow Simulation on Structured Meshes

• Example: Finite Element Approach on an Unstructured Mesh




Parallelization - Target

• High Application Performance

• Using real big MPP’s

– no loss in efficiency even when using 500 Processors and more

• Using Clusters of SMP’s

– no decrease in Performance due to using external Networkconnections



A Problem (I)

Flow around a cylinder: Numerical Simulation using FV, FE or FD

Data Structure: A(1:n,1:m)

Solve: (A+B+C)x=b




Parallelization strategies

Flow around a cylinder: Numerical Simulation using FV, FE or FD

Data Structure: A(1:n,1:m)

Solve: (A+B+C)x=b

Work decomposition

Domain decomposition

A(1:20,1:50)A(1:20,51:100)A(1:20,101:150)A(1:20,151:200)

Data decomposition

Scaling ?

Scalestoo much communication ?

Good Chance

do i=1,100Í i=1,25

i=26,50i=51,75i=76,100



Parallelization Problems

• Decomposition (Domain, Data, Work)

• Communication

du dx u u dxi i/ ( ) /= −+ −1 1

ui-1 ui+1




Concepts - Message Passing (II)

User defined communication



How to split the Domain in the Dimensions (I)

1 - Dimensional 2 (and 3) - Dimensional




How to split the Domain in the Dimensions (II)

• That depends on:

– computational speed i.e. processor:vectorprocessor or cache

– communication speed:• latency• bandwidth• topology

– number of subdomains needed

– load distribution (is the effort for every mesh cell equal)



Replication versus Communication (I)

• If we need a value from a neighbour we have basically twoopportunities

– getting the necessary value directly from the neighbour, whenneeded

Communication, Additional Synchronisation

– calculating the value of the neighbour again locally from valuesknown there

Additional Calculation

• Selection depends on the application




Replication versus Communication (II)

• Normally replicate the values

– Consider how many calculations you can execute while onlysending 1 Bit from one process to another(6 µs, 1.0 Gflop/s Í 6000 operations)

– Sending 16 kByte (20x20x5) doubles(with 300 MB/s bandwidth Í 53.3 µs Í 53 300 operations)

– very often blocks have to wait for their neighbours

– but extra work limits parallel efficiency

• Communication should only be used if one is quite sure that this isthe best solution



2- Dimensional DD with two Halo Cells

Mesh Partitioning

Subdomain for each Process



Example:

Parallelization of a 3D Finite Volume Flow Solver



Starting Point: Sequentiell Program

• Written in FORTRAN77

• Using structured meshes

• Parts of the program

– Preprocessing, reading Data

– Main Loop

• Setup of the equation system

• Preconditioning

• Solving step

– Postprocessing, writing Data




Dynamic Data Structures

• Pure FORTRAN77 is too static

– number of processors can vary from run to runsize of arrays even within the same case can vary

– dynamic data structurs• use Fortran90 dynamic arrays• use all local memory on a PE for a huge FORTARN77 array

and setup your own memory management• second method has a problem on SMP’s and cc-NUMA’s

we should only use as much memory as necessary



Main Loop

• Setup of the equation system

– each cell has 6 neighbours

– needs data from neighbour cells

– 2 halo cells at the inner boundaries

• Preconditioning

– no neighbour infomation needed at all

– completely done locally

• Solving step

– Jacobi Line relaxation with subiterations

– more complicated (next slides)




Heptadiagonalmatrix

D X X XX D X X X

X D X X XX X D X X X

X X D X X XX X D X X X




X X X D X X XX X X D X X X






X X X D X XX X X D X X




X X X D X XX X X D X

X X X D XX X X D



Parallelization - Solver (I)

rAu

ruArr

rr

1−==

• Parallelization problems:

– Matrix A is distributed

– Matrix inversion is a priori not parallelizable

• Sequential:




Parallelization - Solver (II)

−=

−=

−==+

+==

−−

−

)1(1)(

)1()(

)(

jj

jj

uMrLu

uMruL

uMruL

ruML

MLA

r uA

rrr

rrr

rrr

rr

rr



Parallelization - Solver (III)

=A +

1212

111111

101010

99

88

777

666

55

44

333

222

11

ml

uml

uml

umml

uml

uml

umml

uml

uml

um

9

8

5

4

lu

lu

M+=A L




34 35 36

31 32 33

28 29 30

25 26 27

A Real Heptadiagonalmatrix

D X X XX D X X X

X D X XX D X X X

X X D X X XX X D X X


X X D X XX D X X

X X D X XX X D X


X X D X XX X D X X X

X X X D X X XX X X D X X

X X D X X XX X X D X X X

X X X D X XX X D X X


X D X XX X D X X

X X D XX X D X X


X X D X XX X X D X X

X X X D XX X D X

X X X D XX X X D

1

2

1

1

2

2

22 23 24

19 20 21

16 17 18

13 14 15

10 11 12

7 8 9

4 5 6

1 2 3



Difference between strong / weak coupling

• Solver with weak coupling– Extra computational effort due to additional solving step

(but no factorization)– Additional update of right hand side– Two times communication

• one after each solving step

• Solver with stronger coupling– Jacobi line relaxation method with subiterations– collapse iterations from line relaxation and parallelzation– no additional iterations– much more communication




Comparison of Solvers - Convergence

Iterations

L2 -

Res

idua

lSolver 1 —Solver 2 —



Comparison of Solvers - Performance

Number of Processors

Tim

e in

Sec

onds

strong Ethernet

weak Ethernet

weak HPS

strong HPS




Results - Solving Method

• the presented solver with weak coupling works fine for this CFD

problems

• Solutions differ in the scale of one percent

• convergence rate is nearly equal to the sequential program







Speedup on Cray T3E


Spe

edup

Problemsize 110592

Problemsize 55296

Problemsize 27648



Scaleup on Cray T3E


Sca

leup



Domain Decomposition of Unstructured Grids



Unstructured FEM Grid with Global Numbering

87

1415

9

1

510

6

11

2

12

13

3

4




Shape of Corresponding System Matrix

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

xxxxx

xxxx

xxxxxx

xxxxx

xxxxx

xxxx

xxxx

xxxxxx

xxxxxxxx

xxxxxx

xxxxxxx

xxxx

xxxxx

xxxxx

xxxx = edge of FEM grid




1. Nonoverlapping Domain Decomposition

2. Grid Points Separated into Inner and Boundary Points

3. Renumbering of the Inner and Boundary Points




1. Nonoverlapping Domain Decomposition

P1

87

1415

9

1

510

6

11

2

12

13

3

4

P2

P3



87

1415

9

2. Grid Points Separated into Inner and Boundary Points

43

1

510

6

11

2

12

13

P1

P3

P2




3. Renumbering of the Inner and Boundary Points

9

32

1

21

1

2 3

7

5

6

4

8

P1

P3

P2

10

• Inner points are numbered from 1 up to the number of inner points

• Boundary points are numbered globally starting from the maximum of all inner points of all partitions plus 1



Arrow Shaped System Matrix after Renumbering

P1 P2 P3 logical boundary 1 2 3 4 5 1 2 1 2 3 6 7 8 9 10

10

9

8

7

6

3

2

1

2

1

5

4

3

2

1

xxxxxx

xxxxxxxx

xxxxxx

xxxxxx

xxxxx

xxxx

xxxxx

xxx

xxx

xxxxx

xxxxx

xxxxx

xxxx

xxxxxxx

xxx




Data Distribution

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6



Linear System

• Available on local memories

BBi

BIi

IBi

IIi

AA

AAi = 1, ... ,n

=BB

A ∑n

BB

i A =i 1

with

BBBIn

BIBI

IBn

IIn

IBII

IBII

AAAA

AA

AA

AA

..

..

..

21

22

11

B

In

I

I

x

x

x

x

.

.

2

1

=

B

In

I

I

b

b

b

b

.

.

2

1




Direct Substructuring

1. Direct factorization

2. Assembling of the Schur complement system

4. Solving of the interior unknowns

3. Solving of the Schur complement system



Transformations of the Original System

A=

I

I . . ..

. . .. AAI IBII

I. ... .. . ..

..1-1

)1(

IB BI

iA II

Ai ABB

iAiS i )(

1−

− =

IAA

IIA

IIA

BIBI.

....

..2.

...1

21

+

−

−

n+SS

AAI

AAI

IBII

IBII

.1

.......

..

..21)2(

11)1(

b

I

I

x

x

x

.

2

1

=

b

I

I

b

b

b

.

2

1

IIA

. . ...

. . . IIA 2

.. . ..

1

+ + n SS .1. .

-1A . I

II A A

II A

BIBI. .)2( 2)1 (1 -1

(2AAI IBII..

2)-1

. . ...




Schurcomplement System

B SHbxSH

A =

I

ibI

ixII

iA =B

xIB

iA−

or

with

IB

iA( II

i -1 A )

n

= ASHb Bb − ∑=i 1

BI

i

I

ib

n

where or

SHA = −

BI

iA ∑ BB

Ai

=i 1

iZSHA = −

BI

i

A ∑ BBA

i

=i 1

n

= iZ IB

iA

II

iA i Z =

IB

iA

iiz = I

ib iz = IIA

I

ib i z= ASHb Bb − ∑

=i 1

BI

in

( II

i -1 A )

( II

i -1 A )

( II

i -1 A )



1. Direct factorization

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6

IIiA i

Z = IBiA

IIiA iz = I

ibIIiA i

Z = IBiA

IIiA iz = I

ib

IIiA i

Z = IBiA

IIiA iz = I

ib




2. Assembling of the Schur complement system (I)

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6

− BI

iABB

Ai

iZBIi

A iz − BI iA

BBA

iiZ

BIi

A iz

− BI iA

BBA

iiZ

BIi

A iz



2. Assembling of the Schur complement system (II)

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6

iZSHA = −

BI

i

A ∑ BBA

i

=i 1

n

=SHb Bb − ∑

=i 1

i z A BI

in




3. Solving of the Schur complement system

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6

B SHbxSH

A =



4. Solving of the interior unknowns

1

2 3

7

5

6

4

8

P1

P2

9

32

1

10

7

8

P3

2

1

10 9

7

6

I

ibI

ixII

iA =B

xIB

iA−

I

ibI

ixII

iA =B

xIB

iA−

I

ibI

ixII

iA =B

xIB

iA−




Parallel Computations on the Subdomains



Overlapping Domain Decomposition

1

2

3

4

5

18

6

7

8

9

10

11

12

13

11 14

10 13

12

16

15 17

4 7

5 8

96

1

2

3

5

6

4

8

9

10

11

12

13

14

15

16

17

18

7

1γ2γ

2Ω

1Ω




Literature

• Barry F. Smith, Petter E. Bjoerstad, William D. Gropp:Domain Decomposition, Parallel Multilevel Methods for EllipticPartial Differential Equations, Cambridge University Press, 1996,ISBN 0-521-49589-X

• David E. Keyes, Youcef Saad, Donald G. Truhlar:Domain-Based Parallelism and Problem Decomposition Methods inComputational Science and Engineering, SIAM, 1995,ISBN 0-89871-348-X

Documents

Domain Decomposition€¦ · Domain Decomposition Höchstleistungsrechenzentrum Stuttgart Adamidis/Bönisch How to split the Domain in the Dimensions (II) • That depends on: –