Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-1
Höchstleistungsrechenzentrum Stuttgart
Domain Decomposition
Parallelization of Mesh Based Applications
Panagiotis AdamidisThomas Bönisch
University of StuttgartHigh-Performance Computing-Center Stuttgart (HLRS)
www.hlrs.de
Slide 2Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Outline
• Introduction
• Basics
• Boundary Handling
• Example: Finite Volume Flow Simulation on Structured Meshes
• Example: Finite Element Approach on an Unstructured Mesh
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-2
Slide 3Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization - Target
• High Application Performance
• Using real big MPP’s
– no loss in efficiency even when using 500 Processors and more
• Using Clusters of SMP’s
– no decrease in Performance due to using external Networkconnections
Slide 4Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
A Problem (I)
Flow around a cylinder: Numerical Simulation using FV, FE or FD
Data Structure: A(1:n,1:m)
Solve: (A+B+C)x=b
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-3
Slide 5Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization strategies
Flow around a cylinder: Numerical Simulation using FV, FE or FD
Data Structure: A(1:n,1:m)
Solve: (A+B+C)x=b
Work decomposition
Domain decomposition
A(1:20,1:50)A(1:20,51:100)A(1:20,101:150)A(1:20,151:200)
Data decomposition
Scaling ?
Scalestoo much communication ?
Good Chance
do i=1,100Í i=1,25
i=26,50i=51,75i=76,100
Slide 6Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization Problems
• Decomposition (Domain, Data, Work)
• Communication
du dx u u dxi i/ ( ) /= −+ −1 1
ui-1 ui+1
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-4
Slide 7Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Concepts - Message Passing (II)
User defined communication
Slide 8Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
How to split the Domain in the Dimensions (I)
1 - Dimensional 2 (and 3) - Dimensional
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-5
Slide 9Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
How to split the Domain in the Dimensions (II)
• That depends on:
– computational speed i.e. processor:vectorprocessor or cache
– communication speed:• latency• bandwidth• topology
– number of subdomains needed
– load distribution (is the effort for every mesh cell equal)
Slide 10Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Replication versus Communication (I)
• If we need a value from a neighbour we have basically twoopportunities
– getting the necessary value directly from the neighbour, whenneeded
Communication, Additional Synchronisation
– calculating the value of the neighbour again locally from valuesknown there
Additional Calculation
• Selection depends on the application
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-6
Slide 11Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Replication versus Communication (II)
• Normally replicate the values
– Consider how many calculations you can execute while onlysending 1 Bit from one process to another(6 µs, 1.0 Gflop/s Í 6000 operations)
– Sending 16 kByte (20x20x5) doubles(with 300 MB/s bandwidth Í 53.3 µs Í 53 300 operations)
– very often blocks have to wait for their neighbours
– but extra work limits parallel efficiency
• Communication should only be used if one is quite sure that this isthe best solution
Slide 12Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
2- Dimensional DD with two Halo Cells
Mesh Partitioning
Subdomain for each Process
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-7
Höchstleistungsrechenzentrum Stuttgart
Example:
Parallelization of a 3D Finite Volume Flow Solver
Slide 14Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Starting Point: Sequentiell Program
• Written in FORTRAN77
• Using structured meshes
• Parts of the program
– Preprocessing, reading Data
– Main Loop
• Setup of the equation system
• Preconditioning
• Solving step
– Postprocessing, writing Data
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-8
Slide 15Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Dynamic Data Structures
• Pure FORTRAN77 is too static
– number of processors can vary from run to runsize of arrays even within the same case can vary
– dynamic data structurs• use Fortran90 dynamic arrays• use all local memory on a PE for a huge FORTARN77 array
and setup your own memory management• second method has a problem on SMP’s and cc-NUMA’s
we should only use as much memory as necessary
Slide 16Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Main Loop
• Setup of the equation system
– each cell has 6 neighbours
– needs data from neighbour cells
– 2 halo cells at the inner boundaries
• Preconditioning
– no neighbour infomation needed at all
– completely done locally
• Solving step
– Jacobi Line relaxation with subiterations
– more complicated (next slides)
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-9
Slide 17Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Heptadiagonalmatrix
D X X XX D X X X
X D X X XX X D X X X
X X D X X XX X D X X X
X X D X X XX X D X X X
X X D X X XX X D X X X
X X D X X XX X D X X X
X X X D X X XX X X D X X X
X X X D X X XX X X D X X X
X X X D X X XX X X D X X X
X X X D X X XX X X D X X X
X X X D X X XX X X D X X X
X X X D X X XX X X D X X X
X X X D X XX X X D X X
X X X D X XX X X D X X
X X X D X XX X X D X X
X X X D X XX X X D X X
X X X D X XX X X D X
X X X D XX X X D
Slide 18Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization - Solver (I)
rAu
ruArr
rr
1−==
• Parallelization problems:
– Matrix A is distributed
– Matrix inversion is a priori not parallelizable
• Sequential:
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-10
Slide 19Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization - Solver (II)
−=
−=
−==+
+==
−−
−
)1(1)(
)1()(
)(
jj
jj
uMrLu
uMruL
uMruL
ruML
MLA
r uA
rrr
rrr
rrr
rr
rr
Slide 20Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallelization - Solver (III)
=A +
1212
111111
101010
99
88
777
666
55
44
333
222
11
ml
uml
uml
umml
uml
uml
umml
uml
uml
um
9
8
5
4
lu
lu
M+=A L
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-11
Slide 21Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
34 35 36
31 32 33
28 29 30
25 26 27
A Real Heptadiagonalmatrix
D X X XX D X X X
X D X XX D X X X
X X D X X XX X D X X
X D X X XX X D X X X
X X D X XX D X X
X X D X XX X D X
X D X X XX X D X X X
X X D X XX X D X X X
X X X D X X XX X X D X X
X X D X X XX X X D X X X
X X X D X XX X D X X
X X X D X XX X X D X
X D X XX X D X X
X X D XX X D X X
X X X D X XX X X D X
X X D X XX X X D X X
X X X D XX X D X
X X X D XX X X D
1
2
1
1
2
2
22 23 24
19 20 21
16 17 18
13 14 15
10 11 12
7 8 9
4 5 6
1 2 3
Slide 22Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Difference between strong / weak coupling
• Solver with weak coupling– Extra computational effort due to additional solving step
(but no factorization)– Additional update of right hand side– Two times communication
• one after each solving step
• Solver with stronger coupling– Jacobi line relaxation method with subiterations– collapse iterations from line relaxation and parallelzation– no additional iterations– much more communication
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-12
Slide 23Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Comparison of Solvers - Convergence
Iterations
L2 -
Res
idua
lSolver 1 —Solver 2 —
Slide 24Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Comparison of Solvers - Performance
Number of Processors
Tim
e in
Sec
onds
strong Ethernet
weak Ethernet
weak HPS
strong HPS
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-13
Slide 25Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Results - Solving Method
• the presented solver with weak coupling works fine for this CFD
problems
• Solutions differ in the scale of one percent
• convergence rate is nearly equal to the sequential program
Slide 26Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Domain Decomposition
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-14
Slide 27Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Speedup on Cray T3E
Number of Processors
Spe
edup
Problemsize 110592
Problemsize 55296
Problemsize 27648
Slide 28Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Scaleup on Cray T3E
Number of Processors
Sca
leup
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-15
Höchstleistungsrechenzentrum Stuttgart
Domain Decomposition of Unstructured Grids
Slide 30Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Unstructured FEM Grid with Global Numbering
87
1415
9
1
510
6
11
2
12
13
3
4
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-16
Slide 31Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Shape of Corresponding System Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
xxxxx
xxxx
xxxxxx
xxxxx
xxxxx
xxxx
xxxx
xxxxxx
xxxxxxxx
xxxxxx
xxxxxxx
xxxx
xxxxx
xxxxx
xxxx = edge of FEM grid
Slide 32Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Domain Decomposition
1. Nonoverlapping Domain Decomposition
2. Grid Points Separated into Inner and Boundary Points
3. Renumbering of the Inner and Boundary Points
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-17
Slide 33Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
1. Nonoverlapping Domain Decomposition
P1
87
1415
9
1
510
6
11
2
12
13
3
4
P2
P3
Slide 34Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
87
1415
9
2. Grid Points Separated into Inner and Boundary Points
43
1
510
6
11
2
12
13
P1
P3
P2
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-18
Slide 35Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
3. Renumbering of the Inner and Boundary Points
9
32
1
21
1
2 3
7
5
6
4
8
P1
P3
P2
10
• Inner points are numbered from 1 up to the number of inner points
• Boundary points are numbered globally starting from the maximum of all inner points of all partitions plus 1
Slide 36Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Arrow Shaped System Matrix after Renumbering
P1 P2 P3 logical boundary 1 2 3 4 5 1 2 1 2 3 6 7 8 9 10
10
9
8
7
6
3
2
1
2
1
5
4
3
2
1
xxxxxx
xxxxxxxx
xxxxxx
xxxxxx
xxxxx
xxxx
xxxxx
xxx
xxx
xxxxx
xxxxx
xxxxx
xxxx
xxxxxxx
xxx
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-19
Slide 37Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Data Distribution
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
Slide 38Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Linear System
• Available on local memories
BBi
BIi
IBi
IIi
AA
AAi = 1, ... ,n
=BB
A ∑n
BB
i A =i 1
with
BBBIn
BIBI
IBn
IIn
IBII
IBII
AAAA
AA
AA
AA
..
..
..
21
22
11
B
In
I
I
x
x
x
x
.
.
2
1
=
B
In
I
I
b
b
b
b
.
.
2
1
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-20
Slide 39Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Direct Substructuring
1. Direct factorization
2. Assembling of the Schur complement system
4. Solving of the interior unknowns
3. Solving of the Schur complement system
Slide 40Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Transformations of the Original System
A=
I
I . . ..
. . .. AAI IBII
I. ... .. . ..
..1-1
)1(
IB BI
iA II
Ai ABB
iAiS i )(
1−
− =
IAA
IIA
IIA
BIBI.
....
..2.
...1
21
+
−
−
n+SS
AAI
AAI
IBII
IBII
.1
.......
..
..21)2(
11)1(
b
I
I
x
x
x
.
2
1
=
b
I
I
b
b
b
.
2
1
IIA
. . ...
. . . IIA 2
.. . ..
1
+ + n SS .1. .
-1A . I
II A A
II A
BIBI. .)2( 2)1 (1 -1
(2AAI IBII..
2)-1
. . ...
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-21
Slide 41Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Schurcomplement System
B SHbxSH
A =
I
ibI
ixII
iA =B
xIB
iA−
or
with
IB
iA( II
i -1 A )
n
= ASHb Bb − ∑=i 1
BI
i
I
ib
n
where or
SHA = −
BI
iA ∑ BB
Ai
=i 1
iZSHA = −
BI
i
A ∑ BBA
i
=i 1
n
= iZ IB
iA
II
iA i Z =
IB
iA
iiz = I
ib iz = IIA
I
ib i z= ASHb Bb − ∑
=i 1
BI
in
( II
i -1 A )
( II
i -1 A )
( II
i -1 A )
Slide 42Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
1. Direct factorization
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
IIiA i
Z = IBiA
IIiA iz = I
ibIIiA i
Z = IBiA
IIiA iz = I
ib
IIiA i
Z = IBiA
IIiA iz = I
ib
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-22
Slide 43Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
2. Assembling of the Schur complement system (I)
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
− BI
iABB
Ai
iZBIi
A iz − BI iA
BBA
iiZ
BIi
A iz
− BI iA
BBA
iiZ
BIi
A iz
Slide 44Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
2. Assembling of the Schur complement system (II)
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
iZSHA = −
BI
i
A ∑ BBA
i
=i 1
n
=SHb Bb − ∑
=i 1
i z A BI
in
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-23
Slide 45Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
3. Solving of the Schur complement system
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
B SHbxSH
A =
Slide 46Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
4. Solving of the interior unknowns
1
2 3
7
5
6
4
8
P1
P2
9
32
1
10
7
8
P3
2
1
10 9
7
6
I
ibI
ixII
iA =B
xIB
iA−
I
ibI
ixII
iA =B
xIB
iA−
I
ibI
ixII
iA =B
xIB
iA−
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-24
Slide 47Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Parallel Computations on the Subdomains
Slide 48Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Overlapping Domain Decomposition
1
2
3
4
5
18
6
7
8
9
10
11
12
13
11 14
10 13
12
16
15 17
4 7
5 8
96
1
2
3
5
6
4
8
9
10
11
12
13
14
15
16
17
18
7
1γ2γ
2Ω
1Ω
31. — Domain Decomposition – Parallelization of Mesh Based Applications — 3131-25
Slide 49Domain Decomposition
Höchstleistungsrechenzentrum StuttgartAdamidis/Bönisch
Literature
• Barry F. Smith, Petter E. Bjoerstad, William D. Gropp:Domain Decomposition, Parallel Multilevel Methods for EllipticPartial Differential Equations, Cambridge University Press, 1996,ISBN 0-521-49589-X
• David E. Keyes, Youcef Saad, Donald G. Truhlar:Domain-Based Parallelism and Problem Decomposition Methods inComputational Science and Engineering, SIAM, 1995,ISBN 0-89871-348-X