Upload
barnaby-bruce
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Paolo Miocchi
in collaboration with
R. Capuzzo-Dolcetta, P. Di Matteo, A. Vicari
Dept. of Physics, Univ. of Rome “La Sapienza” (Rome, Italy)
Work supported by the INAF-CINECA agreement (http://inaf.cineca.it, grant inarm033).
The use of High Performance The use of High Performance Computing in Astrophysics: an Computing in Astrophysics: an
experience reportexperience report
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Theoretical study of a system made up of
N ~ 105 – 107 gravitationally bound stars
(Self-gravitating system).
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Theoretical study of a system made up of N ~ 105 – 107 gravitationally bound stars
(Self-gravitating system).
O(N2 ) force computations to do.
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Gravity is a long-range and attractive force
Very unstable dynamical states
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Gravity is a long-range and attractive force
Inhomogeneous mass distributions
very wide range of time-scales ~ (G)–1/2
Numerically “expensive” time integration
of particle motion
Individual and variable time-steps should be adopted
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Gravity is a long-range and attractive force
Very unstable dynamical states Inhomogeneous mass distributions 3D problems!
arduous analytical approach!
The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics
Dynamical evolution of self-gravitating systems with N > 105 stars
> tens of Gflops needed!
codes PARALLELIZATION required
rrQrrQrF 753 2
5rG
rG
rGM
m
computational cost independent of n
m
r cmFm
The tree-codeThe tree-coden particles
M = tot. mass
Q = quadrupole
see Barnes & Hut 1986, Nature 324, 446
1
3
2
4
‘tree’ logical structure
each node corresponds to a box
recursive subdivision in ‘boxes’
The tree-codeThe tree-code
The tree-codeThe tree-code
Multipolar coefficients are evaluated for each box. O(N log N) computations
recursive subdivision in ‘boxes’
Problems in the tree-code Problems in the tree-code parallelizationparallelization
Gravity is a long range interaction: inter-processor data transfer unavoidable (heavy overhead on DMP)
Inhomogeneous mass distributions: particles assignment to PEs has to be done according to the work-load
Hierarchical force evaluation: most of force contributions due to closer bodies, spatial domain decomposition.
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
Domain decomposition is performed ‘on-the-fly’ during the tree-construction with a low computational cost.
The adaptivity of the tree structure is exploited to give a good load-balancing and data-locality in the forces evaluation.
The locally essential tree is built ‘dynamically’ during the tree-walking: remote boxes are linked only when really needed.
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
LOWER-TREE: few boxes containing many particles.
Two different parallelization strategies
UPPER-TREE: many boxes with few particles inside.
see Miocchi & Capuzzo-Dolcetta 2002, A&A 382, 758
PE
3
2
1
0
Some definitionsSome definitions
UPPER-tree = made up of boxes with less than kp particles inside;
LOWER-tree = made up of boxes with more than kp particles;
a Pseudo-terminal (PTERM) box is a box in the upper-tree whose ‘parent box’ is in the lower-tree;
p = no. of processors,
k = fixed coefficient
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ approachapproach
load balancing: in this stage it is ensured by setting k sufficiently large so to deal always with a number of particles in a box much greater than the number of processors.
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
1. Preliminary “random” particles distribution to PEs.
2. All PEs work, starting from the root box, constructing in synchrony the same lower-boxes (by a recursive procedure).
3. When a PTERM box is found, it is assigned to a certain PE (so to preserve a good load-balancing in the subsequent forces evaluation) and no further ‘branches’ are built up.
domain decomposition: Communications among PEs during tree-walking are minimized by the particular order in which PTERM boxes are met. The lower-tree is stored in the local memories of ALL PEs.
Parallelization of the Parallelization of the lowerlower-tree construction...-tree construction...
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
Example of a uniform 2-D distribution with PTERM boxes at the 3rd subdivision level.
Every spatial domain is (nearly) contiguous
the data transfer among PEs is minimized
PTERM orderPTERM order
Example of domain decompositionExample of domain decomposition
Plummer distribution of 16K particles; 4 processors
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
Parallelization of the Parallelization of the upperupper-tree construction-tree construction
PTERM boxes have been already distributed to PEs Each PE works independently and asynchronously,
starting from every PTERM box in the domain and building the descendant portion of the upper-tree, up to the terminal boxes.
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
Parallelization of the tree walkingParallelization of the tree walking
Each PE evaluates independently the forces on the particles belonging to its domain (i.e. those contained in the PTERM boxes previously assigned).
Each PE has in its memory the local tree, i.e. the whole lower-tree plus the portion of the upper-tree that is descended from the PTERM boxes of the PE’s domain.
When a ‘remote’ box is met, it is linked to the local tree, copying it into the local memory.
the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod
Code performance on a IBM SP4Code performance on a IBM SP4
Performances on one ‘main’ time-step (T ) with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles
WARNING
each particle has its own variable time-step depending on the local density of mass and typical velocity.
Dynamical tree recostruction implemented according to the block time scheme the particle step can be T/2n
(Aarseth 1985)
The tree is re-built when the no. of interactions evaluated is > N /10
(Springel et al., 2001, New Astr., 6, 51)
Code performance on a IBM SP4Code performance on a IBM SP4
Performance on one ‘main’ time-step (T ) with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles
Particle time-step distribution
0
1
2
3
4
5
6
7
T T/2 T/4 T/8 T/16 T/32 T/64 T/128 T/256
time-step
log
(n)
2,100,000 time-advancing performed
Code performance on a IBM SP4Code performance on a IBM SP4
CPU-time (sec)
Performance on one ‘main’ time-step with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles ( = 0.7, k = 256, up to 16 PEs per node)
25,000 particles per second
Code performance on a IBM SP4Code performance on a IBM SP4
The speedup behaviour is very good up to 16 PEs (= 10).
The load-unbalancing is low (10% with 64 PEs). Data transfer and communications still penalize
the overall performance with low PEs / N ratio (34% with 64 PEs).
An MPI-2 version could fully exploit the ATD parallelization strategy.
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
To what extent can GCs survive the strong tidal bulge interaction?
Do they merge at the end? What features the final merging product
will have? To what extent can the bulge accrete from the
GCs mass lost?
Motivation: the study of Motivation: the study of the the dynamical evolution and the fate dynamical evolution and the fate of young GCs within the bulgeof young GCs within the bulge
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
30,000 CPU-hours on an IBM SP4 provided by the INAF-CINECA agreement for a scientific ‘key-project’ (under grant inarm033)
Motivation: the study of Motivation: the study of the the dynamical evolution and the fate dynamical evolution and the fate of young GCs within the bulgeof young GCs within the bulge
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
N-body (tree-code) accurate simulations with high number of ‘particles’ (106).
Dynamical friction and mass function included. Self-consistent triaxial bulge model (Schwarzschild).
Features of the numerical approachFeatures of the numerical approach
3310090.97215b
37425.51.29820c
37283.81.37715d
33170140.89520a
(km/s)
tcr (Kyr)
rc (pc)crt (pc)M (106 M)clusterSimulation
B
A
higher concentration
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
Quasi-radial orbits Clusters cross each other at every passage (twice per period)
t (Myr)
x (pc)
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
“tidal tails” around Pal 5 (after Odenkirchen et. al. 2002)
Our simulation of a cluster in a circular orbit
tidal tails reproduced by our simulation
Tidal tails structure and formationTidal tails structure and formation
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
“ripples” around a cluster in our simulations “ripples” around NGC
3923
Tidal tails structure and formationTidal tails structure and formation
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
“ripples” around a cluster“ripples” around NGC 3923
What “ripples” are?
How do they form?
3D visualization tools can help to give answers!
Tidal tails structure and formationTidal tails structure and formation
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
t = 0t = 17 Myr (dashed black line: bulge central density)
least compact cluster at t = 15 Myr
Density profiles of the most compact cluster (solid lines) fitted with a single-mass King model (dotted lines)
tidal tails
Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions
p = fraction of mass
lost if / < p/100
central cluster density
E = fraction of
mass lost if Ei > 0
FractionoFractionof mass f mass
lostlost
c = 0.8
0.91.2
1.3
bulge stellar density