Paolo Miocchi in collaboration with R. Capuzzo-Dolcetta, P. Di Matteo, A. Vicari Dept. of Physics, Univ. of Rome “La Sapienza” (Rome, Italy) Work supported

Paolo Miocchi

in collaboration with

R. Capuzzo-Dolcetta, P. Di Matteo, A. Vicari

Dept. of Physics, Univ. of Rome “La Sapienza” (Rome, Italy)

Work supported by the INAF-CINECA agreement (http://inaf.cineca.it, grant inarm033).

The use of High Performance The use of High Performance Computing in Astrophysics: an Computing in Astrophysics: an

experience reportexperience report

The needs of HPC in Globular Cluster The needs of HPC in Globular Cluster dynamicsdynamics

Theoretical study of a system made up of

N ~ 105 – 107 gravitationally bound stars

(Self-gravitating system).


Theoretical study of a system made up of N ~ 105 – 107 gravitationally bound stars

(Self-gravitating system).

O(N2 ) force computations to do.


Gravity is a long-range and attractive force

Very unstable dynamical states



Inhomogeneous mass distributions

very wide range of time-scales ~ (G)–1/2

Numerically “expensive” time integration

of particle motion

Individual and variable time-steps should be adopted



Very unstable dynamical states Inhomogeneous mass distributions 3D problems!

arduous analytical approach!


Dynamical evolution of self-gravitating systems with N > 105 stars

> tens of Gflops needed!

codes PARALLELIZATION required

rrQrrQrF 753 2

5rG

rG

rGM

m

computational cost independent of n

m

r cmFm

The tree-codeThe tree-coden particles

M = tot. mass

Q = quadrupole

see Barnes & Hut 1986, Nature 324, 446

1

3

2

4

‘tree’ logical structure

each node corresponds to a box

recursive subdivision in ‘boxes’

The tree-codeThe tree-code

The tree-codeThe tree-code

Multipolar coefficients are evaluated for each box. O(N log N) computations

recursive subdivision in ‘boxes’

Problems in the tree-code Problems in the tree-code parallelizationparallelization

Gravity is a long range interaction: inter-processor data transfer unavoidable (heavy overhead on DMP)

Inhomogeneous mass distributions: particles assignment to PEs has to be done according to the work-load

Hierarchical force evaluation: most of force contributions due to closer bodies, spatial domain decomposition.

the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ methodmethod

Domain decomposition is performed ‘on-the-fly’ during the tree-construction with a low computational cost.

The adaptivity of the tree structure is exploited to give a good load-balancing and data-locality in the forces evaluation.

The locally essential tree is built ‘dynamically’ during the tree-walking: remote boxes are linked only when really needed.


LOWER-TREE: few boxes containing many particles.

Two different parallelization strategies

UPPER-TREE: many boxes with few particles inside.

see Miocchi & Capuzzo-Dolcetta 2002, A&A 382, 758

PE

3

2

1

0

Some definitionsSome definitions

UPPER-tree = made up of boxes with less than kp particles inside;

LOWER-tree = made up of boxes with more than kp particles;

a Pseudo-terminal (PTERM) box is a box in the upper-tree whose ‘parent box’ is in the lower-tree;

p = no. of processors,

k = fixed coefficient

the ‘Adaptive Tree Decomposition’ the ‘Adaptive Tree Decomposition’ approachapproach

load balancing: in this stage it is ensured by setting k sufficiently large so to deal always with a number of particles in a box much greater than the number of processors.


1. Preliminary “random” particles distribution to PEs.

2. All PEs work, starting from the root box, constructing in synchrony the same lower-boxes (by a recursive procedure).

3. When a PTERM box is found, it is assigned to a certain PE (so to preserve a good load-balancing in the subsequent forces evaluation) and no further ‘branches’ are built up.

domain decomposition: Communications among PEs during tree-walking are minimized by the particular order in which PTERM boxes are met. The lower-tree is stored in the local memories of ALL PEs.

Parallelization of the Parallelization of the lowerlower-tree construction...-tree construction...


Example of a uniform 2-D distribution with PTERM boxes at the 3rd subdivision level.

Every spatial domain is (nearly) contiguous

the data transfer among PEs is minimized

PTERM orderPTERM order

Example of domain decompositionExample of domain decomposition

Plummer distribution of 16K particles; 4 processors


Parallelization of the Parallelization of the upperupper-tree construction-tree construction

PTERM boxes have been already distributed to PEs Each PE works independently and asynchronously,

starting from every PTERM box in the domain and building the descendant portion of the upper-tree, up to the terminal boxes.


Parallelization of the tree walkingParallelization of the tree walking

Each PE evaluates independently the forces on the particles belonging to its domain (i.e. those contained in the PTERM boxes previously assigned).

Each PE has in its memory the local tree, i.e. the whole lower-tree plus the portion of the upper-tree that is descended from the PTERM boxes of the PE’s domain.

When a ‘remote’ box is met, it is linked to the local tree, copying it into the local memory.


Code performance on a IBM SP4Code performance on a IBM SP4

Performances on one ‘main’ time-step (T ) with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles

WARNING

each particle has its own variable time-step depending on the local density of mass and typical velocity.

Dynamical tree recostruction implemented according to the block time scheme the particle step can be T/2n

(Aarseth 1985)

The tree is re-built when the no. of interactions evaluated is > N /10

(Springel et al., 2001, New Astr., 6, 51)


Performance on one ‘main’ time-step (T ) with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles

Particle time-step distribution

0

1

2

3

4

5

6

7

T T/2 T/4 T/8 T/16 T/32 T/64 T/128 T/256

time-step

log

(n)

2,100,000 time-advancing performed


CPU-time (sec)

Performance on one ‘main’ time-step with complete forces evaluation and time integration of motion for a self-gravitating system with N = 106 particles ( = 0.7, k = 256, up to 16 PEs per node)

25,000 particles per second


The speedup behaviour is very good up to 16 PEs (= 10).

The load-unbalancing is low (10% with 64 PEs). Data transfer and communications still penalize

the overall performance with low PEs / N ratio (34% with 64 PEs).

An MPI-2 version could fully exploit the ATD parallelization strategy.

Merging of Globular Clusters in Merging of Globular Clusters in galactic central regionsgalactic central regions

To what extent can GCs survive the strong tidal bulge interaction?

Do they merge at the end? What features the final merging product

will have? To what extent can the bulge accrete from the

GCs mass lost?

Motivation: the study of Motivation: the study of the the dynamical evolution and the fate dynamical evolution and the fate of young GCs within the bulgeof young GCs within the bulge


30,000 CPU-hours on an IBM SP4 provided by the INAF-CINECA agreement for a scientific ‘key-project’ (under grant inarm033)

Motivation: the study of Motivation: the study of the the dynamical evolution and the fate dynamical evolution and the fate of young GCs within the bulgeof young GCs within the bulge


N-body (tree-code) accurate simulations with high number of ‘particles’ (106).

Dynamical friction and mass function included. Self-consistent triaxial bulge model (Schwarzschild).

Features of the numerical approachFeatures of the numerical approach

3310090.97215b

37425.51.29820c

37283.81.37715d

33170140.89520a

(km/s)

tcr (Kyr)

rc (pc)crt (pc)M (106 M)clusterSimulation

B

A

higher concentration


Quasi-radial orbits Clusters cross each other at every passage (twice per period)

t (Myr)

x (pc)


“tidal tails” around Pal 5 (after Odenkirchen et. al. 2002)

Our simulation of a cluster in a circular orbit

tidal tails reproduced by our simulation

Tidal tails structure and formationTidal tails structure and formation


“ripples” around a cluster in our simulations “ripples” around NGC

3923



“ripples” around a cluster“ripples” around NGC 3923

What “ripples” are?

How do they form?

3D visualization tools can help to give answers!



t = 0t = 17 Myr (dashed black line: bulge central density)

least compact cluster at t = 15 Myr

Density profiles of the most compact cluster (solid lines) fitted with a single-mass King model (dotted lines)

tidal tails


p = fraction of mass

lost if / < p/100

central cluster density

E = fraction of

mass lost if Ei > 0

FractionoFractionof mass f mass

lostlost

c = 0.8

0.91.2

1.3

bulge stellar density

Documents

Paolo Miocchi in collaboration with R. Capuzzo-Dolcetta, P. Di Matteo, A. Vicari Dept. of Physics, Univ. of Rome “La Sapienza” (Rome, Italy) Work supported