Magneto-hydrodynamics simulation in astrophysics

Magneto-hydrodynamics simulation in astrophysics

by

Bijia Pang

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of PhysicsUniversity of Toronto

Copyright c© 2011 by Bijia Pang

Abstract

Magneto-hydrodynamics simulation in astrophysics

Bijia Pang

Doctor of Philosophy

Graduate Department of Physics

University of Toronto

2011

Magnetohydrodynamics (MHD) studies the dynamics of an electrically conducting fluid

under the influence of a magnetic field. Many astrophysical phenomena are related to

MHD, and computer simulations are used to model these dynamics. In this thesis, we

conduct MHD simulations of non-radiative black hole accretion as well as fast magnetic

reconnection. By performing large scale three dimensional parallel MHD simulations on

supercomputers and using a deformed-mesh algorithm, we were able to conduct very high

dynamical range simulations of black hole accretion of Sgr A* at the Galactic Center.

We find a generic set of solutions, and make specific predictions for currently feasible

observations of rotation measure (RM). The magnetized accretion flow is subsonic and

lacks outward convection flux, making the accretion rate very small and having a density

slope of around −1. There is no tendency for the flows to become rotationally supported,

and the slow time variability of the RM is a key quantitative signature of this accretion

flow.

We also provide a constructive numerical example of fast magnetic reconnection in a

three-dimensional periodic box. Reconnection is initiated by a strong, localized perturba-

tion to the field lines and the solution is intrinsically three-dimensional. Approximately

30% of the magnetic energy is released in an event which lasts about one Alfven time,

but only after a delay during which the field lines evolve into a critical configuration. In

the co-moving frame of the reconnection regions, reconnection occurs through an X-like

ii

point, analogous to the Petschek reconnection. The dynamics appear to be driven by

global flows rather than local processes.

In addition to issues pertaining to physics, we present results on the acceleration of

MHD simulations using heterogeneous computing systems [83]. We have implemented

the MHD code on a variety of heterogeneous and multi-core architectures (multi-core x86,

Cell, Nvidia and ATI GPU) using different languages (FORTRAN, C, Cell, CUDA and

OpenCL). Initial performance results for these systems are presented, and we conclude

that substantial gains in performance over traditional systems are possible. In particular,

it is possible to extract a greater percentage of peak theoretical performance from some

heterogeneous systems when compared to x86 architectures.

iii

Acknowledgements

It is a pleasure to thank the many people who made this thesis possible.

First I would like to thank my supervisor, Prof. Ue-Li Pen, for his enthusiastic

teaching and inspiring guiding throughout my Ph.D. study.

I want to thank my committee members, Prof. Christopher D. Matzner, Prof.

Stephen W. Morris, Prof. Sabine Stanley, and Prof. Ralph E. Pudritz for their in-

teresting questions and helpful suggestions for the thesis. Specially, I am grateful to

Prof. Matzner, who has devoted a lot of time on my project. Discussion with him always

enlightens me on the research.

I want to thank Kiyoshi Masui and Joachim Harnois-Deraps for editing my draft, and

Gregory Paciga for correcting my presentation slides.

I also want to thank my friends, Xingxing Xing, Bin Guo, Sing-Leung Cheung, Chao

Zhuang, Nan Chen, Jing Wang, Lu Wang, MinXue Liu, Xiaomin Du, Xingyu Liu, and

Jun Hong Liang, who made my life not that boring during my Ph.D. study.

Finally, and most importantly, I want to thank my parents, Li Qin and Li Pang. I

thank them for bringing me to this colourful world, raising me, supporting me, and loving

me. I dedicate this thesis to them.

iv

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Black hole accretion . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Fast magnetic reconnection . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 Accelerate MHD simulation . . . . . . . . . . . . . . . . . . . . . 12

1.3 MHD equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 The properties of MHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4.1 Frozen-in effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4.2 Magnetic energy and stress . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Tools for the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Black hole accretion 20

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Constraining the accretion flow . . . . . . . . . . . . . . . . . . . 21

2.2 Simulation detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Physical setup and dimensionless physical parameters . . . . . . . 24

2.2.2 Grid setup and numerical parameters . . . . . . . . . . . . . . . . 26

2.3 Simulations and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

v

2.3.1 Character of saturated accretion flows . . . . . . . . . . . . . . . 29

2.4 Rotation measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6 Observational Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Fast magnetic reconnection 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 Simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.1 Physical setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.2 Numerical setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3.1 Global fast magnetic reconnection . . . . . . . . . . . . . . . . . . 52

3.3.2 What happens on the current sheet? . . . . . . . . . . . . . . . . 58

3.3.3 What happens globally? . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5 Ideal vs resistive MHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Accelerate MHD 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 The algorithms of MHD . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Implementation on heterogeneous systems . . . . . . . . . . . . . . . . . 74

4.3.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.2 Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.3 Nvidia GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.4 ATI GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4 Comparative Results and Discussion . . . . . . . . . . . . . . . . . . . . 83

vi

4.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Conclusion 91

A Rotation measure constraint on accretion flow 101

B Inner boundary conditions 105

B.0.1 Magnetic field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

B.0.2 Density and pressure . . . . . . . . . . . . . . . . . . . . . . . . . 108

C Supporting Movie for black hole accretion 110

Bibliography 110

vii

List of Tables

2.1 Simulations described in this paper. Columns: Run number; Maximum

resolution relative to the Bondi radius; Radial dynamic range within RB;

grid expansion factor within RB; effective resolution at RB; magnetization

parameter; rotation parameter; range of simulation times over which flow

properties were measured; mean mass accretion rate over this period; and

typical density power law slope (ρ ∝ r−k) over this period. . . . . . . . . 29

4.1 Performance on the multi-core x86 for different box sizes; timings in mil-

liseconds. x86(1) refers to single-core performance; x86(8) to 8. . . . . . 76

4.2 Cell performance while using PPE or varying numbers of SPEs for different

box sizes; timings in milliseconds. . . . . . . . . . . . . . . . . . . . . . 78

4.3 x86 vs NVidia GPU performance for different box sizes; timings in mil-

liseconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 x86 vs ATI GPU performance for different box sizes; timings in millisec-

onds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5 Performance comparison for different architectures; timings in millisec-

onds. N-GPU represents Fermi; A-GPU represents ATI HD5870; peak

Gflops represents theoretical peak floating-point performance; peak GB/s

represents the theoretical on-chip bandwidth; . . . . . . . . . . . . . . . 85

viii

List of Figures

1.1 X ray image for Sgr A*. The luminosity of the supermassive black hole is

108 order dimmer than simple theoretical predictions. NASA/CXC/MIT/F.K.

Baganoff et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Image of Submillimeter Array (SMA). Successful measurements of RM

have been done by [56] using Submillimeter Array in 2006. Image courtesy

SMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Image of solar flare. The time scale of solar flare is 105 faster than the

theoretical model (Sweet-Parker). Courtesy of NASA/SDO and the AIA,

EVE, and HMI science teams. . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Geometry for Sweet-Parker reconnection. The flows come into the thin

reconnection region from up and down half, and go out to two other direc-

tions horizontally. The speed of magnetic reconnection is limited by the

ratio of L/δ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Comparison between conventional supercomputer and heterogeneous plat-

form. To the left is a picture of Scinet supercomputer, which has the

price-to-performance ratio of $100,000 for 1 Tera flops. To the right is my

desktop computer (ATI GPUs inside), which has the price-to-performance

ratio as $400 for 1 Tera flops. Programming on a heterogeneous platform

takes more time and effort. . . . . . . . . . . . . . . . . . . . . . . . . . 13

ix

2.1 2D slice of the simulation for 6003 box at 15 Bondi times. Colour represents

the entropy, and arrows represent the magnetic field vector. The right

panel is the equatorial plane (yz), while the left panel a perpendicular

slice (xy). White circles represent the Bondi radius (rB = 1000). The

fluid is slowly moving, in a state of magnetically frustrated convection. A

movie of this flow is available in the supporting information section of the

electronic edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Density versus radius. The dotted line represents the density profile for

the Bondi solution, which is the steepest plausible slope at k = 1.5. The

dashed line represents the density scaling for CDAF solution, which is the

shallowest proposed slope with k = 0.5. The solid line is the density profile

from one of our simulations, which is intermediate to the two. . . . . . . 32

2.3 log(β), entropy and radial velocity versus radius. The dashed line vr/cs

represents the radial velocity in units of mach number. The dots vr/cms

represent the radial velocity in units of magnetosonic mach number. The

solid line is the entropy, and we see the entropy inversion which leads to

the slow, magnetically frustrated convection. Inside the inner boundary,

the sound speed is lowered, leading to the lower entropy. The + symbols

are the magnetic field strength, β. . . . . . . . . . . . . . . . . . . . . . 33

2.4 Rotation measure vs time (in units of tB). We chose Rrel = 17, corre-

sponding to Rrel/RB=0.068. Six lines represent three axes: upper set is X

(centered at +3), center is Y (centered at 0) and lower is Z (centered at

-3), with positive and negative directions drawn as solid and dashed lines,

respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5 PDF of RM in Figure 2.4. The dashed line represents a Gaussian distribu-

tion. The horizontal axis has been normalized by the standard deviation

in figure 2.4, σRM = 0.63. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

x

2.6 The rotation measure integrant ρBr vs radius and time. The central dark

bar represents the inner boundary, the vertical axis is the Z axis. The

horizontal axis is time, in units of tB; Greyscale represents sign(Br)4

√

ρ|Br|,

which was scaled to be more visually accessible. The coherence time is

longer at large radii and at late times. Several Bondi times are needed to

achieve the steady state regime. . . . . . . . . . . . . . . . . . . . . . . 39

2.7 Autocorrelation for Figure 2.4. X axis represents time lags; Y axis rep-

resents autocorrelation for different Rin. The dotted, dashed, dashed-dot

and solid lines correspond to Rin = 43, 34, 26, 17 respectively. . . . . . . 40

2.8 RM coherence time τ as a function of the inner truncation radius Rrel;

points refer to Rrel = 17, 26, 34 and 43. The bootstrap error of 0.17 dex

is based on the six data, two for each coordinate direction, at each Rrel.

The normalization for Rrel = RB is log10(tlags/tB) = 2.15. . . . . . . . . . 41

3.1 Numerical setup: the sphere in the center of the box represent the area of

the rotational perturbation. up-left is the rotational perturbation looked

from YZ plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2 Reconnection for different initial conditions. The total magnetic energy

is an indication of reconnection. The dash-dot line has non-zero mean

magnetic field perturbation, and the reconnected field asymptotes to a

slightly different value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Reconnection for different resolutions. . . . . . . . . . . . . . . . . . . . 55

3.4 Reconnection for different resolutions near reconnection point. This plots

recenters figure 3.3 to the time of maximum magnetic energy release, and

scales the horizontal and vertical axis to the fractional energy release and

mean alfven time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5 2D snapshot during reconnection. current as background color . . . . . . 57

3.6 Geometry of Petscheck solution . . . . . . . . . . . . . . . . . . . . . . . 60

xi

3.7 snapshot of magnetic field line on the background of current, and snapshot

of both magnetic and velocity field line, and B2 at 37 CT . . . . . . . . . 60





3.10 Snapshot of magnetic field line on the background of current, and snapshot

of both magnetic and velocity field line, and B2 at 0 CT for 400 cells . . 62









3.15 Geometry of global configuration . . . . . . . . . . . . . . . . . . . . . . 70

4.1 Time vs box size for GPU comparison. X axis represents the length of the

box; Y axis represents the time for one time step, Timings in milli second;

Dot diamond is OpenCL on ATI; Dash circle is OpenCL on Nvidia; Dash

dot asterisk is CUDA on Nvidia; Solid is linear fit with slope=3. . . . . 88

5.1 Atacama Large Millimeter/Submillimeter Array (ALMA). ALMA has much

higher sensitivity and higher resolution compared with current sub-millimeter

telescopes. Image courtesy ALMA (ESO/NAOJ/NRAO). . . . . . . . . 92

xii

5.2 The three-dimensional simulation box is for fast magnetic reconnection.

The fast magnetic reconnection is a three dimensional effect, and the global

geometry determines the reconnection. . . . . . . . . . . . . . . . . . . . 93

5.3 Roadmap for Nvidia GPU. DP represents double precision. FLOPS rep-

resents FLoating point Operations per Second, which is a measure for

computing performance. X axis represents the time; Y axis represents the

computing performance. Tesla, Fermi, Kepler, and Maxwell are the family

name of each generation of GPU from Nvidia. . . . . . . . . . . . . . . . 97

A.1 The logarithm of the relativistic RM factor, log10 F (k, kT ). The true RM

integral is modified by a factor F (k, kT ) relative to an estimate in which

the nonrelativistic formula is used, but the inner bound of integration is

set to the radius Rrel at which electrons become relativistic; see equation

A.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

B.1 Vacuum solution of the magnetic field is calculated in the central region.

The field lines outside of the central region show the boundary condition. 108

xiii

Chapter 1

Introduction

1.1 Motivation

Computer technology has been developing rapidly during the last decades and it has

revolutionized practically all aspects of human life. Scientists control satellites using

computers to observe the universe; financial departments use computers to estimate

economic growth; transportation companies use computers to guide trains and monitor

air traffic; computers allow people to pay their bills and order tickets online; children

play computer games and watch videos; people use electronic mail to contact each other

and video-conference makes long distances meetings possible.

While we now use computers for a variety of tasks, they were originally designed for

computing, that is, to perform simple mathematical operations. Today we continue to

use them for such purposes. Thanks to the technology revolution, a typical personal

computer can currently calculate a simple operation (e.g. 1+1=2) 10 billions times in

just one second. Furthermore, the fastest computer cluster to date has the theoretical

performance of 1peta flops (FLoating point Operations Per Second), which means that

it performs this simple equation a quadrillion times in one single second!

Obviously, we do not just use computers to calculate ‘1+1=2’. To utilize their full

1

Chapter 1. Introduction 2

power, people now take advantage of computers to do far more complicated numeri-

cal simulations in different areas, including physics, chemistry, biology, economics, and

engineering.

A computer simulation tries to solve a mathematical model, which is usually expressed

using equations. Different output behaviours depend on the model and on the input

data, and the results of the simulations allow users to setup the experiment and predict

the results at very low cost, since the only apparatus needed is the computer. The

mathematical equations are first translated into programming languages (i.e. code). The

results are then calculated by computers and presented using graphs, videos or other

readable methods for the analysis.

Computer simulations are commonly applied in fluid dynamics, a field known as com-

putational fluid dynamics (CFD). CFD simulations have become a very important tool

for engineers because of their low cost. In aerodynamics, for example, people can study

how air flow affects the objects by constructing a wind tunnel for testing. However,

building such an apparatus will cost much money and time. A good example is HiMAT

(Highly Maneuverable Aircraft Technology), the experimental NASA (National Aero-

nautics and Space Administration) aircraft [25], which was designed for the testing of

high maneuverability for the next generation of fighter planes. NASA found that their

wind tunnel tests of HiMAT suffered a drag problem related to the wings. Redesigning

the aircraft model would cost about $150, 000 and an unacceptable time delay. However,

they were able to redesign it using a computer simulation, costing only $6, 000.

The situation is worse in astronomy, since in many fields an experiment is almost

impossible. For example, there is strong evidence that a supermassive black hole sits at

our Galactic centre. The distance between the black hole and the Earth is about 1020

meters, such that it takes more than 2.6×104 years for light emitted in its neighbourhood

to reach us. We know that black holes have very strong gravitational potentials, making

it difficult for nearby objects to escape falling in. However, due to the gas dynamics in


the surrounding region, not all of the matter falls into the black hole. How then can one

investigate how much gas finally falls? The region of the gravitational influence of the

black hole at the Galactic center has a length scale of 1014 meters, and it is impossible

to reproduce experimentally this huge system on Earth. The currently available way to

study the system is to use computers to mimic the black hole in a simulation. Equations

would represent the evolution of the gas dynamics, and numerical parameter constraints

are used to represent the black hole and all the underlying physics. Both of these would

all be translated to computer languages, and after long simulation runs, we could show

how the matter around such a black hole evolves and use data analysis to find out the

percentage of the gas that finally falls into the black hole.

In this thesis, we apply magnetohydrodynamics (MHD), which is an extension of fluid

dynamics (i.e., additional terms are added into the equations of fluid dynamics), to study

phenomena in the universe.

Magnetohydrodynamics (MHD) studies the dynamics of an electrically conducting

fluid under the influence of a magnetic field. If there is no magnetic field present, the

problem reduces to traditional fluid dynamics. However, in most astrophysical settings,

the fluids are highly conductive and observed to be magnetized. Electro-motive forces

generated by magnetic fields will modify the flow, which will in turn affect the field. As

a result, one has to solve both the hydrodynamics and electromagnetics simultaneously.

The interactions between the field and the motion are a source of great interest and many

challenges in the MHD domain.

To simulate MHD in astrophysics is not a simple task, due to its complexity and the

large scales of astrophysical objects. As a result, a highly efficient simulation code and

powerful supercomputers are needed. Fortunately both of these resources are available

here, allowing us to perform this research.


1.2 Thesis projects

The majority of matter in the universe exists in the form of plasma. The plasma is

electrically conducting, and can be treated as a fluid if length scales are much larger than

the mean free path of particles. Additionally, it is widely believed that magnetic fields

are present throughout the universe. For instance, magnetic fields are observed from

polarization measures. Therefore, magnetohydrodynamics can be applied in the field

of astrophysics. Compared to laboratory MHD, astrophysical objects have such large

sizes that the electric currents are generated from self-induction rather than electrical

resistance [28]. A consequence of the high conducting fluids in astrophysical environments

is that it allows ideal MHD to be safely adopted.

1.2.1 Black hole accretion

A supermassive black hole, with mass MBH ≃ 4.3×106M⊙ [37] is situated at the Galactic

center, near the compact radio source Sgr A* [60] (Figure 1.1). This black hole has the

largest angular resolution compared to other black holes, and the current best estimate

for the distance to the Galactic center is R0 = 8.33 ± 0.35kpc [37]. This means that

one arcsecond corresponds to only 0.04 parsec (i.e. ∼ 1.2 × 1017cm). Therefore, in

comparison with other astrophysical objects which are much further away, finer details

can be resolved through observational means so as to constrain the fluid dynamics, which

also proves helpful to theoretical modeling.

What is curious about this black hole is that it has a very low luminosity, despite

being embedded in a huge amount of matter (e.g. ne ≃ 130cm−3 at accretion radius,

which is 0.06 pc). This is paradoxical because as matter accretes to it due to the strong

gravitational field, a large amount of energy would be released, which should then be

a source of radiation [14]. The bolometric luminosity is determined by the product of

accretion rate and radiative efficiency. Taking the accretion rate of the Bondi solution


Figure 1.1: X ray image for Sgr A*. The luminosity of the supermassive black hole is

108 order dimmer than simple theoretical predictions. NASA/CXC/MIT/F.K. Baganoff

et al.

[19], which describes spherical accretion under the influence of gravitational potential, and

a typical radiative effciency (∼ 10%) [33], the theoretical luminosity is about 1041ergs s−1.

On the other hand, the reported luminosity is about 1033ergs s−1 [13; 35]. Therefore,

there exists a discrepancy between theory and observation of 8 orders of magnitude.

While it is still unclear how to calculate the radiative efficiency, the accretion rate

can be constrained by the properties of linear polarization [76]. Due to the fact that the

magnetized plasma has an anisotropic index of refraction, the position angle will rotate

differently for different frequencies of linearly polarized light,

θ = RMλ2 (1.1)


where RM is the rotation measure. In nonrelativistic plasma, the RM is defined as

[80; 34],

RM = (8.1 × 105)∫

neB · dl rad m−1 (1.2)

where ne is the electron density in units of cm−3, dl is the path length element in units

of parsec, and B is the magnetic field in units of Gauss. In the case of ultrarelativistic

thermal plasma, the RM is suppressed by a factor of log γ/2γ2 [76], with γ = kTe/mec2.

It can be shown that the rotation measure is only dependent on the density of the

electrons and the magnetic field on the light path from the source to the observer. As

a result, when the RM value is measured, the electron density (i.e. the accretion rate)

can be determined with certain assumptions for the electron density and magnetic field.

This can be done by assuming a power law density n(r) to attain the expression for

electron density, and taking the condition of equipartition between magnetic, kinetic

and gravitational energy [59] to acquire the expression of magnetic field. The value of

rotation measure was observed to be around 105 rad m−1 [56; 55; 58], by Submillimeter

Array (Figure 1.2) [56], and by Berkeley-Illinois-Maryland Association (BIMA) Array

[55; 21]. This number would lead to an accretion rate of 10−9 − 10−7 M⊙ yr−1, which is

significantly smaller than the Bondi solution (∼ 10−5 M⊙ yr−1).

With the aforementioned constraint on the accretion rate, we can now revisit previous

accretion models. To begin, the Bondi solution is a spherically symmetrical accretion

model under a gravitational field [19]. Matter inside the capture radius, rB = GM/cs02,

falls into the center due to the gravity. The expression for the mass accretion rate is

4πλcrB2ρcs0, with λc = 0.25. If taking the observational data from [14], rB = 0.06pc,

ne=130nf−1/2cm−3, cs0 ≈ 550km s−1 at the Bondi radius, we can find an accretion rate

of: MBondi ∼ 1.4 × 10−5M⊙yr−1. Clearly, this value is too large and does not agree with

the constraint from rotation measures.

Obviously, the Bondi solution is insufficient, as it does not include magnetic fields, ke-

plerian rotation, and is also a one-dimensional model. There have been many other mod-


Figure 1.2: Image of Submillimeter Array (SMA). Successful measurements of RM have

been done by [56] using Submillimeter Array in 2006. Image courtesy SMA.

els designed for this low luminosity accretion flow at a later date, including Advection-

Dominated Accretion Flow (ADAF) [65; 63; 27; 61; 72; 64], adiabatic inflow-outflow solu-

tion (ADIOS) [18], Convection-Dominated Accretion Flows (CDAF) [62; 77], Convection-

Dominated Bondi Flow (CDBF) [42], Thermal conductive flow [45; 85], and Stellar wind

[54]. In Advection-Dominated Accretion Flow (ADAF) [65], the entropy generated by

the accretion cannot radiate out of the disk surface and must advect into the black hole

with hot ions. The ions and electrons interact only through inefficient Coulomb collisions,

resulting in a very low radiative efficiency [66; 64]. Unfortunately, the accretion rate in

this model is once again inconsistent with the RM values. An Adiabatic inflow-outflow

solution (ADIOS) [18] was proposed in which a large fraction of the released energy will

be driven away by wind and only a small amount of mass will fall into the centre of the

black hole. This process leads to a much smaller accretion rate compared to the Bondi

solution, however, there is no simulation that can produce this effect. ADAF is convec-

tively unstable [41; 40; 89], and later Convection-Dominated Accretion Flows (CDAF)

[62; 77] and Convection-Dominated Bondi Flow (CDBF) [42] were proposed in which

convection plays an important role in the flow. In both models, the matter in a spherical


shell circulates indefinitely in convective eddies, causing the accretion rates to be very

small. Thermal conduction can be important if the conduction time of the plasma is

shorter than the electron cooling time [45; 85]. These simulations showed that the ther-

mal conduction transports energy outward, resulting in a reduction of the accretion rate.

Stellar wind from the stars near the black hole may provide direct matter input for the

low luminosity of Sgr A* [54]. In this calculation, these simulations derived an accre-

tion rate as ∼ 10−8M⊙yr−1, which is comparable to the observations. However, a large

amount of mass can still accrete into the black hole even if the stars do not contribute.

An additional flow was proposed by Pen et al [70], who presented a very subsonic flow

referred to as Magnetically Frustrated Convection, in which the flow is quasi hydrostat-

ically supported by thermal pressure. Pen et al [70] conducted a 14003 grid zones MHD

simulations, in which they found the density slope to be n ∼ 0.72. There is very small

amount of inward energy flux and the buoyant motions are resisted by the magnetic shear

stresses.

Many simulations were performed for the black hole accretion flow, but all suffered

various difficulties. Examples include: no conservation during magnetic reconnection,

some simulations had dimensional problems (one or two dimensional simulation instead

of three), boundary problems, poor dynamical range, or a lack of run time to achieve a

stable result.

Here, we address these problems and continue the effort on magnetically frustrated

convection through the use of our three dimensional large-scale MHD simulations. Our

simulations have an expanding Cartesian grid (i.e. the grid distance increases as one

moves away from the centre), which can achieve a larger box with smaller number of grid

points. An inner boundary at the centre represents the black hole, which removes the

mass and energy at each time step. The box edge is far away enough from the Bondi

radius (rB = GM/c2s0) so as to minimize the outer boundary effects from previous sim-

ulations. The simulations begin with static flow, except keplerian velocity, and uniform


magnetic field is subsequently added. Our simulations are able to run long enough to

attain a stable result.

1.2.2 Fast magnetic reconnection

Magnetic reconnection is a process in which magnetic field lines reconnect, causing topo-

logical rearrangements of field lines, and the conversion of magnetic energy into other

forms of energy (e.g. heat, kinetic). It is believed that some energy release in astrophysics

is related to magnetic reconnection, for example, in the case of solar flare. However, the

theoretical predictions for reconnection speed are too slow when compared to the observa-

tions. For example, the Sweet-Parker model [90; 68], which describes opposing magnetic

fields interacting in a thin current sheet, indicates a time scale for solar flare (Figure 1.3)

reconnection that is 105 times slower than the observation [29].

In the Sweet-Parker model (Figure 1.4), two oppositely directed inflows push two

oppositely directed magnetic fields toward a neutral line, causing magnetic reconnection

in a thin region with thickness of 2δ and length of 2L (L ≫ δ). The conversion of

magnetic energy increases the pressures, and the magnetic tension force expels the inflow

plasma. The conservation law and the steady reconnection require that the inward mass

equal the outward mass:

vinL = voutδ (1.3)

As a result, even the outflow plasma can be accelerated to Alfven speed (vA = B/√

4πρ),

the inflow speed is still very small due to the large ratio of L/δ in astrophysics, limiting

the reconnection rate.

To solve this problem, Petschek later proposed the X-point reconnection configuration

[71], in which standing shock waves direct the reconnection outflow. Instead of a thin

current film, a short length (L′) was applied, and the reconnection rate was increased

by√

L/L′, approaching Alfven speed. However, two dimensional simulations cannot

reproduce Petschek’s fast reconnection [16]. Later Priest et al. [74] pointed out that


Figure 1.3: Image of solar flare. The time scale of solar flare is 105 faster than the

theoretical model (Sweet-Parker). Courtesy of NASA/SDO and the AIA, EVE, and

HMI science teams.


Figure 1.4: Geometry for Sweet-Parker reconnection. The flows come into the thin re-

connection region from up and down half, and go out to two other directions horizontally.

The speed of magnetic reconnection is limited by the ratio of L/δ.

the boundary conditions are crucial for the occurrence of fast magnetic reconnection.

However, there is still no ideal MHD simulation that can realize this, except artificially

enhanced local resistivity [82], or collisionless MHD [15]. Unfortunately, these cases are

still unable to explain the fast reconnection in astrophysics environments, such as solar

flare, which requires ideal collisional MHD.

Both Sweet-Parker and Petschek’s models are two-dimensional, while magnetic re-

connection may occur in three-dimensional space, which means that the previous two-

dimensional simulations may be limited. Furthermore, Petschek’s configuration suffered

geometrical problems: it emphasizes the significance of the microscopic X-point in the

fast reconnection, but ignores the fact that most of the energy and mass are in the global

flow.

Here, we use our three-dimensional ideal MHD simulations to address these issues.

The simulation starts with two oppositely aligned magnetic fields in a periodic boundary

box. An initial perturbation is added in the center of the box, between two opposing mag-

netic field lines. Consequently, we were able to show a three-dimensional fast magnetic


reconnection.

1.2.3 Accelerate MHD simulation

Computational simulations use discretization to simulate the real world physics prob-

lems. Usually, a higher grid resolution yields more accurate the results. Additionally,

astrophysical objects have large length scales, which means that a large simulation box

is needed. Both high resolution and large simulation boxes require greater computing

power. Alternatively, clusters of linked computers and massive parallelization have been

used to compute a great amount of data simultaneously. However, the simple accumu-

lation of computers could still be a problem due to the high monetary cost, high power

consumption, and large space occupation.

Thanks to the improvements in technology, computers are becoming cheaper and

faster. Furthermore, with heterogeneous systems being developed [83], it is now becoming

possible to achieve higher performance through the exploration of these new architectures.

Heterogeneous systems typically have a single controlling processor and many com-

puting cores. Examples include Cell processors from IBM, and graphics processing units

(GPU) from both Nvidia and ATI. The controlling processor is responsible for mission

assignment, while the computing cores do the calculation work. As there are many cores

on one heterogeneous platform, calculation can be greatly sped up. Furthermore, hetero-

geneous platforms are also inexpensive, and consume less power and occupy less space,

achieving a perfect price-to-performance ratio.

To demonstrate these economical benefits, we do a rough comparison between Scinet

computer and the GPUs in my desktop computer (Figure 1.5). The Scinet supercomputer

has 30,240 cores, totalling 306 TFlops performance. It cost about 30 million Canadian

dollars, which means $100,000 per Tera flops. My desktop contains two ATI HD5870

GPUs, which have about 5 TFlops performance. The total cost is about $2,000, giving

$400 per Tera flops. This ratio is very impressive. To port the simulations into a hetero-


Figure 1.5: Comparison between conventional supercomputer and heterogeneous plat-

form. To the left is a picture of Scinet supercomputer, which has the price-to-performance

ratio of $100,000 for 1 Tera flops. To the right is my desktop computer (ATI GPUs in-

side), which has the price-to-performance ratio as $400 for 1 Tera flops. Programming

on a heterogeneous platform takes more time and effort.

geneous system, the programmers have to rewrite the code; however, given the attractive

price-to-performance ratio, it is worth doing so.

Here, we show our progress on the programming MHD simulations on three different

heterogeneous systems, (i.e. Cell/B.E. [1], Nvidia GPU, and ATI GPU). As different

heterogeneous platforms have various programming languages, we port our FORTRAN

MHD code to C, Cell SDK, CUDA and OpenCL. We present the varied speed-ups for

different heterogeneous platforms, as a guide for the future acceleration of MHD simula-

tion.

1.3 MHD equations

Hannes Alfven brought out the concept of MHD in 1942 [11]. The governing MHD

equations contain both Euler equations and Maxwell’s equations, but with modifications

to represent the interactions between the magnetic field and the motion.


Because continuum assumption is used for electromagnetism, the mean free path of

the electrons is assumed to be small in comparison with the Larmor radius, which is the

radius of curvature of the electrons’ orbits in the magnetic field [50]. The Larmor radius

is proportional to the magnetic field, and the mean free path is inversely proportional

to the density. As a result, MHD cannot apply in a rarefied medium or with a strong

magnetic field. We are also more concerned about the conductor rather than the oscil-

lations; assuming the variation of the field is slow, the Maxwell’s displacement currents

are ignored. Furthermore, only the ideal MHD is considered here, and thus diffusion,

viscosity, heat conduction, and resistivity are ignored.

The equations for ideal magnetohydrodynamics are listed below [88], The equation of

continuity is the same as for fluid dynamics [49],

∂ρ

∂t+ ∇ · (ρu) = 0 (1.4)

with u denotes the velocity and ρ represents the density.

The equation of motion is modified by the inclusion of Lorentz force,

∂u

∂t+ (u · ∇)u = −1

ρ∇p +

1

4πρ(∇× B) × B (1.5)

with p represents the gas pressure and B stands for the magnetic field. The last term is

lorentz force, je × B/c = (∇× B) ×B/4π.

We write the equation of heat transfer in the form of conservation of energy, with

additional terms from electromagnetic part in both density and flux terms,

∂

∂t(1

2ρu2 + ρǫ +

B2

8π) = −∇(ρu(

1

2u2 + ǫ + p/ρ +

1

4πB2/ρ) +

1

4πB × (u ×B)) (1.6)

with ǫ is the internal energy per unit mass. The density of energy includes the magnetic

energy B2/8π, and the energy flux includes the Poynting vector cE × H/4π, whose

dissipation part is already neglected in the equation.

To relate the pressure, density and temperature, the equation of state is used,

p = p(ρ, T ) (1.7)


The electromagnetic aspect also needs to be included in the MHD equations. The

equations that describe the electromagnetic field in a moving conductor are:

∇ · B = 0 (1.8)

∂B/∂t = ∇× (u× B) (1.9)

As a result, Equation 1.4 to 1.9 are the equations for ideal magnetohydrodynamics.

1.4 The properties of MHD

1.4.1 Frozen-in effect

By combining the equations of continuity and the evolution function for a magnetic field

(i.e. Equation 1.9), we get the following formula,

d

dt(B

ρ) = (

∂

∂t+ u · ∇)

B

ρ= (

B

ρ· ∇)u (1.10)

This formula effectively represents the frozen-in effect of ideal MHD, explained in detail

below [50]. Imagine an element of length (i.e. δl) on a fluid line, which represents a line

that moves with the fluid particles. If the velocity at one end of the element is u, then

the velocity on the other end can be expressed as u+(δl ·∇)u. The length of the element

will change to dt(δl · ∇)u after a time interval dt, which consequently means that:

d

dt(δl) = (δl · ∇)u (1.11)

This expression is exactly the same as Equation 1.10, if δl is substituted by B/ρ. There-

fore, we can determine that the vectors will remain parallel if their initial directions are

the same, and the ratio of their length will not change. Due to the frozen-in effect, par-

ticles of an infinitesimal distance apart will always move on the same lines of magnetic

forces. This is a characteristic of ideal MHD.


1.4.2 Magnetic energy and stress

The energy of a magnetic field is defined by B2/2µ per unit volume. The total energy

WM can be defined as [28]:

WM =∫

B2

2µdτ (1.12)

together with Equation 1.9, the rate of change of the magnetic energy for ideal MHD is:

dWM

dt= µ−1

∫

B · [∇× (u ×B)]dτ (1.13)

which can be further simplified as:

dWM

dt= µ−1

∫

{∇[(u×B) × B] + (u× B) · ∇ × B}dτ = −∫

u · (j× B)dτ (1.14)

Therefore, the change of magnetic energy is due to the work done by the magnetic

force j× B. More specifically, this magnetic force can be expressed as:

j× B = −∇B2

2µ+ (B · ∇)

B

µ= −∇B2

2µ+ ∇BB

µ(1.15)

The term B2/2µ represents a hydrostatic pressure if there is magnetic field gradient, and

the term B2/µ represents the tension along magnetic flux tubes.

1.5 Tools for the research

Software – Simulation code

The MHD code we are using was written by Ue-li Pen in 2003, and later expanded

by Phil Arras, ShingKwong Wong, Hugh Merz, Matthias Liebendorfer, Stephen Green,

and Bijia Pang.

The code [69] is a three-dimensional second-order accurate (in space and time) high-

resolution total variation diminishing (TVD) MHD parallel code. Kinetic, thermal, and

magnetic energy are conserved and the divergence of the magnetic field is kept to zero

by flux constrained transport. There is no explicit magnetic and viscous dissipation in


the code. The TVD constraints result in non-linear viscosity and resistivity on the grid

scale.

This code is MPI parallelled [46] and OpenMP, and is therefore suited for large scale

simulations. Furthermore, this code is exceptionally fast. By combining the code with

powerful computer clusters, we can simulate the largest and longest MHD simulation yet.

Hardware – Computer clusters

If you attempt to run simulations on a large box with your desktop machine you

will never graduate. Fortunately, we have access to two powerful computer clusters,

which are hundreds of thousands of linked computers. One is called SunnyVale, in the

Canadian Institute for Theoretical Astrophysics (CITA). The other is Scinet, and is the

most powerful supercomputer in Canada. Access to these two clusters provides us the

opportunity to operate high performance parallel computing for MHD simulations.

SunnyVale is the Beowulf cluster in Canadian Institute for Theoretical Astrophysics

(CITA) [2]. Here there are 200 Dell PE1950 compute nodes, and each node contains 2

quad core Intel(R) Xeon(R) E5310 @ 1.60GHz processors, 4GB of RAM, 2 gigE network

interfaces, and a 40GB disk. As a result, there are 1600 CPUs in total.

Scinet [3] has the most powerful supercomputer in Canada, to date. It has 306

TFlops theoretical peak performance, which earned it the rank of No. 16 on June 2009

TOP500 list. Scinet has two large clusters, GPC and TCS; the former one is a very large

x86-based commodity cluster, which is designated for large variety of serial and parallel

applications. The latter one is for jobs that have large numbers of processes/threads or

large memory and low latency interconnect. In the case of GPC, there are 30,240 Intel

Xeon E5540 aka ‘Nehalem’ (2.53GHz) cores, and each node contains 8 cores, 16GB of

RAM. The interconnections are hybrid GigE and InfiniBand.

In addition to computer clusters, a wide variety of heterogeneous platforms [83] are

also available for our research of the acceleration of MHD simulations 1.

1More details about heterogeneous platforms are provided in Chapter 4


CITA provides many Nvidia GPUs, ranging from the lower level of early ‘Quadro’

series to the latest ‘Fermi’ family. Moreover, two dual ATI HD 5870 GPUs are also

available for the test. Scinet also provides the newest GPUs from Nvidia, and a cluster

of Cell blade is available.

1.6 Contribution

Chapters 2 to 4 are three technical papers.

The authors for the black hole accretion section include Bijia Pang, Ue-Li Pen,

Christopher D. Matzner, Stephen Green and Matthias Liebendorfer. This section has

been submitted to Monthly Notices of the Royal Astronomical Society. Ue-Li Pen and

Christopher D. Matzner were involved in this project for several years, and both pro-

vided extremely valuable suggestions and devoted a lot of time to the subject. At the

same time, Christopher D. Matzner has spent a considerable amount of time editing the

manuscript. Stephen Green contributed to the non-equal grid and inner boundary setting

of the code. I received the MPI version of the code from Matthias Liebendorfer several

years ago, and he provided tremendous assistance in my early attempts to familiarize

myself with the code.

Regarding the section on fast magnetic reconnection, the authors include Bijia Pang,

Ue-Li Pen and Ethan T. Vishniac. This was published in Physics of Plasma. Ue-Li Pen

and Ethan T. Vishniac contributed a lot of time to the discussion and writing of the

draft. Ethan T. Vishniac is an expert of MHD reconnection and travelled to Toronto

frequently to meet with us during the progress of this project, despite his busy schedule.

Prof. Vishniac also has spent a considerable amount of time editing the manuscript.

The accelerate MHD section was authored by Bijia Pang, Ue-Li Pen and Michael

Perrone. This section has already been posted on arXiv.org e-Print archive, and future

plans include its submission to a computer science conference. Ue-Li Pen contributed


to the discussion and draft writing; Michael Perrone was the manager of multi-core

department of IBM TJ Watson research center when I was there and helped with the

discussion.

Chapter 2

Black hole accretion

2.1 Introduction

The radio source Sgr A* at the Galactic centre (GC) is now accepted to be a supermassive

black hole [MBH ≃ 4.3 × 106M⊙: 37], accreting hot gas from its environment [ne ≃

130 cm−3, kBT ≃ 2 keV at 1 arc second: 14]. Interest in the Sgr A* accretion flow is

stimulated by its remarkably low luminosity; by its similarity to other low-luminosity

AGN; by circumstantial evidence for past episodes of bright X-ray emission [79, but see

98] and nearby star formation [53]; and foremost, by its status as an outstanding physical

puzzle.

Supermassive black holes are enigmatic in many respects; for the GC black hole

(GCBH) the enigma is sharpened by a wealth of observational constraints, which permit

detailed, sensitive and spatially resolved studies of its accretion dynamics. Within a

naıve model such as Bondi flow, matter would flow inward at the dynamical rate from

its gravitational sphere of influence, which at ∼ 1′′ is resolved by Chandra. Converted to

radiation with an efficiency ηc2, the resulting luminosity would exceed what is actually

observed by a factor ∼ 105(η/0.1). This wide discrepancy between expectation and

observation has stimulated numerous theoretical explanations, including convection [62;

20

Chapter 2. Black hole accretion 21

77], outflow [18], domination by individual stars’ winds [54], and conduction [91; 45; 85;

87].

2.1.1 Constraining the accretion flow

Because many of its parameters are uncertain, the central density and accretion rate of

the GCBH flow are not strongly constrained by the its emission spectrum [78]; the most

stringent constraints come from observations of the rotation measure [76], now known to

be roughly −5.4 × 105 radm−1 [58]. Interpreting this as arising within a quasi-spherical

flow with magnetic fields in rough equipartition with gas pressure, and adopting the

typical assumption that magnetic fields do not reverse rapidly, we derive a gas density

nH ∼ 105.5 cm−3(RS/Rrel)1/2 at the radius Rrel which dominates the RM integral, namely

where electrons become relativistic; see §A for more detail. If this radius is about 102

Schwarzschild radii (102RS), as in the spectral models of [78], then a comparison between

this density and conditions at the Bondi radius RB ≃ 0.053 pc indicates a density power

law ρ ∝ r−k with k = 1.1 − 1.3; the derived value is rather insensitive to the black

hole mass, the degree of equipartition, and the precise radius at which electrons become

relativistic. (If rapid conduction causes electrons to be nonrelativistic at all radii, the

implied slope falls to 0.8.)

An independent but weak constraint on k comes from recent multi-wavelength ob-

servations of flares in the emission from Sgr A*. Yusef-Zadeh et al [97] favor an in-

terpretation in which these flares originate within regions in which electrons have been

transiently heated and accelerated; using equipartition arguments they estimate a mag-

netic field strength B ∼ 13−15G at 4−10 Schwarzschild radii, implying a total pressure

P > 20 dyn cm−2 at those radii. Because P ∝ r−(k+1), a comparison to the conditions at

RB requires k > 0.6− 0.8. This constraint could be violated if the emitting regions were

sufficiently over-pressured relative to the surrounding gas; however the subsonic rate of

expansion inferred by [96] suggests this is not the case.


The density power law k is an important diagnostic, both because it allows one to

estimate the mass accretion rate onto the black hole, and because k takes definite values

within proposed classes of accretion flows. Bondi [19] accretion and ADAFs [advection-

dominated accretion flows, 65], in which gas undergoes a modified free fall, imply k =

3/2 and have long been ruled out [10] by limits on the rotation measure [20]. CDAFs

[convection-dominated accretion flows, 62; 77] and related flows like CDBFs [convection-

dominated Bondi flows, 42], in which convection carries a finite outward luminosity, all

have k = 1/2 outside some small radius: otherwise, convection becomes supersonic [38].

Three classes of flows are known to have intermediate values, 1/2 < k < 3/2, as

suggested by the observations. One of these is the ADIOS [advection-dominated inflow-

outflow solutions, 18], in which mass is lost via a wind from all radii within a rotating

ADAF; however these flows appear to require that low angular momentum material has

been removed from the axis. Another is a class of conductive flows, in which heat is

carried outward by electrons and stifles accretion at large radii [45]. A third consists of

flows which lack any significant outward convective or conductive luminosity [38], but are

nevertheless hydrostatic rather than infalling; this behavior is seen within some numerical

simulations in which magnetized gas is accreted, such as those of Igumenshchev et al. [43]

and Pen et al. [70], who termed the flow “magnetically-frustrated convection”.

We are concerned with the last flow class, as it is physically simple, realizable within

simulations, and consistent with observational constraints. Whether it is physically rel-

evant depends on the strength of conduction in the accretion flow, a question we return

to in § 2.5. Although it is of interest, previous simulations do not suffice to make any

quantitative comparisons between it and the Sgr A* accretion flow. Igumenshchev et

al. [43] have already discussed several shortcomings which afflicted prior numerical work,

such as (1) a lack of energy conservation during magnetic reconnection and (2) simulation

durations too short to capture steady states or secular trends. There are a number of

other roadblocks: (3) Dynamical range: RB is 105 Schwarzschild radii, but the largest


simulations yet done have only a factor of ∼ 102 separating their inner and outer bound-

aries; (4) Resolution: numerical solutions are rarely close enough to the continuum limit

to allow turbulent phenomena to be predicted with confidence; (5) Outer boundary con-

ditions: although matter is presumably fed into the accretion flow by stellar winds from

the nuclear star cluster [36], the flow structure and magnetization of this gas is not well

constrained; (6) Inner boundary conditions: the hole interacts with the flow in a manner

which is not fully characterized, and which is likely to dominate the energetics; (7) Mass

injection: stars within RB produce fresh wind material, which have the potential to affect

the final solution [54]; and (8) Plasma physics: close to be black hole, the flow is only

weakly collisional, leading to effects such as anisotropic pressure and conduction, which

may alter the nature of fluid instabilities and the character of heat transport [85]. Po-

tential deviations from ideal MHD become stronger as one approaches the event horizon,

and are discussed further in section 2.5.

In this paper we describe a numerical parameter survey designed to partially overcome

difficulties (1)-(5) in the above list, while making an educated guess regarding (7) and

leaving (6) and (8) to future work. Specifically, we conduct three dimensional, explicitly

energy conserving simulations to the point of saturation – often tens of dynamical times

at RB. We vary the dynamical range and resolution in order to gather information about

the astrophysical limits of these parameters, although they lie beyond our numerical

reach. We push numerical outer boundaries far enough from RB to minimize their effect

on the flow, and we vary the conditions exterior to RB in order to gauge the importance

of magnetization and rotation in the exterior fluid. Our simulations obey ideal MHD, but

are viscous and resistive on the grid scale for numerical reasons; we make no attempt to

capture non-ideal plasma effects. We do not account for stellar mass injection within the

simulation volume. Our gravity is purely Newtonian, and at its base we have a region of

accretion and reconnection rather than a black hole (although we are currently pursuing

relativistic simulations to overcome this limitation). Our numerical approach is described


more thoroughly below.

By varying the conditions of gas outside RB and by varying the allocation of grid

zones within RB we are able to disentangle, to some degree, physical and numerical

factors within our results. We also compute integrated quantities related to the value

and time evolution of RM, and draw conclusions regarding the importance of RM(t) as

a powerful discriminant between physical models.

We reiterate that our simulations have two simplifications which could substantially

change the behaviour. 1. Our black hole boundary condition is Newtonian. Since the

deepest potential dominates the dynamics and energy of the flow, a change in this as-

sumption might alter the solution. 2. We assume ideal MHD to hold. As one approaches

the black hole, the Coloumb collision rate is insufficient to guarantee LTE. Plasmas can

thermalize through other plasma processes, but if these fail, strong non-ideal effects could

dominate and lead to rapid conduction. These effects are both strongest at small radii,

potentially modifying the extrapolation to the actual physical parameters. We address

these issues in more detail in section 2.5.

2.2 Simulation detail

2.2.1 Physical setup and dimensionless physical parameters

We wish our simulations to be reasonably realistic with regard to the material which

accretes onto the black hole, but also easily described by a few physical and numerical

parameters. We therefore do not treat the propagation and shocking of individual stellar

winds or turbulent motions, but take the external medium to be initially of constant

density ρ0 and adiabatic sound speed cs0, and imbued with a characteristic magnetic field

B0 and characteristic rotational angular momentum j0 (but no other initial velocity). A

Keplerian gravity field −GM/r2 accelerates material toward a central “black hole” of


mass M surrounded by a central accretion zone. The Bondi accretion radius is therefore

RB =GM

c2s0

. (2.1)

We adopt the Bondi time tB = RB/cs0 as our basic time unit; this is 100 years for

the adopted conditions at Sgr A*. All of the initial flow quantities will evolve as a

result of this during the course of the simulation, and we run for many Bondi times

in order to allow the accretion flow to settle into a final state quite different from our

initial conditions. From the above dimensional quantities we define several dimensionless

physical parameters.

The adiabatic index is γ = 5/3; the initial plasma-β parameter, or ratio of gas to

magnetic pressure, is

β0 =8πγρ0c

2s0

B20

; (2.2)

we consider models with β0 = (1, 10, 100, 1000,∞) to capture a wide range of plausible

magnetizations. In our main sequence of simulations we adopt a uniform magnetic field

B0.1

The initial velocity field is v0 = (j0 × r)/r, where r is the separation from the black

hole. The specific vector angular momentum is thus j0 at the rotational equator, with

solid-body rotation on spherical shells away from the equator. A dimensionless rotation

parameter is therefore

RK

RB=

(

j0cs

GM

)2

; (2.3)

here RK = j20/(GM) is the Keplerian circularization radius of the equatorial inflow. (Our

flows never do circularize at RK , both because angular momentum transport alters the

distribution of j, and because gas pressure can never be neglected.)

We impose mass accretion and magnetic field reconnection within a zone of char-

acteristic radius Rin, described below, which introduces the dynamic range parameter

1We also investigated scenarios with Gaussian random field components, in which the dominantwavelengths were some multiple of RB; however we abandoned these, as such fields decay on a Alfvencrossing time, confounding our attempts to quantify the accretion flow, and we did not wish to add aturbulent driver to maintain steady state.


RB/Rin. Because it sets the separation between small and large scales and the maximum

depth of the potential well, this ratio has a strong influence on flow properties. One of

our goals is to test how well the flow quantities at high dynamic range can be predicted

from simulations done at lower dynamic range, as the dynamic range appropriate to Sgr

A* is beyond what we can simulate.

2.2.2 Grid setup and numerical parameters

We employ a fixed, variable-spacing Cartesian mesh in which the grid spacing increases

with distance away from the black hole. To simplify our boundary conditions, we hold

the spacing fixed within the inner accretion zone and near the outer boundary. The total

box size is 40003 in units of the minimum grid spacing; however this is achieved within a

numerical grid of only 3003 to 6003 zones. Our grid geometry allows for a large number

of long-duration runs to be performed at respectable values of the dynamic range, while

avoiding coordinate singularities and resolution boundaries. These advantages come at

the cost of introducing an anisotropy into the grid resolution; however we have tested the

code for conservation of angular momentum and preservation of magnetosonic waves, and

found it to be comparable in accuracy to fixed-grid codes with the same resolution. Our

grid expansion factor s = δdxi/dxi takes one value for xi < RB and another, larger value

for xi > RB; this allows us to devote most of our computational effort to the accretion

region of interest, while also pushing the (periodic) outer boundary conditions far away

from this region. The inner expansion factor sin is therefore an important numerical

parameter, related to both the grid’s resolution and its anisotropy where we care most

about the flow.

Within our inner accretion region, magnetic fields are reconnected (relaxed to the

vacuum solution consistent with the external field, see appendix B) and mass and heat

are adjusted (invariably, removed) so that the sound speed and Alfven velocity both

match the Keplerian velocity at RB. The accretion zone is a cube, whose width we hold


fixed at 15 in units of the local (uniform) grid separation, so we define Rin = 7.5 dxmin

(but note, the volume of this region is equivalent to a sphere of radius 9.3dxmin.) We

consider it too costly to vary the numerical parameter Rin/dxmin.

Our grid geometry imposes a local dimensionless resolution parameter

ℜ ≡ r

maxi(dxi)(2.4)

(the maximum being over coordinate directions), which depends both on radius and on

angle within the simulation volume. At the inner boundary ℜ ≃ 7.5 − 9.3; ℜ increases

to nearly s−1in ≃ 102 at RB, then decreases toward s−1

out in the exterior region. In §2.3 we

report the effective resolution at the Bondi radius, ℜB = ℜ(RB), along with our results.

2.3 Simulations and results

Our suite of simulations is described in Table 2.1, along with some selected results. We

independently varied the magnetization, rotation, and dynamic range of the flow, as well

as the effective resolution at RB. In order to suppress the lingering effects of our initial

conditions, we ran each simulation for long enough that a total mass equivalent to all the

matter initially within RB was eventually accreted, before assessing the flow structure.

Because most of our runs exhibited a significant suppression of the mass accretion rate M

relative to the Bondi value, this constraint required us to simulate for many tB (typically

20 tB). This requirement put strenuous constraints on our simulations (each of which

required ∼ 3 weeks to complete), and will be a serious limitation on any future simulations

performed at higher dynamic range.

Run RB

dxmin

RB

Rin

1+sin ℜB β0RK

RB

tsimtB

MMBondi

keff2

1 500 67 1.023 40.15 ∞ 0 8 1.02 1.5047

2 250 33 1.013 59.29 ∞ 0 3 1.10 1.5273

2Values are taken from Equation 2.5


3 125 17 1.013 48.11 100 0 6-20 0.49 1.2482

4 250 33 1.013 59.29 100 0 6-20 0.31 1.1650

5 500 67 1.023 40.15 100 0 6-20 0.22 1.1399

6 1000 133 1.0315 30.82 100 0 6-10 0.16 1.1253

7 250 33 1.013 59.29 1 0 6-20 0.15 0.9574

8 250 33 1.013 59.29 10 0 6-20 0.26 1.1147

9 250 33 1.013 59.29 1000 0 6-20 0.40 1.2379

10 250 33 1.013 59.29 100 0.1 6-20 0.289 1.1450

11 250 33 1.013 59.29 100 0.5 6-20 0.286 1.1420

12 250 33 1.013 59.29 100 1.0 6-20 0.31 1.1650

133 62.5 33 1.06 14.24 100 0 6-20 0.30 1.1557

144 125 33 1.037 28.94 100 0 6-20 0.33 1.1829

15 250 33 1.013 59.29 ∞ 0.1 6-20 0.615 1.3610

16 250 33 1.013 59.29 ∞ 0.5 6-20 0.621 1.3637

17 250 33 1.013 59.29 ∞ 1.0 6-20 0.759 1.4211

18 250 33 1.013 59.29 1000 0.1 6-20 0.400 1.2379

195 250 33 1.013 59.29 1000 0.1 6-20 0.469 1.2835

20 250 33 1.013 59.29 100 0.1 6-20 0.300 1.1557

21 250 33 1.013 59.29 10 0.1 6-20 0.233 1.0834

22 250 33 1.013 59.29 1 0.1 6-20 0.188 1.0220

23 250 33 1.013 59.29 100 0 6-20 0.340 1.1915

246 500 67 1.0315 31.65 100 0.1 6-20 0.18 1.2434

257 1000 58.9 1.015 64 100 0.1 6-20 0.19 1.0925

3case of 753 grid resolution4case of 1503 grid resolution519-23, B field is along [0 0 1] axis624-25, B field is along [1 2 0] axis7case of 6003 grid resolution


Table 2.1: Simulations described in this paper. Columns:

Run number; Maximum resolution relative to the Bondi

radius; Radial dynamic range within RB; grid expansion

factor within RB; effective resolution at RB; magnetiza-

tion parameter; rotation parameter; range of simulation

times over which flow properties were measured; mean

mass accretion rate over this period; and typical density

power law slope (ρ ∝ r−k) over this period.

2.3.1 Character of saturated accretion flows

Figure 2.1 shows the 2D slices for the simulation of our highest resolution 6003 box at

15 Bondi times (case 25) 8. The remaining Figures are all based on case 10, which is

most representative of the whole set of simulations. Figures 2.2 and 2.3 display the

spherically-averaged properties, figure 2.2 shows the spherically-averaged density of the

run; figure 2.3 shows the spherically-averaged radial velocity, β and entropy (normalized

to the Bondi entropy). The entropy inversion is clearly visible, which leads to the slow,

magnetically frustrated convection.

We draw several general conclusions from the runs listed in Table 2.1:

- In the presence of magnetic fields, the flow develops a super-adiabatic temperature

gradient and flattens to k ∼ 1. Gas pressure remains the dominant source of

support at all radii, although magnetic forces are always significant at the inner

radius.

- Mass accretion diminishes with increasing dynamic range, taking values M ≃ (2−

8Movies are also available in various formats at http://www.cita.utoronto.ca/ pen/MFAF/blackhole movie/index.html


4)MB(Rin/RB)3/2−k.

- Even significant rotation at the Bondi radius has only a minor impact on the mass

accretion rate, as the flows do not develop rotationally supported inner regions.

- Our results depend only weakly on the effective resolution ℜB.

- In the absence of magnetic fields and rotation, a Bondi flow develops. ([70] further

demonstrated a reversion to Bondi inflow if magnetic fields are suddenly eliminated;

we have not repeated this experiment.)

2.3.1.1 Lack of rotational support

The non-rotating character of the flow casts some doubt on models which depend on

equatorial inflow and axial outflow. Our nonrelativistic simulations cannot rule out an

axial outflow from a spinning black hole, but they certainly show no tendency to develop

rotational support in their inner regions, even after many tens of dynamical times. In

a rotating run, angular momentum is important at first, in preventing the accretion

of matter from the equator. Axial, low-j material does accrete, but some of it shocks

and drives an outflow along the equator (as reported by [70] and [75]). After a few tB

this quadrupolar flow disappears, leaving behind the nearly hydrostatic, slowly rotating

envelope which will persists for our entire simulation time, i.e. tens of tB. We attribute

the persistence of this rotational profile to magnetic braking, as the Alfven crossing time

of the envelope is always shorter than its accretion time. Magnetic fields thus play a role

here which is rather different than in simulations which start from a rotating torus, where

the magneto-rotational instability is the controlling phenomenon; the critical distinction

is the presence of low-angular-momentum gas.

Unlike compact object disks, which accrete high-angular-momentum material and are

guaranteed to cool in a fraction of their viscous time, the GCBH feeds upon low-angular-

momentum matter, and its accretion envelope cannot cool. For both of these reasons


Figure 2.1: 2D slice of the simulation for 6003 box at 15 Bondi times. Colour represents

the entropy, and arrows represent the magnetic field vector. The right panel is the equa-

torial plane (yz), while the left panel a perpendicular slice (xy). White circles represent

the Bondi radius (rB = 1000). The fluid is slowly moving, in a state of magnetically

frustrated convection. A movie of this flow is available in the supporting information

section of the electronic edition.


−1.5 −1 −0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

← Inner boundary ← Bondi radius

log10

(R/RB)

log 10

(ρ)

Figure 2.2: Density versus radius. The dotted line represents the density profile for

the Bondi solution, which is the steepest plausible slope at k = 1.5. The dashed line

represents the density scaling for CDAF solution, which is the shallowest proposed slope

with k = 0.5. The solid line is the density profile from one of our simulations, which is

intermediate to the two.


−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

← Inner boundary ← Bondi radius

log10

(R/RB)

log 10

(β),

Vr, &

ent

ropy

← Inner boundary ← Bondi radius← Inner boundary ← Bondi radius← Inner boundary ← Bondi radius

Figure 2.3: log(β), entropy and radial velocity versus radius. The dashed line vr/cs

represents the radial velocity in units of mach number. The dots vr/cms represent the

radial velocity in units of magnetosonic mach number. The solid line is the entropy, and

we see the entropy inversion which leads to the slow, magnetically frustrated convection.

Inside the inner boundary, the sound speed is lowered, leading to the lower entropy. The

+ symbols are the magnetic field strength, β.


it is not surprising to discover a thick, slowly rotating accretion envelope rather than a

thin accretion disk. We stress that global simulations, which resolve the Bondi radius

and beyond and continue for many dynamical times, are required to capture the physical

processes which determine the nature of the flow.

2.3.1.2 Dependence on parameters; Richardson extrapolation

We now investigate whether our results for the accretion rate can be distilled into a single,

approximate expression. It is clear from the results in Table 2.1 that rotation affects the

accretion rate in a non-monotonic fashion. However as we have just noted that rotation

plays a minor role in our final results, we are justified in fitting only the non-rotating

runs. Rather than M/MBondi we fit an effective density slope keff defined by

M

MBondi

=(

Rin

RB

)3/2−keff

. (2.5)

There are three major variables: the magnitute of the ambient magnetic field (β0), the

radial dynamical range (RB/Rin), the resolution of the Bondi scale (ℜB). Our fit is

keff = 1.50 − 0.56β−0.0980 + 6.51

(

RB

Rin

)−1.4

− 0.11ℜ−0.48B ; (2.6)

all seven numerical coefficients and exponents were optimized against the 25 runs in Table

2.1 . The form of equation (2.6) is significantly better than others we tested, including

those involving log(RB/Rin) and log(ℜB). It predicts the entries in Table 2.1 to within

a root-mean-square error of only 0.017.

Somewhat unexpectedly, this nonlinear fit to our simulation output recovers the Bondi

solution in the continuum, unmagnetized limit (keff → 3/2 as β0 → ∞, RB/Rin → ∞,

and ℜB → ∞). Moreover the form of the expression allows us to extrapolate, in the

manner of Richardson extrapolation, to conditions we expect are relevant to Sgr A*:

ℜB ∼ ∞, RB/Rin ∼ 105, and β0 ∼ 1 − 5: then, keff ∼ 0.94 − 1.0.

It is encouraging that this result lies in the vicinity of observational constraints,

lending additional credence to the notion that Sgr A* is surrounded by a “magnetically-


frustrated” accretion flow. We must recall, however, that this is only an extrapolation

based on simulations which lack potentially important physics such as a relativistic in-

ner boundary and a non-ideal plasma. The absence of an imposed outward convective

luminosity is likely to be the essential element which allows for a lower value of k.

2.4 Rotation measure

The magnitude of RM constrains the density of the inner accretion flow, thereby also

constraining the mass accretion rate and power law index k. Future observations should

provide time series of RM(t), a rich data set which encodes important additional infor-

mation about the nature of the flow. Our goal in this section will be to characterize RM

variability within our own simulations sufficiently well to distinguish them from other

proposed flow classes.

We pause first to consider why RM should vary at all. The rotation of polarization is

determined by an integral (eq. A.1, [86]) which is proportional to∫

neB·dl integrated over

the zone of nonrelativistic electrons. The integral is typically dominated by conditions

at Rrel, the radius where kTe = mec2. Even if ne is reasonably constant, B likely will

change in magnitude and direction as the flow evolves. Given that the dynamical time

at Rrel is under a day, any strongly convective flow should exhibit significant day-to-day

fluctuations in RM; measurements by [58] appear to rule this out. Rotational support

also implies rapid RM fluctuations unless B is axisymmetric. In the highly subsonic

flow of magnetically-frustrated convection, however, RM may vary on much longer time

scales.

Two proposals have been advanced in which RM(t) would be roughly constant.

Within their simulations of thick accretion disks, [84] show that trapping of poloidal

flux lines leads to a rather steady value of RM for observers whose lines of sight are

out of the disk plane. [85] point to the constancy of RM in the steady, radial magnetic


configuration which develops due to the saturation of the magneto-thermal instability

(in the presence of anisotropic electron conduction). We suspect that noise at the dy-

namical frequency is to be expected in both these scenarios, which need not exist in a

magnetically frustrated flow. We also note that both scenarios lead to systematically low

values of RM for a given accretion rate, and therefore imply somewhat higher densities

than we inferred from a spherical model; this may be observationally testable.

Our calculation of RM(t) is based on case 10 in Table 2.1. In Figure 2.4 we plot

RM(t) against an analytical estimate of its magnitude. For this purpose we estimate RM

as,

RM ≡ e3

2πme2c4

∫ RB

Rrel

neBdr, (2.7)

integrated along radial rays (two per coordinate axis) through the simulation volume. We

neglect the difference between this expression and one which accounts for the relativistic

nature of electrons within Rrel. We therefore normalize RM to the estimate RMest as,

RMest =e3

2c4m2e

[

GMRrelµene(Rrel)3

11π

]1/2

(2.8)

given by equation (A.5) with F (k, kT ) → 1, 〈cos(θ)〉 → 1/2, β → 10, and k → 1. Because

we do not calculate electron temperature within our simulations, we have the freedom to

vary Rrel and to probe the dependence of coherence time on this parameter. In practice

we chose Rrel = (17, 26, 34, 43)δxmin in order to separate this radius from the accretion

zone (7.5δxmin) and Bondi radius (250δxmin, in this case). Figure 2.4 illustrates RM(t)

along each coordinate axis with the case for Rrel = 17δxmin). As this figure shows, RM

changes slowly and its amplitude agrees with our estimate RMest. In our simulations, we

can measure the full PDF, shown in figure 2.5.

We can ask how well a single measurement of RM constrains the characteristic RM,

say the ensemble-averaged root-mean-square value RMrms. This is a question of how

well a standard deviation is measured from a single observation. From figure 2.5 we

see that the distribution in our simulations is roughly Gaussian with standard deviation


0 2 4 6 8 10 12 14 16 18 20

−4

−3

−2

−1

0

1

2

3

4

t/tB

RM

/RM

est

Figure 2.4: Rotation measure vs time (in units of tB). We chose Rrel = 17, corresponding

to Rrel/RB=0.068. Six lines represent three axes: upper set is X (centered at +3), center

is Y (centered at 0) and lower is Z (centered at -3), with positive and negative directions

drawn as solid and dashed lines, respectively.


−3 −2 −1 0 1 2 30

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

RM/RMest

/σRM

RM

dis

trib

utio

n

Figure 2.5: PDF of RM in Figure 2.4. The dashed line represents a Gaussian distribution.

The horizontal axis has been normalized by the standard deviation in figure 2.4, σRM =

0.63.


Figure 2.6: The rotation measure integrant ρBr vs radius and time. The central dark

bar represents the inner boundary, the vertical axis is the Z axis. The horizontal axis is

time, in units of tB; Greyscale represents sign(Br)4

√

ρ|Br|, which was scaled to be more

visually accessible. The coherence time is longer at large radii and at late times. Several

Bondi times are needed to achieve the steady state regime.


−1 −0.5 0 0.5 1 1.5−0.2

0

0.2

0.4

0.6

0.8

1

log10

(tlags

/tB)

auto

corr

elat

ion

dot is Rin=43

dash is Rin=34

dash dot is Rin=26

solid is Rin=17solid is Rin=17solid is Rin=17solid is Rin=17

Figure 2.7: Autocorrelation for Figure 2.4. X axis represents time lags; Y axis repre-

sents autocorrelation for different Rin. The dotted, dashed, dashed-dot and solid lines

correspond to Rin = 43, 34, 26, 17 respectively.


−1.25 −1.2 −1.15 −1.1 −1.05 −1 −0.95 −0.9 −0.85 −0.8 −0.75

−0.4

−0.2

0

0.2

0.4

0.6

log10

(Rrel

/RB)

log 10

(tla

gs/t B

)

straight line is best fit with fixed slope 2

Figure 2.8: RM coherence time τ as a function of the inner truncation radius Rrel; points

refer to Rrel = 17, 26, 34 and 43. The bootstrap error of 0.17 dex is based on the six

data, two for each coordinate direction, at each Rrel. The normalization for Rrel = RB is

log10(tlags/tB) = 2.15.


σRM = 0.63RMest. One needs to apply Bayes’ Theorem to infer the variance of a Gaussian

from N independent measurements:

∆RMrms =(

2

N

)1/2

σRM (2.9)

To date, no sign change in RM has been observed, suggesting that we only have one in-

dependent measurement. Estimating RMrms from a single data point requires a Bayesian

inversion. Estimating from our simulation with a flat prior, the 95% confidence inter-

val for the ensemble characteristic RM given the one data point spans two orders of

magnitude!

In other words, if in fact RMrms = 5.4 × 106, it is not very surprising that we have

observed RM≃ −5.4×105. The maximum likelihood estimate is RMrms = RM. The 95%

upper bound is RMrms = 33RM, and the lower bound is RMrms = 0.33 RM. More data

is essential to constrain this very large uncertainty.

A visual description of the RM integrand through the flow is shown in figure 2.6. The

time variability time scale is shorter at small radii, and shorter at the beginning of the

simulation. Simulations of many Bondi times with boundaries many Bondi radii away

are necessary to see the characteristic flow patterns.

To be more quantitative, we plot in Figure 2.7 the autocorrelation of RM(t) for

different Rrel. We define the coherence time τ to be the lag at which the autocorrelation

of RM falls to 0.5.

The actual RM radius Rrel is not resolved in our simulations. In order to extrapolate

to physically interesting regimes, we fit a trend to our limited dynamic range. The

characteristic variability time scale is given by the flow speed, so τ ∝ R3relρ(Rrel)/M . For

our characteristic values k ∼ 1, we have τ ∝ R2rel, which we fit to our coherence time,

shown in figure 2.8.

For density profiles shallower than Bondi, the characteristic RM time scale τ is sig-

nificantly longer than the dynamical time (τ ∼ (Rrel/RB)3/2tB). In our fit, it is given by


the accretion time

τ ∼ 20(

Rrel

RB

)2 (

Rin

RB

)−1/2

tB, (2.10)

with a relatively large dimensionless prefactor. This indicates a coherence time of order

one year for the conditions at Sgr A*. The actual value of Rrel is uncertain by a factor

of 6, so the expected range could be two months to a year.

This is sufficiently distinct from one day that the distinction between frustrated and

dynamical flows should be readily apparent, once observations span year long baselines.

We will discuss this point more in Section 2.6 below.

2.5 Discussion

In this section we wish to revisit several of the physical processes which are missing from

the current numerical simulations: stellar winds from within RB, the transport of en-

ergy and momentum by nearly collisionless electrons, and the inner boundary conditions

imposed by a central black hole.

Stellar wind input. Our simulations account for the accretion of matter from outside

the Bondi radius inferred from X-ray observations, but not for the direct input of matter

from individual stars in the vicinity of the black hole. [54] raises the possibility that

individual stars may in fact dominate the accretion flow. The wind from a single star at

radius r dominates the flow when its momentum output Mwvw satisfies

Mwvw > 4πr2p(r) → 3.3(10−5M⊙ yr−1)(1000 km s−1) (2.11)

where the evaluation is for a model consistent with RM constraints, in which density

follows nH ≃ 107.3(r/RS)−1cm−3 and pressure follows p ≃ 104(r/RS)−2dyn cm−2 – note

that the criterion is independent of radius for k = 1. The required momentum output,

equivalent to 106.2L⊙/c, is well above the wind force of any of the OB stars observed

within RB. While stars within RB add fresh matter faster than it is accreted by the


hole, we can be confident that no single star dominates the flow. If the density slope

is substantially more shallow, for example in a CDAF with k = 1/2, the steller winds

would be a more important factor.

Collisionless transport. In the context of a dilute plasma where Coulomb collisions

are rare, electron thermal conduction has the potential to profoundly alter the flow pro-

file. The importance of this effect depends on the electrons’ ability to freely stream down

their temperature gradient [85], despite the wandering and mirroring induced by an in-

homogeneous magnetic field. The field must be weak for the magneto-thermal instability

to develop, yet weak fields are less resistive to tangling. The thermal conduction is ex-

pected to be strongest in the deep interior of the flow. If electrons actually free stream

inside of 1000 Schwarzschild radii, the electrons could be non-relativistic all the way to

the emission region, changing the interpretation of the RM. This would favour even shal-

lower density profiles, for example the CDAF models. In such a model, the RM might

be expected to vary on time scales of minutes, which appears inconsistent with current

data. If, on the other hand, the free streaming length is short on the inside, it more likely

places the fluid in an ideal regime for the range of radii in our simulations. We therefore

remain agnostic as to the role of thermal conduction in hot accretion flows, although it

remains a primary caveat of the current study. Observations of time variability of RM

will substantially improve our unstanding.

Black hole inner boundary. Our current inner boundary conditions do not resemble

a black hole very closely, apart from the fact that they also allow gas to accrete. As the

inner region dominates the energetics of the flow, we consider it critical to learn how the

black hole modifies our results. We are currently engaged in a follow-up study with a

relativistic inner boundary, to be described in a future paper.


2.6 Observational Probes

RM can be measured by several techniques. Currently, efforts have concentrated at

high frequencies, ν ∼ 200 − 300 GHz [57], where the polarization angle varies slowly

with frequency. Accurate measurements over long time baselines allows discrimination

between models. At high frequencies, the SMA and ALMA would allow a steady synaptic

monitoring program. The full time 2 point correlation function extends the measurement

space by one dimension.

At lower frequencies, high spectral resolution is needed to resolve the winding rate,

which is now tractable with broad band high resolution instruments such as the EVLA

and ATCA/CABB. The higher winding rate would allow a much more sensitive measure-

ment of small changes in the RM, which would also be a descriminant between models.

The challenge here is that the polarization fraction drops signficantly with frequency,

requiring a more accurate instrumental polarization model. On the other hand, the very

characteristic λ2 dependence of RM should allow a robust rejection against instrumental

effects.

At lower frequency, the spatial extent of the emission region is also expected to in-

crease. When the emission region approaches the rotation measure screen, one expects

depolarization. Direct polarized VLBI imaging could shed light on this matter. This is

complicated by interstellar scattering, which also increases the apparent angular size. The

changing emission location as a function of frequency may complicate the RM inferences

[32; 24] This can skew the actual value of the infered RM, resulting in an underestimate.

The sign of RM would generically be a more robust quantity, and looking for changes in

the sign of RM could be a proxy for the correlation function.

A separate approach is to use other polarized sources as probes of the flow. One

candidate population is pulsars. At the galactic center, interstellar scattering (Lazio

2000) smears out the pulses, making them difficult to detect directly. But the pulse

averaged flux should still be present. Over the orbit around the black hole, one can


measure the time variation of the RM, leading to a probe of the spatial RM variations

in the accretion flow. Some pulsars, such as the crab, exhibit giant pulses, which could

still be visible despite a scattering delay smearing. These could be used to measure the

dispersion measure (DM) along the orbit. The GMRT at 610 MHz would have optimal

sensitivity to detecting the non-pulsating emission from pulsars, and be able to deconfuse

them from the dominant synchrotron emission using rotation measure synthesis [22].

2.7 Summary

A series of new, large dynamical range secular MHD simulation are presented for the

understanding of the low luminosity of the supermassive black hole in the Galactic Center.

These are the first global 3-D MHD simulations which do not face boundary conditions at

outer radii, and impose ingoing boundaries at the interior, running for many Bondi times.

We confirm a class of magnetically frustrated accretion flows, whose bulk properties are

independent of physical and numerical parameters, including resolution, rotation, and

magnetic fields. No significant rotational support nor outward flow is observed in our

simulations. An extrapolation formula is proposed and the accretion rate is consistent

with observational data.

A promising probe for the nature of the accretion flow is the rotation measure, and

its time variability. In this comparison, the dominant free parameter is the electron

temperature. We argued that over the plausible range, from thermal to adiabatic, this

radius varies from 40 to 250 Schwarzschild radii. The RM variations in the simulations

are intermittent, requiring many measurements to determine this last free parameter.

We propose that temporal rotation measure variations are a generic prediction to

distinguish between the wide variety of theoretical models currently under consideration,

ranging from CDAF through ADIOS to ADAF. RM is dominated by the radius at which

electrons turn relativistic, when the flow is still very subrelativistic, and is thus much


further out than the Schwarzschild radius. Most models, other than the ones found in

our simulations, involve rapidly flowing plasmas, with Mach numbers near unity. These

generically result in rapid RM variations on time scales of hours to weeks (or in special

cases, it can be infinite). In contrast, our simulations predict variability on time scales

of weeks to years. A major uncertainty in this prediction is the poor statistical measure

of the standard deviation of RM measurement, which requires long term RM monitoring

to quantify.

Future observations of RM time variability, or spatially resolved measurements using

pulsars, will provide valuable information.

Chapter 3

Fast magnetic reconnection

3.1 Introduction

In ideal hydrodynamics, irreversible processes, such as shock waves and vorticity recon-

nection, occur at dynamical speeds, independent of microscopic viscosity parameters.

Weak solutions describe these irreversible discontinuous solutions of the Euler equations.

While smooth flows conserve entropy and vorticity, the infinitesimal discontinuity sur-

faces generate entropy and reconnect vorticity. This can also be understood as a limiting

case starting with finite viscosity, where these surfaces have a finite width.

The ideal limit of MHD poses a new class of problems in dissipative processes.

If two opposing field lines sit nearby, a state of higher entropy can be reached by

reconnecting the field lines, and converting their magnetic energy into fluid entropy. In

the presence of resistivity, this process occurs on a resistive time scale for some relevant

scale. This exaggerates the problem somewhat. Extensive theoretical research on mag-

netic reconnection([17], [73]) has shown that scales intermediate between the size of a

system and resistive scales can be important. Nevertheless, in many astrophysical set-

tings, simple models for reconnection give time scales that are very long, and reconnection

is observed or inferred to occur on much shorter time scales, e.g. for solar flares, more

48

Chapter 3. Fast magnetic reconnection 49

than 1010 times faster than the theory[29]. This has led to the suggestion that magnetic

reconnection in the limit of vanishing resistivity might also go to a weak (discontinuous)

solution, occuring at a finite speed which is insensitive to the value of the resistivity.

The problem is best illustrated by the Sweet-Parker configuration ([90], [68]), where

opposing magnetic fields interact in a thin current sheet, the reconnection layer. This

unmagnetized layer becomes a barrier to further reconnection. In a finite reconnection

region, fluid can escape the reconnection region at alfvenic speeds. Because the reconnec-

tion region is thin, the reconnection speed is reduced from the alfven speed by a factor

of the ratio of the current sheet width to the transverse system size. In the Sweet-Parker

model this factor is the inverse of the square root of the Lundquist number (VAL/η),

with η the plasma resistivity. The predicted sheet widths are typically extremely thin.

Petschek proposed a fast magnetic reconnection solution ([71]) based on the idea that

magnetic reconnection happens in a much smaller diffusive region, called the X-point,

instead of a thin sheet. The global structure is determined by the log of the Lundquist

number, and stationary shocks allow the fluid to convert magnetic energy to entropy.

However, Biskamp’s simulations ([16]) showed that Petschek’s solution is unstable

when Ohmic resistivity becomes very small. In their two dimensional incompressible

resistive MHD simulations, they injected and ejected plasma and magnetic flux across

the boundary. They also changed the boundary condition during the simulation to elim-

inate the boundary current layer. However, considering the current sheet formed in their

simulation, the computation domain may not be big enough. After reproducing differ-

ent scaling simulations results([16], [52]), Priest and Forbes [74] pointed out that it is

the boundary conditions that determine what happens (including Biskamp’s unstable

Petscheck’s simulation) and that sufficiently free boundary conditions can make fast re-

connection happen. However, there is no self-consistent simulation of fast reconnection

reported, except with artificially enhanced local resistivity[82].

To reconcile the observed fast reconnection with its absence in simulations leads to


two possible resolutions: 1) ideal MHD are not the correct equations, and long range

collisionless effects are required, or 2) assumptions about the reconnection regions are

too restrictive. This includes the 2-dimensionality and the boundary conditions.

In exploring of the first possibility, it was found that when integrating with the Hall

term in the MHD equations, or using a kinetic description([15]), it was possible to find

fast reconnection. However, this still didn’t offer any help to the collisional system, which

still has fast magnetic reconnection no matter whether Hall term is present or not; and

also the increase of local resistivity is not generic in astrophysical environments, which

mostly has highly conducting fluids.

For the second possibility, we note that Lazarian & Vishniac (LV99) [51] proposed a

model of fast magnetic reconnection with low amplitude turbulence. Subsequent simu-

lation results [48] support this model. They found that the reconnection rate depends

on the amplitude of the fluctuations and the injection scale, and that Ohmic resistivity

and anomalous resistivity do not affect the reconnection rate. The result that only the

characteristics of turbulence determine the reconnection speed provides a good fit for

reconnection in astrophysical systems.

LV99 offered a solution to fast magnetic reconnection in collisional systems with

turbulence. In this paper, we consider a different problem, whether we could still have fast

reconnection without turbulence. We present an example of fast magnetic reconnection

in ideal three dimensional MHD simulation in the absence of turbulence. Here we explore

a different aspect: 3-D effects and boundary conditions. Traditionally, simulations have

searched for stationary 2-D solutions, or scaling solutions. In the case of fast reconnection,

the geometry changes on an alfvenic time, so these assumptions might not be applicable.

Specifically, we bypass the choice of boundary condition by using a periodic box.

The primary constructive fast reconnection solution, the Petscheck solution, has some

peculiar aspects. The global geometry of the flow, and the reconnection speed, depend on

the details of a microscopic X-point. This X-point actually interacts infinitesimal matter


and energy, so it seems rather surprising that this tiny volume could affect the global

flow. Instead, one might worry about the global flow of the system, which dominates the

energy. We will see that this is particularly important in our simulations.

3.2 Simulation setup

3.2.1 Physical setup

The purpose of the simulation is to study magnetic reconnection and its dynamics. We

start by dividing the volume in two, with each subvolume containing a uniform magnetic

field. In a periodic volume, this results in two current sheets where reconnection can

occur. An initial perturbation is added to trigger the reconnection.

3.2.2 Numerical setup

We have a reference setup, and vary numerical parameters relative to that. Initially the

upper and lower halves of the simulation volume are filled with uniform magnetic fields

whose directions differ by 135 degrees (Figure 3.1). The magnitude of the magnetic field

is the same for every cell, and β, the ratio of gas pressure to magnetic pressure, is set to

one.

There is a rotational perturbation on the interface of the magnetic field, at the center

of the box, inside a sphere of radius 0.05, relative to the box size. The rotational axis

is nearly along the X axis, with a small deviation, which is used to break any residual

symmetry. We use constant specific angular momentum at the equator, with solid body

rotation on shells, which comes from the same initial condition generator as [70]. The

rotational speed is set to equal to the sound speed at a radius of 0.02, and 0.4 sound

speed at the sphere’s equatorial surface

We also tried adding a localized magnetic field perturbation: a random Gaussian

magnetic field, with (β = 1) and correlation length is half of the box, was added in the


same region as the rotational perturbation. Since the only dissipation is numerical, on

the grid scale, a translational velocity [93] was added to the simulation to increase the

numerical diffusion for all the cells in the box. The reference value of the translational

velocity is equal to the sound speed and we measure the time (unit in CT) by box size

divided by the initial sound speed. Varying this by a factor of 2 up or down does not

change the results. At the beginning the Alfven speed is the same as the sound speed.

Different resolutions were tested, from 503 cells to 8003 cells.

3.3 Simulation results

3.3.1 Global fast magnetic reconnection

We use the total magnetic energy as a global diagnostic of the system. Figure 3.2 shows

the evolution of the magnetic energy. The generic feature is the sudden drop of magnetic

energy, which occurs on an alfvenic box crossing time, during which much of the magnetic

energy is dissipated. The onset of this event depends on numerical parameters. Due to

symmetries in the code, an absence of any initial perturbations would maintain the initial

conditions indefinitely.

We can see that when there is no forced diffusion and no initial perturbation, the

magnetic energy is almost stationary. When diffusion is added, the magnetic energy

decays gradually throughout the simulation.

When explicit velocity perturbations are present, all the simulations show a sudden

decrease of magnetic energy, which indicates fast magnetic reconnection. The common

property is that they all have some initial perturbation, either rotational or a strong

localized field perturbation; and the background diffusion only affects how early recon-

nection happens. In order to make sure this fast reconnection is not related to resolution,

we simulate different resolutions, from a 503 box, to a 8003 box, in Figure 3.3. All show

fast reconnection and the resolution only affects the time elapsed before fast reconnec-


Figure 3.1: Numerical setup: the sphere in the center of the box represent the area of

the rotational perturbation. up-left is the rotational perturbation looked from YZ plane.


0 10 20 30 40 50 60 70 80 90 100−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

t/CT

log 10

(B2 /2

p 0)

magnetic energy vs time

dash−dot line has diffusion & field perturbation

dash line has no diffusion or perturbation

dot line has only rotational perturbation

solid line has only diffusion

asterisk mark has diffusion & rotational perturbation

Figure 3.2: Reconnection for different initial conditions. The total magnetic energy

is an indication of reconnection. The dash-dot line has non-zero mean magnetic field

perturbation, and the reconnected field asymptotes to a slightly different value.


0 10 20 30 40 50 60 70 80 90 100−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

t/CT

log 10

(B2 /2

p 0)

magnetic energy vs time

dash−dot is 800 cells

dash line is 400 cells

dot line is 200 cells

solid line is 100 cells

asterisk mark is 50 cells

Figure 3.3: Reconnection for different resolutions.


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

(t−trec

)/ctrec

(B2 /B

rec2 )−

1

slope for magnetic energy drop vs time

dash−dot is 800 cellsdash line is 400 cellsdot line is 200 cellssolid line is 100 cellsasterisk mark is 50 cells

Figure 3.4: Reconnection for different resolutions near reconnection point. This plots

recenters figure 3.3 to the time of maximum magnetic energy release, and scales the

horizontal and vertical axis to the fractional energy release and mean alfven time.


Figure 3.5: 2D snapshot during reconnection. current as background color


tion happens, though the details of how the delay depends on resolution are still unclear.

In order to give readers a clear detail of the energy drop, we plot the evolution of the

magnetic energy near the reconnection point (± 2 CT) in Figure 3.4. We see a rapid

reconnection event where roughly 30% of the magnetic energy is released in one alvenic

crossing time, whose rate does not depend substantially on resolution, and this is clearly

fast reconnection by any reasonable criteria.

Figure 3.5 shows a rough two dimensional snapshot of current(∝ ▽ × B) during

fast reconnection, with color representing the current magnitude. It is clear that there

are some regions that have reconnection(i.e. high current value) and we will use higher

resolution to analyze them later.

3.3.2 What happens on the current sheet?

We can see there are some regions that have large currents, and the reconnection should

happen there. Now we use high resolution (e.g. 800 cells) to investigate what exactly

happens there. We want to show a snapshot close to the current sheet to see how flow

evolves and what the magnetic field geometry looks like near the current sheet. We

subtract the average value for both magnetic field and velocity in the region close to the

current sheet. This places us in the frame comoving with the fluid. The mean magnetic

field does not participate in the dynamics of reconnection, so its removal allows us to see

the dynamics more clearly.

We present snapshots of three different times during the reconnection: one at the

beginning, one at the middle and one at the end. Each time step snapshot contains three

graphs, with the upper left one has current magnitude as background color and white

line represents magnetic field line, and the lower left one is a snapshot of both magnetic

field (blue dash) and velocity field (red solid), and the right one is the corresponding

magnetic energy plot. Figure 3.7 is the beginning; Figure 3.8 is the middle; Figure 3.9 is

the end;


It is easy to find that the snapshot of both magnetic field line and velocity field line

in figure 3.7 looks like Figure 3.6 [71], which is the geometry of Petschek’s solution for

fast magnetic reconnection. The X-point, which is the reconnection region, is small and

at the center. The tangent of the angle α represent the ratio of inflow to outflow.

3.3.3 What happens globally?

We show the long term and global 2D evolution of both velocity field lines and magnetic

fields for the 4003 simulations, starting from the beginning until reconnection completes

(i.e. from Figure 3.10 to 3.14). These plots are analogous to the plots in the previous

section: the left one is the snapshot of both magnetic and velocity field lines; the center

one is the snapshot of magnetic field lines with current as the background color; and

the corresponding magnetic energy is also included on the right. At the beginning, the

magnetic field lines are opposite and there is no velocity field. Then the initial rotational

perturbation induces two reconnected regions with closed magnetic field loops, one at

each interface. The closed loops are fed by a slow X-point at each interface. Noting

that there is a mean field perpendicular to the plotted surface, these loops are actually

twists in the perpendicular magnetic field. In the bulk region between the interfaces, the

parallel magnetic fields are not yet disturbed much by the perturbation.

In Figure 3.12 we can see the loops to move into the X-point of the opposing loop,

and strong interactions occur. The fluid forms two large circular cells, offset from the

magnetic loops. The energy to drive the fluid flow comes from the reconnection energy

of the magnetic field. This flow pattern enhances the reconnection by driving the fluid

through the X point.

We illustrate the fast reconnection flows in Figure 3.15. Blue dash circles with arrows

represent the magnetic loops. The red field lines with arrows represent the velocity field.

There are two big black X’s in the global frame, which represent the X point for recon-

nection. Because we are using periodic boundary condition, we extend the simulation


Figure 3.6: Geometry of Petscheck solution

Figure 3.7: snapshot of magnetic field line on the background of current, and snapshot

of both magnetic and velocity field line, and B2 at 37 CT







Figure 3.10: Snapshot of magnetic field line on the background of current, and snapshot

of both magnetic and velocity field line, and B2 at 0 CT for 400 cells









box picture to two other directions, to make the global flow easier to understand. Red

solid lines represent the velocity field in the real box, and red dash-dot line represents

the field line in the extended box.

Reconnection is a local process in the global flow field. To see that, we need to boost

into the comoving frame. Let’s take the right magnetic twist for example: In global frame,

the flow on the right all moves downwards, with the magnetic twist moving at the highest

speed. The X-point is like a saddle point for the flow: the fluid converges vertically, and

diverges horizontally. In the X-point frame, setting the velocity at B to zero, A will move

down and C will move up, which supports the conditions for reconnection.

3.4 Discussion

To summarize, we have found a global flow pattern which reinforces X-point reconnection,

and the resulting fast reconnection in turn drives the global flow pattern. The basic

picture is two dimensional. We did find that a pure 2-D simulation does not show this

fast reconnection. This is easy to understand, since the reconnected field loops are loaded

with matter, and would require resistivity to dissipate. In 3-D, these loops are twists

which are unstable to a range of instabilities, allowing the field loops to collapse. So

three basic ingredients are needed: 1. A global flow which keeps the field lines outside

the X-point at a large opening angle to allow the reconnected fluid to escape, and avoid

the Sweet-Parker time scale. 2. The reconnection energy drives this global flow 3. A

three dimensional instability allows closed (reconnected) field lines to collapse, releasing

all the energy stored in the field.

The problem described here has two geometric dimensionless parameters: the 2 axis

ratios of the periodic box. In addition, there are a number of numerical parameters. We

have varied them to study their effects.

Extending the box in the Y direction (separation between reconnection regions) shuts


off this instability, which might be expected: there are no global flows possible if the two

interaction regions are too far separated. We found the threshold to be Y < 1.2Z. In

the other direction, there appears to be no limit to make Y << Z. Increasing the size

of the Z dimension does not diminish this instability. There is also a dependence on X

(extend along field symmetry axis). Shortening it to one grid cell protects the topology

of field loops, and reconnection is not observed in 2-D simulations.

We changed different initial condition to see whether the fast reconnection is sensitive

to how the initial setup is. After changing the angle of the opposite magnetic field(from

beyond 90 degree to 180 degree), the strength of the rotational perturbation, and axis

of the rotational perturbation, we found that the fast reconnection still appeared. The

boundary condition is kept periodic and we found that the evolution of fluid dynamics

of different initial conditions are similar.

It can be seen that the fast reconnection happens at the two interfaces of the straight

magnetic field at the same time, with a magnetic twist moving towards it on each side.

They are not head-on collision on the magnetic field, but a little separated in transverse

direction. This special geometry helps the magnetic reconnection happen fast, since each

magnetic twist pushes the field line, it also affect the velocity field at the other side and

it helps to increase the outflow speed. If we look back to Sweet-Parker’s solution([90],

[68]), the main problem is that the current sheet is so thin, that even if one accelerates

the outflow to Alfven speed, the mass of outflow is still small, which slows down the

speed of the reconnection. Petschek’s configuration[71] can resolve this problem with a

small reconnection region and finite opening angle for the outflow. In our simulation the

speed of the outflow is further increased by the feedback between the two reconnection

regions.

The solar flare reconnection time scale is about Alfven time scale[29], which is the

order of seconds to minutes.

If there is only magnetic diffusivity(η) present, the diffusive time is τD = L2/η, with L


is the characteristic length. Taking the values from [29], L = 1000km and η is 10−3m2s−1,

τD is 1015s.

Sweet-Parker’s thin current sheet proposed a reconnection time as τSP = L/(VAi/R1/2mi ),

with Rmi = LυAi/η. This makes the reconnection time about 105 Alfven times.

Petschek’s configuration has a reconnection time as τP = L/(αυA), with α is between

0.01 and 0.1 and Alfven speed ∼ 100km/s, and this makes the time scale as 100−1000s.

Our fast reconnection time has the order of Alfven time scale, and Alfven time τA =

L/υA, which is the same order as observed time scales of 20 − 60s [29]. Furthermore,

comparing to LV99, no turbulence is needed or added in our simulations. Our fast

magnetic reconnection time scale is qualitatively similar to the energy release time scale

for solar flares.

3.5 Ideal vs resistive MHD

Our code solves the equations of ideal MHD, without explicit resistivity. Ideal MHD and

also ideal Euler equations lead to discontinuities in the form of weak solutions, where

the solutions become discontinuous. The conserved quantities are still conserved across

these discontinuities, but most derived conserved quantities are not. For example, entropy

increases across shock fronts, vorticity can be generated, and vortex as well as magnetic

field lines can reconnect. These features are all understood from analytical theory.

The TVD algorithm is meant to capture these discontinuous effects by effectively

introducing non-linear viscous and resistive terms on the grid scale while enforcing the

conservation equations and preventing numerical oscillations that arise from differentiat-

ing discontinuous equations.

Our TVD code solves the ideal MHD equations by computing the conservation laws

across cell boundaries, and using the TVD scheme to control oscillations. While there is

no explicit resistivity or viscosity, the solution does capture shocks and discontinuities,


which in the end generates entropy, vorticity and reconnection. This class of algorithms

has a single free numerical parameter, which is the resolution. When we study conver-

gence, we are testing if the results depend on the numerical parameter of cell size. At

each resolution we are solving the ideal MHD equations.

Other approaches could also be taken. One could add sufficient resistivity, diffusivity

and viscosity that shocks and discontinuities do not form on the grid scale. One could test

the dependence of the results on changes in resistivity and viscosity. There are at least

4 numerical and physical parameters that can be varied independently, corresponding to

Reynolds number, Prandtl number, Schmidt number, resolution, etc. One would like to

see if the solution converges to the ideal MHD limit for any ratio of these parameters,

as their dimensionful counterparts go to zero. This is a high dimensional space, and

numerically challenging to explore. In this work, we explore only a subset of this space

in the context of ideal MHD.

Physically this question corresponds to the problem whether all systems always ex-

hibit fast reconnection, independent of micro-physics. As a first step, we offer a construc-

tive existence exploration, without addressing this broader question.

3.6 Summary

We present evidence for fast magnetic reconnection in a global three dimensional ideal

magnetohydrodynamics simulation without any sustained external driving. These global

simulations are self-contained, and do not rely on specified boundary conditions. We have

quantified ranges in parameter space where fast reconnection is generic. The reconnection

is Petscheck-like, and fast, meaning that ∼ 30% of the magnetic energy is released in one

Alfven time.

This example of fast reconnection example relies on two interacting reconnection

regions in a periodic box. It is an intrinsically three dimensional effect. Our interpretation


is that the Petschek-like X-point angles are not determined by microscopic properties at

an infinitesimal boundary where no energy is present, but rather by the global flow far

away from the X-point. Whether or not such configurations are natural in an open system

remains to be seen.





Figure 3.15: Geometry of global configuration

Chapter 4

Accelerate MHD

4.1 Introduction

The magnetohydrodynamics equations are nonlinear, and cannot in general be solved

analytically. Thanks to the increasing power of computers, three dimensional simulations

can be used to model these equations numerically. Numerical simulations are crucial

both for understanding the theory of such fluids, and for use in directing real world

experiments. However, in order to achieve realistic representatives of real world problems,

numerical experiments must push computational hardware resources to their limits.

Due to the considerations of compute power per watt or per dollar, new architectures

are being considered to perform calculations. In particular, we examine here heteroge-

neous systems, which consist of two different kinds of processors; one or more general-

purpose conventional processors which control the overall computation, and specialized,

usually multi-core, processor units to which the numerically intensive computing is of-

floaded [83]. There are currently several heterogeneous platforms in use and we will focus

on Cell/B.E. [1] and graphics processing units (GPU) in this paper, and use a multi-core

x86 system for comparison.

The comparison between these systems is not straight-forward because they are typ-

71

Chapter 4. Accelerate MHD 72

ically programmed using different platform-specific languages that may also affect the

performance. The Open Computing Language (OpenCL) [4], is a cross-platform applica-

tion programming interface (API), which is designed for heterogeneous systems, including

GPU and Cell. It has been released by Khronos Group and ameliorates this problem to

some degree; and we will see here that OpenCL can perform equally well as CUDA on

Nvidia GPU, which makes OpenCL a really good choice for heterogeneous programming,

though it is more complicated than CUDA.

We discuss our reference implementation of a solver for the MHD equations in Section

4.2; in Section 4.3 we discuss our implementation on several architectures. We summarize

our results in Section 4.4 and discuss future work; in Section 4.5 we conclude.

4.2 The algorithms of MHD

There are many algorithms for solving these equations, which we will not attempt to

review here. We follow the approach of [69] in this paper, as the conciseness of its

implementation lends itself to re-implementation for the different architectures, and its

memory-access patterns are an excellent match to the heterogeneous architectures dis-

cussed here.

The method is a second-order accurate (in space and time) high-resolution total

variation diminishing (TVD) [39] scheme. The kinetic, thermal, and magnetic energy

are conserved identically and there is no explicit magnetic or viscous dissipation. The

TVD constraints result in non-linear viscosity and resistivity on the grid scale. The TVD

constraint allows the capture of shocks for compressible flows, where the flow becomes

discontinuous.

The code solves magnetohydrodynamics equations, in finite difference and finite vol-

ume scheme. Different from the MHD equations we list in Chapter 1, the integration form

(i.e. conservation of flux) was used in the calculation. We rewrite the MHD equations in


our code as below,

∂tρ + ∇(ρ~v) = 0 (4.1)

∂t(ρ~v) + ∇(ρ~v~v + P∗δ −~b~b) = 0 (4.2)

∂te + ∇[(e + P∗)~v −~b~b · ~v)] = 0 (4.3)

∂t~b = ∇× (~v ×~b) (4.4)

∇ ·~b = 0 (4.5)

Here for numerical convenience the magnetic field b is normalized by a factor of√

4π.

P∗ is total pressure, which equals to the sum of the gas pressure p and the magnetic

pressure b2/2; ρ and e are the mass and total energy densities, where the latter is the

sum of kinetic energy (ρv2/2), internal energy (p/(γ − 1)), and magnetic energy (b2/2).

The code solves the magnetic component and fluid dynamics separately. The former

is solved by a two-dimensional advection-constraint step [69], while for the latter, a

monotone upwind scheme for conservation laws (MUSCL) is used for a one dimensional

fluid advection update [92]. The time step update is based on Courant-Friedrichs-Lewy

(CFL) constraint, which ensures that the fastest wave can’t travel for more than one

grid space in a single time step. The approach is ‘dimensionally split’ in the sense that

updates are first made along the x direction, then y, and then z; memory transposes are

used to reorient the grid between each sweep. This both greatly simplifies the numerical

kernel (which only has to be implemented once) and ensures regular memory access for

each sweep.

The dimensional splitting reduces the fluid update to one dimensional dynamics:

∂t~u + ∇x~F = 0. (4.6)

This is discretized into finite volumes, ensuring conservation. The fluxes are calculated

using MUSCL, a first-order upwind scheme with a second-order TVD (Van Leer limiter)


correction. Time integration is performed using a second-order Runge-Kutta scheme. To

solve the complex upwind problem that is involved with momentum and energy fluxes,

relaxing TVD [44] is used for the Euler equations.

The magnetic update is reduced to a two dimensional advection-constraint step consis-

tent with Equation 4.4 and to ensure the constraint given by Equation 4.5. In constrained

transport [30], one stores the magnetic flux at the cell face, which can then be used to

accurately maintain a zero divergence of magnetic field.

In addition, [69] proposed not storing all the computed electromotive forces (EMFs)

and just applying the individual pieces of the EMF for advection-constraint steps. This

can save a significant amount of memory, and in addition, reduce unnecessary memory

access. As a result, the code is very memory efficient, and transposing the grid in memory

between sweeps ensures short strides along sweep directions and thus low memory-access

latency. One must remain aware of grid-imposed data dependencies of the method,

however. The one dimensional fluid update stencil is a standard 7-point stencil requiring

data from all 4-neighbouring ‘pencils’ in the direction of a sweep; the magnetic update,

to ensure the consistency of the magnetic field constraint, in addition needs the adjacent

‘pencils’ to be updated by the flux.

4.3 Implementation on heterogeneous systems

Heterogeneous systems have processors for different roles. As a result, a new memory

system design is needed, which is the challenge for the programmers.

In this section, we will discuss our implementation and performance results of the

MHD scheme described above on different platforms: multi-core x86, Cell/B.E., a Nvidia

GPU and an ATI GPU. Each platform has corresponding languages or libraries: OpenMP

for multi-core x86, Cell programming for Cell blade, CUDA for the NVidia GPU, and

OpenCL for the ATI GPU.


In all cases, we implement the full 3D version of the method described above. Our

performance tests consist of measuring the time taken to evolve a 3D domain of varying

size (163, 323, 643, and 1283 zones) by one timestep (only evolution step and no extra

memory transfer is included); by varying the size of the domain we can see the effects of

overhead such as memory transfer. Note that all calculations in this paper are performed

at single precision to make comparisons more readily meaningful. All the timing data

has units of milliseconds.

4.3.1 x86

As a basis of comparison, we first examine the performance of the original FORTRAN

code on a multi-core x86 architecture. We use two Intel Xeon(R) E5506 CPU @ 2.13GHz,

each with 4 processor cores, for this experiment.

Parallelization is done with OpenMP. Programming OpenMP is straightforward: the

programmers only need to add some lines to the loops and the API will partition the

loop automatically. The original version of the code under consideration here already

had OpenMP parallelization, incurring only a minimal overhead in coding length or

complexity. The parallelization is done over 2D slabs, with parallelization occurring over

the outermost loop in the solvers.

Result

Data for different box sizes are provided in Table 4.1, with the numbers inside the

brackets indicating the number of cores. For problem sizes larger than 163, a steady 6.7

times speedup is achieved.

4.3.2 Cell

The Cell Broadband Engine (Cell/B.E.) [1] is a collaboration of Sony, Toshiba and IBM.

The original design purpose was for a gaming machine, the Sony’s Playstation 3; however,

it is also a good candidate for high performance computing due to its specialized multi-


Table 4.1: Performance on the multi-core x86 for different box sizes; timings in millisec-

onds. x86(1) refers to single-core performance; x86(8) to 8.

Domain size

Architecture 163 323 643 1283

x86(1) 17.8 140 1096 8770

x86(8) 4.0 20.7 163.6 1315

speedup (8:1) 4.4 6.7 6.7 6.7

core architecture. Cell/B.E.’s design, a combination of one Power Processor Element

(PPE) and eight Synergistic Processing Elements (SPE), is to overcome three walls – the

power wall, memory wall and frequency wall [12].

The PPE is a 3.2GHz PowerPC-like processor, and is used to control the eight 3.2GHz

SPEs, which are used for data intensive computing. An SPE can perform four single-

precision floating-point operations in a single clock cycle. With dual pipelines, this gives

3.2 × 4 × 2 = 25.6 Gflops peak performance for single precision on one SPE [12]. There

are three levels of memory: the PPE’s main storage, the SPE’s 256kB SRAM local

storage, and the SPE’s 128-bit 128-entry unified register file. It is the programmers’ job

to handle the Direct Memory Access (DMA) to transfer the data between PPE and SPE.

The transfer is performed on Element Interconnect Bus (EIB), a high speed internal bus

which has 204.8 GB/s peak data bandwidth[26].

Cell processors have a high-level C-like programming language.

Parallelization/Partition

In the first stage in the parallelization, the PPE assigns the threads/memory to

the SPEs and performs synchronization. Once the calculation begins, the PPE will no

longer be involved in the calculation, and all work is done by the SPEs. DMA is used to

transfer data between main memory and local storage. Since the PPE is not used during

the calculation, the signal-notification channel is used for synchronization. One SPE is


assigned as the master. Once a synchronization point is reached, the slave SPEs send a

message to the master SPE. Upon receiving all the messages, the master initializes slaves

using a binary synchronization tree.

The fluid updates are performed along one dimensional pencil of grid points (e.g. X

direction), transferred separately to each SPE to calculate; this makes best use of the

fairly modest 256kB limit of local storage on each SPE. By ensuring the domain sides

are always multiples of 4, the starting address of each transfer is correctly aligned. We

follow the update order from the FORTRAN version for the fluid part. For the magnetic

update, any pencil that sits in one SPE has to update the pencil next to it (both Y

and Z directions). As a result, we separate the update of the magnetic part into four

sub-functions. The intermediate value to be updated by the next pencil is sent back to

the PPE. After the synchronization of the former function, the value is sent to the SPEs

to finish the update.

Implementing the grid transpose efficiently requires some care. For every DMA trans-

fer, the start of the address has to be aligned to 16 bytes. To achieve higher performance,

the data for each transfer should be approaching 16kB. For the regular memory accesses

involved in the fluid and magnetic update this is straightforward; but balancing these

constraints for the non-continuous memory access of the transpose is more difficult. As a

result, DMA lists, commands that can cause execution of a list of transfer requests, are

used for this task. For every SPE, there are 163 cube data elements for one component

transfer by DMA list. The incoming lists hold the starting address of a two dimensional

plane data arrays and the data size. After the transfer inside the SPEs, the out-going

lists hold the starting address of the after-transpose plane data arrays and the same data

size. Because the size is a multiple of 4, with at least single precision (4 bytes), the

starting address is always a multiple of 16 bytes.

Optimization

Further performance gains can be achieved by taking advantage of SIMD capabilities


Table 4.2: Cell performance while using PPE or varying numbers of SPEs for different

box sizes; timings in milliseconds.

Domain size


PPE 52 448 3745 32300

1 SPE 22.3 163.8 1257 9901

4 SPE 6.5 43.8 327 2607

16 SPE 3.5 14 112 864

speedup (16 SPE:PPE) 14.9 32.0 33.4 37.4

speedup (16 SPE:1 SPE) 6.4 11.7 11.2 11.5

of the SPEs, and overlapping communication and computation.

To exploit the SIMD capabilities of the SPEs, our code’s data structures are arranged

as a structure-of-arrays (SOA), which means that the different components of the fluid

and magnetic parts are stored in different arrays. For every SIMD operation in single

precision, one component of the adjacent four cells will be calculated.

Since there is no cache on the SPE, to overlap communications and computations,

double buffering is used to hide the memory latency between the PPE and the SPEs.

Result

Data for different box sizes are provided in Table 4.2, with the number 16 inside the

brackets indicating the speed-up ratio for 16 SPEs.

4.3.3 Nvidia GPU

Graphics Processing Units (GPU) were originally developed for 3D graphics rendering,

but their naturally parallel architecture is also suitable for high performance computing.

Current GPUs already use unified shaders for rendering, and these shaders are what we

call ‘cores’ for GPU computing.


The Tesla C2050 is used in our tests. There are 448 1.15GHz cores (one FMA op-

erations per clock cycle), which are partitioned into 14 streaming multiprocessors (SM).

This gives it a peak performance for single precision of: 448 × 1.15 × 2 = 1030.4 Gflops

[5]. There are three levels of memory: 3GB of GPU global memory, 64kB of on-chip

memory on each block (i.e. SM), and 32768 32-bit registers on each SM; In addition,

both shared memory and L1 cache shared this 64 kB per SM, and it can be chosen to

be either 48kB or 16kB for shared memory or L1 cache respectively. The bandwidth

between global memory and in-block memory is 144 GB/s, while the CPU and GPU are

connected by PCI-e, which has 8GB/s bandwidth.

For this architecture, we re-implement the MHD solver using the Compute Unified

Device Architecture (CUDA), a high level C-like language, which can be used to program

on any Nvidia GPU after G80.

Parallelization/Partition

For this architecture, it is the CPU which initializes the work, assigns the threads/memory,

and performs necessary synchronization. To minimize the impact of the relatively low

bandwidth over PCI-e, no more data transfer is performed after transferring the ini-

tialized data to GPU global memory. However, there is still the long latency (several

hundred cycles) of fetching data from the card’s global memory to the arithmetic units,

which must be hidden by oversubscribing the cores.

CUDA uses SIMT (Single Instruction, Multiple Thread), which means every thread

in the same block executes the same instruction at the same time. SIMT is different

from SIMD in that the width (the number of threads) is not fixed, which will affect the

available number of registers (which is fixed per SM) available per thread.

In our implementation for this architecture, each CUDA block of threads is assigned

one one-dimensional pencil, and the corresponding data is copied into the block’s shared

memory. Each thread within the block corresponds to one zone. Synchronization is

provided inside a block, and if synchronization among blocks is needed, we return all


blocks back to CPU control by ending the CUDA kernel.

To further reduce latency resulting from access to global memory, we modify the

magnetic update by staggering the updates; first update the odd indices of the blocks,

and subsequently the even indices. The reading/writing of intermediate flux can be

avoided, and about a 10% speed up is achieved.

Finally, the CUDA SDK provides examples for performing transposes, which are used

and modified for our purposes here. Because the memory transpose is three dimensional,

the data size is limited by the shared memory per block. In our simulation, only 83 grid

points of only one component of either fluid and magnetic field are transposed at a time.

We found this to be the best balance between shared memory and data transpose size.

Optimization

We can further improve the performance on this architecture by being aware of the

underlying memory architecture, and choosing block sizes to maximize occupancy.

Because of the size of the stencil, and the structure of the magnetic field update,

adjacent cells are needed for evolving any zone. Repeated access to global memory is

avoided by using shared memory in CUDA to cache the needed values. We did not use

constant, texture or pinned memory, as there is no large amount of ‘read-only’ data which

could benefit from being stored here.

The global memory access by the updates is automatically coalesced by the memory

transposes, so needs no special work in this implementation.

A further concern is occupancy – keeping each SM as fully occupied with thread

blocks as possible. Occupancy is the ratio of active warps to maximum warps in a block.

Increasing occupancy may not lead to good performance directly, but a low occupancy

will certainly not hide memory latency well. Three factors — threads per block, shared

memory and register usage — affect the occupancy. Empirically, we found that organizing

the thread blocks by pencils, and assigning between 128 and 192 threads (and thus zones)

per block to maximize performance. Further improvements in occupancy is limited by


Table 4.3: x86 vs NVidia GPU performance for different box sizes; timings in milliseconds.

Domain size


x86(1) 17.8 140 1096 8770

Nvidia (CUDA) 1.3 2.3 8.8 64

Nvidia (OpenCL) 1.5 2.5 9.3 65

Nvidia (CUDA) (double) 1.9 3.7 17.9 136

Speedup (CUDA:x86) 13.1 61 125 137

register number for the fluid evolution and shared memory for the magnetic evolution.

Result and comparison with previous works

Timing data for different domain sizes are provided in Table 4.3. For sufficiently large

domains, we achieve a factor of 100 speedup compared to a single-core x86. For reader’s

interest, we also run the test for double precision on Tesla C2050. It can be seen that

the result for double precision is nearly half of single precision, which is the same ratio

as the peak performance claimed from Nvidia.

For this architecture, there are other works that can be used to gauge the efficiency of

our implementation. Two other groups ([94], [81]) have used CUDA to implement Pen’s

[69] TVD code for MHD or pure hydrodynamics. In [94], they used CUDA for MHD

and they achieved a speed-up of 84 times in 3D, on a GTX 295 (480 cores) over an Intel

Core i7 965 3.20GHz. In comparison, our 137 speedup with Tesla C2050 and on Xeon(R)

E5506 2.13GHz seems comparable.

In [81], a relaxing TVD scheme was used for three dimensional hydrodynamics. Fur-

thermore, adaptive mesh refinement (AMR) and a multi-level relaxation scheme were

used, and this was applied to a multi-GPU cluster system. Since this setup is signifi-

cantly different from our own, no direct comparison is presented here. They state that

their speed-up is 12.19 for 1 GPU.


4.3.4 ATI GPU

The ATI GPU uses superscalar cores (shader), a modification from SIMD. One super-

scalar structure contains one 4D vector and one 1D scalar, which means in one cycle,

it can do one 4D operations and one 1D operation. To compensate for the insufficient

power of scalar computing, more cores are added onto the chips, e.g. the Radeon HD

5800 has 1600 cores.

We used an ATI HD 5870 for our simulation. The 1600 0.85 GHz shader cores are

located in 20 SIMD units, and each SIMD unit has 80 cores. The 4D+1D core can

perform two single-precision operations per clock cycle, which gives the ATI HD 5870

peak performance for single precision as: 1600 × 0.85 × 2 = 2720 Gflops [6]. There are

three levels of memory: 1GB of GPU global memory, 32kB of shared memory per block

(i.e. SIMD unit), and 16384 128-bit registers per block. The bandwidth between global

memory and in block memory is 153.6 GB/s, while the CPU-GPU’s bandwidth is the

same as Nvidia.

For this architecture, we re-implement the MHD solver using OpenCL.

Parallelization/Partition/Optimization

The parallelization is similar to Nvidia GPU, except that we vectorize the code to

get the maximum performance. There are both SIMD and SIMT units in ATI GPUs,

and each thread manipulates the data itself, making cross-grid calculation impossible.

As a result, we use structure-of-arrays (SOA), instead of AOS in Cell. We store the first

four of five fluid components as a ‘float4’, and leave the last one as a ‘float’. For the

magnetic part, we package the components as a ‘float4’, leaving the fourth element of

magnetic array unused. For the memory transpose function, we use two subroutines: the

first for first four components of fluid, and the second for the fifth component of fluid

and magnetic components. Otherwise, there are few differences for parallelization and

optimization between CUDA and OpenCL.

Result


Table 4.4: x86 vs ATI GPU performance for different box sizes; timings in milliseconds.

Domain size


x86(1) 17.8 140 1096 8770

ATI GPU 10 26 37 128

Speedup (ATI:x86) 1.78 5.4 29.6 68.5

Data for different box size are provided in Table 4.4.

4.4 Comparative Results and Discussion

4.4.1 Results

We compare different architecture results by four criteria:

1. Code speed-up: speed up ratio on the heterogeneous architecture compared to a

single core x86;

2. Fractional speed-up: ratio of the speed up ratio (heterogeneous to single-core x86)

to theoretical peak performance ratio (heterogeneous to single-core x86);

3. Floating-point operations per second (FLOPS) fraction: ratio of actual FLOPS to

theoretical peak performance for each architecture.

4. Bandwidth fraction: ratio of actual data transfer (including read and write) to

theoretical bandwidth (on-chip bandwidth).

All these values are relative to respective languages, i.e. OpenMP for multi-core

x86, Cell for QS22, CUDA for Nvidia GPU, and OpenCL for ATI. However, OpenCL is

provided as a reference across different architectures as well.

We calculate the total number of operations in one time step for our FORTRAN

version, including CFL, fluid and magnetic update. For a single cell in one simulation

time step for the box (ignoring O(n2)), there are 466 addition operations, 598 subtraction


operations, 1174 multiplication operations, 125 division operations, and 3 square root op-

erations. Since the proportion of division and square root operations are small, following

[67], we regard their cost as 1 flop each, for simplicity. As a result, the FORTRAN code

has 4.62 Giga floating-point operations for a 1283 box in each time step, which contains

1 CFL function, 6 fluid update and 6 magnetic update functions. By combining the code

run times with this value, one can calculate the actual FLOPS for different architectures.

We calculate the total data load/write for one time step for our FORTRAN version,

including CFL, fluid, magnetic update and memory transpose. For single-precision in a

single cell in one time step, there are 11 float reads in the CFL function, 10 float reads

and 5 float writes in the fluid update, 14 float reads and 6 float writes in the magnetic

update, and 8 float reads and 8 float writes in the memory transpose. As a result, the

FORTRAN code has 2.23 GBytes of data transfer (i.e. 1.46 Gbytes read and 0.77 Gbytes

write) per time step for a 1283 box, which contains 1 CFL function, 6 fluid and 6 magnetic

updates functions, and 4 memory transposes. Combining the code run times with this

value one can calculate the actual bandwidth for different architectures.

Table 4.5 presents the comparison for different architectures for a box size of 1283,

including both the respective program language and OpenCL. Code and fractional speed-

up, and FLOPS fraction are included. We also add the theoretical peak performance for

single precision, memory bandwidth, and our practical power consumption in units of

watts. No data for OpenCL on a single core or cell is provided. The former issue is due

to the fact OpenCL treats multi-core x86 as a heterogeneous system and all the available

compute units are used. The latter is because our OpenCL code still can’t run on a Cell

cluster, which may be due to the limitations of beta release of OpenCL on Cell. The

power usage for single-core is not available because the Xeon is a multi-core processor.

It can be seen that the CUDA on the Nvidia GPU gets the best speed-up in both

code and fractional speed-up.


Table 4.5: Performance comparison for different architectures; timings in milliseconds.

N-GPU represents Fermi; A-GPU represents ATI HD5870; peak Gflops represents the-

oretical peak floating-point performance; peak GB/s represents the theoretical on-chip

bandwidth;

Architecture x86(1) x86(8) Cell N-GPU A-GPU

Respective time 8770 1315 864 64 128

OpenCL time N/A 6435 N/A 65 128

Peak Gflops 17 136 409.6 1030 2720

Peak GB/s 19.2 19.2 204.8 144 153.6

Power(Watts) N/A 170 440 550 360

Code speed-up 1.0 6.7 10.2 137 68.5

Fractional speed-up 1.0 0.83 0.42 2.0 0.43

FLOPS fraction 3.1% 2.6% 1.3% 7.0% 1.3%

Bandwidth fraction 1.3% 8.8% 1.3% 24.2% 11.3%


4.4.2 Discussion

Speed-up and fractional parameters on different architectures:

The code speed-up quantifies the total gain in performance for different architectures.

The fractional speed-up takes into account the theoretical peak performance comparison

and also the programmer’s optimization work relative to the original code. The FLOPS

fraction tells us how many operations are done compared to the peak FLOPS. The

bandwidth fraction tells us what percentage of bandwidth the code occupies. Comparing

the two fractions can give us an idea which one is the bottleneck for the performance. Our

results indicate that CUDA on the Nvidia GPU is a good choice for starting heterogeneous

computing. CUDA on the Nvidia GPU has up to 137 times code speed-up and 2.0

fractional speed-up 1. CUDA also has the highest FLOPS and bandwidth fraction, which

tells us that CUDA uses its flops computing ability and bandwidth efficiently.

More detail for CUDA and OpenCL on GPU:

The Nvidia GPU has scalar shader cores and high efficiency of computing. On the

other hand, the ATI GPU has much more cores, which leads to much higher power for

floating-point operations, but with low efficiency of computing. This may be due to the

difficulty of mapping the algorithm efficiently onto the 4D+1D vector core design. Since

we have both CUDA and OpenCL here, while the latter one can also run on Nvidia GPU,

we did some more comparisons here.

The comparison for CUDA on Nvidia, OpenCL on Nvidia and OpenCL on ATI is

shown in Figure 4.1. The X axis represents the length for the box, which is in log scale,

and the Y axis represents the time for one time step, which is in log scale and millisecond

time units. It can be seen that CUDA on Nvidia on smaller box sizes are good. The

OpenCL on ATI catches up to CUDA with increasing box sizes. The OpenCL on Nvidia

performed almost the same as CUDA on Nvidia. We didn’t simulate the box sizes larger

1These numbers are related to programmer’s optimization work. On the other hand, we have tomention that the CPU code is not fully vectorized, neither is CUDA.


than 1443 due to the memory limit on ATI.

Summary of the code

The code is finite-difference finite-volume three dimensional code. Dimension split is

used to reduce three dimensional problem to one dimensional update. With the help of

matrix transpose, each dimension can update separately, and memory access is linear.

This born feature of memory coalescing is very helpful for heterogeneous computing. The

one dimensional update fits into one block, and with SIMT in GPU, the code speeds up to

a great number. Any cell in the one dimension update requires the neighbouring variable

and this simple dependency simplifies the algorithm. Our code is memory bound, and

shared memory is used to avoid redundant data fetching.

Insight for the programmer

We provide several insights here for potential heterogeneous programmers:

• Memory management: The control-compute structure of heterogeneous system re-

quires a correspondingly complicated three levels — global-shared-local — memory

system. Good management of this memory system is crucial for better performance,

e.g. hide the latency of on-chip bandwidth(e.g. double/multiple buffering for CELL

and more active warps for GPU), use of shared memory to avoid redundant data

fetching, memory coalescing to improve efficient reading.

• Maximize computing on GPU: 100x speed up on GPU was achieved in our sim-

ulation, but one has to note that almost all computing stays on the GPU, which

means the low bandwidth of PCI-e does not affect our result. Actually, our prelimi-

nary MPI + CUDA results show that the CPU and PCI-e communications brought

down the speed a lot.

• Programming effort: In the author’s view, CUDA is the simplest of these three

programming languages, while CELL SDK is the most difficult one. OpenCL is

between them, but quite similar to CUDA. Considering the cross-platform OpenCL


102

101

102

solid: linear fit (slope=3)

dash dot asterisk: CUDA on Nvidia

dash circle: OpenCL on Nvidiadot diamond: OpenCL on ATI

Time vs Box size (Log−Log)

x

T

Figure 4.1: Time vs box size for GPU comparison. X axis represents the length of the

box; Y axis represents the time for one time step, Timings in milli second; Dot diamond

is OpenCL on ATI; Dash circle is OpenCL on Nvidia; Dash dot asterisk is CUDA on

Nvidia; Solid is linear fit with slope=3.


performs almost the same as CUDA on Nvidia GPU, it may be a good idea to start

with CUDA then transfer to OpenCL.

Future work

• Cell: Only SIMD and double buffering are included in our simulation, more can be

done to explore the power of Cell/B.E..

• Nvidia GPU: The register restriction on fluid update and shared memory restriction

on magnetic update limit the occupancy. Reorganizing the algorithms for them

might be helpful to speed up the code.

• ATI GPU: The ATI GPU SIMD unit has the problem of low efficiency for vectorized

core computing, We will do more research on this to explore the power of 2.7Tflops

ATI GPU.

• MPI: We will apply our code to MPI version for use on GPU clusters in the future.

4.5 Summary

We presented magneto-hydrodynamics simulations on heterogeneous systems, e.g. Cell/B.E.,

Nvidia and ATI GPU. These heterogeneous systems share a similar structure that they

all have a control processor for mission management and many computing intensive pro-

cessors for calculations. Correspondingly, the memory system is also complicated, which

is a challenge for programmers. We present the results on different architectures for

comparison; 10 times, 137 times and 68 times speed-up for Cell, Nvidia, and ATI GPU

were achieved. The CUDA on Nvidia GPU has the best performance on both code and

fractional speed-up, and the ATI GPU improves with larger size simulation. Specially,

CUDA and openCL perfrom similar on Nvidia GPU. The 2.0 fractional speed-up for

CUDA on Nvidia GPU shows that a greater percentage of peak theoretical performance

compared to x86 architecture was achieved.


These performance numbers were obtained with an algorithm which was directly

translated from a CPU code. Designing algorithms with heterogeneous architectures in

mind may also improve performance.

Chapter 5

Conclusion

Black hole accretion Here, we present several new and large dynamical range MHD

simulations for the black hole accretion in the Galactic center. This is the first three-

dimensional large scale MHD simulation that does not encounter problems with the outer

boundary, and runs long enough to achieve a stable state. The simulation is designed to

allow for understanding the low luminosity of the supermassive black hole. In addition,

the class of magnetically frustrated accretion flow is confirmed. Multiple physical and

numerical parameters including the strength of magnetic field, rotation, ratio of Bondi

radius to inner boundary and resolution are tested. An extrapolation formula based on

these free parameters is proposed for the accretion rate, which appears to be consistent

with the observation data. The accretion rate is very small, and the density slope is

around −1. The accretion flow is subsonic and no outward flux or rotational support is

observed.

The rotation measure (RM) is an efficient tool to explore the characteristics of the

accretion flow; the value of RM is closely related to the radius when the electrons become

relativistic. We argue that this radius varies from 40 to 250 Schwarzschild radius from

thermal to adiabatic limit, and therefore more observations are needed to determine

this radius. We also propose that the variation of RM can be an effective constraint

91

Chapter 5. Conclusion 92

Figure 5.1: Atacama Large Millimeter/Submillimeter Array (ALMA). ALMA has much

higher sensitivity and higher resolution compared with current sub-millimeter telescopes.

Image courtesy ALMA (ESO/NAOJ/NRAO).

for the accretion models. Our subsonic, non-rotational-support, magnetically frustrated

accretion flow suggests a low variable RM, which is measured to be a time scale of months

to a year in the simulations. On the other hand, from ADAF to CDAF through ADIOS,

these models all involved fast flowing plasma. Some also have rotational support, which

suggests a rapid variation for the RM as a time scale of hours to weeks. In order to

accurately measure the variability of RM, a time series of data points is needed.

Various groups have already successfully detected several RM values, including Sub-

millimeter Array [56], and Berkeley-Illinois-Maryland Association (BIMA) Array [55; 21].

Furthermore, the Atacama Large Millimeter/submillimeter Array (ALMA), which has

much higher sensitivity and higher resolution, will come into operation in 2012 (Figure

5.1). This development will be extremely helpful to the RM observation, as if the slow

variable RM is confirmed by further RM observation, our model can be distinguished

from others.

Fast magnetic reconnection Here, we present the first global three-dimensional ideal

MHD simulation on fast magnetic reconnection. The fast magnetic reconnection is a three

dimensional effect (Figure 5.2) 1, instead of a two dimensional effect seen in Sweet-Parker

1Three dimensional movie for the evolution of magnetic field line can be seen at http ://www.cita.utoronto.ca/ ∼ bpang/mhd simulation in astrophysics/long term 01boxV 1rK2.wmv


Figure 5.2: The three-dimensional simulation box is for fast magnetic reconnection. The

fast magnetic reconnection is a three dimensional effect, and the global geometry deter-

mines the reconnection.

and Petschek’s models. The fast magnetic reconnection is determined by global geometry,

rather than the micro-physics of the X-point in Petschek’s solution. The reconnection

does not rely on specific boundary conditions, external driving, or anomalous resistivity.

This Petschek-like reconnection is self-contained and generic. About 30% of the magnetic

energy is released in one Alfven time, which qualifies for fast reconnection.

The reconnection is initiated by a strong localized perturbation to the field lines in a

periodic box and the two reconnection regions interact with each other, helping the re-

connection to occur rapidly. We conclude that the Petschek-like X-point reconnection is

thus not determined by the microphysics at the infinitesimal boundary where there is no

energy present, but rather by the global flow far away from the X-point. In addition, we

simulated two-dimensional reconnection and found that there is no fast magnetic recon-

nection; supporting the conclusion that fast reconnection is indeed a three-dimensional


effect.

Accelerate MHD We now present the first and widest speed comparison of MHD

simulation on various heterogeneous systems. By porting the FORTRAN MHD code to

different heterogeneous platforms, including multi-core x86, Cell, Nvidia and ATI GPU,

we can show that the Nvidia GPU performs the best on both code speed-up and fractional

speed-up. In fact, more than 100 times speed-up is achieved through this method.

The results of code speed-up on different architectures are: 6.7 times for normal

multi-core x86, 10 times for Cell, 137 times for Nvidia GPU, and 68 times for ATI GPU.

Taking into account the theoretical peak performance, the results of fractional speed-up

on different architectures are: 0.83 times for normal multi-core x86, 0.42 times for Cell,

2.0 times for Nvidia GPU, and 0.43 times for ATI GPU.

The Nvidia GPU achieves optimum performance with a factor of 100 speed-up, and

the 2.0 fractional speed-up shows that an even greater percentage of peak theoretical per-

formance is possible. ATI GPU also has nearly 70 times speed-up; however, considering

its claimed big theoretical peak performance, the fractional speed-up of 0.43 is poor. Cell

has 10 times speed-up, which is more than the 6.7 times speed-up from OpenMP for x86.

However, it may not be a good idea to transfer the program into Cell considering the cost

of Cell blade and the difficulty of programming on Cell SDK. The CUDA and OpenCL

perform equally well on Nvidia GPU, suggesting it may be a good idea to program on

OpenCL to use its cross-platform feature.

As can be seen from our work, a heterogeneous system usually consists of one con-

trolling processor used to manage mission assignment, and many computing processors

used for heavy parallel calculation. This is why the heterogeneous platform is able to

accelerate. This complicated structure also requires an equally complicated memory sys-

tem; the main data must stay in the memory in controlling processor, and the data will

be transferred to local memory for the calculation. Because the local memory is limited,


it is important to organize the calculations attentively so as to avoid a bottleneck effect

of communication. The key to programming on heterogeneous systems is to keep paral-

lelism in mind at all times. This differs from a parallel job on normal computer clusters,

as one now needs to partition the data and algorithms for the computing cores on the

platform.

Future of MHD simulation In this thesis we have presented examples of simulations

in scientific computing. Through the use of super computers we attempt to reproduce

physical phenomena and propose physical explanations. A great number of problems

were solved in this process, but new complications did arise.

A common concern among many people is the accuracy of simulations produced, how

closely the simulation can model reality, and the adequacy in the results. However in

spite of these concerns, simulation results are required to be as time efficient as possible.

Our MHD simulations can be seen as an effective example. We have a fast MHD

code and extremely powerful supercomputers. To model the black hole accretion with

a 3003 deformed-mesh grid box that achieves a box size of 40003, at least three weeks

are needed for a parallel computing job of 216 CPUs for a long-term and stable result.

Yet even this simulation can only achieve a ratio of RB/Rin as 100, while the real world

ratio is about 10000. Obviously, the result of this simulation is unable to accurately

represent real dynamics. Although we did obtain an expression for the density slope

from our simulation result and the formula fits the observation data very well, this is

based on extrapolation 2. An alternative option is to enlarge the ratio to approach the

real value in simulations, however the corresponding update time for the simulation is

inversely proportional to grid size. As a result, the update time would be so small that it

would be almost impossible to run such a large simulation to a stable state. We are left

with no other choice but to do the simulation and obtain results that are not completely

2Extrapolation is very common in astrophysics for the large scale in astronomy, even though it is notentirely accurate


satisfactory.

Similar situations occur in other fields, prompting people to ask questions such as:

is the assumed mathematical model correct? Is the algorithm suitable for the problem?

Is the discretization fine enough to represent the problem? It should be noted that the

increasing of accuracy could put a great burden on the cost of calculation. The more

complicated models, higher order algorithms, and finer grids all lead to more calculation.

Here we reach an impasse. It appears to be very difficult to achieve adequate results

in a satisfying time scale. Is a solution possible?

The solution is actually quite simple; just increase the speed of the simulation. By

doing so, a more complicated calculation can be realized in an affordable time line. For

evidence of this we can look back five to ten years ago when two-dimensional simula-

tions were mainstream. Now, more complicated three-dimensional simulations can be

performed easily - this is due to increasing power, which in turn increases the simulation

speed.

Here we provide quality examples of fast computing by the parallelization on a super-

computer. In addition we also demonstrate the impressive speed-up using heterogeneous

system. We can foresee the great speed-up by the combination of both, a parallel com-

puting on heterogeneous systems, which includes host nodes, GPUs, and interconnects.

This idea is not a new one, and was achieved five years ago by the scientists in State

University of New York [31]. These scientists built a 30 GPU nodes cluster and the

simulation was sped up by 4.6 times compared with traditional CPU cluster. Their

GPUs were not that powerful, and they were only able to use Cg [7] to program. GPUs

have become much more powerful in a relatively short duration; Nvidia has developed the

GPUs focusing on high performance computing, e.g. Tesla and Fermi. The roadmap of

Nvidia GPU is shown in Figure 5.3, and it can be seen that the proposed double precision

performance will be double in the next two years, and four times by 2013. Nvidia also

developed CUDA, which is designed for parallel data programming, and can increase the


Figure 5.3: Roadmap for Nvidia GPU. DP represents double precision. FLOPS represents

FLoating point Operations per Second, which is a measure for computing performance.

X axis represents the time; Y axis represents the computing performance. Tesla, Fermi,

Kepler, and Maxwell are the family name of each generation of GPU from Nvidia.

ease of programming.

Presently in Ontario, Sharcnet [8], one of seven High performance computing consortia

in Canada, has a GPU cluster which contains 11 Nvidia Tesla S1070 GPU servers with

peak performance for single precision of over 40 Tflops. Each GPU server has 4 GPUs

and 16GB of global memory and are connected to 2 HP DL160G5 CPU servers. The

CPU-CPU connections are via 4X DDR Infiniband, and CITA is also trying to acquire

a new GPU cluster.

Meanwhile, we also have some preliminary results on GPU clusters. Using our raw

FORTRAN MPI code, combined with updated subroutines for both magnetic and fluid

part in CUDA, we performed the simulation on a mini GPU cluster, a 2 Tesla C1060 GPU

nodes, and we get a speed-up of 3.4 compared with OpenMP on conventional computer

cluster. In addition, there are other benefits compared with traditional CPU clusters:

GPUs are cheaper, occupy smaller space and consume less power - which leads to lower

cooling demands.


However, these techniques are not yet mature and many problems still exist. The

most pertinent issue is communication, which took more than half of the running time in

our GPU cluster test. Increasing the efficiency of the communication will not only bring

up the speed, but also bring down the power consumption. [47] proved that the global

memory access (communication) consumed more power than on-chip register or shared

memory accesses.

Much like increasing the speed in computer simulation, the simple solution for parallel

heterogeneous computing is to increase the efficiency of the communication.

There is a multitude of methods for solving this problem, which are proving to be

successful. In the case of a GPU cluster, the communication contains both GPU-CPU

and CPU-CPU connection, to increase the efficiency of communication, programmers

need to pay attention to minimizing the I/O between the host and the GPU, in order

to keep the GPUs computing. An efficient hardware design is also needed. For example,

the application of infiniband [9] between CPUs, and better connections between GPU

and CPU. The new PCI-e 3.0 is coming at the end of 2010 and its new feature includes

the doubling of the bandwidth of current PCI-e 2.0, which will definitely improve the

communication speed.

Here, we do a simple and rough estimate for the future CITA GPU cluster. The ex-

pected GPU cluster will have 360 Fermi family GPUs from Nvidia. Each Fermi has about

1Tera Flops computing power, which translated into a theoretical peak performance of

0.4Peta flops. However when compared with a traditional CPU cluster, it costs only

1/10th of the money and consumes 1/20th of the power, which is very attractive. Based

on the results of our preliminary test and the improving communication from PCI-e 3.0,

this GPU cluster can have about 10-20 times speed-up compared with a CPU cluster per

node 3. This means that when running the same program with the same nodes, we can

get at least a factor of 10 speed-up with lower cost and power consumption. For example,

3Assuming one-to-one (CPU-GPU) structure of GPU cluster


it took about three weeks to finish the black hole accretion using 27 nodes on the CPU

cluster, but it only requires two days on a GPU cluster with the same nodes. We will be

able to try some larger simulations, for example, RB/Rin = 1000, which would achieve a

satisfyingly stable result in three to four months with only 27 nodes on the GPU cluster,

while five or more years is needed for 27 nodes on a CPU cluster. If all the 360 nodes

GPU could be used we would be able to simulate the real black hole (RB/Rin = 10000) in

about three years. This may appear long, but is at least within the realm of possibility.

When we approached the closing of this thesis, we became aware that the GPU

supercomputer, Tianhe-1,was officially the fastest supercomputer in the world on October

28, 2010. This supercomputers’ 2.5 peta flops performance surpassed the former number

one CPU supercomputer (Jaguar) by 40 percent, securing its first place rank in the

coming TOP500 list. This particular supercomputer contains over 7,000 Fermi GPUs,

which is a very good example to support our view that great speed-up will be achieved

by parallel computing on heterogeneous systems.

We witnessed the development of computer simulations, from the realization of sim-

ple to complicated mathematical models. We witnessed the development of conventional

computer clusters, from several linked office computers to the former number one con-

ventional supercomputer, Jaguar, which has a theoretical peak performance as 1.75 peta

flops. We also witnessed the development of heterogeneous systems, including the hard-

ware (for example, Nvidia Fermi and ATI HD 5870), and the software (for example, the

programming language CUDA and OpenCL, subroutine and template libraries CUBLAS,

CUFFT, and compiler-based approaches, PyCUDA, RapidMind). And most recently, we

witnessed the success of Tianhe-1, a GPU cluster that surpassed all the conventional CPU

supercomputers.

Due to all of the above, we have significant reasons to believe that the parallel het-

erogeneous computing will play a much more important role in the scientific research in

the future.


Summary We present three-dimensional large scale magnetohydrodynamics simula-

tions for black hole accretion in the Galactic centre. Our presentation of the simulation

is significant because there are no outer boundary problems and a stable state is achieved

from the long runs. The subsonic magnetically frustrated accretion flow predicts a low

variable rotation measure. Furthermore, we present the first fast magnetic reconnection

in three-dimensional ideal magnetohydrodynamics simulations. The reconnection is a

three-dimensional effect and about 30% of the magnetic energy is released in one Alfven

time. Finally, we have access to the first and widest speed comparison of magnetohydro-

dynamics simulation on various heterogeneous systems. CUDA on Nvidia GPU performs

the best and achieves more than one hundred times speed-up.

Appendix A

Rotation measure constraint on

accretion flow

In traversing the accretion flow, linearly polarized radio waves of wavelength λ are rotated

by RMλ2 radians, where

RM =e3

2πm2ec

4

∫

nef(Te)B cos(θ)dl. (A.1)

Here f(Te) is a ratio of modified Bessel functions: f(Te) = K0(mec2/kBTe)/K2(mec

2/kBTe)

[86], which suppresses RM by a factor ∝ T−2e wherever electrons are relativistic. The

integral here covers the entire path from source to observer; θ is the angle between

B and the line of sight. This expression is appropriate for the frequencies at which

RM has been observed; at lower frequencies, where propagation is “superadiabatic” [23]

cos(θ) → ±| cos(θ)|.

We adopt a power-law solution with negligible rotational support in which ρ ∝ r−k,

and the total pressure P ∝ r−kP with kP = k + 1; moreover we take Te ∝ r−kT for the

relativistic electrons. The hydrostatic equation dP/dr = −GMρ/r2 becomes

P = Pg + PB =GM

(k + 1)

ρ

r, (A.2)

101

Appendix A. Rotation measure constraint on accretion flow 102

and with Pg = βPB = βB2/(8π), ρ = neµe (where µe = 1.2mp is the mass per electron),

B =

[

8π

(β + 1)(k + 1)

GMµene

r

]1/2

. (A.3)

So long as k > 1/3 (so that RM converges at large radii) and k < (1 + 4kT )/3 (so

it converges inward as well), the RM integral is set around Rrel. Taking a radial line of

sight (dl → dr), we write

∫

∞

0nef(Te)B cos(θ)dr = F (k, kT )

∫

∞

Rrel

neB cos(θ)dr (A.4)

=2

3k − 1〈cos(θ)〉F (k, kT ) [neBr]Rrel

where 〈cos(θ)〉 encapsulates the difference between the true integral what it would have

been if θ = 0 all along the path, and F (k, kT ) encapsulates the difference between a

smooth cutoff and a sharp one. We plot F (k, kT ) in Figure A.1; it is of order unity

except as kT approaches (3k − 1)/4. All together,

RM =4e2GM

m2ec

5

〈cos(θ)〉F (k, kT )

3k − 1

[

µene(Rrel)3

π(k + 1)(β + 1)

Rrel

RS

]1/2

. (A.5)

To estimate ne(Rrel) from RM, one must make assumptions about the uncertain pa-

rameters β, 〈cos(θ)〉, kT , and Rrel/RS; then k can be derived self-consistently from ob-

servations ne(RB) and RM. Our fiducial values of these parameters are 10, 0.5, 0.5 and

100, respectively, of which we consider the last to be the most uncertain. We now discuss

each in turn.

Although the magnetization parameter β could conceivably take a very wide range

of values, we consistently find β ≃ 10 in our simulations, with some tendency for β to

decrease inward. We consider it unlikely for the flow to be much less magnetized, given

the magnetization of the galactic center and the fact that weak fields are enhanced in

most of the flow models under consideration.

If B wanders little in the region where the integrand is large (a zone of width ∼ Rrel

around Rrel), and is randomly oriented relative to the line of sight 〈cos(θ)〉 ≃ cos(θ(Rrel)),


0

0

0

0.2

0.2

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6

0.8

0.8

0.8

1

1

1

1.2

1.2

1.4

1.4

1.6

1.6

1.8

k

kT

0.5 1 1.50.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

Figure A.1: The logarithm of the relativistic RM factor, log10 F (k, kT ). The true RM

integral is modified by a factor F (k, kT ) relative to an estimate in which the nonrelativistic

formula is used, but the inner bound of integration is set to the radius Rrel at which

electrons become relativistic; see equation A.1.

typically 1/2 in absolute value. If the field were purely radial, 〈cos(θ)〉 would be unity.

Conversely if B reverses frequently in this region (the number of reversals Nr is large)

then 〈cos(θ)〉rms ≃ 1/(2√

Nr + 1) will be small. However, Nr cannot be too large, or

magnetic forces are unbalanced. We gauge its maximum value by equating the square

of the buoyant growth rate, N 2 = [(3 − 2k)/5]GM/r3, against the square of the Alfven

frequency N2r v2

A/r2. Noting that v2A = GM/[2(β + 1)(k + 1)r], we find N2

r ≃ (2/5)(β +

1)(k + 1)(3 − 2k). For β = 10 and k = 1 this implies 〈cos(θ)〉rms ≃ 0.25: a very minor

suppression. We can therefore be confident that 〈cos(θ)〉 = 0.5 to within a factor of 2,

unless β ≫ 10 for some reason.

The precise value of kT is not important unless it approaches or falls below the

minimum value (3k − 1)/4. If electron conduction is very strong this is unavoidable, as

rapid transport implies kT ≃ 0; however in this case the relativistic region disappears, as

discussed below. Alternately, if relativistic electrons are trapped and adiabatic, Te ∝ ρ1/3

and kT = k/3; however kT < (3k−1)/4 then requires k < 3/5, which can only be realized


within the CDAF model. Finally, if electrons remain strongly coupled to ions, kT = 1

and we only require k < 5/3.

The location at which electrons become relativistic, Rrel, is quite uncertain. Models

such as those of [95], in which electrons are heated while advecting inward, predict

Rrel ≃ 102RS. The maximum conceivable Rrel corresponds to adiabatic compression

of the electrons, inward from the radius at which they decouple from ions; this yields

about 500/(1+k)RS. If conduction is very strong, however, electrons should remain cold

throughout the flow; in this case we should replace Rrel/RS → 1 and F (k, kT ) → 1 in

equation (A.5).

Adopting our fiducial values for the other variables, and taking F (k, kT ) → 1 for lack

of knowledge regarding kT , we may solve for the self-consistent value of k which connects

the density at RB with ne(Rrel) derived from equation A.5. We find k → (0.90, 1.23, 1.32)

for Rrel/RS → (200, 100, 1), respectively. As noted in the text, the current small set of RM

measurements allows a two order of magnitude range in RMest, and k ∼ 1 is consistent

with data. Longer observations of time and amplitude will improve the constraints.

Appendix B

Inner boundary conditions

The inner boundary conditions were determined by first solving for the vacuum solution

of the magnetic field inside the entire inner boundary cube. Then inside the largest

possible sphere within this cube, matter and energy were removed.

To simplify the programming, we put the entire inner boundary region on one node.

This meant that the grid had to be divided over an odd number of nodes in each Cartesian

direction.

B.0.1 Magnetic field

In order to determine the vacuum magnetic field solution, we use the following two

Maxwell equations for zero current:

∇ · B = 0 , (B.1)

∇× B = 0 . (B.2)

Equation (B.2) enables us to write B = ∇φ, for some scalar function φ. Combining this

with (B.1) we obtain Laplace’s equation

∇2φ = 0 , (B.3)

105

Appendix B. Inner boundary conditions 106

which we solve with Neumann boundary conditions (the normal derivative n · ∇φ speci-

fied) given by B · n on the boundary of the cube.

Since the MHD code stores the values of B on the left-hand cell faces, we must solve

for φ in cell centers and then take derivatives to get the value of B on the cell boundary.

Let the inner boundary cube be of side length N , consisting of cells numbered 1, . . . , N

in all three directions. In order to simplify the problem we set B · n = 0 on five of the

six faces of the cube, and find the contribution to φ from one face at a time.

Suppose B · n = 0 on all of the faces except the i = N + 1 face (i.e., BN+1,j,kx can be

non-zero). The Laplace equation (B.3) with Neumann boundary conditions only has a

solution if the net flux of field into the cube is zero. Since all of the boundary conditions

are zero except for the i = N + 1 face, that face must have a net flux through it of zero.

Defining

BN+1x =

1

N2

N∑

j=1

N∑

k=1

BN+1,j,kx (B.4)

to be the average of Bx on the i = N + 1 face, and letting bN+1,j,kx = BN+1,j,k

x − BN+1x ,

bN+1,j,kx can be used as the boundary condition and BN+1

x will be added in later.

We use separation of variables to solve for φ. Set φijk = X iY jZk, substitute into

(B.3), and rearrange to get

X i+1 − 2X i + X i−1

X i+

Y j+1 − 2Y j + Y j−1

Y j+

+Zk+1 − 2Zk + Zk−1

Zk= 0 . (B.5)

Now let

Y j+1 − 2Y j + Y j−1

Y j= −η2 , and (B.6)

Zk+1 − 2Zk + Zk−1

Zk= −ω2 . (B.7)

Solving equations (B.6) and (B.7) with the boundary conditions,

Y jm = cos

mπ(j − 12)

N, Zk

n = cosnπ(k − 1

2)

N, (B.8)


η2m = 4 sin2 mπ

2N, ω2

n = 4 sin2 nπ

2N. (B.9)

Substituting (B.6), (B.7), and (B.9) into (B.5), and solving, yields

X imn = cosh

αmnπ(i − 12)

2N, (B.10)

where

αmn =2N

πarcsinh

√

sin2 nπ

2N+ sin2 mπ

2N. (B.11)

Finally, putting this all together,

φijk =N−1∑

m=0

N−1∑

n=0

Amn cosmπ(j − 1

2)

Ncos

nπ(k − 12)

N×

× coshαmnπ(i − 1

2)

N, (B.12)

and define A00 = 0.

To determine the coefficients Amn we add in the final boundary condition (i = N +1),

and get

Amn =4

N2

1

2 sinh(αmnπ) sinh(αmnπ/2N)×

×N

∑

j=1

N∑

k=1

bN+1,j,kx cos

mπ(j − 12)

N×

× cosnπ(k − 1

2)

N. (B.13)

A similar calculation may be performed for the case when the i = 1 boundary has

non-zero field. After finding the contribution from each face, store their sum in φ.

To deal with the subtracted cube face field averages, let

φijk0 = B1jk

x i + Bi1ky j + Bij1

z k +

+BN+1,j,k

x − B1jkx

2N(i2 + j2) +

+BN+1,j,k

x + Bi,N+1,ky − B1jk

x − Bi1ky

2N(j2 + k2), (B.14)

and add this to φ. φ0 is the potential of a cube where each face has the uniform magnetic

field given by the average of the magnetic field on the corresponding face of the inner

boundary cube.


−10 −5 0 5 10

−10

−5

0

5

10

x

y

Figure B.1: Vacuum solution of the magnetic field is calculated in the central region.

The field lines outside of the central region show the boundary condition.

To find B, set

Bijkx = φijk − φi−1,j,k , (B.15)

Bijky = φijk − φi,j−1,k , (B.16)

Bijkz = φijk − φi,j,k−1 . (B.17)

In Figure B.1 we used the magnetic field solver with a boundary condition consisting

of field going in one side and out an adjacent side of the box. This boundary condition

tests both the φ0 component of the solution (since faces have non-zero net flux) as well

as the Fourier series component (since faces have non-constant magnetic field).

B.0.2 Density and pressure

Inside the largest possible sphere that can be inscribed within the inner boundary cube,

we adjust the density and pressure so that the Alfven speed and the sound speed are


both equal to the circular speed. We accomplish this by setting

ρ =B2r

GMBH, (B.18)

p =GMBHρ

rγ. (B.19)

We then set p to 0.1p. ρ and p were assigned minimum values of 0.1 times the average

value of ρ outside of the sphere, and 0.001, respectively, to ensure stability.

Appendix C

Supporting Movie for black hole

accretion

Animation of magnetically frustrated convection simulation.

The qualitative behaviors of the accretion flow is best illustrated in the form of a

movie. This movie shows case 25. The raw simulation used 6003 grid cells. The Bondi

radius is at 1000 grid units, where one grid unit is the smallest central grid spacing. The

full box size is 80003 grid units. Colour represents the entropy, and arrows represent the

magnetic field vector. The right side shows the equatorial plane (yz). the left side shows a

perpendicular plane (xy). The moving white circles represent the flow of an unmagnetized

Bondi solution, starting at the Bondi radius. On average, the fluid is slowly moving

inward, in a state of magnetically frustrated convection. Various other formats can also

be seen at http://www.cita.utoronto.ca/∼pen/MFAF/blackhole movie/index.html.

110

Bibliography

[1] http://www.research.ibm.com/cell/.

[2] http://wiki.cita.utoronto.ca/mediawiki/index.php/Sunnyvale.

[3] http//www.scinet.utoronto.ca/.

[4] http://www.khronos.org/opencl/.

[5] http://developer.download.nvidia.com/compute/cuda/3 1/toolkit/docs/NVIDIA CUDA C ProgrammingGuide

[6] http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-

5870/Pages/ati-radeon-hd-5870-specifications.aspx.

[7] http://developer.nvidia.com/page/cg main.html.

[8] https://www.sharcnet.ca/.

[9] http://en.wikipedia.org/wiki/InfiniBand.

[10] E. Agol. Sagittarius A* Polarization: No Advection-dominated Accretion Flow,

Low Accretion Rate, and Nonthermal Synchrotron Emission. ApJL, 538:L121–L124,

August 2000.

[11] H. Alfven. Existence of Electromagnetic-Hydrodynamic Waves. Nature, 150:405–

406, October 1942.

111

BIBLIOGRAPHY 112

[12] A. Arevalo, R.M. Matinata, M. Pandian, E. Peri, K. Ruby, F. Thomas, and C. Al-

mond. Programming the Cell Broadband Engine Architecture: Examples and Best

Practices. IBM Red-Books, 2008.

[13] F. K. Baganoff, M. W. Bautz, W. N. Brandt, G. Chartas, E. D. Feigelson, G. P.

Garmire, Y. Maeda, M. Morris, G. R. Ricker, L. K. Townsley, and F. Walter. Rapid

X-ray flaring from the direction of the supermassive black hole at the Galactic Cen-

tre. Nature, 413:45–48, September 2001.

[14] F. K. Baganoff, Y. Maeda, M. Morris, M. W. Bautz, W. N. Brandt, W. Cui, J. P.

Doty, E. D. Feigelson, G. P. Garmire, S. H. Pravdo, G. R. Ricker, and L. K. Towns-

ley. Chandra X-Ray Spectroscopic Imaging of Sagittarius A* and the Central Parsec

of the Galaxy. ApJ, 591:891–915, July 2003.

[15] J. Birn, J. F. Drake, M. A. Shay, B. N. Rogers, R. E. Denton, M. Hesse,

M. Kuznetsova, Z. W. Ma, A. Bhattacharjee, A. Otto, and P. L. Pritchett. Geospace

Environmental Modeling (GEM) magnetic reconnection challenge. JGR, 106:3715–

3720, March 2001.

[16] D. Biskamp. Magnetic Reconnection Via Current Sheets (Invited paper). In M. A.

Dubois, D. Gresellon, and M. N. Bussac, editors, Magnetic Reconnection and Tur-

bulence, pages 19–+, 1986.

[17] D. Biskamp. Magnetic Reconnection in Plasmas. November 2000.

[18] R. D. Blandford and M. C. Begelman. On the fate of gas accreting at a low rate on

to a black hole. MNRAS, 303:L1–L5, February 1999.

[19] H. Bondi. On spherically symmetrical accretion. MNRAS, 112:195, 1952.

[20] G. C. Bower, D. C. Backer, J.-H. Zhao, M. Goss, and H. Falcke. The Linear Po-

BIBLIOGRAPHY 113

larization of Sagittarius A*. I. VLA Spectropolarimetry at 4.8 and 8.4 GHZ. ApJ,

521:582–586, August 1999.

[21] G. C. Bower, H. Falcke, M. C. Wright, and D. C. Backer. Variable Linear Polarization

from Sagittarius A*: Evidence of a Hot Turbulent Accretion Flow. ApJL, 618:L29–

L32, January 2005.

[22] M. A. Brentjens and A. G. de Bruyn. Faraday rotation measure synthesis. A&A,

441:1217–1228, October 2005.

[23] A. E. Broderick and R. D. Blandford. Understanding the Geometry of Astrophysical

Magnetic Fields. ApJ, 718:1085–1099, August 2010.

[24] A. E. Broderick and J. C. McKinney. Parsec-scale Faraday Rotation Measures

from General Relativistic MHD Simulations of Active Galactic Nuclei Jets. ArXiv

e-prints, June 2010.

[25] P.E. Ceruzzi. Beyond the limits: flight enters the computer age. The MIT Press,

1989.

[26] T. Chen, R. Raghavan, JN Dale, and E. Iwata. Cell broadband engine architecture

and its first implementation: a performance view. IBM Journal of Research and

Development, 51(5):559–572, 2007.

[27] X. Chen, M. A. Abramowicz, and J.-P. Lasota. Advection-dominated Accretion:

Global Transonic Solutions. ApJ, 476:61–+, February 1997.

[28] T. G. Cowling. Magnetohydrodynamics. 1976.

[29] K. P. Dere, J.-D. F. Bartoe, G. E. Brueckner, J. Ewing, and P. Lund. Explosive

events and magnetic reconnection in the solar atmosphere. JGR, 96:9399–9407, June

1991.

BIBLIOGRAPHY 114

[30] C. R. Evans and J. F. Hawley. Simulation of magnetohydrodynamic flows - A

constrained transport method. apj, 332:659–677, September 1988.

[31] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU cluster for high perfor-

mance computing. In Proceedings of the 2004 ACM/IEEE conference on Supercom-

puting, page 47. IEEE Computer Society, 2004.

[32] V. L. Fish, S. S. Doeleman, A. E. Broderick, A. Loeb, and A. E. E. Rogers. Detecting

Changing Polarization Structures in Sagittarius A* with High Frequency VLBI. ApJ,

706:1353–1363, December 2009.

[33] J. Frank, A. King, and D. Raine. Accretion power in astrophysics. 1992.

[34] F. F. Gardner and J. B. Whiteoak. The Polarization of Cosmic Radio Waves.

ARA&A, 4:245–+, 1966.

[35] R. Genzel, R. Schodel, T. Ott, A. Eckart, T. Alexander, F. Lacombe, D. Rouan,

and B. Aschenbach. Near-infrared flares from accreting gas around the supermassive

black hole at the Galactic Centre. Nature, 425:934–937, October 2003.

[36] R. Genzel, R. Schodel, T. Ott, F. Eisenhauer, R. Hofmann, M. Lehnert, A. Eckart,

T. Alexander, A. Sternberg, R. Lenzen, Y. Clenet, F. Lacombe, D. Rouan, A. Ren-

zini, and L. E. Tacconi-Garman. The Stellar Cusp around the Supermassive Black

Hole in the Galactic Center. ApJ, 594:812–832, September 2003.

[37] S. Gillessen, F. Eisenhauer, S. Trippe, T. Alexander, R. Genzel, F. Martins, and

T. Ott. Monitoring Stellar Orbits Around the Massive Black Hole in the Galactic

Center. ApJ, 692:1075–1109, February 2009.

[38] A. Gruzinov. 1/2 Law for Non-Radiative Accretion Flow. ArXiv Astrophysics e-

prints, April 2001.

BIBLIOGRAPHY 115

[39] A. Harten. High resolution schemes for hyperbolic conservation laws. Journal of

computational physics, 135(2):260–278, 1997.

[40] I. V. Igumenshchev and M. A. Abramowicz. Rotating accretion flows around black

holes: convection and variability. MNRAS, 303:309–320, February 1999.

[41] I. V. Igumenshchev, X. Chen, and M. A. Abramowicz. Accretion discs around

black holes: two-dimensional, advection-cooled flows. MNRAS, 278:236–250, Jan-

uary 1996.

[42] I. V. Igumenshchev and R. Narayan. Three-dimensional Magnetohydrodynamic

Simulations of Spherical Accretion. ApJ, 566:137–147, February 2002.

[43] I. V. Igumenshchev, R. Narayan, and M. A. Abramowicz. Three-dimensional Mag-

netohydrodynamic Simulations of Radiatively Inefficient Accretion Flows. ApJ,

592:1042–1059, August 2003.

[44] S. Jin, Z. Xin, Shi Jin, and Zhouping Xin. The relaxation schemes for systems of

conservation laws in arbitrary space dimensions. Comm. Pure Appl. Math, 48:235–

277, 1995.

[45] B. M. Johnson and E. Quataert. The Effects of Thermal Conduction on Radiatively

Inefficient Accretion Flows. ApJ, 660:1273–1281, May 2007.

[46] R. Kaeppeli, S. C. Whitehouse, S. Scheidegger, U. -. Pen, and M. Liebendoerfer.

FISH: A 3D parallel MHD code for astrophysical applications. ArXiv e-prints, Oc-

tober 2009.

[47] V.V. Kindratenko, J.J. Enos, G. Shi, M.T. Showerman, G.W. Arnold, J.E. Stone,

J.C. Phillips, and W. Hwu. GPU clusters for high-performance computing. In Pro-

ceedings on the IEEE cluster2009 workshop on parallel programming on accelerator

clusters (PPAC09), pages 1–8, 2009.

BIBLIOGRAPHY 116

[48] G. Kowal, A. Lazarian, E. T. Vishniac, and K. Otmianowska-Mazur. Numerical

Tests of Fast Reconnection in Weakly Stochastic Magnetic Fields. ArXiv e-prints,

March 2009.

[49] L. D. Landau and E. M. Lifshitz. Fluid mechanics. 1959.

[50] L. D. Landau and E. M. Lifshitz. Electrodynamics of continuous media. 1960.

[51] A. Lazarian and E. T. Vishniac. Reconnection in a Weakly Stochastic Field. ApJ,

517:700–718, June 1999.

[52] L. C. Lee and Z. F. Fu. Multiple X line reconnection. I - A criterion for the transition

from a single X line to a multiple X line reconnection. JGR, 91:6807–6815, June

1986.

[53] Y. Levin and A. M. Beloborodov. Stellar Disk in the Galactic Center: A Remnant

of a Dense Accretion Disk? ApJL, 590:L33–L36, June 2003.

[54] A. Loeb. Direct feeding of the black hole at the Galactic Centre with radial gas

streams from close-in stellar winds. MNRAS, 350:725–728, May 2004.

[55] J.-P. Macquart, G. C. Bower, M. C. H. Wright, D. C. Backer, and H. Falcke. The Ro-

tation Measure and 3.5 Millimeter Polarization of Sagittarius A*. ApJL, 646:L111–

L114, August 2006.

[56] D. P. Marrone, J. M. Moran, J.-H. Zhao, and R. Rao. Interferometric Measurements

of Variable 340 GHz Linear Polarization in Sagittarius A*. ApJ, 640:308–318, March

2006.

[57] D. P. Marrone, J. M. Moran, J.-H. Zhao, and R. Rao. The Submillimeter Polarization

of Sgr A*. Journal of Physics Conference Series, 54:354–362, December 2006.

[58] D. P. Marrone, J. M. Moran, J.-H. Zhao, and R. Rao. An Unambiguous Detection

of Faraday Rotation in Sagittarius A*. ApJL, 654:L57–L60, January 2007.

BIBLIOGRAPHY 117

[59] F. Melia. An accreting black hole model for Sagittarius A. ApJL, 387:L25–L28,

March 1992.

[60] F. Melia and H. Falcke. The Supermassive Black Hole at the Galactic Center.

ARA&A, 39:309–352, 2001.

[61] K. E. Nakamura, M. Kusunose, R. Matsumoto, and S. Kato. Optically Thin,

Advection-Dominated Two-Temperature Disks. PASJ, 49:503–512, August 1997.

[62] R. Narayan, I. V. Igumenshchev, and M. A. Abramowicz. Self-similar Accretion

Flows with Convection. ApJ, 539:798–808, August 2000.

[63] R. Narayan, S. Kato, and F. Honma. Global Structure and Dynamics of Advection-

dominated Accretion Flows around Black Holes. ApJ, 476:49–+, February 1997.

[64] R. Narayan, R. Mahadevan, J. E. Grindlay, R. G. Popham, and C. Gammie.

Advection-dominated accretion model of Sagittarius A*: evidence for a black hole

at the Galactic center. ApJ, 492:554–568, January 1998.

[65] R. Narayan and I. Yi. Advection-dominated accretion: A self-similar solution. ApJL,

428:L13–L16, June 1994.

[66] R. Narayan, I. Yi, and R. Mahadevan. Explaining the spectrum of Sagittarius A*

with a model of an accreting black hole. Nat, 374:623–625, April 1995.

[67] L. Nyland, M. Harris, and J. Prins. Fast n-body simulation with CUDA. GPU gems,

3:677–695, 2007.

[68] E. N. Parker. Sweet’s Mechanism for Merging Magnetic Fields in Conducting Fluids.

JGR, 62:509–520, December 1957.

[69] U.-L. Pen, P. Arras, and S. Wong. A Free, Fast, Simple, and Efficient Total Variation

Diminishing Magnetohydrodynamic Code. ApJS, 149:447–455, December 2003.

BIBLIOGRAPHY 118

[70] U.-L. Pen, C. D. Matzner, and S. Wong. The Fate of Nonradiative Magnetized Ac-

cretion Flows: Magnetically Frustrated Convection. ApJL, 596:L207–L210, October

2003.

[71] H. E. Petschek. Magnetic Field Annihilation. NASA Special Publication, 50:425–+,

1964.

[72] R. Popham and C. F. Gammie. Advection-dominated Accretion Flows in the Kerr

Metric. II. Steady State Global Solutions. ApJ, 504:419–+, September 1998.

[73] E. Priest and T. Forbes. Magnetic Reconnection. June 2000.

[74] E. R. Priest and T. G. Forbes. Does fast magnetic reconnection exist? JGR,

97:16757–+, November 1992.

[75] D. Proga and M. C. Begelman. Accretion of Low Angular Momentum Material

onto Black Holes: Two-dimensional Magnetohydrodynamic Case. ApJ, 592:767–

781, August 2003.

[76] E. Quataert and A. Gruzinov. Constraining the Accretion Rate onto Sagittarius A*

Using Linear Polarization. ApJ, 545:842–846, December 2000.

[77] E. Quataert and A. Gruzinov. Convection-dominated Accretion Flows. ApJ,

539:809–814, August 2000.

[78] E. Quataert and R. Narayan. Spectral Models of Advection-dominated Accretion

Flows with Winds. ApJ, 520:298–315, July 1999.

[79] M. G. Revnivtsev, E. M. Churazov, S. Y. Sazonov, R. A. Sunyaev, A. A. Lutovinov,

M. R. Gilfanov, A. A. Vikhlinin, P. E. Shtykovsky, and M. N. Pavlinsky. Hard X-ray

view of the past activity of Sgr A* in a natural Compton mirror. A&A, 425:L49–L52,

October 2004.

BIBLIOGRAPHY 119

[80] G. B. Rybicki and A. P. Lightman. Radiative processes in astrophysics. 1979.

[81] Hsi-Yu Schive, Yu-Chih Tsai, and Tzihong Chiueh. GAMER: a GPU-Accelerated

Adaptive Mesh Refinement Code for Astrophysics. Astrophys. J. Suppl., 186:457–

484, 2010.

[82] M. Scholer. Undriven magnetic reconnection in an isolated current sheet. JGR,

94:8805–8812, July 1989.

[83] A. Shan. Heterogeneous processing: a strategy for augmenting moore’s law. Linux

Journal, 2006(142):7, 2006.

[84] P. Sharma, E. Quataert, and J. M. Stone. Faraday Rotation in Global Accretion

Disk Simulations: Implications for Sgr A*. ApJ, 671:1696–1707, December 2007.

[85] P. Sharma, E. Quataert, and J. M. Stone. Spherical accretion with anisotropic

thermal conduction. MNRAS, 389:1815–1827, October 2008.

[86] R. V. Shcherbakov. Propagation Effects in Magnetized Transrelativistic Plasmas.

ApJ, 688:695–700, November 2008.

[87] R. V. Shcherbakov and F. K. Baganoff. Inflow-Outflow Model with Conduction and

Self-Consistent Feeding for Sgr A*. ArXiv e-prints, April 2010.

[88] F. Shu. Physics of Astrophysics, Vol. II: Gas Dynamics. University Science Books,

1991.

[89] J. M. Stone, J. E. Pringle, and M. C. Begelman. Hydrodynamical non-radiative

accretion flows in two dimensions. MNRAS, 310:1002–1016, December 1999.

[90] P. A. Sweet. The Neutral Point Theory of Solar Flares. In B. Lehnert, editor, Elec-

tromagnetic Phenomena in Cosmical Physics, volume 6 of IAU Symposium, pages

123–+, 1958.

BIBLIOGRAPHY 120

[91] T. Tanaka and K. Menou. Hot Accretion with Conduction: Spontaneous Thermal

Outflows. ApJ, 649:345–360, September 2006.

[92] H. Trac and U.-L. Pen. A Primer on Eulerian Computational Fluid Dynamics for

Astrophysics. pasp, 115:303–321, March 2003.

[93] H. Trac and U.-L. Pen. A moving frame algorithm for high Mach number hydrody-

namics. New Astronomy, 9:443–465, July 2004.

[94] H.C. Wong, U.H. Wong, X. Feng, and Z. Tang. Magnetohydrodynamics simulations

on graphics processing units. Imprint, 2009.

[95] F. Yuan, E. Quataert, and R. Narayan. Nonthermal Electrons in Radiatively In-

efficient Accretion Flow Models of Sagittarius A*. ApJ, 598:301–312, November

2003.

[96] F. Yusef-Zadeh, H. Bushouse, M. Wardle, and 11 coauthors. Simultaneous Multi-

Wavelength Observations of Sgr A* during 2007 April 1-11. ArXiv e-prints, July

2009.

[97] F. Yusef-Zadeh, H. Bushouse, M. Wardle, C. Heinke, D. A. Roberts, C. D. Dowell,

A. Brunthaler, M. J. Reid, C. L. Martin, D. P. Marrone, D. Porquet, N. Grosso,

K. Dodds-Eden, G. C. Bower, H. Wiesemeyer, A. Miyazaki, S. Pal, S. Gillessen,

A. Goldwurm, G. Trap, and H. Maness. Simultaneous Multi-Wavelength Observa-

tions of Sgr A* During 2007 April 1-11. ApJ, 706:348–375, November 2009.

[98] F. Yusef-Zadeh, M. Muno, M. Wardle, and D. C. Lis. The Origin of Diffuse X-Ray

and γ-Ray Emission from the Galactic Center Region: Cosmic-Ray Particles. ApJ,

656:847–869, February 2007.

Documents

Magneto-hydrodynamics simulation in astrophysics