CS 224W Final Project: Predicting ... - snap.stanford.edu

CS 224W Final Project: Predicting Galactic Properties from Network

Structure of Cosmological Simulations

Ryan Gao (rgao)

December 8, 2015

1 Introduction

This project focuses on an application of networkscience to the field of cosmology, one of theless well-explored cross-domain collaborations.Cosmological simulations aim to reproduce theevolution of the universe, from its early homoge-neous state 13.8 billion years ago, to the currenthighly inhomogeneous arrangement of galaxiesseparated by large voids of intergalactic space. Inrecent years, as computational power has increasedexponentially quickly, the granularity and detailof cosmological simulations has likewise improved.Since the earliest large-scale gravitational N-bodysimulations in 1985 [1], cosmologists have beentweaking their models to better simulate theobserved universe. The influential Millenniumsimulations [7] used N-body simulations coupledwith smoothed particle hydrodynamics, whichmodels the fluid behavior of gases, to better resolvesmall-scale structure.

The prevailing standard model of cosmology,the Lambda Cold Dark Matter (ΛCDM) model,describes a universe dominated by the unexplainedmechanisms of dark matter and dark energy, witha small contribution from the familiar baryonicmatter made of protons and neutrons. There are 6unknown parameters of the ΛCDM model, and bytuning simulations to observations, we can find thecosmological parameters that best fit our universe.However, not only do empirical observations informour understanding of the simulation results, so toodo the simulations inform our understanding ofwhat we see in the night sky. Simulations haveexplained how galaxies come to be arranged in a“Cosmic Web,” a series of filaments and sheets thatintersect at dense clusters and are separated bysparse voids. This was first noted in observational

data by de Lapparent et al. in 1985 [3].

This project aims to predict properties of galaxiesseen in cosmological simulations, based on thegalaxy’s network properties within the CosmicWeb. High-quality observations of galaxies aretime-consuming and expensive, especially ones athigh redshift, i.e., distant in both time and space.Thus, we can apply the patterns learned fromsimulation data to observational data, in orderto fill in missing fields or to improve on noisyobservations.

We will review the literature for cosmological ap-plications of network analysis in Section 2, describethe Illustris simulation dataset in Section 3, and ex-plain the methods and results in Section 4.

2 Literature Review

Here, we review two of the few instances in theliterature of network theory applied to cosmology.

Ueda et al. [8] compare some network character-istics derived from the eigenvalues of a simulatedadjacency matrix to the values seen in real skydata. This adjacency matrix represents the undi-rected constellation graph (k-nearest neighbornetwork for k = 1) of a 2-D projection of galaxies.This network is an example of a proximity graph,or a set of points with edges based solely on thegeometric layout of the nodes. The network edgeshave weights wst proportional to rjst, where rst isthe Euclidean distance between nodes s and t.

As described in Section 4, we will use a connectedproximity graph, rather than a constellation graph.Proximity graphs like Delaunay triangulation arealways connected, whereas k = 1 nearest neighbor

1

graphs tend to be highly disconnected, even whenthere is large-scale spatial structure. This choicewill also keep the network resilient against miss-ing or additional nodes, which happens often inobservational data when dim or obscured galaxiesgo undetected. We employ a similar system forweighting edges, with j ≤ 0, to indicate thatproximate nodes should have a stronger connectionthan distant nodes.

In a more recent and thorough study, Hong andDey [2] measure degree centrality, closeness cen-trality, and betweenness centrality on a networkderived from observational Cosmological EvolutionSurvey (COSMOS) data. Nodes are galaxies, andtwo nodes are connected by an edge if they arecloser than a linking length parameter `. Thisproximity graph requires parameter-tuning andcan produce unconnected networks, so we preferconnected proximity graphs, like the Delaunaytriangulation.

Using the three measures of centrality, Hongand Dey classify nodes into 8 categories, whichcorrespond to different topological features ofthe Cosmic Web. Degree centrality is used toassign the categories ‘cluster,’ ‘wall,’ and ‘void,’ asit’s a proxy for local node density. Betweennesscentrality assigns the labels ‘main branch’ and‘dangling leaf,’ which distinguishes between galax-ies forming the main filaments and galaxies at theperiphery. Closeness centrality is used to assignlabels ‘fracture’ (non-giant component nodes),‘backbone’ (majority of giant component nodes),and ‘kernel’ (highest closeness giant componentnodes).

Hong and Dey then look for differences in the dis-tributions of galaxy properties like color, star for-mation rate, and stellar mass between the 8 galaxytypes. They compare this network-based predictoragainst a baseline predictor that uses only Voronoitessellation density, a measure of local node den-sity. Their result was that network centrality worksbetter for older, quiescent, “red” galaxies, while thebaseline Voronoi cell density is a better predictor formore active, star-forming, “blue” galaxies. We willtake a similar approach for this project, however, ona simulated dataset with other kinds of proximitygraphs. We borrow their evaluation methodology of

comparing new predictors to the baseline Voronoipredictor, using the Kolmogorov-Smirnov statistic(described in section 4.3).

3 Data

Figure 1: Illustris simulation at current epoch z =0, showing dark matter density (in blue) overlaidwith the gas velocity field (in orange).

The Illustris cosmological simulations [9] model afull suite of physical phenomena that can affect theevolution of galaxies, including gravity, hydrody-namics, gaseous chemistry, magnetic fields, stellarformation, galactic winds, and black hole-drivenheating. There were six runs of Illustris, threewith only dark matter under the force of gravity,and three with full hydrodynamical modeling, atvarying levels of granularity. Due to computationalcomplexity, we will use data from the coarsestgranularity simulations, Illustris-3-Dark (only darkmatter) and Illustris-3 (full hydrodynamic). For asense of the model complexity, Illustris-3 trackedmore than 282 million particles in a volume of(106.5 Mpc)3, equivalent to a cube with side length350 million light-years.

In early 2015, the Illustris team publicly released all265 TB of Illustris simulation data [5]. An Illustris-3 snapshot, or all data at a given time, occupiesover 20 GB, which is prohibitively large. Instead,

2

we use group catalog data, which has aggregatedthe roughly 282 million particles into a more man-ageable 112, 000 clusters. These clusters, called sub-haloes, correspond roughly to galaxies. Clusteringis done via the subfind algorithm [6], which iden-tifies locally overdense, gravitationally self-boundgroups of particles. The group catalog has com-puted properties for each subhalo, such as mass,velocity (mass-weighted average of all constituentparticle velocities), luminosity (magnitude of stel-lar light, measured at eight wavelengths), black holemass, star formation rate (rate at which stars arenewly formed), and metallicity (the proportion ofmatter consisting of elements heavier than helium).

4 Methods and Results

4.1 Data Cleaning

The process by which these subhaloes form, gravi-tational agglomeration, is a rich-get-richer process.The more massive a subhalo, the higher the rate ofaccretion, i.e., the more smaller subhaloes it sub-sumes. Thus, it’s unsurprising that subhalo massesfollow a power law distribution, as shown in greenin figure 2. However, the power law fails at the lowend, as there are too few subhaloes with mass belowabout 1010M� (solar masses). This is likely due topeculiarities in the subhalo clustering algorithm, sowe clean the data by removing all subhaloes withmass less than 2 · 1010M�, shown in blue in figure2.

Figure 2: Distribution of subhalo masses. For com-parison, the mass of the Milky Way is estimated tobe 8.5 · 1011M�.

In order to better visualize the data, we work with1/10th slices of Illustris-3-Dark data. Suppose thesimulation coordinates are [0, L]×[0, L]×[0, L]; thenall points with iL/10 ≤ z < (i + 1)L/10 wouldcomprise the i-th slice. To simplify the analysis, weproject points into 2D by ignoring the z-coordinate.

4.2 Predicting Subhalo Mass

One of the most fundamental characteristics of asubhalo is its mass. Traditionally, galaxy mass mea-surements are computed from the orbital speed ofstars around the galaxy center, which is inferredfrom Doppler shifts in the galaxy’s spectrum. Ourgoal is to provide a way to differentiate galaxymasses based on only the most fundamental charac-teristic, the galaxy position. We encode the galaxypositions in proximity graphs; some examples werediscussed in section 2. For this project, we considerthe Delaunay triangulation in section 4.4 and theGabriel graph in section 4.7.

4.3 Voronoi Tessellation Density Base-line

Given n nodes {xi}, the Voronoi tessellation isa unique partitioning of the plane into n regions{Ri}, such that all points in Ri are closer to nodexi than to any other node. The volume of theVoronoi cell associated with a node provides anestimate for the local node density. Larger Voronoicells mean a more sparse, void-like region. Just asHong and Dey did, we will take Voronoi cell densityto be the baseline, purely geometric, non-networkpredictor of subhalo properties.

Applied to the Illustris data, Voronoi cell areaseems to do a reasonable job at differentiatingthe clusters (high density regions, in blue), thefilaments (intermediate density regions, in red),and the void (low density regions, in green), asshown in Figure 3.

We use the 2-sample Kolmogorov-Smirnov (K-S)statistical test to evaluate the degree to whicha variable x is good for distinguishing subhalomasses. The two samples are masses of subhaloesin the lowest quartile of x and subhalo masses inthe highest quartile of x. The K-S test computesthe maximum difference between the cumulativedistribution functions (CDFs) of the two samples.

3

Figure 3: Nodes colored by Voronoi cell volume.Green represents large Voronoi cells and blue rep-resents small Voronoi cells.

Figure 4: Distribution of subhalo mass, by Voronoicell density

Higher K-S statistic means that the distributionsare more different, or specifically, x is a betterdistinguishing variable for subhalo mass.

As shown in Figure 4, the mass distribution oflowest quartile Voronoi cell volume subhaloes (inblue) - those in locally dense regions - is centeredhigher than that of high Voronoi cell volumesubhaloes (in red) - those in locally sparse regions.This is consistent with the rich-get-richer feedbackof gravitational agglomeration. High-mass sub-haloes tend to gravitationally bind more smallersubhaloes, and eventually, merge with them togrow even larger. Thus, we should expect to seehigh-mass subhaloes in denser regions, like clusters

and filaments. The test statistic of the 2-sampleK-S test is 0.1249, which translates to a p-value of5.2 · 10−10. This means that the difference betweenmass distributions in high and low density regionsis statistically significant.

4.4 Delaunay Triangulation

The Delaunay triangulation is a unique partition-ing of the entire space into triangles with the nodesas vertices, such that no node lies within the cir-cumcircle of any triangle. We use existing Pythonlibraries, but implement a modification that allowsspace to have toroidal topology, i.e., the bottomand top edges are associated and the left and rightedges are associated. This matches the topology ofspace in the Illustris simulation. If a galaxy clustercontains subhaloes near x = 0, it will also likely con-tain subhaloes near x = 75000 (side length of theIllustris simulation cube). Our modified Delaunaytriangulation allows edges between those groups ofsubhaloes, while the standard implementation doesnot, as they’re considered very distant.

Figure 5: Delaunay triangulation of 2D slice ofIllustris-3-Dark run.

4.5 Degree Centrality

We take the edges of the Delaunay triangulation tobe edges in an undirected network. We also assign

4

edge weights wst proportional to rjst, where rst isthe Euclidean distance between nodes s and t.

Figure 6: Unweighted degree distribution (j = 0)of Delaunay triangulation, for each of 10 slices.

Figure 7: Weighted degree distribution (j = −1) ofDelaunay triangulation, for each of 10 slices.

In figures 6 and 7, we compare weighted (j = −1)and unweighted (j = 0) degree distributions ofthe Delaunay triangulation of Illustris subhaloeswith that of an homogenous Poisson process of thesame size, i.e. equal number of uniformly randomlydistributed subhaloes. It’s clear that for Illustrisdata, the degree distributions (both weighted andunweighted) are wider and heavier-tailed than forthe Poisson process. This is exactly the highlyskewed degree distribution that arises from arich-get-richer, preferential attachment model.There are large “hub” subhaloes, which tend to bein dense regions and close to other nodes, which

Figure 8: Distribution of subhalo mass, by un-weighted degree.

Figure 9: Distribution of subhalo mass, by weighteddegree.

leads to higher weighted degree.

From figure 8, we get no statistically significantdifferentiation of subhalo mass when categorizingby unweighted degree: K-S statistic 0.0243, p-value0.80. However, as shown in figure 9, we getgreat differentiation by j = −1 weighted degree:K-S statistic 0.131, p-value 5.1 · 10−11. In fact,this is better differentiation than the baseline ofcategorizing by Voronoi cell volume.

We perform the same analysis from j = −9 (heav-ily discount long edges) to j = 0 (all edges weightedequally), and across all 10 slices of the simulationcube. We compare the K-S statistic for subhalomass distribution using weighted degree centrality,versus using the baseline Voronoi cell volume. The

5

difference in K-S statistics is plotted in figure 10,with higher values indicating that degree central-ity is better at distinguishing subhalo mass thanVoronoi volume. The advantage peaks near j = −2,but is fairly stable for j < −1.

Figure 10: K-S test statistic advantage for distin-guishing subhalo mass; weighted degree centralitycompared to baseline Voronoi cell density. The av-erage advantage over all 10 slices is shown in red.

4.6 Predicting Subhalo Luminosity: asecond-order effect

Many of the other subhalo properties one mightwant to predict, such as luminosity, black holemass, star formation rate, metallicity, and max-imum radial velocity (speed of stars orbitinggalactic center), are strongly correlated with mass.We’ve shown that degree centrality is a goodpredictor of subhalo mass, so it will also likely be agood predictor of these other properties. However,we now wish to investigate a second-order effect:how well does degree centrality distinguish residualluminosity (residual compared to other galaxies ofthe same mass)?

First, we filter out all subhaloes without luminos-ity estimates, and all subhaloes with mass lessthan 1011M� (luminosity estimates are often un-reliable at low masses). In figure 11, we see theexpected strong correlation between luminosity andmass - more massive galaxies are brighter. The redcurve is computed using locally weighted scatterplotsmoothing (LOWESS), and we take the residual lu-minosity with respect to that LOWESS estimate.

Figure 11: Luminosity, as a function of subhalomass. Note that luminosity units (mag) are on alogarithmic scale, and lower values indicate higherluminosity.

Figure 12: Residual luminosity in a slice of Illustris-3 data. Nodes are larger if the residual is furtherfrom zero.

In figure 12, we see that subhaloes with nega-tive residual luminosity (brighter than expectedfor its mass) tend to lie in higher density clus-ters, while dimmer-than-expected subhaloes tendto lie in lower density filaments and voids. Thiscan be explained physically, because high densityclusters tend to have higher densities of coalescinggas, which drives star formation. Star-forming re-gions contain many hot, bright stars, which burnout quickly, while quiescent regions contain mostlyold, cool, dim stars. Figure 13 shows that even forthis kind of second-order property, degree centralityoutperforms the baseline Voronoi predictor.

6

Figure 13: K-S test statistic advantage for distin-guishing residual luminosity; weighted degree cen-trality compared to baseline Voronoi cell density.

Figure 14: K-S test statistic advantage for distin-guishing residual luminosity; weighted degree cen-trality compared to subhalo mass.

One might posit that the heteroskedasticity evi-dent in figure 11 might result in high K-S statis-tic. This might be due to differences in distribu-tion variance, rather than differences in distribu-tion mean, which are much more useful when try-ing to distinguish two populations. However, figure14 refutes this claim, by comparing the distinguish-ing power of degree centrality compared to subhalomass. Categorizing by degree centrality results ina much higher K-S statistic than categorizing bymass, so the heteroskedasticity contributes negligi-bly to the K-S statistic.

Figure 15: Gabriel graph of 2D slice of Illustris-3-Dark run.

4.7 Gabriel Graph

The Gabriel graph is a subgraph of the Delaunaytriangulation. It is also a proximity graph, butwith the rule that an edge between node S andnode T exists iff the closed disk with diameterST intersects no other nodes. We implementedin Python a linear time construction, proposedby Matula & Sokul [4], that takes the Delaunaytriangulation as input.

Comparing figures 5 and 15, we see that theGabriel graph is considerably more sparse. Forplanar uniformly distributed points, the averagedegree of the Delaunay triangulation is 6, whilethe average Gabriel graph degree is 4. However,it’s clear visually that mostly long edges have beenremoved, so the important filamentary structure isstill expressed.

Shown in figure 16, weighted degree centrality onthe Gabriel graph shows slightly better K-S advan-tage than on the Delaunay triangulation. However,this difference is not statistically significant. This isexpected, as the removal of long edges has little ef-fect on weighted degree centrality for small j << 0.Thus, there is little difference between using theDelaunay triangulation and the Gabriel graph.

7

Figure 16: K-S test statistic advantage, weighteddegree centrality on Gabriel graph compared tobaseline Voronoi cell density. The average advan-tage over all 10 slices is shown in red.

5 Conclusion

We have shown that, for the task of distinguishingsubhaloes with different masses, a spatial network-based subhalo property, like weighted degree cen-trality on the Delaunay triangulation, will outper-form the baseline non-network-based Voronoi cellvolume property. We have shown that degree cen-trality distinguishes both first-order properties likemass, as well as second-order properties like resid-ual luminosity. This work hopefully demonstratesthat there is room to explore the sparse juncturebetween network analysis and cosmology.

References

[1] Marc Davis et al., “The Evolution of Large-Scale Structure in a Universe Dominated byCold Dark Matter”. The Astrophysical Journal292 (May 1985), 371-394.

[2] Sungryong Hong and Arjun Dey, “Networkanalysis of cosmic structures: network central-ity and topological environment”. Monthly No-tices of the Royal Astronomical Society 450(June 2015), 1999-2015.

[3] Valerie de Lapparent, Margaret J. Geller, JohnP. Huchra, “A Slice of the Universe”. The As-trophysical Journal 302 (March 1986), L1-L5.

[4] David W. Matula and Robert R. Sokal, “Prop-erties of Gabriel Graphs Relevant to Geo-

graphic Variation Research and the Clusteringof Points in the Plane”. Geographical Analysis12 (July 1980), 205-222.

[5] Dylan Nelson et al., “The Illustris Simulation:Public Data Release”. Astronomy and Com-puting 13 (November 2015), 12-37.

[6] Volker Springel et al., “Populating a cluster ofgalaxies - I. Results at z = 0”. Monthly Noticesof the Royal Astronomical Society 328 (2001),726-750.

[7] Volker Springel et al., “Simulations of the for-mation, evolution and clustering of galaxiesand quasars”. Nature 435 (June 2005), 629-636.

[8] H. Ueda, T. T. Takeuchi, and M. Itoh, “Agraph-theoretical approach for comparison ofobservational galaxy distributions with cosmo-logical N-body simulations”. Astronomy & As-trophysics 399 (2003), 1-7.

[9] Mark Vogelsberger et al., “Properties of galax-ies reproduced by a hydrodynamic simulation”.Nature 509 (May 2014), 177-182.

8

Documents

CS 224W Final Project: Predicting ... - snap.stanford.edu