Hogg & Fergus / CDI-Type I: A uniﬁed probabilistic model ...fergus/drafts/hogg_cdi.pdf · signiﬁcance [23], despite using essentially the same estimators. Large-scale structure

Hogg & Fergus / CDI-Type I: A unified probabilistic model of astronomical imaging 1

1 Introduction

The hierarchical structure of astrophysics: All astrophysics knowledge comes, ulti-mately, from photons collected in imaging cameras (and spectrographs, which are imagingcameras behind dispersive optics). That is, we have been able to discover a remarkableamount about the Universe (its mass density, age, and contents, and the mechanisms bywhich stars shine and black holes accrete, and so on) simply by performing theory-informedanalyses of those images.

Fortunately, it is also extremely hierarchical. That is, the formation of images in anastronomical camera (really a camera+telescope setup) depends only on a small number ofaspects of the system, including properties of the hardware and atmosphere and propertiesof the photon spectra emitted by the relatively small number of stars that happen to bein the field of view. In turn, the properties of the photon spectra depends on the physicalproperties of stars, and the physical properties of stars depend on the contents and formationof structure in the Universe. This hierarchy is depicted in Fig. 1.

In this proposal, we are going to argue that the hierarchical structure of astrophysicalimage formation is going to make possible full probabilistic modeling of enormous collectionsof astronomical imaging. But first we ought to say why that is such a good idea. How wouldit lead to new capabilities?

Present-day astronomical data analysis doesn’t propagate uncertainty: In stan-dard astronomical practice, astronomers think of their “data” as being measurements madeon the images, and they think of measurements as being point estimates of star and galaxyproperties (like positions, brightnesses, or shapes). Bayesian inference is used modern as-trophysics, but only to model the measurements which are themselves point estimates madefrom the images, but not the images themselves. In many cases, the inference is applied notto the measurements, but to, say, second-order statistics (variances) measured (again, withpoint estimates) from those measurements.

In the above and what follows, a “point estimate” is any hard estimator—any arithmeticoperation on the data that gives a single number as its output—for any quantity of interest.A good example of a common point estimate in astronomy is what’s used for the brightnessof a star: Usually astronomers just use either the total counts above background near thestar’s position, or else the amplitude of a point-spread function fit. Some point estimatorsare maximum-likelihood estimators, some are maximum-a-posteriori estimators, and somehave no probabilistic justification. But in astronomical data analysis they abound.

As a case in point: Perhaps the most remarkable quantitative success in the history ofastrophysics is the determination of the age, density, and initial conditions in the Universeand the confirmation of the simple but remarkable ΛCDM standard cosmological model.This model has been fit (using Bayesian methods) to point estimates of fluctuations in thecosmic microwave background [24] (which are reasonable, since the fluctuations and the noiseare both close to Gaussian), to point estimates of the clustering of galaxies [46], and to pointestimates of the distribution of massive galaxy-cluster masses [50]. There are many smallanomalies in these fits and in the results, but we think it is fair to say that the astrophysics


Image

Object

TelescopeOrientation

TelescopeLocation

Time/Date

TelescopePSF

DetectorNoise

DetectorBias

CosmicRay

Detector Spectral

Response

CosmicRay Prior

Pixels

TelescopeFoV

AtmosphericModel

Atmospheric Distortion

SkyBrightness

Key: Telescope / Atmosphere / Detector / Star / Galaxy

Noise True Pixels

Distortion ImagePosition

Star GalaxyStar

PositionStar

DistanceStar

Spectrum

MilkyWay

StellarEmission

Prior

Milky Way Formation

Model

MetalllicityPrior

PositionPrior

Mass Prior

Stellar Emission

Prior

Galaxy Appearance

Galaxy Shape

Star/GalaxyPrior

Galaxy Spectrum

AngularDistribution

Prior

Stellar Population

Prior

Morphological Prior

Galaxy Formation

Models

Large Scale Structure Prior

StarMass

StarMetallicity

IntrinsicStar

BrightnessGalaxy

Position

PositionPrior

Galaxy Priors

Star/Galaxy/otherswitch

StarAge

AgePrior

Figure 1: Our unified graphical model (also known as a Bayesian network [33]), for astronomicalimage data. It integrates in a principled framework: large-scale cosmological models of galaxyand Milky Way formation; galaxy appearance models; spectral emission models and detailedcamera, sky and telescope models. The shaded oval nodes are observed variables (i.e., theirvalues are known) while the unshaded ones are unobserved and hence will be inferred from theraw astronomical data. The square nodes represent priors, typically informed by well-understoodphysics models. The arrows represent dependencies between variables in the model (and the lackthereof correspond to assumptions of independence). The conditional probability distributionswithin the model (which detail how a particular node depends on those variables which point toit) are not shown, but will be outlined in the text. The rectangles refer to replications of variables,e.g. an image will contain many stars/galaxies. The realization of this model is the ultimate goalof the project, but initial work will focus on sub-pieces of the model. This figure is best viewedin color.


community does not really understand yet how to push the Bayesian inference “down” alayer closer to the pixel level, or why that would matter.

There are many indications in contemporary astrophysics that conventional point-estimate-based data analysis is not reaching the information limit provided by the photonnoise in the raw data:• Very faint sources: Current and next-generation surveys derive their sensitivity by repeat

observations. Astronomers, in current practice, co-add these repeat observations todetect the faint sources. Co-addition is information-preserving only under extremelystrong assumptions about the noise model, about data defects (or their identifiably),and about source variability. In particular, the most interesting sources are those thatvary or move during the survey; there is no way in the co-added data to measure theirtime-variable properties. The PI has demonstrated that a pixel-level modeling approachto these problems works where co-addition can’t, and in the process has discovered new,faint brown-dwarf (non-hydrogen-burning) stars [25].

• Star–galaxy separation: Because the Milky Way is finite in extent, at faint magnitudesin astronomical imaging, external galaxies enormously outnumber stars. Furthermore,very distant galaxies are only barely resolved in the ground-based imaging of interestfor this proposal. Morphological (resolution-based) star–galaxy separation (the onlytechnique used in large, contemporary surveys) breaks down at these faint magnitudes;it becomes impossible to use the faint stars—you just can’t find them among the farmore numerous galaxies. In unpublished work by the PI, it has become apparent thatsimultaneous modeling of multi-band data, plus informative priors about the differentspectra of stars and galaxies, learned from the data hierarchically, can overcome thesechallenging problems for faint-star science.

• Time-domain astrophysics: The bulk of future imaging surveys—and especially the hugeLSST Survey—have a time-domain component, or are entirely built around time-domainscience. One example is the important scientific results coming from measurements ofstellar proper motions at the far-sub-pixel level. The most convincing work in this areahas involved detailed modeling of the data at the pixel level [2]. What isn’t known ishow to make these (very slow) methods applicable to large surveys or to make these(very hand-crafted) methods run automatically on arbitrary data.

• Weak lensing: One of the key science goals of the next generation of imaging experi-ments is to measure weak lensing, or the tiny (few percent-level) distortions of galaxiesfrom gravitational lensing by foreground structure. The observed galaxy shapes aredetermined in part by latent (unobserved) foreground mass structures that we wish toinfer. The issue is that the distortions are so small, it is easy to measure them with sub-stantial bias or variance over what is theoretically possible given photon statistics. Thiscommunity is coming around to the view that the data must be modeled directly [5],but are still thinking in terms of point estimates (like maximum-likelihood estimators),which won’t transmit the full information from the images to the distortions of interest.

• Ultra large-scale structure: In 2005, the PI was involved in the discovery of the baryonacoustic feature in the second-order statistics (clustering) of massive galaxies [12]. Thisdiscovery and analysis was based on point estimates of the clustering of galaxy positions,


which are themselves the result of point estimates in the imaging; in 2010, a new analysiswith a superset of the data—more than twice as much data—found the feature with lowersignificance [23], despite using essentially the same estimators. Large-scale structure onthese scales is hard to measure and hard to model, but it is possible that this significanceanomaly comes from the crudeness of the estimation.

In some of these areas, there have been some attempts to get closer to precise data modeling.However, there is no pipeline-level or comprehensive modeling planned for existing or futurelarge data sets, at least not that propagates uncertainty at the image-pixel noise-model levelup to the quantities of astrophysical interest.

Machine learning needs a non-toy sandbox: Building accurate generative models ofreal-world image data is challenging. Correspondingly, machine learning practitioners oftenuse fairly constrained types of image (e.g., isolated hand-written digits from the MNISTdatabase) when demonstrating end-to-end (i.e., class to pixels) generative models. Withmore complex data, such as real-world images containing natural scenes, current generativemodels fall into two groups. One group provides a generative model of low-level structures(e.g., edges) but lack a high-level model of the scene [32, 37]. The second group accuratelymodels high-level structure but does not directly model the pixels themselves, instead usinghand-constructed local feature descriptors extracted from the image [41, 45]. Thus themodels are generative in the descriptors but not the pixels. Ideally, the models wouldcapture both high and low-level structures, but creating a unified model for natural imagesis very much an open problem in computer vision and machine learning [38, 54].

We intend to leverage generative image models developed by computer vision and ma-chine learning researchers to astronomical images. This domain represents a “sweet spot”:the data are simple enough to be modeled in a fully generative fashion, yet have real scien-tific relevance. Developing a successful implementation of a complex generative model forastronomical data provides a springboard for tackling the much harder problem of modelingnatural scenes. Many techniques we aim to develop, such as large-scale inference, accurateappearance models and their fusion with low-level formation models have direct relevanceto image recognition, a core problem in computer vision and machine learning.

Astronomy tutorial for computer scientists: Astronomical images (for our purposes)are taken by CCD cameras in the focal plane of a telescope, which is an optical system alwaysfocused at infinity. In these images, stars are unresolved delta-function point sources; theyhave finite angular size in the images because there is a finite point-spread function (PSF)caused by the finiteness of the telescope aperture and atmospheric turbulence. Galaxies areresolved as fuzzy blobs; internally they contain billions of stars, but it is not usually usefulto think of them purely as a mixture of stars, both because there are so many stars, andbecause there are other emitting components (gas, dust, black holes, and so on). There aremany galaxies, but we live in the Galaxy, called the Milky Way. In addition to stars andgalaxies there are various other sources, described in Section 3.

Astronomical brightnesses are discussed in magnitudes, which are on a negative logarith-mic scale. Because distances are hard to estimate, we think of astronomical sources as living


on the celestial sphere of pure angles. The conversion of detector counts in the CCD toreal intensity units for the incident intensity field is called “photometric calibration”. Theconversion of position in the detector to position on the celestial sphere is called “astrometriccalibration”. The apparent brightness of any source is related to its intrinsic luminosity andthe distance to the object. A similar thing can be said about the angular size of any galaxy(but because of special relativity and the expansion of the Universe, these are goverened bydifferent distances [18]). These distances, along with the age and formation history of theUniverse are governed by cosmology. The current standard cosmological model, “ΛCDM” isvery successful. It is a model of means and variances; the Universe’s initial conditions arevery close to a true Gaussian Process [34].

In addition we may use the terms SED (spectral energy distribution), which is the spec-trum of photons (in intensity or luminosity or brightness units) emitted by a source, andWCS (world coordinate system), which is the astrometric mapping from detector pixels touniversally agreed-upon celestial coordinates RA and Dec.

Inference tutorial for astronomers: Complex real-world data are often hard to modelwith parametric representations where the functional forms of the probability distributionsmust be precisely specified (e.g., the number of stars in an image or the shape of a galaxy).By contrast, non-parametric models avoid such assumptions by allowing the complexityof the model to grow as more data are observed. Counter-intuitively, “non-parametric”does not mean that the model lacks parameters—in fact, in a Bayesian context the modelhas a potentially infinite number of parameters. However, the complexity of the model isconstrained by prior distributions, so that the number of parameters is finite in practice.Crucially, the complexity depends on the quantity of data: small datasets will only producesimple predictions, while large ones will support more sophisticated ones.

Real-world observations typically result from many causal variables, which are hard to in-fer. In astronomy, the low signal-to-noise of the observations means the underlying variables(causes) will typically have some uncertainty to them. However, to make analysis tractable,current approaches have collapsed these to point estimates—a potentially dangerous opera-tion. By contrast, we propose to use fully Bayesian methods that propagate uncertainty ina principled manner, through all stages of the model shown in Fig. 1, from pixel noise up tolarge-scale galaxy structure. This will be performed by using sophisticated probabilistic sam-pling techniques, elaborated on in Section 5. Such a model can be used to perform science.For example, if one wishes to know the angular distribution of galaxies, this would involveintegrating out all other unobserved (non-shaded) nodes and their associated distribution inFig. 1, given all astronomical observations (shaded round nodes) and priors (square nodes).This can be shown [22] to produce the “true” estimate of the quantity of interest, subjectto certain basic axioms. The priors in the model (shaded squares in Fig. 1) provide us witha mechanism for incorporating our physical understanding into the framework. Where suchknowledge is not available, maximally uninformative distributions will be used instead, sothat do not induce any untoward biases.

Ultimately, these methods allow us to combine uncertainties in a principled way, unlike thepoint estimates currently used. For example, given multiple images of a weak source, existing


methods would compute point estimates and average them. In contrast, our approach wouldcombine them, taking into account the uncertainty in each individual observation.

This proposal: Our program is to combine what the machine learning world knows aboutinference using hierarchical, non-parametric Bayesian models with what astronomers have astheir core problems and core data to make transformative contributions to both fields. Wewill be exercising high-end machine learning methods on a real (that is, not a toy) problemand building an end-to-end model that has enormous scientific importance. We will improvethe precision and capability of existing and future astrophysics experiments, especially theexisting SDSS Project [52, 1] and the upcoming LSST Project [21].

2 Bottom layer: Astronomical data

The data we use: We will restrict our attention to the following (enormous) data col-lections: We will use the Sloan Digital Sky Survey (SDSS) ground-based, five-band, visible,quarter-sky imaging data, including the multi-epoch Stripe-82 multi-epoch subset [1]. Wewill use the NASA GALEX space-based, ultraviolet, all-sky imaging data [28]. We will usealso the NASA 2MASS ground-based, three-band, infrared, all-sky imaging data [42]. Thesedata sets, combined, make up about 1012 pixels (30 TB), and comprise more than half ofall the available imaging data available at the present day. The PI has all of these datasets spinning at NYU, and is very familiar with all of them, especially the SDSS, which hewas involved in calibrating and exploiting. For all of these data sets we have good noisemodels and good PSF models, and astrometric registration to the world coordinate systemis excellent. That is, we have a good basis for all the components in the “Image” plate inFig. 1.

In addition, we plan to use public astronomical imaging data contributed by amateurusers to the Astrometry.net and Open Source Sky Survey (OSSS) projects [26]. Theseprojects—run by the PI—gather and calibrate data from amateur, hobbyist, and educa-tional telescopes for scientific use. These projects have begun asking contributors to licensetheir data with open Creative Commons licenses to make their data re-usable, re-analyzable,and re-publishable in other formats and forums. Right now these projects have more than athousand total contributing users, but the open licensing requests and discussion is only juststarting, so it is hard to estimate how much data this will be by the end of the project, butit could be enormous, given that scientific use of hobbyist data typically increases participa-tion and interest. All these amateur data come to us in formats we understand and (withAstrometry.net) calibrated well enough for our purposes. The inclusion of citizen-sciencedata inputs into the model constitutes one of this proposal’s broader impacts.

Noise model: Essentially all detectors in astronomical instruments are CCDs, which havevery well-understood noise characteristics. These consist of: read noise (Gaussian), photonnoise (Poisson) and light pollution of the night sky (Gaussian). The per-pixel variances offirst two sources is usually known, having been determined though calibration of the detector.


The sky noise varies from image to image, thus will be inferred by considering non-objectparts of the image, and may be helped by knowledge of the ambient air temperature (for IRwavelengths).

Cosmic rays (CR) produce distinct artifacts in images, in the form of small lines of sat-urated pixels (and charge bleeding to surrounding ones). These can easily be modeled as agenerative process consisting of: a prior over their frequency (may depend on the altitudeof the telescope); a line drawing model (latent variables: centroid, length, orientation) anda bleeding model (a Gaussian basis function around line, of fixed or inferred variance). Thecurrent approach used simply masks out the saturated pixels and then interpolate over themwhen fitting sources – which is certainly flawed.

Camera & telescope calibration: In order to locate the stars and galaxies consistentlyfrom image to image, that is, to build a consistent model of multiple images of the sky, werequire a mapping from pixel space (where the data exist) to sky coordinates (where the starsand galaxies live, at least at the lowest level of modeling). This mapping—usually encodedin the astrophysics WCS standard—depends primarily on the telescope location, orientation,field-of-view and its optical design, which we consider to be observed variables. For amateurdata, these parameters can be estimated using the PI’s Astrometry.net software.

Secondary effects include focus changes, temperature changes, and gravitational loadingchanges as the telescope tracks. The effects can be inferred with the help of any additionalmeta-data about the camera/telescope configuration (e.g. pointing angle) and priors distilledfrom image collections taken with a specific camera/telescope setup.

The photometric sensitivity (calibration or zeropoint) of the image is a function of thecamera+telescope setup and atmospheric transparency, which is variable over a wide range.

Point-spread function: The finite aperture of the telescope, plus atmospheric distortionleads to a finite point-spread function (PSF). There is a great deal of work in astrophysicsmodeling PSFs with Hermite polynomials, mixtures of Gaussians, Moffat profiles, and variousother bases. For some data the PSF is known, but for the remainder we intend to infer thePSFs directly from the data. Recent work from computer vision by the co-PI and othershas introduced reliable blind image deconvolution methods for natural images, corrupted bycamera shake [13] or atmospheric distortion [17]. These can be used to automatically inferspatially varying PSFs for each image, without employing strong constraints of the shape ofthe PSF. Given sufficient images we can also build a prior on the PSFs themselves, whichwill improve the robustness of the PSF estimation.

3 Middle layer: Stars and galaxies

Stars: By far the most prominent features in most optical astronomical images are thestars, which are—for all imaging of interest here—unresolved point sources. They are there-fore trivial to model at the pixel level in an individual image, once there is a point-spreadfunction. Multi-image modeling must take into account the SEDs of stars, their (small)


motions on the sky with time, and their modes of variability.An enormous amount is known about stars at the present day, including wavelength-for-

wavelength models of their emission, with thousands of atomic and molecular absorptionlines quantitatively explained and minute departures from blackbody emission understood.The approach of astronomers to modeling the SEDs of stars has, traditionally fallen intoone of two camps. In one, the models are assumed to be absolutely correct, and everystar is slammed onto an exact and perfect model. In the other, every star is permitted anabsolutely arbitrary SED and variability and each image and band is analyzed completelyindependently. (The SDSS pipelines, for example, take the second approach.)

The approach we (would like to) take here is intermediate. We would like to give the stellarSEDs great freedom but then have the hierarchical machinery “discover” the regularitieswe expect it to find; it should discover that stellar spectral energy distributions and timevariability are not arbitrarily diverse, but obey regularities. In a previous project on stellarand quasar SEDs [6], we encoded SED space as a set of broad-band intensity ratios andthen performed mixture-of-Gaussians modeling of the density in the flux-ratio space. Thiswas very successful; here we would upgrade the fitting from maximum-likelihood to fullprobabilistic inference, and from fixed complexity to variable (as discussed in Section 5).

The properties of the model in SED space will be interpretable in terms of stellar blackbod-ies and the (highly constrained) mixtures of blackbodies that constitute binary-star SEDs. Ifwe go “higher up” in our modeling, therefore, models could make use of the physical under-standing of stars (plus a free model for outliers and anomalies) that pulls parameters towardsphysically plausible SED distributions. That is, it could set priors for the components of theGaussians in the lower layers. And so on. We are being somewhat vague here, because oneof the key goals of this proposal is to work out what the right framework is for this kind ofmulti-layer modeling in the real, scientific domain of astrophysics.

Stars have non-trivial time-domain properties; parallax motion follows a celestial-sphereellipse fully specified by the object mean position, and proper motions are, for our purposes,constant in angular velocity. Above that the model will be permitted to set power-law-likesmooth prior probability distribution functions over both properties (with a lot of weightnear zero). In addition to being able to move, some stars vary. Again, the time-variabilityproperties are strongly constrained, with large-amplitude variations only happening withcertain kinds of (rare) eclipsing binaries and stars with colors and brightnesses consistentwith being part of the “instability strip” in the temperature–luminosity plane. Once again,our approach would be to permit a great deal of freedom at the first layer above the lowlevel, but then restrict the freedom with priors. Physical knowledge could come in abovethat intermediate layer, putting priors on the prior parameters for the lowest level.

Galaxies: After the stars come galaxies. Galaxies are technically very close to mixtures—transparent superpositions—of stars, though they have some significant differences. In prin-ciple a sophisticated-enough, hierarchical model of the stars could capture galaxies too, butthere are various reasons that this is not a good idea in practice: Galaxies are so distantthat unlike stars, they do not move detectably even in the entire history of observationalastronomy. A typical galaxy contains many billions of stars and therefore is not well rep-


Data SDSS GalFit 1 GalFit 2 GalFit 3 GalFit 4 GalFit 5

Figure 2: Galaxy models of different complexity. Left: a 400× 400-pixel cutout from the SDSS.The SDSS Catalog is limited to simple radial-profile galaxy models that are unable to capture thespiral arm structure in this galaxy. As a result, the core of the galaxy has been represented as asmooth profile, and 23 false stars and small galaxies have been added to “explain” the spiral arms.A series of models of increasing complexity, fit using the GalFit software [35], is also shown. Thesimplest (1) includes two Sersic components, and better fits the overall structure of the galaxy.Model (2) adds a third component. Model (3) adds spiral arms, (4) allows the spiral arms to fitindependently of the core, and (5) adds azimuthal Fourier modes which allow the spiral arms tomove slightly. Each of these increasingly-complex models results in a very large improvement inthe fit quality with only a few extra parameters. Real galaxies will only populate a small region ofthe available parameter space; automatically discovering this low-dimensional manifold will allowus to find feasible models more easily, will effectively reduce the number of parameters required,and may yield astrophysical insight.

resented by any small superposition of simple stars. Galaxies contain nebular gas and dustthat absorb stellar light and emit continuous and line (monochromatic) spectra.

Lots of work has gone into modeling galaxies, some of it very constrained to simplefunctions, and some of it very free—too free. In detail, here and in many cases below, we aregoing to need to use a free model and then use hierarchical priors to constrain the freedomin a data-driven way, which is the point of the next Section. Fig. 2 shows a series of galaxymodels of different complexity for a small spiral galaxy.

Similar to the situation with stars, galaxies are usually given by astronomers either waytoo little freedom (they are fit as elliptically symmetric exponential radial profiles) or else waytoo much freedom (they are fit with unconstrained Hermite polynomials [40] or an uncon-strained grid-based mixture of Gaussians with hundreds or thousands of parameters). Thecomplexities of these models has always been hard-set by the investigators. Again, our ap-proach will be hierarchical; we will give the large amount of freedom (with a non-parametricmorphological model like a mixture of delta functions or a Hermite-polynomial-like basis)but then permit prior parameters (such as a mixture of Gaussians) on the parameters thatcan be used to learn where true galaxies live in the high-dimensional space. Work on thishas begun with our collaborator Dustin Lang (Princeton) and he intends to continue thiscollaboration in the context of the projects proposed here (see supporting documentation).

Among other things, any successful galaxy model should discover that red galaxies havesimpler morphologies than blue galaxies. It should discover that red galaxies are morecentrally concentrated than blue ones, that blue ones show spiral structure, and that somegalaxies contain bright central point sources. If we are awesome, a higher-level model should


also find structure in the space that corresponds to Euler angles on an even more simple modelin three-space, or that a smooth weak-lensing map applied to the morphologies simplifiesthe underlying prior distribution function.

The distribution in brightness and color is simple to model. Here the model is doing whatastronomers were doing from the seventies to the nineties! Either we can be informed bythat work or else discover it. In the brightness distribution, as with stars, power-law-likedistribution functions will be used or come up.

Positions of galaxies we could leave free at this stage, but if we want to do inference onthem, there is an angular clustering function (variance as a function of scale or correlationfunction) that is a strong function of color and brightness. This in turn comes from strongclustering in three-dimensional space and a cosmological model and a galaxy-evolution model.In this area, Gaussian Processes are highly relevant, and we have begun discussions withIain Murray (Edinburgh) about methods by which we might proceed.

Other astrophysical sources: There are also major and minor planets, nebulosity in ourGalaxy, twilight and moonshine, artificial satellites, meteors, and so-on. These effects putflux in an incredibly small fraction (< 10−8) of all pixels, but we will see them as anomalies inthe model; places where the data-driven model cannot easily explain the data, or assigns verylow prior probability to the observed pixel values. Re-discovery of, for example, known Solar-System and nebular sources will provide validation of our approaches. Future extensions ofthis project can include these sources in the model (enlarging and complexifying Fig. 1).

4 Top layer: Astrophysics and cosmology

The top layers of the model of Fig. 1, which includes models of the distribution of stars in age,metallicity, and mass, and of the distribution of galaxies in position (clustering), age, andmorphology, are the fundamental goals of contemporary astrophysics. The parameters of thecosmological model that produces those distributions is, in turn, are among the fundamentalgoals of contemporary cosmology. We are not proposing to perform inference all the wayup to these layers; at least we are not proposing to build fully sampled posterior probabilitydistributions for the parameters of these hyper-priors.

On the other hand, the existence of these layers at the top of the model demonstrates thatthe approach to astrophysical imaging data laid out in this proposal represents a potentiallytransformative way of thinking about the field of astrophysics and about empirical sciencein general. This represents one of the most important broader impacts of this proposal.

In astrophysics at the present day, some of the top-level links in the graphical modelhave been implemented with fully Bayesian probabilistic machinery. However, because thequantities that have been acted upon by these models (stellar masses and ages, for example)have been the result of point estimates made from photometry which was itself a set of pointestimates from imaging, the full information content of the images has not, at present, beentransmitted up the model. One way of seeing this is that if point estimates are biased orhigh in variance or both, taking more data will not improve the accuracy of any histogram of


point-estimated values. Taking more data will, however, improve the accuracy of an inferredprior distribution if that prior is inferred hierarchically from probabilistic inputs from thelayers below.

5 Implementation

The model outlined in Fig. 1 is very complex and hence we intend to implement it in stages.We give a timeline in the “Coordination plan” at the end of the proposal, but briefly it isas follows: Initially we will focus on the low-level image formation model, where we haveclearly defined priors and convenient functional forms. The mid-phase of the project will beconcerned with appropriate models for stars and galaxies. The final phase of the project willaddress the high-level star and galaxy priors in the model. These should build on top of theexisting model layers and create important astrophysical results, such as the mass functionof stars or star-formation history in galaxies, but now with a proper probabilistic basis.

Throughout the project we first will apply the model to relatively small datasets (tens tohundreds of images), before exploring ways to scale the model to survey-sized datasets. Amore detailed timeline is shown in Fig. 6. In realizing the model we draw on a range of state-of-the-art non-parametric Bayesian methods from machine learning, as well as computervision methods for decomposing the image formation process.

Models of variable complexity: As outlined in the introduction, a key feature of ourproposed model is the ability to automatically choose a complexity appropriate to the data.For example, astronomical images vary enormously in their content, hence fixing a-priori thenumber of objects (e.g. stars, galaxies) present in an image is highly restrictive. Instead, wewould prefer to have the model automatically choose how many objects are present basedon the data. The non-parametric Bayesian models that we now describe allow us to do this.

Since our model is probabilistic, the model for a fixed number of objects defines a dis-tribution over possible images. To allow the number of objects to vary in a probabilisticfashion, we need a distribution on probability distributions. Statisticians have discovered anumber of such models, one of which is the Dirichlet process mixture model shown in Fig. 3.This uses a Dirichlet process to model the weighting of components in a Gaussian mixturemodel [3, 39]. The model has a countably infinite number of components with non-zeromixture weights under the prior and posterior. However, because some components carrylarge mixture weights, the posterior explanation of a finite data set typically involves manyfewer components than data points. Given data x, we infer through Gibbs sampling boththe component parameters (mean & covariance of each Gaussian) and also the distributionover the number of components, shown in Fig. 3(b). Fig. 3(c) & (d) show two differentsamples from the model, illustrating its ability to dynamically change its complexity. Thehyper-parameter α in Fig. 3(a), known as the concentration parameter, controls the cluster-ing and indirectly the shape of the distribution in Fig. 3(b). In our unified model of Fig. 1,Dirichlet processes can be deployed in numerous places, permitting different numbers of:stars/galaxies per image; cosmic rays per images and the bodies in a star (single or binary).


(a) (b) (c) (d)

Figure 3: (a): A Dirichlet process Gaussian mixture model, which has a potentially infinitenumber of components, applied to a toy 2D dataset. N data points xi have latent clusterassignments zi, drawn Dirichlet process prior. The parameters for each component θk (potentialan infinite number) are drawn from a conjugate prior λ. The parameters are inferred using Gibbssampling. (b): the posterior distribution on the number of components. (c) & (d): The dataand model at two different iterations of Gibbs sampling. Figures from Sudderth [43].

(a) (b) (c)

Figure 4: (a): A non-parametric Bayesian model for scene recognition, proposed by the co-PI[41]. We aim to label objects in an input image (b) by clustering objects o in a labeled database(c) based on their similarity in appearance g and location x to the input. Note that this modelis similar to many computer vision models (e.g. [45]) in that it does not generate actual pixels,only pre-computed Gist [31] descriptors g.

Dirichlet processes have been applied to a range of real-world problems, including those intext analysis and computer vision [45, 53]. In Fig. 4 we show their use as part of a complexmodel for object recognition [41] (previous work by the co-PI).

The middle layers of the model in Fig. 1 capture the properties of stars and galaxies,which in turn will be used to generate the observed image pixels. The characteristics of stars(and galaxies) fall into groups and although these can be analyzed separately, it is naturalto model these groups as produced by related but distinct generative processes. In practice,this is done by sharing parameters between the groups, effectively transferring informationbetween related stars. This can be performed by a hierarchical Dirichlet process (HDP) [48],as demonstrated on the toy 2D dataset in Fig. 5. The hierarchical sharing mechanisms usedin the HDP can be extended to multiple layers, enabling us to build rich models with manylevels of structure. Note that at all levels of the model, distributions are used rather thanpoint estimates, enabling uncertainty to be propagated throughout the model.


6 7 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of Components

Post

erio

r Pro

babi

lity

Image 1 Image 2

(a) (b) (c)

Figure 5: (a): A hierarchical Dirichlet process mixture model, applied to toy data. We observeJ = 2 related groups of data (images), each with a different number of points Nj. points.Instead of modeling each image independently (as per Fig. 3), we instead share the parameters θof the component models (Gaussians) between images. However, the relative weighting of eachcomponent (indicated by the ellipse thickness) is unique to each image. This demonstrates howan underlie process, common to all images, can be inferred while allowing other elements to varyon a per-image basis. (b): the posterior distribution on the number of components common toboth images. (c): Our model superimposed on the data.

The Dirichlet distributions used in the models above has a fairly constrained form. ThePitman-Yor process [36, 47, 44] is a generalization of the Dirichlet process that has thevery attractive property: it is a distribution over power-law distributions, which are highlycommon in astronomy. For example, the distribution of observed stellar brightnesses fol-lows something close to a power-law, as does the distribution of proper motions, and thedistribution of galaxy luminosities below L∗.

The models introduced above are suitable for continuous data, but other parts of ourmodel require the modeling of a set of binary features which or may not be present, forexample the presence or absence of magnetic flares in M-type stars, or the existence of acentral AGN point source in each galaxy center. For this we can use the Indian BuffetProcess [16, 15], for which there is a hierarchical form that relies on the Beta process [49].

Inference: Hierarchical forms of Dirichlet, Pitman-Yor and Indian Buffet processes willallow us to implement the proposed model in Fig. 1. However, the major technical challengeis performing inference within it. There are three main issues:• Because we wish to perform science with the model, we must be very careful about

imposing functional forms on variables for the sake of computational convenience. Forexample, while conjugate priors might aid marginalization, if they are not appropriate,they will bias the resulting posterior distribution.

• In view of this, we intend to only use priors and functional forms that have a strongphysics grounding. The difficulty is that these are not likely to permit efficient inference,thus for large parts of the model we will have no option but to sample.

• The inference needs to be performed on a vast amount of data (30Tb), with potentiallybillions of stars and galaxies whose properties must be inferred. Thus some form of


parallelism is crucial.Various effective sampling schemes have been developed for HDP and IBP models, based

on Rao-Blackwellization, collapsed and block Gibbs sampling [20, 27, 30]. More generalMCMC techniques will also be used, with Hamiltonian MCMC [11, 29] adopted where gra-dient information is available (which should be the case if we have a physics model). The PIcollaborates closely with Iain Murray (Edinburgh), an MCMC expert who can assist in de-veloping efficient sampling methods. Once suitable models and sampling methods have beenselected, parallel variants will be implemented. Existing work shows that several optionsare available [51]. Moreover, large-scale distributed implementations of complex hierarchicaland non-parametric models have recently been demonstrated [4, 10].

Model Validation: We aim to perform science with our model, thus it is vital that wehave mechanisms to determining if our model and sampling schemes are indeed producingcorrect estimates of quantities that we are interested in. There are two main approaches todoing this:

1. Held-out data: We can use the standard technique [14] of holding out a portion ofdata and compute its log probability. This should be comparable to the log probability ofdata used for training, and will allow us to debug issue relating to sampler convergence andover-fitting.

2. Held-out variables: More uniquely, we can infer posterior distribution for variables inour model which have well understood and characterized physical models. Our model shouldbe able to infer these directly from the data. Failure in this regard would point to issues withthe structure of the model and/or the prior distributions (assuming the first test is passed).

6 Required NSF content

Prior NSF support, grant AST-0908357: This award, for the project Dynamical mod-els from kinematic data: The Milky Way Disk and Halo, supports the investigation of princi-pled probabilistic approaches to inferring dynamical properties of the Milky Way from extantand forthcoming data on stars. This project is closely related to the current project only inthat it involves modeling of data and inference. The funding for this two-year project endsbefore the start of the proposed project. This award has supported the research of NYUgraduate student Jo Bovy, who will complete his PhD in 2011, and who has accepted a post-doctoral position at the Institute for Advanced Study to begin in Fall 2011. In the first yearof this award, it supported ten refereed publications relating to the structure and dynamicsof the Galaxy [7] and related systems, plus projects on related aspects of probabilistic infer-ence [9] including one of the first (perhaps the first) use of hierarchical probabilistic inferencein astrophysics [19]. It has also supported more than ten presentations at meetings and inastrophysics seminar series, by Bovy and Hogg.

Broader impacts: This project aims at making all astrophysics data analysis more preciseand more informative. Success in this endeavor has extremely broad impact: It is will be


equivalent to increasing the aperture or budget of every large future astrophysics project,and it will provide information for free from projects that already exist. Because all of ourcode and model parameters will be made available on the Web at all times, all people andprojects can benefit from our successes.

In bringing a real scientific (that is, non-toy) data set to the machine learning community,we hope to have an impact on the practices of this community, where in the past mostmethods have been tested and vetted on toy problems in simple domains. If it becomesstandard in the machine learning community to test methods on astrophysics data, this willhave broad impacts in both areas: It will provide much more realistic challenges to machinelearning developers, and it will make every machine learning project a scientific discoveryoperation in an important field.

Over the past five or so years, the PI has been introducing inference methods to theastrophysics community, including mixtures-of-Gaussians hierarchical modeling [8], hierar-chical probabilistic modeling of distribution functions [19], and non-parametric methods fornuisance parameters [9]. This project continues this work, as it will bring a whole new setof data analysis and inference tools to the astronomy community? Many astrophysicists areBayesians, but only at the very last stage of their data analysis—the stage that relates thehigh-level theory to high-level derived “observables”. The work proposed here could leadthe way to field-wide adoption of much more sophisticated, information-preserving methods.Contemporary astrophysics is an expensive endeavor, involving enormous optical systems,many of which are in space. Any improvement in data analysis is equivalent to enlargingexpensive facilities, or reducing their end-to-end costs.

Finally, and perhaps most importantly, this project touches citizen science. The modelsand methods of this project will be running not just on professional astrophysics data but alsoon data from amateurs, hobbyists, and educators, providing back to them fully calibrateddata—with astrometric, photometric, and point-spread function calibration meta-data—plusan estimated noise model and flagged anomalies. This extends and enriches the already-successful Astrometry.net citizen-science outreach project [26]. It makes amateur data—which have been used to discover hundreds of supernovae, hundreds of near-Earth asteroids,and a few microlensing-discovered extrasolar planets, among many other things—even moreuseful to the scientific endeavor. More than this, however, if we empower amateurs withthe capabilities of interacting with a flexible astrophysics model for their data, we can makethem capable of performing new and more precise kinds of experiments with their data.This is true for individual hobbyist astronomers, but even more true for collections andcollaborations of them.

The Co-PI is actively involved with a New York City high-school outreach program topromote Computer Science. Recent visits include: Bronx High School of Science, BrooklynTechnical High School and Stuyvesant High School and more are scheduled. Astronomy is aparticularly accessible and attractive area of science for high school students and hence thisthis project makes an ideal topic on which to speak at the various schools.

CDI themes: Primarily this project fits into the CDI theme of “From Data to Knowledge”,since we are explicitly building models from enormous amounts of data.


CDI Coordination plan: The project will support two junior researchers. The first isa CS PhD student, Li Wan, who is in his 2nd year and is well versed in non-parametricBayesian techniques and sampling methods. The second is a Physics postdoc who will behired for the project. Given the large number of technically sophisticated projects underwayin astrophysics, there are many good candidates for this position. They will work accordingto the timeline shown in Fig. 6.

In the first year both with work on the low-level image formation, with the Postdocleveraging the PI’s Astrometry.net techniques for camera calibration. Meanwhile, the PhDstudent will work on PSF estimation, using the co-PI’s models in this area. Both will thenjoint forces and develop the full noise model and integrate their respective components toproduce a complete the image formation model. This will be evaluated on modest datasetsizes.

In the second year, the PhD student will work to scale the low-level formation modelto large datasets. The Postdoc will commence work on building a detailed star model,using existing knowledge of stars as priors. Once the framework of the star model has been

Date (month/year)

9/11

11/11

1/12

3/12

5/12

7/12

9/12

11/12

1/13

3/13

5/13

7/13

9/13

11/13

1/14

3/14

5/14

7/14

Camera calibration

Point-spread function

Noise model

Star model

Galaxy model

Anomaly detection

Star priors

Galaxy priors

Fast inference/Large scale

Key: CS PhD student

Physics Postdoc

Figure 6: The proposed timeline for the project.


developed, the two will combine to actually implement it and integrate it with the previouslydeveloped formation model. The same process will then be repeated for the galaxy model.

In the third year, the PhD student will focus exclusively on fast sampling methods andparallelism, so that the low and mid-level models can be applied to entire surveys. ThePostdoc will focus on building high-level priors on star and galaxies, and relating them toexisting cosmological models.

Both the PhD student and the Postdoc will be co-advised by both of the senior person-nel. We believe that this will provide a rich new educational and professional developmentenvironment that is uniquely valuable for individuals pursuing either academic or industrialcareers: We are using enormous data to solve a set of difficult problems in real-world set-tings and to produce functional, useful, and (eventually) commercially valuable tools in theprocess.

Both the PI and the Co-I have weekly group meetings to discuss research findings andpractice presentation of finished results and work in progress. All personnel (senior andjunior) will participate in both group meetings, both as attendants and as participants.This will introduce all parties to the languages and practices of both communities. ThePI and Co-I already share an open, web-exposed SVN code repository, for developing webservices, handling basic image-processing and format operations, and making figures andwriting.

The senior personnel have already begun some interdisciplinary collaborative work be-tween their groups at NYU, and have co-advised a student (Jon Barron) at NYU who workedon baby steps towards the generative models in both domains. It goes without saying thatwe expect to co-author publications. Coordination will be straightforward and rich, becauseour groups already interact.

Documents

Hogg & Fergus / CDI-Type I: A uniﬁed probabilistic model ...fergus/drafts/hogg_cdi.pdf · signiﬁcance [23], despite using essentially the same estimators. Large-scale structure