27
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

Data-Intensive Statistical Challenges in Astrophysics

  • Upload
    patty

  • View
    42

  • Download
    1

Embed Size (px)

DESCRIPTION

Data-Intensive Statistical Challenges in Astrophysics. Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU ), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary). The Age of Surveys. Angular Galaxy Surveys ( obj ) 1970 Lick 1M 1990 APM 2M - PowerPoint PPT Presentation

Citation preview

Page 1: Data-Intensive Statistical Challenges in Astrophysics

Data-Intensive Statistical Challenges in Astrophysics

Alex SzalayThe Johns Hopkins University

Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

Page 2: Data-Intensive Statistical Challenges in Astrophysics

The Age of SurveysCMB Surveys (pixels)• 1990 COBE 1000• 2000 Boomerang 10,000• 2002 CBI 50,000• 2003 WMAP 1 Million• 2008 Planck 10 Million

Galaxy Redshift Surveys (obj)• 1986 CfA 3500• 1996 LCRS 23000• 2003 2dF

250000• 2008 SDSS 1000000• 2012 BOSS

2000000• 2012 LAMOST 2500000

Angular Galaxy Surveys (obj)• 1970 Lick

1M• 1990 APM

2M• 2005 SDSS

200M• 2011 PS1

1000M• 2020 LSST

30000MTime Domain• QUEST• SDSS Extension survey• Dark Energy Camera• Pan-STARRS• LSST…

Petabytes/year …

Page 3: Data-Intensive Statistical Challenges in Astrophysics

Sloan Digital Sky Survey

• “The Cosmic Genome Project”• Two surveys in one

– Photometric survey in 5 bands– Spectroscopic redshift survey

• Data is public– 2.5 Terapixels of images => 5 Tpx– 10 TB of raw data => 120TB processed– 0.5 TB catalogs => 35TB in the end

• Started in 1992, finished in 2008• Extra data volume enabled by

– Moore’s Law– Kryder’s Law

Page 4: Data-Intensive Statistical Challenges in Astrophysics

Analysis of Galaxy Spectra

• Sparse signal in large dimensions• Much noise, and very rare events• 4Kx1M SVD problem, perfect for randomized

algorithms• Motivated our work on robust incremental PCA

Page 5: Data-Intensive Statistical Challenges in Astrophysics

Galaxy Properties from Galaxy Spectra

Continuum EmissionsSpectral Lines

Page 6: Data-Intensive Statistical Challenges in Astrophysics

Galaxy Diversity from PCA

[Average Spectrum]

[Stellar Continuum]

[Finer Continuum Features + Age]

[Age]Balmer series hydrogen lines

[Metallicity] Mg b, Na D, Ca II Triplet

1st

2nd

3rd

4th

5th

PC

Page 7: Data-Intensive Statistical Challenges in Astrophysics

Streaming PCA

• Initialization– Eigensystem of a small, random subset– Truncate at p largest eigenvalues

• Incremental updates– Mean and the low-rank A matrix– SVD of A yields new eigensystem

• Randomized algorithm!

T. Budavari, D. Mishin 2011

Page 8: Data-Intensive Statistical Challenges in Astrophysics

Robust PCA

• PCA minimizes σRMS of the residuals r = y – Py– Quadratic formula: r2 extremely sensitive to outliers

• We optimize a robust M-scale σ2 (Maronna 2005)– Implicitly given by

• Fits in with the iterative method!• Outliers can be processed separately

Page 9: Data-Intensive Statistical Challenges in Astrophysics

Eigenvalues in Streaming PCA

Classic Robust9

Page 10: Data-Intensive Statistical Challenges in Astrophysics

Examples with SDSS Spectra

Built on top of the Incremental Robust PCA

• Principal Component Pursuit (I. Csabai et al)• Importance sampling (C-W Yip et al)

Page 11: Data-Intensive Statistical Challenges in Astrophysics

Principal component pursuit• Low rank approximation of data matrix: X • Standard PCA:

– works well if the noise distribution is Gaussian– outliers can cause bias

• Principal component pursuit

– “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low

– NP-hard problem

• The L1 trick:

– numerically feasible convex problem (Augmented Lagrange Multiplier)

kEranktosubjectEX )(min2

* E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

kNrankANXtosubjectA )(,min0

ANXtosubjectANAN

,

1*min

21*,

)(min ANXtosubjectANAN

Page 12: Data-Intensive Statistical Challenges in Astrophysics

• Slowly varying continuum + absorption lines

• Highly variable “sparse” emission lines

• This is the simple version of PCP: the position of the lines are known• but there are many of

them, automatic detection can be useful

• spiky noise can bias standard PCA

DATA:Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.)

SDSS 1M galaxy spectraMorphological subclassesRobust averages + first few PCA directions

Testing on Galaxy Spectra

Page 13: Data-Intensive Statistical Challenges in Astrophysics

PCA

PCA reconstruction

Residual

Page 14: Data-Intensive Statistical Challenges in Astrophysics

Principal component pursuit

Low rank

Sparse

Residual

λ=0.6/sqrt(n), ε=0.03

Page 15: Data-Intensive Statistical Challenges in Astrophysics

Not Every Data Direction is Equal

A = C X

Gal

axy

ID

Wavelength

Gal

axy

ID

Selected WavelengthsWavelength

Procedure:1. Perform SVD of A = U VT

2. Pick number of eigenvectors = K3. Calculate Leverage Score = i ||VT

ij||2 / K

Selected Wavelengths

Mahoney and Drineas 2009

Page 16: Data-Intensive Statistical Challenges in Astrophysics

Wavelength Sampling Probabilityk = 2 c = 7

k = 4c = 16

k = 6c = 25

k = 8c = 29

Page 17: Data-Intensive Statistical Challenges in Astrophysics

Ranking Astronomical Line Indices

(Yip et al. 2012 in prep.)(Worthey et al. 94; Trager et al. 98)

Subspace Analysis of Spectra Cutouts:

- Othogonality- Divergence- Commonality

Page 18: Data-Intensive Statistical Challenges in Astrophysics

Identify Informative Regions

“NewMethod”1. Pick the λ with largest Pλ2. Define its region of influence using λ Pλ convergence.

Mask λ’s from future selection. 3. Go back to Step 1, or quit.

“MahoneySecond”4. Over-select λ’s from the targeted number.5. Merge selected λ if two pixels lie within a certain distance6. Quit.

Page 19: Data-Intensive Statistical Challenges in Astrophysics

Identifying New Line Indices, Objectively

(Yip et al. 2012 in prep.)

Page 20: Data-Intensive Statistical Challenges in Astrophysics

New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)

Page 21: Data-Intensive Statistical Challenges in Astrophysics

NewMethod vs MahoneySecond

NM

M2

Page 22: Data-Intensive Statistical Challenges in Astrophysics

Gunawan & Neswan 2000)

Page 23: Data-Intensive Statistical Challenges in Astrophysics

Angle between Subspaces

JHU Lick

Page 24: Data-Intensive Statistical Challenges in Astrophysics

λ Pλ

JHU Lick

Page 25: Data-Intensive Statistical Challenges in Astrophysics

Line Indices for Galaxy Parameter Estimations

Page 26: Data-Intensive Statistical Challenges in Astrophysics

Importance Sampling and Galaxies

• Lick indices are ad hoc• The new indices are objective

– Recover atomic lines– Recover molecular bands– Recover Lick indices– Informative regions are orthogonal to each other,

in contrast to Lick• Future

– Emission line indices– More accurate parameter estimation of galaxies

Page 27: Data-Intensive Statistical Challenges in Astrophysics

Summary

Non-Incremental changes on the way• Science is moving increasingly from hypothesis-

driven to data-driven discoveries• Need randomized, incremental algorithms

– Best result in 1 min, 1 hour, 1 day, 1 week• New computational tools and strategies … not just statistics, not just computer science,

not just astronomy, not just genomics…

Astronomy has always been data-driven….now becoming more generally accepted