9
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Embed Size (px)

Citation preview

Page 1: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Applications of UDFsin Astronomical Databases and

Research

Manuchehr Taghizadeh-Popp

Johns Hopkins University

Page 2: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

User Defined Functions (UDFs)

Motivation:

-Scientists need to execute own code/functions where the data is stored (databases)

-Need fast code/algorithms no more complex than O(N log N), parallelizable if possible in 104+ threads.

For astronomers:

-Basic astronomical UDFs bring 3-Dimensional and temporal view of the universe.

-Created Cosmological functions library (CfunBASE) written in C# (.NET framework). Library uploaded into SQL SERVER and code executed through CLR integration.

-Used in CasJobs/SkyServer service hosting SDSS data archive.

-Execute Functions/Stored procedures in simple SQL commands.

Page 3: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Functions for SQL Server-Cosmological Functions:

-volume, distances and times as a function of redshift “z” (F=F(z)) -inverse functions z = F-1(F(z)) also implemented.

-Basic data exploratory and statistical functions also included:

- Cumulative distribution and quantile functions (both scalar and aggregate)

- Binning and grids (1-D streaming table valued function, linear/log-scaled)(for aggregation, table creation, etc)

- N-Dimensional weighted histogram.

-Numerical Methods:

Integration, root finding, interpolation. Customizable for speed/precision.

-Many functions in astronomy contain integrals/sums:

many problems parallelizable with CUDA/GPU (to be done…)

Page 4: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Advanced Astronomical Examples

-Galaxy clusters from Friends-of-Friends algorithm: 3D view of the Large Scale Structure.

-Luminosity Function (1-D weighted histogram)

SELECT dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1), sum(1/v.Vmax)/0.1, sqrt(sum( 1/(v.Vmax*v.Vmax) ) )/0.1,count(*)

FROM( SELECT dbo.fCosmfAbsMag(m_r,z) AS AbsMag_r, Vmax FROM DR7) AS v

GROUP BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1)ORDER BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100 ,1, 1)

-Color-Magnitude Diagram (2-D weighted histogram)

EXECUTE spMathHistogramNDim ‘SELECT dbo.fCosmfAbsMag(m_r,z), Color_u_r, 1.0/Vmax FROM DR7’,2, '-25,0', '-15,5', '50,50' ,1

-Use query parsing function for preventing SQL injection when functions run user’s query.

Page 5: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Extreme Value Statistics (EVS) as a tool

-Used widely in calculations of risk and the study of tails of distributions.

-EVS predicts the biggest/smallest value we will ever observe.

- Distribution φ(x) of extremes is known for the extremes of n i.i.d. random variables (of parent distribution P(x) ) when n ∞:

- ξ defines 3 universal distributions depending on tail of parent distribution P(x):

(1) (power law tail) ξ > 0 [ φ(x) called Frechet distribution]

(2) (exponential tail) ξ = 0 [ φ(x) called Gumbel distribution]

(3) ( x0>x ) (finite cutoff tail) ξ < 0 [ φ(x) called Weibull distribution]

With large data sets , questions to answer:

-Are maximal galaxy luminosities really Gumbel distributed [P(L) ~ exp(-L)] ?

-Having lots of galaxies, can we observe the finite size correction of φ(x) due to having finite n?

1~)( xxP

)exp(~)( xxP

10 )(~)( xxxP

11

1

)(1exp

)(1

1),,|(

xxx

Page 6: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Sampling luminosities from HealPIX cells

-HealPIX tessellation library uploaded into database.-Can be used for spatial indexing. (use tree schema and bitshift on HealPIX ID)-Equal area cells.

Applications for EVS:

-Build HealPIX SDSS footprint on the sky. Use HTM spatial indexing library.

-Each cell has 1 “realization” of the random variable (Luminosity) -Sample highest luminosity at each one of all n cells.

-3 different spatial resolutions:Nside =(16, 32, 64)

n ~ (296, 1450, 6642)

Page 7: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

RESULTS: tail classes and finite size correction

-Tail index ξ from DEdH estimator

η = normalized order statistics

Test 4 different galaxy samples:

Generally close to ξ = 0 [P(L) ~ exp(-Lβ)]

-1st time observation of finite size correction

- x = Standardized maximal luminosities

- Finite size correction Δ due to finite n:

Δ = P(x) – StandardGumbel

- Slow theoretical convergence:

Δ(n) ~ 1/log n

RESULT:Correction appears when n>6000(tradeoff between noise/convergence)

Page 8: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Mining the space of Galaxy Properties

How to classify galaxies in the n-dimensional cloud of Photometric/Spectral properties?

- Use Principal Components Analysis (PCA) on properties and consider important eigenvectors.

- Build PRINCIPAL CURVE: Smooth fit/projection to the cloud’s spine. Complexity of ~O(N2)

- Explore diverse statistics as a function of arc length.

- Scalability for big N:

Streaming PCA (T. Budavari) and randomized sampling for principal curve

(P. curve not yet implemented in SQLCLR)

Page 9: Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University

Final remarks

- Algorithms useful if randomized, ~O(N log N), streaming capable and parallelizable

- For analysis, an astronomer would like

- A programming layer on the database (with the functionality of e.g R)

- implementing matrix algebra, calculus, statistics, etc.

- Including data visualization.