20
A Web service for Distributed A Web service for Distributed Covariance Computation on Covariance Computation on Astronomy Catalogs Astronomy Catalogs Presented by Presented by Haimonti Dutta Haimonti Dutta CMSC 691D CMSC 691D

A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

A Web service for Distributed Covariance A Web service for Distributed Covariance Computation on Astronomy CatalogsComputation on Astronomy Catalogs

Presented by Presented by Haimonti DuttaHaimonti DuttaCMSC 691DCMSC 691D

Page 2: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

ROADMAP

• Background Information

• Interesting Astronomy Data Mining Problems

• What has / not been done (Literature review)

• My project objectives

• The problem of Alignment in astronomy catalogs

• The Fundamental Plane

• A case study for recreating the Fundamental Plane from astronomy catalogs

• Experimental Results

• Efforts towards building Web services

Page 3: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Background Information

Next generation Astronomy catalogs will contain data for most of the sky

Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc

Terabytes and Peta bytes of Data

Data Avalanche in Astronomy

Getting useful information is like looking for a needle in a haystack

National Virtual Observatory (NVO) has been set up to facilitate scientific discovery

Obvious need for Distributed Data Mining

Page 4: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

What kind of Data Mining activities are astronomers interested in ?

Detection of transient objects such as supernovae (Online transient object detection in real time)

Obtain statistics of variable and moving objects (model variability, refine existing models, fit models to irregularly sampled data )

Parameterize shapes of objects using rotationally invariant quantities

Efficient cluster and outlier detection

Supervised Data Mining problems (match objects detected in multiple bands, derive photometric red shifts)

Page 5: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

What has/not been doneWhat has/not been done

Lot of efforts in centralized data mining Lot of efforts in centralized data mining (NVO, FMass, Class X, FIRST etc )(NVO, FMass, Class X, FIRST etc )

Some grid mining (Notable GRIST Some grid mining (Notable GRIST project)project)

Very few distributed data mining efforts in Very few distributed data mining efforts in their preliminary stagestheir preliminary stages

((http://www.cs.queensu.ca/home/mcconell/DDMAstro.htmlhttp://www.cs.queensu.ca/home/mcconell/DDMAstro.html))

Page 6: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Objectives of this project

Aligning of Catalogs (The Fundamental Plane Problem)

Implementation of algorithms for Distributed Data Mining on Astronomy Catalogs

Development of webservices for the catalogs / investigation into what needs to be done to integrate this into the NVO

Page 7: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Alignment of Astronomy CatalogsAlignment of Astronomy Catalogs

Cross matching is a non trivial problem in itself. We assume cross matching happens off line and there exists an indexing scheme by which catalogs know the exact cross matched tuples

Page 8: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Some interesting numbersSome interesting numbers Size of current SDSS catalogs 3.0 TB , contains about 180 million objects (As per Data Release 4)

2Mass has already observed 99% of the sky and reports 470,992,970 Point sources and 1,647,599 Extended sources

Portion of the sky observed by SDSS

Page 9: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Problems Problems Cross Matching is an inherently difficult Cross Matching is an inherently difficult

problem for the astronomy catalogsproblem for the astronomy catalogs We We assume assume data sets are cross matched data sets are cross matched

and this computation is done offlineand this computation is done offline This is a strong assumption and often This is a strong assumption and often

may not be acceptable to astronomersmay not be acceptable to astronomers

Page 10: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

A real life cross matching ExerciseA real life cross matching Exercise

Problems encountered Problems encountered Which catalogs to use ? Which catalogs to use ? We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift CatalogWe tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog Catalogs have different indexing schemes – more recent ones use Catalogs have different indexing schemes – more recent ones use

HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even Names of objectsNames of objects

Some attributes are really not available ! (SDSS has -9999 for most Some attributes are really not available ! (SDSS has -9999 for most of its red shift values)of its red shift values)

Different catalogs observe different portions of the sky (SDSS Different catalogs observe different portions of the sky (SDSS covers only about 16% of the sky in the latest release while 2Mass covers only about 16% of the sky in the latest release while 2Mass covers the entire sky) – covers the entire sky) – Select subsets to cross match wisely ! Select subsets to cross match wisely !

Page 11: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

The successful cross matching …..The successful cross matching ….. Chose a region of the sky between 0 and 15 (dec) and 150 and 200 Chose a region of the sky between 0 and 15 (dec) and 150 and 200

degrees (ra) – observed by both SDSS and 2Massdegrees (ra) – observed by both SDSS and 2Mass Use a web interface provided by SDSS to do the cross matchingUse a web interface provided by SDSS to do the cross matching Selected the K-band for obtaining red shift and surface brightness Selected the K-band for obtaining red shift and surface brightness

(astronomical significance)(astronomical significance)

Case StudyCase Study Centralized database 1249 cross matched objectsCentralized database 1249 cross matched objects Attributes are size, surface brightness, velocity dispersionAttributes are size, surface brightness, velocity dispersion Does not really make a case for a distributed data mining scenario ! Does not really make a case for a distributed data mining scenario !

Solution Solution

- try a larger subset of the data from both catalogs - try a larger subset of the data from both catalogs

Page 12: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

The Fundamental PlaneThe Fundamental Plane

Interesting problem in astronomy - Identify Interesting problem in astronomy - Identify correlations in high dimensional spaces correlations in high dimensional spaces

For the class of elliptical and spiral galaxiesFor the class of elliptical and spiral galaxies Observed featuresObserved features – radius, mean surface – radius, mean surface

brightness and central velocity dispersionbrightness and central velocity dispersion A two dimensional plane in the observed A two dimensional plane in the observed

space of 3D parameters exist called space of 3D parameters exist called THE FUNDAMENTAL PLANETHE FUNDAMENTAL PLANE

Page 13: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

An illustration of the Fundamental Plane

Page 14: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Experimental Results Experimental Results

First PC captured 69.4193% of variance

Second PC captured 12.1333% of the variance

The astronomy literature suggests 1st and 2nd PC together should capture about 88% of variance

Reasonably close recreation of the Fundamental Plane from two cross matched data sets in the centralized setting

Page 15: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Algorithm for Distributed Covariance ComputationAlgorithm for Distributed Covariance Computation

A central co-ordination site S sends A and B a random A central co-ordination site S sends A and B a random number generation seednumber generation seed

A and B generate and n X l Random matrix R where l << nA and B generate and n X l Random matrix R where l << n A and B send S – R A and B send S – R TT A and R A and R TT B B S computes ( R A )S computes ( R A )TT (RB) / n (RB) / n

Page 16: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Experimental Results – Distributed SettingExperimental Results – Distributed Setting

Case StudyCase Study 1249 attributes at site A and B 1249 attributes at site A and B 2 attributes at site A and 1 2 attributes at site A and 1

attribute at site Battribute at site B

Page 17: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

More resultsMore results

Page 18: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Development of a Web ServiceDevelopment of a Web ServiceArchitecture of the Proposed SystemArchitecture of the Proposed System

CLIENT

SITE A

SITE B

WEB SERVICEFor Distributed

Covariance Computation

Soap Message

Soap Message

Page 19: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

Current Implementation Current Implementation

Using Apache Axis (SOAP engine – a Using Apache Axis (SOAP engine – a framework for making SOAP processors framework for making SOAP processors such as clients, servers )such as clients, servers )

Tomcat version 4.1Tomcat version 4.1 SOAP version 1.2SOAP version 1.2 Short Demo Short Demo Further System Developmental Issues Further System Developmental Issues

(use of SOAP with attachments)(use of SOAP with attachments)

Page 20: A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

QUESTIONS ?QUESTIONS ?