Upload
hahanh
View
219
Download
4
Embed Size (px)
Citation preview
The Next Generation Virgo Cluster Survey (NGVS) is one of six science projects integrated into CANFAR during its development phase. It is a 104 square degree survey of the Virgo Cluster of galaxies in 5 optical bands (ugriz), utilizing the MegaCam camera on the Canada-France-Hawaii telescope, with a limiting magnitude of 25.7 (10σ point source) in the g band. The survey will revolutionize the science of this prototypical high density environment in the local universe.
The survey data size, while not extremely large by modern standards, sti l l represents a substantial dataset that will be amenable to data mining. The expected final dataset is 50T, processed by two independent pipelines, MegaPipe at CADC, and TERAPIX at the Institut d'Astrophysique in Paris.
Currently, the problem of deducing cluster membership absent spectroscopic redshifts remains unsolved.
K-Means Clustering
The aim of K-means is to optimally assign points in a parameter space to clusters, in an unsupervised manner:
for observations xj, k clusters Si, with cluster means μi.
Here, we perform dimension reduction (currently PCA) and run the SkyTree kmeans algorithm on
We describe ongoing astroinformatics work at the Canadian Astronomy Data Centre (CADC). With a collection of over 0.5 petabytes of information, and serving nearly 3000 astronomers worldwide, CADC is one of the world's largest astronomy data centres. Its unique blend of astronomers and computer specialists among its staff results in a rich interaction between world experts that is ideal for the fostering of developments within astroinformatics. Part of CADCʼs ongoing goals is to retain science drivers as the primary motivator at each step of the process, from the receipt of raw data from telescopes, to the release of that data, and its use by scientists. Thus, the developments remain guided by maximal benefit to the astronomy community.The Canadian Advanced Network for Astronomical Research (CANFAR) is a University of Victoria and CADC project that builds on the existing CADC infrastructure to provide storage, processing, and analysis tools needed to enable astronomers to perform data-intensive astronomy on current and next generation datasets. CANFAR provides a Virtual Cluster, accessed via a Virtual Machine environment, over which the user has complete control, and access to Cloud Computing on the Compute Canada Grid. Its services are compliant with the International Virtual Observatory Alliance standards. Hence, rather than build a new infrastructure for a project such as a sky survey, an individual or collaboration may utilize CANFAR.Although the infrastructure provided by CANFAR is vital, its main focus is on the basic storage and processing of data. To apply methods such as KDD, machine learning, and data mining, further software must be run. By analogy to the argument that CANFAR can provide the generic hardware portions of a data processing pipeline, we implement fast, scalable, data mining algorithms that simplify the generic portions of KDD within current and future datasets, further enabling practical data-intensive astronomy. We show an example of the use of the SkyTree software to perform K-means clustering to determine which galaxies in the Next Generation Virgo Cluster Survey (NGVS) are cluster members. This problem is unsolved within the survey.
Astroinformatics at the Canadian Astronomy Data Centre Nicholas M. Ball
Canadian Astronomy Data Centre, Herzberg Institute of Astrophysics, Victoria, BC, Canada http://sites.google.com/site/nickballastronomer [email protected]
Introduction Virgo Cluster Membershipvia K-Means
Dowler, P., et al., 2008, Common Archive Observation Model. ADASS XVII, ASP Conference Proceedings, Vol. 394, eds. Argyle R.W., Bunclark P.S., Lewis J.R., pp 426-429
Gaudet S., the CADC team, 2011, Virtualization and Grid Utilization within the CANFAR Project. ADASS XX, ASP Conference Proceedings, Vol. 442, eds. Evans I.N., Accomazzi A., Mink D.J., Rots A.H., pp 61–64
This research used the facilities of the Canadian Astronomy Data Centre, operated by the National Research Council of Canada with the support of the Canadian Space Agency. Funding for CANFAR was provided by CANARIE via the Network Enabled Platforms Supporting Virtual Organisations program.
The IVOA Interest Group in Knowledge Discovery in Databases (led by G. Longo) aims to deploy practical data mining algorithms of use to astronomers:
“We will develop and test scalable data mining algorithms and the accompanying new standards
for VO interfaces and protocols, so that these algorithms can be discovered and used
transparently within VO science workflows or in standalone data exploration applications.”
KDD-IG Charter, 2010
As part of achieving these aims, we are constructing an online guide to data mining in astronomy. Prior to this guide, no such tool existed. The guide is designed for the astronomer who is interested in using the methods of data mining to improve their science return, but whose main priority remains getting their science done. The guide is currently situated at http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/IvoaKDDguide .
References & Acknowledgments
Figure 2: K-means clustering results, showing normalized cluster membership for several
subsets of objects as a function of cluster number (35 clusters in this case)
Astroinformatics at CADC
Astroinformatics will become the only way to render future datasets comprehensible. It will become increasingly impractical to download data, hence an infrastructure is required in which the data analysis can be done in situ, without the need for downloading and local processing.A significant proportion of the KDD component of CADCʼs astroinformatics has been in the context of the science requirements of the Next Generation Virgo Survey (NGVS), guided by the authorʼs science interests, e.g., the galaxy luminosity function. We show an example here.
Canadian Astronomy Data Centre
The Canadian Astronomy Data Centre (CADC), based at the Herzberg Institute of Astrophysics in Victoria, BC, is one of the largest astronomy data centres in the world. Founded in 1986, it currently holds over 500T of data, and has served over 100T to more than 5000 distinct IP addresses worldwide. CADC combines the expertise of astronomers and computer specialists, and is hence ideal for realizing the scientific benefits of astroinformatics.
the first three PCs. kmeans includes a facility to determine the optimal number of clusters via cross-validation.
Results
Initial results show that the procedure is able to discern meaningful groups of galaxies within the PC1-PC2-PC3 space (Figure 2), e.g., cluster members and background galaxies confirmed by spectroscopy, saturated stars, and bright, low surface brightness artifacts.
Future Work
Obvious refinements include:
• More thorough removal of image/catalogue artifacts
• Testing of non-linear dimension reduction, e.g., kernel PCA
• Use of prior knowledge, e.g., constrained K-means, guided by object spectra
• Probabilistic cluster membership• Detailed characterization of the objects
contained in each cluster
CANFAR & CVO
T h e C a n a d i a n A d v a n c e d N e t w o r k f o r Astronomical Research (CANFAR; http://canfar.phys.uvic.ca), led by Chris Pritchet at the University of Victoria, and contracted to CADC, is:
“a project ... to provide the delivery, processing, storage, analysis, and distribution of astronomical
datasets of unprecedented size. ... The project builds on CADC's existing infrastructure to provide IVOA-
compliant tools and services for astronomers, and access to Cloud Computing on the Compute Canada
Grid, via a Virtual Machine environment.” CANFAR Statement of Work, 2008
CANFAR Usage
CANFAR provides a Virtual Cluster, accessible to an individual user or collaboration (Virtual Organization). Each user operates within their own Virtual Machine environment, over which they have complete control. This provides access to CANFAR services, and the Cloud Computing resources of Compute Canada (Figure 1).
VO-Compliant Web Services
Outward-facing CANFAR services use IVOA-compliant protocols, e.g., TAP and VOSpace for data services, UWS for processing, and TLS and X.509 grid certificates for security. Inward-facing infrastructure builds on existing CADC resources. For cloud computing, Cloud Scheduler, Condor, Nimbus, and iRODS are used.
In addition, CADC data are available in IVOA-compliant form via the Common Archive Observation Model of the Canadian Virtual Observatory (Dowler et al. 2008).
Figure 1: CANFAR infrastructure: Virtual Organizations, such as individual users or survey teams, access CANFAR and Compute Canada resources via a Virtual Machine environment. Green boxes show new components resulting
from CANFAR. From Gaudet et al. (2011).
CANFAR Science
Several hundred thousand processor hours have been logged on CANFAR in aid of science projects. In particular, six projects, including the NGVS, have been integral to CANFARʼs development. Extensive analysis that would not be possible on a desktop, e.g., the NGVS MegaPipe pipeline, fitting galaxy profiles, etc., is now being performed, including by non-data specialists.
Guide to Data Mining in Astronomy
Fast Data Mining Algorithms
Until recently, most data mining algorithms have scaled as N2, rendering them intractable for modern datasets. However, fast libraries which implement data mining algorithms scaling as NlogN, or better, are now available.
Thus the installation of such libraries on the CADC infrastructure will enable the practical use of these algorithms by astronomers who are not data mining specialists, enabling useful science.
While each specific usage of such software will remain science-driven, the underlying tools are not dataset-specific, hence the effort to make available such generic tools is appropriate.
We have installed and are running the SkyTree software (http://www.fast-lab.org), and have confirmed that its algorithms scale as required.
Data Services Processing Services
Processing Resources
CondorWeb Service
Storage Resources
Browse, Retrieveand Store Data
Queue and Monitor Processing
Start VI
Store and Retrieve Data
Control Processing
Monitor Processing
Store and Retrieve Data
GetVI
Maintain VCE
Collaboration User InterfacesWIKI
ExistingExternal Services
Interactive Collaborationactivities
MonitorProcessing
Link to Data
Astronomer's Desktop
CANFAR enabled applicationsHTTP clients VO enabled applications
CADCStorageCluster
GridStorage
CADCRDBMS
Data StorageAD
ProcessingState
Management
Virtual Resource Advertisement
GridStorage
GridStorageClusters
Database Queries
VoSpaceTAP
UWSGMS
CADCProcessing
ClusterGrid
Cluster
Virtual Cluster Scheduler(Condor)
Run Job
Queue Analysis
Nimbus
GridCluster
NimbusGrid
ProcessingClusters
Start VI
Data Storage(IRODS)
SurveyWeb Page
VirtualOrganizationManagement
ConfigureVI
Get and Save VI
Store and Retrieve Data
UWS
NimbusNimbus
CloudScheduler
ManageProcessingSequences
Control Processing
Monitor Processing
Database Queries
UWS
Data WebService
Retrieveand Store Data