View
215
Download
0
Category
Preview:
Citation preview
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey
Alexander S. Szalay, Peter Z. Kunszt, Ani ThakarDept. of Physics and Astronomy, The Johns Hopkins University
Jim Gray, Don Slutz Microsoft Research
Robert J. Brunner, California Institute of Technology
Towards the Digital Sky
Goal: interactive exploration of astronomical data efforts underway to capture digital images of the sky multiple wavelengths: x-rays, ultraviolet, visible, infrared diverse data types: images, text, numerical attributes data is big: set of multi-TB archives no need to wait for access to a telescope
NGC 5033, from “Image of the week”1/5000 of first light image, May 27-28, 1998
Astronomy 101
Celestial Sphere
©Sky Publishing Corp
Declination(degrees)
Right ascension(time - h,m,s)
Surface area - “square” degreesUnit of solid anglesphere = 41252.96 deg2
Arcminute = 1/60 degreeArcsecond = 1/60 arcminute
Sloan Digital Sky Survey
Goals (1999)➲ Map ~10000 deg2 of northern sky (~1/4 celestial sphere)➲ Determine position and brightness of 100M celestial objects➲ Measure distance to 1M galaxies, create 3D model➲ Measure distance to 100K quasars➲ Make data available to the public
As of data release 6 (data through June 2006)➲ Images, attributes of ~287M objects over 9583 deg2
➲ 1.27 million spectra of stars, galaxies, quasars and blank sky (for sky subtraction) over 7425 deg2
➲ Additional estimates of stellar temperatures, gravities, metallicities➲ Data, search tools available on web (http://skyserver.sdss.org)
Where is the data acquired?
➲ Apache Point Observatory (APO), Sunspot, NM far away from large cities – dark night sky altitude: 9200 feet little water vapor few pollutants many cloudless,
moonless nights!
Photo: Fermilab Visual Media Services
Telescopes➲ 2.5 meter reflecting light telescope
wide angle: 3° field of view (diameter of ~30 full moons) camera: 120 Mpixel, 30 CCDs, each 2” square, 5 color filters 2 spectrographs measure spectra of ~600 objects at once generates up to 200 GB/night
Photos: Fermilab Visual Media Services
Telescopes
➲ 0.5 meter photometric telescope used to monitor atmosphere during survey
(temperature, pressure) calibrate brightness of objects captured by main telescope
Photos: Fermilab Visual Media Services
Drift scan imaging➲ Telescope is positioned once➲ Images taken as sky moves past
Reading of CCD lines synchronized with sky movement
Exposure time: 55 sec Two scans (runs) form a stripe 5-color columns split into fields, 2048x1489
2B/pixel 5-color images (+ ~60 attributes)
➲ Output: photometric catalog Atlas images, 500+ attributes for each of
100M galaxies, 100M stars, 1M quasars Attributes: position, magnitude, size, color, ...
Image: Christoph Flohr, www.driftscan.comM45 The Pleaides
Spectroscopic survey
➲ Target specific objects automatically chosen from photometric survey
1M galaxies, 100K stars, 100K quasars Up to 5000 spectra collected per night
➲ Classify objects (stars, galaxies, quasars...) template matching against standard spectra for each object class examine spectra for object properties (e.g., chemical composition)
➲ Create 3D map of galaxy distribution Measure distance using Doppler shift
Data archivesraw data FedEx tapes to FermiLab for processing, reduction
operational archive processed data in instrumental form perform calibration information for target selection
science archive object catalog: positions, magnitudes, colors, sizes, radial profiles, classifications, etc. for over 100 million objects housekeeping data: calibrations and logs atlas images in 5 colors for all identified objects one-dimensional spectra of all spectroscopic targets
local archive replica of science archive
public archive scientifically verified recalibrated (if necessary)
Typical queries
Q1: Find all galaxies without unsaturated pixels within 1 arcsecond of a given point in the sky (right ascension and declination).
spatial lookup
Q2: Find all galaxies with blue surface brightness between and 30 and 40, and -10<super galactic latitude (sgb) <10, and declination less than zero.
search for galaxies with a specified blue brightness in a given region of skycoordinate system needs translation
Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75.local extinction indicates amount of dust in a given direction (dust masks light)
Q15: Provide a list of moving objects consistent with an asteroid.Objects are classified as moving: 5 successive observations from the 5 color bands. SQL: select moving object where sqrt((deltax5-deltax1)2 + (deltay5-deltay1)2) < 2 arc
seconds.
Database design
Original design based on OODB (ObjectivityDB), changed to relational DB (Reported in SIGMOD 2002)
Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, Jan vandenBerg. “The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data”, SIGMOD 2002
80 million objects5 color images
target selection
follow-up on selected targets
Schema: photographic objects
PhotoObj: star & galaxy attributes records for 80 million objects each ~470 attributes (~2KB) heavily indexed (“tens of indices”) 30% of storage space devoted to indices
Field processing used for objects in field, all
frames
Neighbors computed after the data is loaded For every object, list of objects within 1/2
arcminute (~10 objects)Views
PhotoPrimary: photoObj with mode=1 (best instance of deblended object)
Stars: PhotoPrimary with type='star' Galaxies: PrimaryObjects with
type='galaxy'
Spatial Data Access
Coordinate systems right-ascension and declination hierarchical triangular mesh (HTM): recursive partitioning of celestial sphere
HTM recursively assigns a number to each point on the sphere
Recursion 20 levels deep: smallest triangles < 0.1 arcsecond on a side
HTM index is built as an extension of SQL Server’s B-trees
Spatial queries use the HTM index to limit searches to small set of triangles
Thoughts on server architecture
➲ Use commodity servers and storage Processors, memory costs 10x lower than high end Storage cost 3x lower Deploy as much processing as one can afford
➲ Partition data spatially Repartition as servers added, removed
➲ Replicate high traffic data➲ Exploit parallelism➲ Deploy as network service initially
SkyServerSDSS DR1 is about 900GB (3.4B rows)
SkyServer cluster➲ Web front ends (3)
Hardware: Dell Poweredge 1750 servers, 2GB memory, dual Gbit Ethernet, 2 36GB Ultra320 SCSI disks, RAID1
Software: Windows Server 2003, IIS 6.0 Microsoft Network Load Balancing
➲ Database servers (3) 1 DB server - short queries on the public website 2 DB servers - longer queries for registered users, failover Hardware: Dell 4600 database servers, 4GB memory, 1.2
TB of 10k rpm Ultra SCSI drives, 4 drives/SCSI channel, RAID0
Software: Windows Server 2003 and SQL Server 2000. Data rates: 400MBps (simple query), 160-200 Mbps (typical
multi user load)➲ Log server (1)
same configuration as DB server? all back-ends on private network
http://skyserver.sdss.org
Table Records BytesField 14k 60MBFrame 73k 6GBPhotoObj 14m 31GBProfile 14m 9GBNeighbors 111m 5GBPlate 98 80KBSpecObj 63k 1GBSpecLine 1.7m 225MBSpecLineIndex 1.8m 142MBxcRedShift 1.9m 157MBelRedShift 51k 3MB
Major tables, records and sizes.Indices double the storage. (SIGMOD 2002)
Recommended