Upload
cyrus-bundrick
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
Stephen GwynCanadian Astronomy Data Centre
Aggregating Metadata from Multiple Archives: a Non-VO Approach
Stephen GwynCanadian Astronomy Data Centre
CADC
Stephen GwynCanadian Astronomy Data Centre
- Astronomy is using more and more archival data - More than 50% of HST papers are archival - Similar trends for other telescopes- Harder for solar system astronomy
SSOIS: Solar System Object Image Search allows users to search for images of moving targets
Stephen GwynCanadian Astronomy Data Centre
SSOIS: Solar System Object Image Search allows users to search for images of moving targets
Stephen GwynCanadian Astronomy Data Centre
SSOIS: Solar System Object Image Search allows users to search for images of moving targets
Stephen GwynCanadian Astronomy Data Centre
CFHT
Initally, only data from CFHT/MegaCam was searched
Stephen GwynCanadian Astronomy Data Centre
NEAT
CFHT
Subaru
ESOGemini
AAT
SDSS
NOAO
ING
Next added data from external telescope archives
Stephen GwynCanadian Astronomy Data Centre
CADC
Next added data from external telescope archives
Stephen GwynCanadian Astronomy Data Centre
For each image, we need:
- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data
Scraping external archives:
Stephen GwynCanadian Astronomy Data Centre
For each image, we need:
- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data
Scraping external archives:
There are a variety of data archive interfaces....
Stephen GwynCanadian Astronomy Data Centre
- In an ideal world: one query to get all metadata- In real life: row limits- As the archives are updated, they need to be re-scraped periodically- Programmatic retrieval is required
Scraping external archives:
Stephen GwynCanadian Astronomy Data Centre
Advantages: - A single tool can scrape multiple archives
Disadvantages: - Not all archives have an SIAP interface - Many SIAP services do not conform to the VO standard - Not all SIAP services contain all the necessary metadata - Most archives have at least 1 heavily observed patch of sky: hit the row limit again - SIAP services vary in ability for positional queries - maximum search area - search is circle or box - may require 105 queries: may be perceived as DOS attack
Far better off scraping by day/night/MJD - Almost all telescopes take <10000 observations per 24 hours: - Can re-scrape with fewer queries
Use SIAP?
Stephen GwynCanadian Astronomy Data Centre
Scraping by RA/Dec
Stephen GwynCanadian Astronomy Data Centre
Scraping by Date
Stephen GwynCanadian Astronomy Data Centre
Older archive interfaces:- Query page + simple CGI result page- view source on the query page- get form inputs- issue repeated queries to CGI result page using GET or POST with wget/curl/scripting API- Easy
http://astronomydata.edu/query?ra=12.87&dec=13.52&mjd=57323
Stephen GwynCanadian Astronomy Data Centre
Newer archive interfaces:- AJAX/HTML5/etc page - Download Javascript and run through de-obfuscator- locate relevant XMLHttpRequest- determine if cookies are necessary- issue repeated queries to XMLHttpRequest URLs- Much harder
Stephen GwynCanadian Astronomy Data Centre
Easiest of all...http://smoka.nao.ac.jp/status/obslog/SUP_2007.txt
Stephen GwynCanadian Astronomy Data Centre
A script to get all Subaru/SuprimeCam metadata...
#!/bin/bashwget http://smoka.nao.ac.jp/status/obslog/SUP_1999.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2000.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2001.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2002.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2003.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2004.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2005.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2006.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2007.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2008.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2009.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2010.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2011.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2012.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2013.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2014.txt
Stephen GwynCanadian Astronomy Data Centre
The second easiest: CADC's Advanced Search
Stephen GwynCanadian Astronomy Data Centre
The second easiest: CADC's Advanced Search
Stephen GwynCanadian Astronomy Data Centre
The second easiest: CADC's Advanced Search
Stephen GwynCanadian Astronomy Data Centre
The second easiest: CADC's Advanced Searchhttp://www1.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/tap/sync?LANG=ADQL&REQUEST=doQuery&QUERY=SELECT%20Observation.observationURI%20AS%20%22Preview%22%2C%20Observation.collection%20AS%20%22Collection%22%2C%20Observation.observationID%20AS%20%22Obs.%20ID%22%2C%20COORD1(CENTROID(Plane.position_bounds))%20AS%20%22RA%20(J2000.0)%22%2C%20COORD2(CENTROID(Plane.position_bounds))%20AS%20%22Dec.%20(J2000.0)%22%2C%20Plane.time_bounds_cval1%20AS%20%22Start%20Date%22%2C%20Observation.instrument_name%20AS%20%22Instrument%22%2C%20Plane.time_exposure%20AS%20%22Int.%20Time%22%2C%20Observation.target_name%20AS%20%22Target%20Name%22%2C%20Plane.energy_bandpassName%20AS%20%22Filter%22%2C%20Plane.calibrationLevel%20AS%20%22Cal.%20Lev.%22%2C%20Observation.type%20AS%20%22Obs.%20Type%22%2C%20Plane.energy_bounds_cval1%20AS%20%22Min.%20Wavelength%22%2C%20Plane.energy_bounds_cval2%20AS%20%22Max.%20Wavelength%22%2C%20Observation.proposal_id%20AS%20%22Proposal%20ID%22%2C%20Observation.proposal_pi%20AS%20%22P.I.%20Name%22%2C%20Plane.productID%20AS%20%22Product%20ID%22%2C%20Plane.dataRelease%20AS%20%22Data%20Release%22%2C%20AREA(Plane.position_bounds)%20AS%20%22Field%20of%20View%22%2C%20Plane.position_sampleSize%20AS%20%22Pixel%20Scale%22%2C%20Plane.dataProductType%20AS%20%22Data%20Type%22%2C%20Plane.position_timeDependent%20AS%20%22Moving%20Target%22%2C%20Plane.provenance_name%20AS%20%22Provenance%20Name%22%2C%20Plane.provenance_keywords%20AS%20%22Provenance%20Keywords%22%2C%20Observation.intent%20AS%20%22Intent%22%2C%20Observation.target_type%20AS%20%22Target%20Type%22%2C%20Observation.target_standard%20AS%20%22Target%20Standard%22%2C%20Plane.metaRelease%20AS%20%22Meta%20Release%22%2C%20Observation.sequenceNumber%20AS%20%22Sequence%20Number%22%2C%20Observation.algorithm_name%20AS%20%22Algorithm%20Name%22%2C%20Observation.proposal_title%20AS%20%22Proposal%20Title%22%2C%20Observation.proposal_keywords%20AS%20%22Proposal%20Keywords%22%2C%20Observation.proposal_project%20AS%20%22Proposal%20Project%22%2C%20Plane.position_bounds%20AS%20%22Polygon%22%2C%20Plane.energy_emBand%20AS%20%22Band%22%2C%20Plane.provenance_reference%20AS%20%22Prov.%20Reference%22%2C%20Plane.provenance_version%20AS%20%22Prov.%20Version%22%2C%20Plane.provenance_project%20AS%20%22Prov.%20Project%22%2C%20Plane.provenance_producer%20AS%20%22Prov.%20Producer%22%2C%20Plane.provenance_runID%20AS%20%22Prov.%20Run%20ID%22%2C%20Plane.provenance_lastExecuted%20AS%20%22Prov.%20Last%20Executed%22%2C%20Plane.provenance_inputs%20AS%20%22Prov.%20Inputs%22%2C%20Plane.energy_restwav%20AS%20%22Rest-frame%20Spectral%20Coverage%22%2C%20Plane.planeID%20AS%20%22planeID%22%2C%20isDownloadable(Plane.planeURI)%20AS%20%22DOWNLOADABLE%22%2C%20Plane.planeURI%20AS%20%22CAOM%20Plane%20URI%22%2C%20Observation.instrument_keywords%20AS%20%22Instrument%20Keywords%22%2C%20Plane.energy_transition_species%20AS%20%22Molecule%22%2C%20Plane.energy_transition_transition%20AS%20%22Transition%22%2C%20Plane.position_resolution%20AS%20%22IQ%22%20FROM%20caom2.Plane%20AS%20Plane%20JOIN%20caom2.Observation%20AS%20Observation%20ON%20Plane.obsID%20%3D%20Observation.obsID%20WHERE%20%20(%20Observation.instrument_name%20%3D%20%27MegaPrime%27%20AND%20Observation.collection%20%3D%20%27CFHT%27%20)&FORMAT=tsv
Stephen GwynCanadian Astronomy Data Centre
The second easiest: CADC's Advanced Search SELECT Observation.observationURI AS "Preview",
Observation.collection AS "Collection", Observation.observationID AS "Obs. ID", COORD1(CENTROID(Plane.position_bounds)) AS "RA (J2000.0)", COORD2(CENTROID(Plane.position_bounds)) AS "Dec. (J2000.0)", Plane.time_bounds_cval1 AS "Start Date", Observation.instrument_name AS "Instrument", Plane.time_exposure AS "Int. Time", Observation.target_name AS "Target Name", Plane.energy_bandpassName AS "Filter", Plane.calibrationLevel AS "Cal. Lev.", Observation.type AS "Obs. Type", Plane.energy_bounds_cval1 AS "Min. Wavelength", Plane.energy_bounds_cval2 AS "Max. Wavelength", Observation.proposal_id AS "Proposal ID", Observation.proposal_pi AS "P.I. Name", Plane.productID AS "Product ID", Plane.dataRelease AS "Data Release", AREA(Plane.position_bounds) AS "Field of View", Plane.position_sampleSize AS "Pixel Scale", Plane.dataProductType AS "Data Type", Plane.position_timeDependent AS "Moving Target", Plane.provenance_name AS "Provenance Name", Observation.intent AS "Intent", Observation.target_type AS "Target Type", Observation.target_standard AS "Target Standard", Observation.sequenceNumber AS "Sequence Number", Observation.algorithm_name AS "Algorithm Name", Observation.proposal_title AS "Proposal Title", Observation.proposal_keywords AS "Proposal Keywords", Plane.energy_emBand AS "Band", Plane.provenance_version AS "Prov. Version", Plane.provenance_project AS "Prov. Project", Plane.provenance_runID AS "Prov. Run ID", Plane.provenance_lastExecuted AS "Prov. Last Executed", Plane.energy_restwav AS "Rest-frame Spectral Coverage", isDownloadable(Plane.planeURI) AS "DOWNLOADABLE", Plane.planeURI AS "CAOM Plane URI", Observation.instrument_keywords AS "Instrument Keywords", Plane.energy_transition_species AS "Molecule", Plane.energy_transition_transition AS "Transition", Plane.position_resolution AS "IQ"
FROM caom2.Plane AS Plane JOIN caom2.Observation AS Observation ON Plane.obsID = Observation.obsID
WHERE ( Observation.collection = 'CFHT' )
Stephen GwynCanadian Astronomy Data Centre
The other hard part:
- Parsing downloaded metadata
- Which observations are images?
- Quality control - is MJD right? - Are coordinates 2000.0 or 1950.0?
- Sorting out filters: - remove narrow band filter data - remove bad filters - remove grism data - maybe homogenize filter names (B vs Bj vs Bjohnson vs Johnson B vs ...)
- Telescope footprint not typically part of the metadata
- Work out links back to original images
SSOIS saves the Earth....
Stephen GwynCanadian Astronomy Data Centre
Summary:
- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it
Stephen GwynCanadian Astronomy Data Centre
Stephen GwynCanadian Astronomy Data Centre
Summary:
- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it