28
Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre CADC

Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Embed Size (px)

Citation preview

Page 1: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Aggregating Metadata from Multiple Archives: a Non-VO Approach

Stephen GwynCanadian Astronomy Data Centre

CADC

Page 2: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

- Astronomy is using more and more archival data - More than 50% of HST papers are archival - Similar trends for other telescopes- Harder for solar system astronomy

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Page 3: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Page 4: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Page 5: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

CFHT

Initally, only data from CFHT/MegaCam was searched

Page 6: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

NEAT

CFHT

Subaru

ESOGemini

AAT

SDSS

NOAO

ING

Next added data from external telescope archives

Page 7: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

CADC

Next added data from external telescope archives

Page 8: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

For each image, we need:

- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data

Scraping external archives:

Page 9: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

For each image, we need:

- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data

Scraping external archives:

Page 10: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

There are a variety of data archive interfaces....

Page 11: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

- In an ideal world: one query to get all metadata- In real life: row limits- As the archives are updated, they need to be re-scraped periodically- Programmatic retrieval is required

Scraping external archives:

Page 12: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Advantages: - A single tool can scrape multiple archives

Disadvantages: - Not all archives have an SIAP interface - Many SIAP services do not conform to the VO standard - Not all SIAP services contain all the necessary metadata - Most archives have at least 1 heavily observed patch of sky: hit the row limit again - SIAP services vary in ability for positional queries - maximum search area - search is circle or box - may require 105 queries: may be perceived as DOS attack

Far better off scraping by day/night/MJD - Almost all telescopes take <10000 observations per 24 hours: - Can re-scrape with fewer queries

Use SIAP?

Page 13: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Scraping by RA/Dec

Page 14: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Scraping by Date

Page 15: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Older archive interfaces:- Query page + simple CGI result page- view source on the query page- get form inputs- issue repeated queries to CGI result page using GET or POST with wget/curl/scripting API- Easy

http://astronomydata.edu/query?ra=12.87&dec=13.52&mjd=57323

Page 16: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Newer archive interfaces:- AJAX/HTML5/etc page - Download Javascript and run through de-obfuscator- locate relevant XMLHttpRequest- determine if cookies are necessary- issue repeated queries to XMLHttpRequest URLs- Much harder

Page 17: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Easiest of all...http://smoka.nao.ac.jp/status/obslog/SUP_2007.txt

Page 18: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

A script to get all Subaru/SuprimeCam metadata...

#!/bin/bashwget http://smoka.nao.ac.jp/status/obslog/SUP_1999.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2000.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2001.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2002.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2003.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2004.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2005.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2006.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2007.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2008.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2009.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2010.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2011.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2012.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2013.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2014.txt

Page 19: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Page 20: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Page 21: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Page 22: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Searchhttp://www1.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/tap/sync?LANG=ADQL&REQUEST=doQuery&QUERY=SELECT%20Observation.observationURI%20AS%20%22Preview%22%2C%20Observation.collection%20AS%20%22Collection%22%2C%20Observation.observationID%20AS%20%22Obs.%20ID%22%2C%20COORD1(CENTROID(Plane.position_bounds))%20AS%20%22RA%20(J2000.0)%22%2C%20COORD2(CENTROID(Plane.position_bounds))%20AS%20%22Dec.%20(J2000.0)%22%2C%20Plane.time_bounds_cval1%20AS%20%22Start%20Date%22%2C%20Observation.instrument_name%20AS%20%22Instrument%22%2C%20Plane.time_exposure%20AS%20%22Int.%20Time%22%2C%20Observation.target_name%20AS%20%22Target%20Name%22%2C%20Plane.energy_bandpassName%20AS%20%22Filter%22%2C%20Plane.calibrationLevel%20AS%20%22Cal.%20Lev.%22%2C%20Observation.type%20AS%20%22Obs.%20Type%22%2C%20Plane.energy_bounds_cval1%20AS%20%22Min.%20Wavelength%22%2C%20Plane.energy_bounds_cval2%20AS%20%22Max.%20Wavelength%22%2C%20Observation.proposal_id%20AS%20%22Proposal%20ID%22%2C%20Observation.proposal_pi%20AS%20%22P.I.%20Name%22%2C%20Plane.productID%20AS%20%22Product%20ID%22%2C%20Plane.dataRelease%20AS%20%22Data%20Release%22%2C%20AREA(Plane.position_bounds)%20AS%20%22Field%20of%20View%22%2C%20Plane.position_sampleSize%20AS%20%22Pixel%20Scale%22%2C%20Plane.dataProductType%20AS%20%22Data%20Type%22%2C%20Plane.position_timeDependent%20AS%20%22Moving%20Target%22%2C%20Plane.provenance_name%20AS%20%22Provenance%20Name%22%2C%20Plane.provenance_keywords%20AS%20%22Provenance%20Keywords%22%2C%20Observation.intent%20AS%20%22Intent%22%2C%20Observation.target_type%20AS%20%22Target%20Type%22%2C%20Observation.target_standard%20AS%20%22Target%20Standard%22%2C%20Plane.metaRelease%20AS%20%22Meta%20Release%22%2C%20Observation.sequenceNumber%20AS%20%22Sequence%20Number%22%2C%20Observation.algorithm_name%20AS%20%22Algorithm%20Name%22%2C%20Observation.proposal_title%20AS%20%22Proposal%20Title%22%2C%20Observation.proposal_keywords%20AS%20%22Proposal%20Keywords%22%2C%20Observation.proposal_project%20AS%20%22Proposal%20Project%22%2C%20Plane.position_bounds%20AS%20%22Polygon%22%2C%20Plane.energy_emBand%20AS%20%22Band%22%2C%20Plane.provenance_reference%20AS%20%22Prov.%20Reference%22%2C%20Plane.provenance_version%20AS%20%22Prov.%20Version%22%2C%20Plane.provenance_project%20AS%20%22Prov.%20Project%22%2C%20Plane.provenance_producer%20AS%20%22Prov.%20Producer%22%2C%20Plane.provenance_runID%20AS%20%22Prov.%20Run%20ID%22%2C%20Plane.provenance_lastExecuted%20AS%20%22Prov.%20Last%20Executed%22%2C%20Plane.provenance_inputs%20AS%20%22Prov.%20Inputs%22%2C%20Plane.energy_restwav%20AS%20%22Rest-frame%20Spectral%20Coverage%22%2C%20Plane.planeID%20AS%20%22planeID%22%2C%20isDownloadable(Plane.planeURI)%20AS%20%22DOWNLOADABLE%22%2C%20Plane.planeURI%20AS%20%22CAOM%20Plane%20URI%22%2C%20Observation.instrument_keywords%20AS%20%22Instrument%20Keywords%22%2C%20Plane.energy_transition_species%20AS%20%22Molecule%22%2C%20Plane.energy_transition_transition%20AS%20%22Transition%22%2C%20Plane.position_resolution%20AS%20%22IQ%22%20FROM%20caom2.Plane%20AS%20Plane%20JOIN%20caom2.Observation%20AS%20Observation%20ON%20Plane.obsID%20%3D%20Observation.obsID%20WHERE%20%20(%20Observation.instrument_name%20%3D%20%27MegaPrime%27%20AND%20Observation.collection%20%3D%20%27CFHT%27%20)&FORMAT=tsv

Page 23: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search SELECT Observation.observationURI AS "Preview",

Observation.collection AS "Collection", Observation.observationID AS "Obs. ID", COORD1(CENTROID(Plane.position_bounds)) AS "RA (J2000.0)", COORD2(CENTROID(Plane.position_bounds)) AS "Dec. (J2000.0)", Plane.time_bounds_cval1 AS "Start Date", Observation.instrument_name AS "Instrument", Plane.time_exposure AS "Int. Time", Observation.target_name AS "Target Name", Plane.energy_bandpassName AS "Filter", Plane.calibrationLevel AS "Cal. Lev.", Observation.type AS "Obs. Type", Plane.energy_bounds_cval1 AS "Min. Wavelength", Plane.energy_bounds_cval2 AS "Max. Wavelength", Observation.proposal_id AS "Proposal ID", Observation.proposal_pi AS "P.I. Name", Plane.productID AS "Product ID", Plane.dataRelease AS "Data Release", AREA(Plane.position_bounds) AS "Field of View", Plane.position_sampleSize AS "Pixel Scale", Plane.dataProductType AS "Data Type", Plane.position_timeDependent AS "Moving Target", Plane.provenance_name AS "Provenance Name", Observation.intent AS "Intent", Observation.target_type AS "Target Type", Observation.target_standard AS "Target Standard", Observation.sequenceNumber AS "Sequence Number", Observation.algorithm_name AS "Algorithm Name", Observation.proposal_title AS "Proposal Title", Observation.proposal_keywords AS "Proposal Keywords", Plane.energy_emBand AS "Band", Plane.provenance_version AS "Prov. Version", Plane.provenance_project AS "Prov. Project", Plane.provenance_runID AS "Prov. Run ID", Plane.provenance_lastExecuted AS "Prov. Last Executed", Plane.energy_restwav AS "Rest-frame Spectral Coverage", isDownloadable(Plane.planeURI) AS "DOWNLOADABLE", Plane.planeURI AS "CAOM Plane URI", Observation.instrument_keywords AS "Instrument Keywords", Plane.energy_transition_species AS "Molecule", Plane.energy_transition_transition AS "Transition", Plane.position_resolution AS "IQ"

FROM caom2.Plane AS Plane JOIN caom2.Observation AS Observation ON Plane.obsID = Observation.obsID

WHERE ( Observation.collection = 'CFHT' )

Page 24: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

The other hard part:

- Parsing downloaded metadata

- Which observations are images?

- Quality control - is MJD right? - Are coordinates 2000.0 or 1950.0?

- Sorting out filters: - remove narrow band filter data - remove bad filters - remove grism data - maybe homogenize filter names (B vs Bj vs Bjohnson vs Johnson B vs ...)

- Telescope footprint not typically part of the metadata

- Work out links back to original images

Page 25: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

SSOIS saves the Earth....

Page 26: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Summary:

- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it

Page 27: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Page 28: Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO Approach Stephen Gwyn Canadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Summary:

- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it