Upload
preston-sarratt
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
File Formats, Conventions,and
Data Level Interoperability
ESDSWG New Orleans, Oct 20, 2010Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Introduction & overview
• Outline of objectives:– Discuss role of standard, self-describing “File
formats” in data level interoperability– Summarize common file formats in use, their
properties, & benefits --“data life cycle economics”– Discuss criteria for choosing a file format, matching
it to needs of consumer/producers.– Discuss critical role of Conventions – any file format
needs good recipes to make them interoperable!– Examples: NASA Measures F/T, SMAP, AIRs, Aura
Role(s) Of File Formats in Interoperability
• File formats represent versatile “packages” for multi-dimensional science data and metadata.
• Offer self-describing “well-known structures” to codify desired, common conventions and practices.
• Offer well-documented reference cases to encapsulate specific data models.
• Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability
• Enhance Mission-to-Mission continuity
Why (and how) are file formats important?
• Standard formats– Come with thorough documentation– Provide good Reference implementations
• Common formats– More datasets in a format more tools that read
that format• Canonical structures and names
general purpose handlers for coordinates, etc. smarter tools
A generic work flow…
• Consider user community needs and culture, fit within architecture, institutional policies & preferences
• Choose a standard file format (or sub-variant)• Design a convention-enabled, specific internal layout
with metadata interfaces• Prototype: Implement in prototype, evaluate• Implement in production context• Integrate within discovery and catalog environments
(Catalog interoperability…)
Examples of standard file formats
• HDF5 – a file format on its own, as well as a broad foundation for others
• netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1)– v4 Classic (widespread adoption, some limitations…)– v4 Enhanced (support Groups, User-defined, variable length
types, and more)• netCDF v3 Classic (legacy+ , tools+, but limited)• HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura…• HDF4 – legacy, extensive use by MODIS Terra, Aqua• Many other domain-specific, less generic formats abound…
(need transform tools to/from HDF?)
Some selection criteria…• Do file-format’s capabilities support required
functionality?• What is breadth of acceptance, adoption within larger
community? (and/or, does institutional policy dictate a specific format?)
• Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support?
• Contribution to investment, data life-cycle economics?• What is the level of standardization?• Adaptability of format to widely used conventions like
CF 1.x, or other accepted convention(s)?
Internal Layout / Design(once format is chosen & adopted…)
• Define &refine High level organization /structure• /DATA• /METADATA
• Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’– Dimensions, Coordinate Variables, projection attributes– Missing_data, _Fillvalue vs. internal fill value– Units, Gain, offset, min, max, range, etc.
• Prototype it!– Leverage script environments (Python H5Py, PyTables, etc)– Panoply, HDFView also quick, useful for prototyping, feedback
Using “Groups”
• HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc.
• Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system
• Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…)
• Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.
Example(s) of File Formats In Action
• HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC)– http://measures.ntsg.umt.edu/sample_2007_day180.zip
• AQUA AIRS Level 2 (from earlier talk):– http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201
0/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf
• Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)
Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout
CF Conventions & file formats:--how they contribute to interoperability.
• CF v1.4.x -- the term “CF” is now broader than just climate-forecasting!
• Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology
• CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance.
• Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.
Attributes vs. Metadata?one man’s ceiling is another man’s floor…
• Collection level vs. Data Set vs. Granule level• Structural vs. science-content• Swath vs. grid vs. point• Commonly used attributes:
– CONVENTIONS attrib, communicates which convention was used
– Basic globals: title, history, institution, source, references– Coordinate variables, axis, formula_terms– Units, _Fillvalue, missing_data, valid_range– Short_name, long_name, other provenance– (gain,offset /scale_factor,addOffset), etc.
Challenges? (just a few remain…)
• Evolution, bifurcation, asymmetric support can result in occasional user confusion:– HDF v1.8.x vs. v1.6.x families?– NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3?– HDFEOS5 vs. HDFEOS2?
• Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor…
• Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg!
• Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)
Resources : URLs• Climate Forecast (CF) Conventions (now at 1.4.x):
– http://cf-pcmdi.llnl.gov/– http://cf-pcmdi.llnl.gov/documents/cf-conventions
• HDF: – http://www.hdfgroup.org/HDF5/doc/index.html
• HDFEOS– http://www.hdfgroup.org/hdfeos.html– http://hdfeos.org/software/aug_hdfeos5.php
• NetCDF: – http://www.unidata.ucar.edu/software/netcdf/– http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.ht
ml• General:
– http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats
– http://en.wikipedia.org/wiki/List_of_file_formats
Resources: File format related Tools
• Panoply: http://www.giss.nasa.gov/tools/panoply/
• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/
• OpenDAP: http://opendap.org
• IDV: http://www.unidata.ucar.edu/software/idv/
• McIDAS: http://www.unidata.ucar.edu/software/mcidas/
• Python: – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, – PyTables: http://www.pytables.org/moin
• Perl: PDL-IO-HDF5, and Biohdf?
• Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs
A provisional DOI, UUID Strategy
• What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered:– DOI: assigned to our reference paper, by IEEE
Transactions in Geoscience and Remote Sensing– UUID recipe, seedString =
www.our.url/GranuleName/Datetime8601StampImport uuiduuid= uuid.uuid5(seedString)