Peter Cornillon Graduate School of Oceanography University of Rhode Island Presented at the NSF Sponsored Cyberinfrastructure Meeting 31 October 2002 OPeNDAP:

Peter CornillonPeter CornillonGraduate School of OceanographyGraduate School of Oceanography

University of Rhode IslandUniversity of Rhode Island

Presented at the NSF Sponsored Presented at the NSF Sponsored Cyberinfrastructure MeetingCyberinfrastructure Meeting

31 October 200231 October 2002

OPeNDAP: Accessing Data in a Distributed, Heterogeneous

Environment

OutlineOutline

DODS NVODS & OPeNDAP

Interoperability: The Core Infrastructure

How OPeNDAP is being used

Lessons learned – also throughout

Distributed Oceanographic Data System (DODS)Distributed Oceanographic Data System (DODS)

Conceived in 1992 at a workshop held at URI.

Objectives were:– to facilitate access to PI held data as well as data

held in national archives and– to allow the data user to analyze data using the

application package with which he or she is the most familiar.

Basic system designed and implemented in 1993-1994 by Gallagher and Flierl

Distributed Oceanographic Data SystemDistributed Oceanographic Data System

DODS consisted of two fundamental parts:

a discipline independent core infrastructure for moving data on the net,

a discipline specific portion related to data –population, location, specialized clients, etc.

DODS DODS OPeNDAP & NVODS OPeNDAP & NVODS

To isolate the discipline independent part of the system from the discipline specific part, two entities have been formed:

Open Source Project for a Network Data Access Protocol (OPeNDAP)

National Virtual Ocean Data System (NVODS)

DODS NVODS/OPeNDAP

OPeNDAP was formed to maintain and evolve the DODS core infrastructure

OPeNDAP is a non-profit corporation

OPeNDAP focuses on the discipline neutral parts of the DODS data access protocol

Objective of OPeNDAP

To provide a data access protocol allowing for machine-to-machine interoperability with semantic meaning in a distributed, heterogeneous data environment

The scripted exchange of data between computers, without human intervention.

Considerations with regard to the development OPeNDAP

Many data providers

Many data formats

Many different semantic representations of the data

Many different client types

The Core InfrastructureInteroperability

Interoperability - Metadata

The degree to which machine-to-machine interoperability is achieved depends on the metadata associated with the data.

OPeNDAP and Metadata

Metadata Types

We define two classes of metadata:

• Use metadata –needed to actually use the data.

• Search metadata – used to locate data sets of interest in a distributed data system.

Use Metadata

We divide use metadata into two classes:

• Syntactic use metadata

• Semantic use metadata

Syntactic Use Metadata

Information about the data types and structures at the computer level - the syntax of the data;

– e.g., variable T represents a 20x40 element floating point array.

Semantic Use Metadata

Information about the contents of the data set.

e.g., variable T represents • sea surface temperature • with units of ºC

Semantic Use Metadata

We divide semantic use metadata into two classes:

• Descriptive Semantic Use Metadata

• Translational Semantic Use Metadata

Translational Semantic Use Metadata

Metadata required to make use of the data; e.g., to properly label a plot of the data

Define the translation from received values to semantically meaningful values

Examples

• Units of the data: 0.125 C+4 C

• Variable names in the data set: t SST

• Missing value flags: -99 missing value

OPeNDAP and Metadata

Interoperability – Data Exchange

Interoperability may be defined at any one of a number of levels ranging from:

the lowest (hardware) - how computers are linked electronically, to

the highest – semantically meaningful, machine-to-machine exchanges.

Organizational Complexity

Example: Consider the different ways of organizing a multi-year data set consisting of one global sea surface temperature (SST) field per day:

one 2-d file per day sst(lat,lon) - URI

one 3-d file sst(lon,lat,time) - PMEL

one file per year with one variable per day 365 variables per file, n files for n year - GSFC

Structure Layer

Provide the capability to reorganize data so that they are in a consistent structural form.

Objective is to reduce the granularity of the data set

Example: one 3-d file sst(lon,lat,time)

Format Layer

Data values are not modified

Format transformation only between server and client

The organizational structure of the data is not modified

Structure Layer

Data values are not modified

The organizational structure of the data is modified

An OPeNDAP Structural Layer Component – The Aggregation Server

Developed by John Caron of Unidata

Is for the aggregation of grids and arrays only

Operates in the Syntactic Structural Level

OPeNDAP - NVODS Status

OPeNDAP/NVODS Server SitesOPeNDAP Server Sites

OPeNDAP Client and Server Status

Special Servers

Projects Using OPeNDAP

GODAE (Global Ocean Data Assimilation Experiment)

NOMADS (NOAA Operational Model Archive and Distribution System) AOIMPS

ESG II - Earth System Grid II

Ocean. US (US-GOOS)

High Altitude Observatory Community

Institutions Making Heavy use of OPeNDAP2

Ingrid - Columbia University

COLA - Center for Ocean-Land-Atmosphere

Goddard DAAC

CDC - Climate Diagnostic Center

PMEL - Pacific Marine Environment Lab

OPeNDAP Monthly Accesses (2002)

Site/Month April May June July August

URI 4,856 19,504 3,691 26,693 7,440

LDEO 80,709 62,930 46,092 93,088 32,084

CDC 102,518 153,362 62,395 181,974 107,512

JPL 3,068 34,028 63,309 8,260 13,282

COLA 347,506 412,991 337,310 400,314 638,376

TOTAL 535,589 648,787 502,797 702,069 785,412

OPeNDAP Unique Users (2002)

Site April May June July August

URI 73 68 72 44 69

CDC 124 105 91 116 111

JPL 122 152 173 198 174

COLA 158 199 317 197 408

Interesting OPeNDAP Access Statistics

• IRI data accesses for 1st quarter of 2002

Type Requests % Volume (gb) %

OPeNDAP

191,611 8.5 375.2 69.4

Other 2,062,681 91.5 165.2 30.6

Total 2,254,292 100.0 540.4 100.0

•PMEL OPeNDAP2 ~ 35,000 with ~26,000 internal.

Lessons (Re)Learned

Lessons (Re)Learned

1. Modularity provides for flexibility

The more modular the underlying infrastructure the more flexible the system. This is particularly important for network based systems for which the technology, software and hardware, are changing rapidly.

Lessons (Re)Learned

2. Data of interest will be stored in a variety of formats.

Regardless of how much one might want to define the format to be used by system participants, in the end the data will be stored in a variety of formats.

2a. The same is true of translational use metadata!

Lessons Learned

3. Structural representation of sequence data sets is a major obstacle to interoperability

Care must be given to the organizational structure (as opposed to the format) of the data. This is the single largest constraint to the use of profile data in NVODS.

Lessons (Re)Learned

4. “Not invented here”

Avoid the “not invented here” trap. The basic concepts of a data system are relatively straightforward to define. Implementing these concepts ALWAYS involves substantially more work than originally anticipated. The “Devil’s in the details”.

Take advantage of existing software wherever possible.

Lessons (Re)Learned

5. Work with those who adopt the system for their own needs.

Take advantage of those who are interested in contributing to the system because the system addresses their needs as opposed to those who are simply doing the work for the associated funding. => Open source.

Lessons Learned

6. There is no well defined funding structure for community based operational systems.

It is much easier to obtain funding to develop a system than it is to obtain funding to maintain and evolve a system. This is a major obstacle to development of a stable cyberinfrastructure that meets the needs of the research community.

Lessons Learned

7. It is relatively more difficult to obtain funding for applied system development than for research related to data systems.

This is another obstacle to the development of cyberinfrastructure that meets the needs of the research community.

Lessons (Re)Learned

8. “Tough to teach old dogs new tricks”

Introducing new technology often requires a cultural change in usage that is difficult to effect. This can negatively effect system development.

Lesser Lessons Learned

9. Some surprises encountered in the NVODS/ OPeNDAP effort

Heavy within organization usage.

Metadata focus in the past is appropriate for interoperability at the data level.

Number of variables increases almost linearly with the number of data sets.

Users will take advantage of all of the flexibility offered by a system sometimes to the disadvantage of all.

Incredible variability in the structural organization of data.

Lessons Learned

10. Metrics suggest Increasing use of scripted requests Large volume transfers

As data systems offering machine-to- machine interoperability with semantic meaning take hold, we could well see an explosive growth in the use of the web.

Lessons Learned

11. Time to maturity is order 10 years not 3

Developing new infrastructure takes time, both to iron out all of the %^*% little details and adoption of the infrastructure takes time.

Peter’s Law

The more metadata required the less data delivered

Of course, the less metadata, the harder it is to use the data

http://unidata.ucar.edu/packages/dods

http://nvods.org

Documents

Peter Cornillon Graduate School of Oceanography University of Rhode Island Presented at the NSF Sponsored Cyberinfrastructure Meeting 31 October 2002 OPeNDAP: