1
Citation and Recognition of contributions using Semantic Provenance Knowledge Captured in the OPeNDAP Software Framework Patrick West 1 ([email protected] ) , James Michaelis 1 ( [email protected] ) , Tim Lebo 1 ([email protected] ) , Deborah L. McGuinness 1 ([email protected] ) , Peter Fox 1 ([email protected] ) ( 1 Rensselaer Polytechnic Institute 110 8 th St., Troy, NY, 12180 United States) Poster: IN31C-3738 Glossary: OPeNDAP - Open-source Project for a Network Data Access Protocol Provenance – information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness RPI – Rensselaer Polytechnic Institute TWC – Tetherless World Constellation at Rensselaer Polytechnic Institute Sponsors: Tetherless World Constellation Providing proper citation and attribution for published data, derived data products, and the software tools used to generate them, has always been an important aspect of scientific research. However, it is often the case that this type of detailed citation and attribution is lacking. This is in part because it often requires manual markup since dynamic generation of this type of provenance information is not typically done by the tools used to access, manipulate, transform and visualize data. In addition, the tools themselves lack the information needed to be properly cited themselves. The OPeNDAP Hyrax Software Framework is a tool that provides access to and the ability to constrain, manipulate and transform different types of data from different data formats into a common format, the DAP (Data Access Protocol), in order to derive new data products. A user, or another software client, specifies an HTTP URL in order to access a particular piece of data, and appropriately transform it to suit a specific purpose of use. The resulting data products, however, do not contain any information about what data was used to create it, or the software process used to generate it, let alone information that would allow the proper citation and attribution to down stream researchers and tool developers. We will present our approach to provenance capture in Hyrax including a mechanism that can be used to report back to the hosting site any derived products, such as publications and reports, using the W3C PROV recommendation pingback service. We will demonstrate our utilization of Semantic Web and Web standards, the development of an information model that extends the PROV model for provenance capture, and the development of the pingback service. We will present our findings, as well as our practices for providing provenance information, visualization of the provenance information, and the development of pingback services, to better enable scientists and tool developers to be recognized and properly cited for their contributions. Host: opendap.tw.rpi.edu Client: coyote.example.com C: GET http://opendap.tw.rpi.edu/opendap/CA_OrangeCo_2011_000402.nc.ascii?c onstraint S: 200 OK S: Link: <http://opendap.tw.rpi.edu/disney/provenance_record > rel=“http://www.w3.org/ns/prov#has_provenance S: Link: <http://opendap.tw.rpi.edu/disney/pingback > rel=“http://www.w3.org/ns/prov#pingback (CA_OrangeCo_2011_000402 ascii representation) Host: opendap.tw.rpi.edu Client: coyote.example.com C: POST http://opendap.tw.rpi.edu/disney/pingback HTTP/1.1 C: Content-Type: text/uri-list C: C: http://coyote.example.org/diagram_abc123/p rovenance C: http://coyote.example.org/journal_article_ def456/provenance S: 204 No Content Abstract Proper data management hinges on recording and maintaining “steps” applied to create data. Consumers require methods to assess whether available data is fit for their usage. Was this dataset produced by a trustworthy source? Producers are often expected to justify their efforts in generating new datasets. Who is using our data? What are they using it for? And why? HOWEVER, most current-generation data analysis and manipulation tools fail to capture appropriate meta- information to address these needs. Motivations and Challenges a PROV pingback-enabled community collaborates to categorize the points in a LiDAR scan of Disneyland. A client accesses a data point from a LiDAR scan of Disneyland The client categorizes the point as “water”, which is a new derivation of that point The client pings-back about this new derivation A researcher generates a data product using OPeNDAP and uses it in a derivation. Another researcher, visualizing that derivation, wishes to access the provenance of the data product. What were the original data sources? Can they use them? A scientist wishes to discover any derivations of data sources they created. OPeNDAP servers are widely used, but are rarely recognized. Use Cases W3C PROV Recommendation Simple Concepts and properties representing the concepts of the PROV Model Representing OPeNDAP provenance trace for use case 1 using PROV Model. OPeNDAP Back-End Server Design :BES_Plan rdf:type prov:Plan, prov:Collection; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:NC_Module; prov:hadRole opendap:Read; opendap:order 1; ]; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:DAP_Module; prov:hadRole opendap:Constrain; opendap:order 2; ]; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:ASCII_Module; prov:hadRole opendap:Transmit; opendap:order 3; ]; . :CA_OrangeCo_2011_000402.nc.ascii rdf:type prov:Entity; prov:wasDerivedFrom :NC_File. prov:wasGeneratedBy :BES_Process; . :BES_Process rdf:type prov:Activity; prov:qualifiedAssociation [ a prov:Association; prov:agent :BES_Agent; prov:hadPlan :BES_Plan; rdfs:comment "Execution of BES Server"@en ]; . :BES_Agent rdf:type prov:Agent; foaf:name "BES Server" . RDF Representation of provenance collected in first use case Initial request for data Pingback of derived data Major goal to provide visualizations of provenan All the way to seeing who implemented the code that was used to act on the data. Be able to see the software modules that acted on the data in producing the derived data. Initial visualization is of the provenance trace allowing users to get back to the original data, actions taken on the data, and the agents that performed those actions. Take Away Tool developers want and deserve credit for the tools that are used in the derivation of data Users of derived data more and more want to discover how the products were generated, the original data used, to determine if they can use the original or derived data Credit given to data creators, data curators, data providers, and data users Acknowledgements: OPeNDAP.org for their support and being open source, especially James Gallagher and Nathan Potter. First attempt was to capture provenance after-the-fact. Didn’t have enough information in the end. Second attempt is to collect the information on-the-go but not forcing module developers to have to implement anything, but providing hooks to allow them to.

Citation and Recognition of contributions using Semantic Provenance Knowledge Captured in the OPeNDAP Software Framework Patrick West 1 ([email protected]),

Embed Size (px)

Citation preview

Page 1: Citation and Recognition of contributions using Semantic Provenance Knowledge Captured in the OPeNDAP Software Framework Patrick West 1 (westp@rpi.edu),

Citation and Recognition of contributions using Semantic Provenance Knowledge

Captured in the OPeNDAP Software Framework

Patrick West1 ([email protected]), James Michaelis1 ([email protected]), Tim Lebo1 ([email protected]), Deborah L. McGuinness1 ([email protected]), Peter Fox1 ([email protected]) (1Rensselaer Polytechnic Institute 110 8th St., Troy, NY, 12180

United States)

Poster: IN31C-3738Glossary:OPeNDAP - Open-source Project for a Network Data Access ProtocolProvenance – information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthinessRPI – Rensselaer Polytechnic InstituteTWC – Tetherless World Constellation at Rensselaer Polytechnic Institute

Sponsors:

Tetherless World Constellation

Providing proper citation and attribution for published data, derived data products, and the software tools used to generate them, has always been an important aspect of scientific research. However, it is often the case that this type of detailed citation and attribution is lacking. This is in part because it often requires manual markup since dynamic generation of this type of provenance information is not typically done by the tools used to access, manipulate, transform and visualize data. In addition, the tools themselves lack the information needed to be properly cited themselves.

The OPeNDAP Hyrax Software Framework is a tool that provides access to and the ability to constrain, manipulate and transform different types of data from different data formats into a common format, the DAP (Data Access Protocol), in order to derive new data products. A user, or another software client, specifies an HTTP URL in order to access a particular piece of data, and appropriately transform it to suit a specific purpose of use. The resulting data products, however, do not contain any information about what data was used to create it, or the software process used to generate it, let alone information that would allow the proper citation and attribution to down stream researchers and tool developers.

We will present our approach to provenance capture in Hyrax including a mechanism that can be used to report back to the hosting site any derived products, such as publications and reports, using the W3C PROV recommendation pingback service. We will demonstrate our utilization of Semantic Web and Web standards, the development of an information model that extends the PROV model for provenance capture, and the development of the pingback service. We will present our findings, as well as our practices for providing provenance information, visualization of the provenance information, and the development of pingback services, to better enable scientists and tool developers to be recognized and properly cited for their contributions.

Host: opendap.tw.rpi.edu Client: coyote.example.com

C: GET http://opendap.tw.rpi.edu/opendap/CA_OrangeCo_2011_000402.nc.ascii?constraint

S: 200 OKS: Link: <http://opendap.tw.rpi.edu/disney/provenance_record>

rel=“http://www.w3.org/ns/prov#has_provenance”S: Link: <http://opendap.tw.rpi.edu/disney/pingback>

rel=“http://www.w3.org/ns/prov#pingback”

(CA_OrangeCo_2011_000402 ascii representation)

Host: opendap.tw.rpi.edu Client: coyote.example.com

C: POST http://opendap.tw.rpi.edu/disney/pingback HTTP/1.1C: Content-Type: text/uri-listC:C: http://coyote.example.org/diagram_abc123/provenanceC: http://coyote.example.org/journal_article_def456/provenance

S: 204 No Content

Abstract

• Proper data management hinges on recording and maintaining “steps” applied to create data.

• Consumers require methods to assess whether available data is fit for their usage.

• Was this dataset produced by a trustworthy source?

• Producers are often expected to justify their efforts in generating new datasets.

• Who is using our data?

• What are they using it for? And why?

• HOWEVER, most current-generation data analysis and manipulation tools fail to capture appropriate meta-information to address these needs.

Motivations and Challenges

• a PROV pingback-enabled community collaborates to categorize the points in a LiDAR scan of Disneyland.• A client accesses a data point from a LiDAR scan of Disneyland• The client categorizes the point as “water”, which is a new derivation of that point• The client pings-back about this new derivation

• A researcher generates a data product using OPeNDAP and uses it in a derivation. Another researcher, visualizing that derivation, wishes to access the provenance of the data product. What were the original data sources? Can they use them?

• A scientist wishes to discover any derivations of data sources they created.

• OPeNDAP servers are widely used, but are rarely recognized.

Use Cases

W3C PROV Recommendation

Simple Concepts and properties representing the concepts of the PROV Model

Representing OPeNDAP provenance trace for use case 1 using PROV Model.

OPeNDAP Back-End Server Design

:BES_Planrdf:type prov:Plan, prov:Collection;prov:qualifiedInfluence [

a prov:Influence; prov:entity opendap:NC_Module; prov:hadRole opendap:Read;

opendap:order 1;];

prov:qualifiedInfluence [a prov:Influence;

prov:entity opendap:DAP_Module; prov:hadRole opendap:Constrain;

opendap:order 2;];

prov:qualifiedInfluence [a prov:Influence;

prov:entity opendap:ASCII_Module; prov:hadRole opendap:Transmit;

opendap:order 3; ];

.

:CA_OrangeCo_2011_000402.nc.asciirdf:type prov:Entity;prov:wasDerivedFrom :NC_File.prov:wasGeneratedBy :BES_Process;

.:BES_Process

rdf:type prov:Activity; prov:qualifiedAssociation [ a prov:Association; prov:agent :BES_Agent; prov:hadPlan :BES_Plan; rdfs:comment

"Execution of BES Server"@en ];. :BES_Agent

rdf:type prov:Agent;foaf:name "BES Server"

.

RDF Representation of provenance collected in first use caseInitial request for data

Pingback of derived data

Major goal to provide visualizations of provenance

All the way to seeing who implemented the code that was used to act on the data.

Be able to see the software modules that acted on the data in producing the derived data.

Initial visualization is of the provenance trace allowing users to get back to the original data, actions taken on the data, and the agents that performed those actions.

Take Away

• Tool developers want and deserve credit for the tools that are used in the derivation of data

• Users of derived data more and more want to discover how the products were generated, the original data used, to determine if they can use the original or derived data

• Credit given to data creators, data curators, data providers, and data usersAcknowledgements:

OPeNDAP.org for their support and being open source, especially James Gallagher and Nathan Potter.

First attempt was to capture provenance after-the-fact. Didn’t have enough information in the end.

Second attempt is to collect the information on-the-go but not forcing module developers to have to implement anything, but providing hooks to allow them to.