23
Best practices for generating linked data Tutorial @ ICBO 2013

Best practices for generating Bio2RDF linked data

Embed Size (px)

Citation preview

Best practices for generating linked data

Tutorial @ ICBO 2013

Tutorial Roadmap

Bio2RDF Best Practices

1. Assign a URI for all things2. Assign labels and identifiers3. Declare and assign types4. Provide dataset provenance

1. Assign URIs for all things

● The base Bio2RDF URI pattern:http://bio2rdf.org/namespace:identifier

● Data provider record identifiers are maintained from source

● Linked Data = no blank nodes!

1. Assign URIs for all things

● Data provider records are maintained from source○ e.g. DrugBank’s resource IRI for

Leucovorin

http://bio2rdf.org/drugbank:DB00650

1. Assign URIs for all things

● Vocabulary namespaces are used for dataset specific types and predicates

http://bio2rdf.org/drugbank_vocabulary:Drug

● Resource namespaces are used to assign an identifier when one isn't a provided by the source

- unique identifier with UUID, hash, counter, concatenated strings, etc

http://bio2rdf.org/drugbank_resource:DB00440_DB00650

1. Assign URIs for all things

● All valid namespaces are listed in the Bio2RDF Life Sciences Registry

○ ensures that URIs are consistent across all Bio2RDF datasets

○ registry is publicly available at http://tinyurl.com/dataregistry

2. Assign labels and identifiers

● Use rdfs:label to assign a language-specified label for all resources○ can be a source provided title, a script generated

phrase, or a phrase provided in a third party dataset○ Pattern: rdfs:label "label [ns:id]"@lang

● Use Dublin Core predicates for source-provided label and identifiers○ Pattern: dc:title "label"@lang (assign language tag

only when one is provided)○ Pattern: dc:identifier "ns:id"^^xsd:string

2. Assign labels and identifiers

● Use Bio2RDF predicates to assign Bio2RDF namespace and Bio2RDF identifiers:

○ Pattern: bio2rdf_vocabulary:namespace "ns"^^xsd:string

○ Pattern: bio2rdf_vocabulary:identifier "id"^^xsd:string

2. Assign labels and identifiers

Example: DrugBank entry for Nitrazepam

drugbank:DB0159 rdfs:label "Nitrazepam [drugbank:DB0159]"@en ;dc:title “Nitrazepam”@en ; dc:identifier “drugbank:DB0159”^^xsd:string ;bio2rdf_vocabulary:namespace “drugbank”^^xsd:string ;bio2rdf_vocabulary:identifier “DB0159”^^xsd:string .

3. Declare and assign types

● All resources should be typed as being resources of the dataset○ Pattern: rdf:type namespace_vocabulary:Resource

● Instances of a dataset vocabulary type should also be typed as owl:NamedIndividual○ Pattern: rdf:type namespace_vocabulary:Type○ Pattern: rdf:type owl:NamedIndividual

● Classes should be typed as owl:Class○ Pattern: rdf:type owl:Class○ If superclass has been described using

namespace_vocabulary pattern, then link class using rdfs:subClassOf

3. Declare and assign types

● Object properties and datatype properties should also be typed○ Pattern: rdf:type owl:ObjectProperty○ Pattern: rdf:type owl:DatatypeProperty

● Examples:drugbank:DB0159

rdf:type drugbank_vocabulary:Resource ;rdf:type owl:Class ; rdfs:subClassOf drugbank_vocabulary:Drug .

drugbank_vocabulary:ddi-interactor-inrdf:type owl:ObjectProperty .

4. Provide dataset provenance

data item

Bio2RDF dataset

Features-Entity-dataset link-Creator-Publisher-Date created-License & rights-Source-Availability- SPARQL endpoint- Data dump

VocabulariesVoIDDublin CoreW3C ProvenanceBio2RDF vocabulary

Source dataset

prov:wasDerivedFrom

void:inDataset

4. Provide dataset provenance

● link every resource to the versioned/dated Bio2RDF dataset in which it is described

○ Pattern: void:inDataset <http://bio2rdf.org/dataset:namespace-dd-mm-yyyy.rdf>

○ Example:drugbank:DB0159 void:inDataset <http://bio2rdf.org/dataset:drugbank-03-07-2013> .

A crash course in PHP

PHP : Hypertext Preprocessor

● A general-purpose open source scripting language○ homepage : http://php.net

● PHP scripts can be executed from the command line or embedded in HTML documents

● Syntactically similar to C/C++/Java but it is not strongly typed

A hello world PHP script

● All PHP scripts are surrounded by the <?php and ?> tags

Declaring and instantiating classes

Using the Bio2RDF PHP API to create an RDFizer

● Basic structure of a Bio2RDFizer script:

○ Initialize script parameters - input file(s), default dataset namespace, etc.

○ Define a Run() function that handles downloading and iterating over input files, as well as function calls to parse and convert input data to RDF

○ Define function(s) to convert input data to RDF using Bio2RDF API helper functions

Using the Bio2RDF PHP API to create an RDFizer

● Bio2RDF PHP API defines helper functions that implement Bio2RDF best practices:○ getNamespace() ○ getVoc()○ getRes()

○ triplify($subject, $predicate, $object) //object is an rdf resource○ triplifyString($subject, $predicate, "string")// object is a literal

○ describeIndividual($uri, $label, $type, $title, $description, $language)○ describeClass( ... )○ describeProperty ( ... )

Example: The Comparative Toxicogenomics Database

CTD Bio2RDFizer script is available on GitHub

Using and contributing to the Bio2RDF project on GitHub

Using and contributing to the Bio2RDF project on GitHub

1. Fork the bio2rdf-scripts and php-lib repositories on Githubhttps://help.github.com/articles/fork-a-repo

2. Write some code!3. Commit code to your fork4. Make a pull request to the bio2rdf-scripts

repo