Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction to RDF and the Semantic Web for the life sciences
Simon Jupp
Sample Phenotypes and Ontologies Team
European Bioinformatics Institute
Practical sessions
• Converting data to RDF
• Three questions
1. What types of things are in my data?
2. Can I identify these things?
3. How are these things related to other things?
Gene expression data example
Experiment Gene name Ensembl id organism organism_part expression t-‐stat p-‐value E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34 E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus spleen UP 140.00183 8.40E-‐34 E-‐TABM-‐865 MMp ENSMUSG00000028158 mus musculus liver UP 138.82608 8.40E-‐34 E-‐TABM-‐865 MMp ENSMUSG00000028158 mus musculus spleen DOWN -‐138.82608 8.40E-‐34 E-‐TABM-‐865 Akr1c14 ENSMUSG00000033715 mus musculus liver UP 132.92674 1.69E-‐33 E-‐TABM-‐865 Akr1c14 ENSMUSG00000033715 mus musculus spleen DOWN -‐132.92674 1.69E-‐33 E-‐TABM-‐865 Gulo ENSMUSG00000034450 mus musculus liver UP 126.44113 4.51E-‐33 E-‐TABM-‐865 Gulo ENSMUSG00000034450 mus musculus spleen DOWN -‐126.44113 4.51E-‐33 E-‐TABM-‐865 Marc1 ENSMUSG00000026621 mus musculus liver UP 124.45381 4.66E-‐33 E-‐TABM-‐865 Marc1 ENSMUSG00000026621 mus musculus spleen DOWN -‐124.45381 4.66E-‐33 E-‐GEOD-‐2852 Gulo ENSRNOG00000016648 raMus norvegicus kidney DOWN -‐32.518154 1.09E-‐42 E-‐GEOD-‐2852 Gulo ENSRNOG00000016648 raMus norvegicus liver UP 32.518154 1.09E-‐42 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus kidney DOWN -‐28.861328 2.29E-‐39 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus liver UP 28.861328 2.29E-‐39 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus kidney DOWN -‐16.854948 2.25E-‐25 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus liver UP 16.854948 2.25E-‐25 E-‐GEOD-‐2852 Amacr ENSRNOG00000018662 raMus norvegicus kidney DOWN -‐6.296967 7.45E-‐08 E-‐GEOD-‐2852 Amacr ENSRNOG00000018662 raMus norvegicus liver UP 6.296967 7.45E-‐08
What is it?
• What concepts do we have in this dataset?
• Some hints are already in the column names
Experiment Gene name Ensembl id organism organism_part expression t-‐stat p-‐value E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34 E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus spleen UP 140.00183 8.40E-‐34 E-‐TABM-‐865 MMp ENSMUSG00000028158 mus musculus liver UP 138.82608 8.40E-‐34 E-‐TABM-‐865 MMp ENSMUSG00000028158 mus musculus spleen DOWN -‐138.82608 8.40E-‐34 E-‐TABM-‐865 Akr1c14 ENSMUSG00000033715 mus musculus liver UP 132.92674 1.69E-‐33 E-‐TABM-‐865 Akr1c14 ENSMUSG00000033715 mus musculus spleen DOWN -‐132.92674 1.69E-‐33 E-‐TABM-‐865 Gulo ENSMUSG00000034450 mus musculus liver UP 126.44113 4.51E-‐33 E-‐TABM-‐865 Gulo ENSMUSG00000034450 mus musculus spleen DOWN -‐126.44113 4.51E-‐33 E-‐TABM-‐865 Marc1 ENSMUSG00000026621 mus musculus liver UP 124.45381 4.66E-‐33 E-‐TABM-‐865 Marc1 ENSMUSG00000026621 mus musculus spleen DOWN -‐124.45381 4.66E-‐33 E-‐GEOD-‐2852 Gulo ENSRNOG00000016648 raMus norvegicus kidney DOWN -‐32.518154 1.09E-‐42 E-‐GEOD-‐2852 Gulo ENSRNOG00000016648 raMus norvegicus liver UP 32.518154 1.09E-‐42 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus kidney DOWN -‐28.861328 2.29E-‐39 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus liver UP 28.861328 2.29E-‐39 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus kidney DOWN -‐16.854948 2.25E-‐25 E-‐GEOD-‐2852 Akr1c14 ENSRNOG00000017672 raMus norvegicus liver UP 16.854948 2.25E-‐25 E-‐GEOD-‐2852 Amacr ENSRNOG00000018662 raMus norvegicus kidney DOWN -‐6.296967 7.45E-‐08 E-‐GEOD-‐2852 Amacr ENSRNOG00000018662 raMus norvegicus liver UP 6.296967 7.45E-‐08
Exercise 1 – Concept maps
• Write down the concepts represented in this data (e.g. Experiment)
• Organise the concepts into a graph and write down some relationships between the concepts
Exercise 1 solution
Experiment Ensembl id
Experimental factor
Expression Value
P-‐value
T-‐staSsSc
Gene name
Organism
Congratulations on building your first Ontology!
Has result
Factor value
T-statistic P-value
Ensembl gene
Gene name
Factor value
Instance vs Types
• The world (of information) is made up of things and lots of them
• Instances, individuals, objects, tokens, particulars.
• The Earth is a kind of Planet
• Simon Jupp (NE 67 41 58 A) is a Person
• E-MTAB-62 is a type of Experiment
• Your liver is a type of Organ
Instance Type Experiment Y E-‐TABM-‐865 Ms4a1 Gene name ensembl id
ENSMUSG00000024673 organism mus musculus organism_part liver expression DOWN
-‐140.00183 t-‐stat p-‐value
8.40E-‐34
Exercise 2 – Identify Types vs Instance data
Instance Type Experiment Y E-‐TABM-‐865 Y Ms4a1 Y Gene name Y ensembl id Y
ENSMUSG00000024673 Y organism Y mus musculus Y organism_part Y liver Y expression Y DOWN Y
-‐140.00183 Y t-‐stat Y p-‐value Y
8.40E-‐34 Y
Exercise 2 solution
Giving things identity
• Choose a URI scheme for resources.
• Re-use URIs for types of things where possible
• Shared URIs for the same things make integration happen
• General rule
1. If it’s your data, give it a URI in your namespace.
2. If it’s someone else's data (e.g. UniProt) use a URI from them (if they have one)
Instance Type Mine Experiment Y E-‐TABM-‐865 Y Ms4a1 Y Gene name Y ensembl id Y
ENSMUSG00000024673 Y N organism Y mus musculus Y organism_part Y liver Y expression Y DOWN Y
-‐140.00183 Y t-‐stat Y p-‐value Y
8.40E-‐34 Y
Exercise 3 – your data vs shared data
Instance Type Mine Experiment Y N E-‐TABM-‐865 Y Y Ms4a1 Y N Gene name Y N ensembl id Y N
ENSMUSG00000024673 Y N organism Y N mus musculus Y N organism_part Y N liver Y N expression Y N DOWN Y N
-‐140.00183 Y Y t-‐stat Y N p-‐value Y N
8.40E-‐34 Y Y
Exercise 3 solution
Types of things usually belong in external reference ontologies. Good practice try and connect your data to these ontologies
Your data is usually the instance data (the experiment or the results)
URI for a instance
• Ensembl Gene ENSMUSG00000024673
• http://www.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000024673
• Is this a good URI?
• Is it stable? What does it represent?
• This is a URL for the web page, it may change
• It doesn’t return RDF
Identifiers.org
• http://identifiers.org
• Identifiers.org is a system providing resolvable persistent URIs used to identify data for the scientific community, with a current focus on the Life Sciences domain. The provision of a resolvable identifiers (URLs) fits well with the Semantic Web vision, and the Linked Data initiative.
Exercise 4
• Use the identifiers.org website to find the URI for ENSMUSG00000024673
Exercise 4 solution
• Search identifiers.org for ensembl
• Got to http://www.ebi.ac.uk/miriam/main/collections/MIR:00000003
• Find root URL
• http://identifiers.org/ensembl/ENSMUSG00000024673
• See what it resolves to
URI for types
• Experimental factor “liver”
• “liver” is an organ. We would expect to find an ontology term that describes what a liver is
• BioPortal is a repository or bio-medical ontologies
• https://bioportal.bioontology.org
Exercise 5
• Go to https://bioportal.bioontology.org and find ontologies that contain terms for “liver”, “spleen” and “kidney”
• Get the URIs for liver, spleen and kidney from the Experimental Factor Ontology (EFO)
Exercise 5 solution
• “liver”
• http://purl.obolibrary.org/obo/UBERON_0002107
• “spleen”
• http://purl.obolibrary.org/obo/UBERON_0002106
• “kidney”
• http://purl.obolibrary.org/obo/UBERON_0002113
Instance Type Mine URI Experiment Y N E-‐TABM-‐865 Y Y Ms4a1 Y N Gene name Y N ensembl id Y N
ENSMUSG00000024673 Y N http://identifiers.org/ensembl/ENSMUSG00000024673
organism Y N mus musculus Y N organism_part Y N liver Y N http://purl.obolibrary.org/obo/UBERON_0002107 expression Y N DOWN Y N
-‐140.00183 Y Y t-‐stat Y N p-‐value Y N
8.40E-‐34 Y Y
Exercise 5 – Find URIs using BioPortal for types and identifiers.org for instances Restrict types search to EFO, UBERON, SIO, OBI and EDAM Ontology
Instance Type Mine URI Experiment Y N http://www.ebi.ac.uk/efo/EFO_0004033
E-‐TABM-‐865 Y Y N/A
Ms4a1 Y N N/A this is just a label for the ensembl gene
Gene name Y N http://edamontology.org/data_2299
ensembl id Y N http://edamontology.org/data_2610
ENSMUSG00000024673 Y N http://identifiers.org/ensembl/ENSMUSG00000024673
organism Y N http://purl.obolibrary.org/obo/OBI_0100026
mus musculus Y N http://purl.obolibrary.org/obo/NCBITaxon_10090
organism_part Y N http://www.ebi.ac.uk/efo/EFO_0000635
liver Y N http://purl.obolibrary.org/obo/UBERON_0002107
expression Y N http://edamontology.org/topic_0203
DOWN Y N http://semanticscience.org/resource/SIO_001078
-‐140.00183 Y Y N/A
t-‐stat Y N http://semanticscience.org/resource/SIO_001074
p-‐value Y N http://semanticscience.org/resource/SIO_000765
8.40E-‐34 Y Y N/A
Exercise 5 solution- find some more URIs
Building the RDF graph
• We have identified our types with URIs
• We know what data is ours
• Now we need to translate each row in the file to an RDF representation using N-triples
• <Subject> <Predicate> <Object>
• Remember the Object can be a URI or a value
• For predicates create URIs in our own namespace
• http://www.mydomain.com/mydata#
Example row conversion to RDF
Experiment
E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34
E-‐TABM-‐865
type
RDF Triples SUBJECT PREDICATE OBJECT
mydata:E-‐TABM-‐865 rdf:type efo:EFO_0004033
Example row conversion to RDF
Experiment
E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34
E-‐TABM-‐865
type
RDF Triples SUBJECT PREDICATE OBJECT
mydata:E-‐TABM-‐865 rdf:type efo:EFO_0004033
mydata:E-‐TABM-‐865 mydata:hasResult mydata:result1
mydata:result1 rdf:type sio:SIO_001078
Down Expression Value
mydata:result1
has result type
Exercise 6
• Using the following schema write out some RDF in N-triples to represent this single row of data
Experiment Ensembl id
Experimental factor
Expression Value
P-‐value
T-‐staSsSc
Gene name
Organism
has result
Factor value
T-stat P-value
dbxref
label
Factor value
E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34
Exercise 6 solution
RDF Triples SUBJECT PREDICATE OBJECT
mydata:E-‐TABM-‐865 rdf:type efo:EFO_0004033
mydata:E-‐TABM-‐865 mydata:hasResult mydata:result1
mydata:result1 rdf:type sio:SIO_001078
mydata:result1 mydata:factorValue obo:NCBITaxon_10090
mydata:result1
mydata:factorValue obo:UBERON_0002107
mydata:result1 mydata:t-‐stat “-‐140.00183”
mydata:result1 mydata:p-‐value “8.40E-‐34”
mydata:result1 mydata:dbxref idenSfiers:ENSMUSG00000024673
idenSfiers:ENSMUSG00000024673
rdfs:label Ms4a1
E-‐TABM-‐865 Ms4a1 ENSMUSG00000024673 mus musculus liver DOWN -‐140.00183 8.40E-‐34
Generating RDF
• CSV2RDF
• OpenRefine
• Scripts
• Output serialised RDF
• Simple to print out N3 to files
• Use an RDF API
• Most programming language will have RDF libraries
• Other options
• RDB2RDF: Work directly off your relational database
A simple CSV 2 RDF in Perl
• Example script data2rdf.pl
• Read input file (raw-data.csv)
• Convert rows into triple statements according to my schema
• Generate appropriate URIs for things
• Print out triple statement in simple N3 format
Exercise 7
• Look at the N-triple file generated (raw-data.rdf) • See if you understand how that translates to the Schema
• Convert this file to RDF/XML using online converter
• http://www.rdfabout.com/demo/validator
Blank nodes (bnode)
• You can use an anonymous resource in RDF
• They can be the subject or object of any triple
• Denote the existence of a “thing” but you don’t have to explicitly give it a URI
• In our scenario we created a URI for the Gene expression value, we didn’t have to
• Using turtle syntax we could have said
mydata:E-‐TABM-‐865 rdf:type efo:EFO_0004033 . mydata:E-‐TABM-‐865 mydata:hasResult [ rdf:type sio:SIO_001078 ; mydata:factorValue obo:UBERON_0002107 ; mydata:t-‐stat “-‐140.00183”]
Querying RDF
• Specialised databases for indexing RDF graphs
Stardog
Apache Jena
Sesame Virtuoso
Allegrograph
OWLIM
OpenRDF sesame
• http://www.openrdf.org • OpenRDF Sesame is a de-facto standard framework for processing RDF data. This includes parsers,
storage solutions (RDF databases a.ka. triplestores), reasoning and querying, using the SPARQL query language. It offers a flexible and easy to use Java API that can be connected to all leading RDF storage solutions.
• Easy to deploy (Java servlet)
• Provides SPARQL endpoint and workbench for administration tasks
• Scalable to millions of triples
• Other more scalable implementations of the storage and inference layer available
• OWLIM
• Virtuoso
• Bigdata
The Sesame workbench
• We have a workbench online for you to play with
• http://goo.gl/K5wmIe
• (http://ec2-54-72-241-21.eu-west-1.compute.amazonaws.com/openrdf-workbench)
• Use this to create a repository
• Upload data
• Test queries
Exercise 8
• Create a new in memory store repository for your data
Exercise 9
• Load RDF Data file (use raw-data.rdf form the dropbox folder)
• Set Data format to N-Triples
• Set base URI to
• http://www.mydomain.com/mydata#
SPARQL endpoint
Exploring a SPARQL endpoint
• Show me some triples
• Select all data = not a very friendly query!
• Find the types of things
• http://www.w3.org/TR/rdf-sparql-query/
SELECT * WHERE { ?subject ?predicate ?object }
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?type WHERE { ?subject rdf:type ?type } LIMIT 10
Describing a resource
• What is known about
• http://www.mydomain.com/mydata#E-TABM-865
DESCRIBE <http://www.mydomain.com/mydata#E-TABM-865>
Exercise 11 – SPARQL endpoint
• Try some of the previous queries on the SPARQL endpoint
• Explore clicking around URIs to follow links through the data
• Explore download formats
• SPARQL query results XML, JSON, CSV
Binding variables
• Get all things that are types of experiment
• Experiment URI http://www.ebi.ac.uk/efo/EFO_0004033
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?thing WHERE { ?thing rdf:type efo:EFO_0004033 } LIMIT 10
Exercise 12
• Write a SPARQL query to get the labels for all experiments (hint: Use the rdfs:label relation)
• Tip: Store SPARQL queries that work in a text file, easier to edit and re-use previous queries
Exercise 12 solution
• Select labels for all classes
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?label WHERE { ?thing rdf:type efo:EFO_0004033 . ?thing rdfs:label ?label }
Exercise 13
• Explore the raw-data.rdf files and try and write a SPARQL query that would show you all the genes UP in “liver” samples
• Hint:
• UP = http://semanticscience.org/resource/SIO_001081
• “liver” = http://purl.obolibrary.org/obo/UBERON_0002107
Exercise 13 solution
• Get genes up regulated in liver samples
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX sio:<http://semanticscience.org/resource/> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result rdf:type sio:SIO_001081 . ?result mydata:hasFactorValue obo:UBERON_0002107 }
Filtering SPARQL queries
• Restrict values in results from matches in the graph patterns
• String matching
• FILTER regex(?x, "pattern" [, "flags"])
• E.g. FILTER regex (?label, “E-TABM-865”)
• Testing values
• FILTER (?tstat >0 24)
Exercise 14
• Get all experiments where label contain “GEOD”
• Get all genes up regulated with a t-statistic < 0
Exercise 14 solutions PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?label WHERE { ?thing rdf:type efo:EFO_0004033 . ?thing rdfs:label ?label . FILTER regex(?label, "geod", "i") }
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> SELECT DISTINCT ?geneid ?label ?tstat WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result mydata:hasTStatistic ?tstat . FILTER (?tstat < 0) }
Enriching data
• Our dataset is still a bit sparse
• e.g. no labels or descriptions for sample information
• We used URIs form external ontologies to define some concepts
• Let’s integrate our dataset with those ontologies and do some querying
Exercise 15
• Find the Experimental Factor Ontology ontology file
• Can get from Web or efo.owl in the course material
• Load the ontology file into the same repository as your raw data RDF
• Now describe the liver URI
• http://purl.obolibrary.org/obo/UBERON_0002107
• Create a SPARQL query to pull out labels for all of the factor values
Exercise 15 solution
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> SELECT DISTINCT ?factor?label WHERE { ?result mydata:hasFactorValue ?factor . ?factor rdfs:label ?label }
DESCRIBE <http://purl.obolibrary.org/obo/UBERON_0002107>
Exploiting knowledge
• As an ontology, EFO contains lots of biological domain knowledge
• E.g. classification of diseases, organism parts etc..
• We can exploit this knowledge to enhance queries over our datasets
• E.g. What are all the parent types (or categories) for liver in EFO
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?parent ?label WHERE { obo:UBERON_0002107 rdfs:subClassOf ?parent . ?parent rdfs:label ?label }
Property paths
• We can query along paths of relations using SPARQL
• This is useful for exploiting transitive relationships
• Special SPARQL 1.1 syntax for property paths “*”
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?parent ?label WHERE { obo:UBERON_0002107 rdfs:subClassOf* ?parent . ?parent rdfs:label ?label }
Exercise 16 – Ontology query
• Get all genes expressed in your data where the factor values is a child of “organism part” (efo:EFO_0000635)
Exercise 16 solution
• Get all genes expressed in your data where the factor values is a child of “organism part” (efo:EFO_0000635)
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?geneid ?label ?factor WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result mydata:hasFactorValue ?factor . ?factor rdfs:subClassOf* efo:EFO_0000635 }
End of 1st practical session
• Introduced modeling data in RDF
• Three questions I always ask of data
• What is it (types)?
• What is it (id)?
• What is it related to?
• Generating RDF statements in N-Triples
• Loading RDF into a triple store
• Basic querying with SPARQL