55
Introduction to RDF and the Semantic Web for the life sciences Simon Jupp Sample Phenotypes and Ontologies Team European Bioinformatics Institute [email protected]

Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Introduction to RDF and the Semantic Web for the life sciences

Simon Jupp

Sample Phenotypes and Ontologies Team

European Bioinformatics Institute

[email protected]

Page 2: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Practical sessions

•  Converting data to RDF

•  Three questions

1.  What types of things are in my data?

2.  Can I identify these things?

3.  How are these things related to other things?

Page 3: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Gene expression data example

Experiment   Gene  name   Ensembl  id   organism   organism_part   expression   t-­‐stat   p-­‐value  E-­‐TABM-­‐865   Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  E-­‐TABM-­‐865   Ms4a1   ENSMUSG00000024673   mus  musculus   spleen   UP   140.00183   8.40E-­‐34  E-­‐TABM-­‐865   MMp   ENSMUSG00000028158   mus  musculus   liver   UP   138.82608   8.40E-­‐34  E-­‐TABM-­‐865   MMp   ENSMUSG00000028158   mus  musculus   spleen   DOWN   -­‐138.82608   8.40E-­‐34  E-­‐TABM-­‐865   Akr1c14   ENSMUSG00000033715   mus  musculus   liver   UP   132.92674   1.69E-­‐33  E-­‐TABM-­‐865   Akr1c14   ENSMUSG00000033715   mus  musculus   spleen   DOWN   -­‐132.92674   1.69E-­‐33  E-­‐TABM-­‐865   Gulo   ENSMUSG00000034450   mus  musculus   liver   UP   126.44113   4.51E-­‐33  E-­‐TABM-­‐865   Gulo   ENSMUSG00000034450   mus  musculus   spleen   DOWN   -­‐126.44113   4.51E-­‐33  E-­‐TABM-­‐865   Marc1   ENSMUSG00000026621   mus  musculus   liver   UP   124.45381   4.66E-­‐33  E-­‐TABM-­‐865   Marc1   ENSMUSG00000026621   mus  musculus   spleen   DOWN   -­‐124.45381   4.66E-­‐33  E-­‐GEOD-­‐2852   Gulo   ENSRNOG00000016648   raMus  norvegicus   kidney   DOWN   -­‐32.518154   1.09E-­‐42  E-­‐GEOD-­‐2852   Gulo   ENSRNOG00000016648   raMus  norvegicus   liver   UP   32.518154   1.09E-­‐42  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   kidney   DOWN   -­‐28.861328   2.29E-­‐39  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   liver   UP   28.861328   2.29E-­‐39  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   kidney   DOWN   -­‐16.854948   2.25E-­‐25  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   liver   UP   16.854948   2.25E-­‐25  E-­‐GEOD-­‐2852   Amacr   ENSRNOG00000018662   raMus  norvegicus   kidney   DOWN   -­‐6.296967   7.45E-­‐08  E-­‐GEOD-­‐2852   Amacr   ENSRNOG00000018662   raMus  norvegicus   liver   UP   6.296967   7.45E-­‐08  

Page 4: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

What is it?

•  What concepts do we have in this dataset?

•  Some hints are already in the column names

Experiment   Gene  name   Ensembl  id   organism   organism_part   expression   t-­‐stat   p-­‐value  E-­‐TABM-­‐865   Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  E-­‐TABM-­‐865   Ms4a1   ENSMUSG00000024673   mus  musculus   spleen   UP   140.00183   8.40E-­‐34  E-­‐TABM-­‐865   MMp   ENSMUSG00000028158   mus  musculus   liver   UP   138.82608   8.40E-­‐34  E-­‐TABM-­‐865   MMp   ENSMUSG00000028158   mus  musculus   spleen   DOWN   -­‐138.82608   8.40E-­‐34  E-­‐TABM-­‐865   Akr1c14   ENSMUSG00000033715   mus  musculus   liver   UP   132.92674   1.69E-­‐33  E-­‐TABM-­‐865   Akr1c14   ENSMUSG00000033715   mus  musculus   spleen   DOWN   -­‐132.92674   1.69E-­‐33  E-­‐TABM-­‐865   Gulo   ENSMUSG00000034450   mus  musculus   liver   UP   126.44113   4.51E-­‐33  E-­‐TABM-­‐865   Gulo   ENSMUSG00000034450   mus  musculus   spleen   DOWN   -­‐126.44113   4.51E-­‐33  E-­‐TABM-­‐865   Marc1   ENSMUSG00000026621   mus  musculus   liver   UP   124.45381   4.66E-­‐33  E-­‐TABM-­‐865   Marc1   ENSMUSG00000026621   mus  musculus   spleen   DOWN   -­‐124.45381   4.66E-­‐33  E-­‐GEOD-­‐2852   Gulo   ENSRNOG00000016648   raMus  norvegicus   kidney   DOWN   -­‐32.518154   1.09E-­‐42  E-­‐GEOD-­‐2852   Gulo   ENSRNOG00000016648   raMus  norvegicus   liver   UP   32.518154   1.09E-­‐42  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   kidney   DOWN   -­‐28.861328   2.29E-­‐39  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   liver   UP   28.861328   2.29E-­‐39  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   kidney   DOWN   -­‐16.854948   2.25E-­‐25  E-­‐GEOD-­‐2852   Akr1c14   ENSRNOG00000017672   raMus  norvegicus   liver   UP   16.854948   2.25E-­‐25  E-­‐GEOD-­‐2852   Amacr   ENSRNOG00000018662   raMus  norvegicus   kidney   DOWN   -­‐6.296967   7.45E-­‐08  E-­‐GEOD-­‐2852   Amacr   ENSRNOG00000018662   raMus  norvegicus   liver   UP   6.296967   7.45E-­‐08  

Page 5: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 1 – Concept maps

•  Write down the concepts represented in this data (e.g. Experiment)

•  Organise the concepts into a graph and write down some relationships between the concepts

Page 6: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 1 solution

Experiment Ensembl  id

Experimental factor

Expression  Value

P-­‐value  

T-­‐staSsSc

Gene  name

Organism  

Congratulations on building your first Ontology!

Has result

Factor value

T-statistic P-value

Ensembl gene

Gene name

Factor value

Page 7: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance vs Types

•  The world (of information) is made up of things and lots of them

•  Instances, individuals, objects, tokens, particulars.

•  The Earth is a kind of Planet

•  Simon Jupp (NE 67 41 58 A) is a Person

•  E-MTAB-62 is a type of Experiment

•  Your liver is a type of Organ

Page 8: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type  Experiment   Y  E-­‐TABM-­‐865  Ms4a1  Gene  name  ensembl  id  

ENSMUSG00000024673  organism  mus  musculus  organism_part  liver  expression  DOWN  

-­‐140.00183  t-­‐stat  p-­‐value  

8.40E-­‐34  

Exercise 2 – Identify Types vs Instance data

Page 9: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type  Experiment   Y  E-­‐TABM-­‐865   Y  Ms4a1   Y  Gene  name   Y  ensembl  id   Y  

ENSMUSG00000024673   Y  organism   Y  mus  musculus   Y  organism_part   Y  liver   Y  expression   Y  DOWN   Y  

-­‐140.00183   Y  t-­‐stat   Y  p-­‐value   Y  

8.40E-­‐34   Y  

Exercise 2 solution

Page 10: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Giving things identity

•  Choose a URI scheme for resources.

•  Re-use URIs for types of things where possible

•  Shared URIs for the same things make integration happen

•  General rule

1.  If it’s your data, give it a URI in your namespace.

2.  If it’s someone else's data (e.g. UniProt) use a URI from them (if they have one)

Page 11: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type   Mine  Experiment   Y  E-­‐TABM-­‐865   Y  Ms4a1   Y  Gene  name   Y  ensembl  id   Y  

ENSMUSG00000024673   Y   N  organism   Y  mus  musculus   Y  organism_part   Y  liver   Y  expression   Y  DOWN   Y  

-­‐140.00183   Y  t-­‐stat   Y  p-­‐value   Y  

8.40E-­‐34   Y  

Exercise 3 – your data vs shared data

Page 12: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type   Mine  Experiment   Y   N  E-­‐TABM-­‐865   Y   Y  Ms4a1   Y   N  Gene  name   Y   N  ensembl  id   Y   N  

ENSMUSG00000024673   Y   N  organism   Y   N  mus  musculus   Y   N  organism_part   Y   N  liver   Y   N  expression   Y   N  DOWN   Y   N  

-­‐140.00183   Y   Y  t-­‐stat   Y   N  p-­‐value   Y   N  

8.40E-­‐34   Y   Y  

Exercise 3 solution

Types of things usually belong in external reference ontologies. Good practice try and connect your data to these ontologies

Your data is usually the instance data (the experiment or the results)

Page 13: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

URI for a instance

•  Ensembl  Gene  ENSMUSG00000024673  

•  http://www.ensembl.org/Mus_musculus/Gene/Summary?g=ENSMUSG00000024673

•  Is this a good URI?

•  Is it stable? What does it represent?

•  This is a URL for the web page, it may change

•  It doesn’t return RDF

Page 14: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Identifiers.org

•  http://identifiers.org

•  Identifiers.org is a system providing resolvable persistent URIs used to identify data for the scientific community, with a current focus on the Life Sciences domain. The provision of a resolvable identifiers (URLs) fits well with the Semantic Web vision, and the Linked Data initiative.

Page 15: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 4

•  Use the identifiers.org website to find the URI for ENSMUSG00000024673

Page 16: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 4 solution

•  Search identifiers.org for ensembl

•  Got to http://www.ebi.ac.uk/miriam/main/collections/MIR:00000003

•  Find root URL

•  http://identifiers.org/ensembl/ENSMUSG00000024673    

•  See  what  it  resolves  to

Page 17: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

URI for types

•  Experimental factor “liver”

•  “liver” is an organ. We would expect to find an ontology term that describes what a liver is

•  BioPortal is a repository or bio-medical ontologies

•  https://bioportal.bioontology.org

Page 18: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 5

•  Go to https://bioportal.bioontology.org and find ontologies that contain terms for “liver”, “spleen” and “kidney”

•  Get the URIs for liver, spleen and kidney from the Experimental Factor Ontology (EFO)

Page 19: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 5 solution

•  “liver”

•  http://purl.obolibrary.org/obo/UBERON_0002107

•  “spleen”

•  http://purl.obolibrary.org/obo/UBERON_0002106

•  “kidney”

•  http://purl.obolibrary.org/obo/UBERON_0002113

Page 20: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type   Mine   URI  Experiment   Y   N  E-­‐TABM-­‐865   Y   Y  Ms4a1   Y   N  Gene  name   Y   N  ensembl  id   Y   N  

ENSMUSG00000024673   Y   N   http://identifiers.org/ensembl/ENSMUSG00000024673    

organism   Y   N  mus  musculus   Y   N  organism_part   Y   N  liver   Y   N   http://purl.obolibrary.org/obo/UBERON_0002107  expression   Y   N  DOWN   Y   N  

-­‐140.00183   Y   Y  t-­‐stat   Y   N  p-­‐value   Y   N  

8.40E-­‐34   Y   Y  

Exercise 5 – Find URIs using BioPortal for types and identifiers.org for instances Restrict types search to EFO, UBERON, SIO, OBI and EDAM Ontology

Page 21: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Instance   Type   Mine   URI  Experiment   Y   N   http://www.ebi.ac.uk/efo/EFO_0004033  

E-­‐TABM-­‐865   Y   Y   N/A  

Ms4a1   Y   N   N/A  this  is  just  a  label  for  the  ensembl  gene  

Gene  name   Y   N   http://edamontology.org/data_2299  

ensembl  id   Y   N   http://edamontology.org/data_2610  

ENSMUSG00000024673   Y   N   http://identifiers.org/ensembl/ENSMUSG00000024673    

organism   Y   N   http://purl.obolibrary.org/obo/OBI_0100026  

mus  musculus   Y   N   http://purl.obolibrary.org/obo/NCBITaxon_10090  

organism_part   Y   N   http://www.ebi.ac.uk/efo/EFO_0000635  

liver   Y   N   http://purl.obolibrary.org/obo/UBERON_0002107  

expression   Y   N   http://edamontology.org/topic_0203  

DOWN   Y   N   http://semanticscience.org/resource/SIO_001078  

-­‐140.00183   Y   Y   N/A  

t-­‐stat   Y   N   http://semanticscience.org/resource/SIO_001074  

p-­‐value   Y   N   http://semanticscience.org/resource/SIO_000765  

8.40E-­‐34   Y   Y   N/A  

Exercise 5 solution- find some more URIs

Page 22: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Building the RDF graph

•  We have identified our types with URIs

•  We know what data is ours

•  Now we need to translate each row in the file to an RDF representation using N-triples

•  <Subject> <Predicate> <Object>

•  Remember the Object can be a URI or a value

•  For predicates create URIs in our own namespace

•  http://www.mydomain.com/mydata#

Page 23: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Example row conversion to RDF

Experiment

E-­‐TABM-­‐865  Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  

E-­‐TABM-­‐865

type

RDF  Triples SUBJECT PREDICATE OBJECT

mydata:E-­‐TABM-­‐865     rdf:type   efo:EFO_0004033

Page 24: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Example row conversion to RDF

Experiment

E-­‐TABM-­‐865  Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  

E-­‐TABM-­‐865

type

RDF  Triples SUBJECT PREDICATE OBJECT

mydata:E-­‐TABM-­‐865   rdf:type efo:EFO_0004033

mydata:E-­‐TABM-­‐865     mydata:hasResult     mydata:result1

mydata:result1 rdf:type   sio:SIO_001078  

Down  Expression  Value

mydata:result1

has result type

Page 25: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 6

•  Using the following schema write out some RDF in N-triples to represent this single row of data

Experiment Ensembl  id

Experimental factor

Expression  Value

P-­‐value  

T-­‐staSsSc

Gene  name

Organism  

has result

Factor value

T-stat P-value

dbxref

label

Factor value

E-­‐TABM-­‐865  Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  

Page 26: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 6 solution

RDF  Triples SUBJECT PREDICATE OBJECT

mydata:E-­‐TABM-­‐865   rdf:type efo:EFO_0004033

mydata:E-­‐TABM-­‐865     mydata:hasResult     mydata:result1

mydata:result1 rdf:type   sio:SIO_001078

mydata:result1 mydata:factorValue   obo:NCBITaxon_10090  

mydata:result1  

mydata:factorValue   obo:UBERON_0002107  

mydata:result1 mydata:t-­‐stat   “-­‐140.00183”  

mydata:result1 mydata:p-­‐value   “8.40E-­‐34”  

mydata:result1 mydata:dbxref   idenSfiers:ENSMUSG00000024673    

idenSfiers:ENSMUSG00000024673  

rdfs:label   Ms4a1  

E-­‐TABM-­‐865  Ms4a1   ENSMUSG00000024673   mus  musculus   liver   DOWN   -­‐140.00183   8.40E-­‐34  

Page 27: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Generating RDF

•  CSV2RDF

•  OpenRefine

•  Scripts

•  Output serialised RDF

•  Simple to print out N3 to files

•  Use an RDF API

•  Most programming language will have RDF libraries

•  Other options

•  RDB2RDF: Work directly off your relational database

Page 28: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

A simple CSV 2 RDF in Perl

•  Example script data2rdf.pl

•  Read input file (raw-data.csv)

•  Convert rows into triple statements according to my schema

•  Generate appropriate URIs for things

•  Print out triple statement in simple N3 format

Page 29: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 7

•  Look at the N-triple file generated (raw-data.rdf) •  See if you understand how that translates to the Schema

•  Convert this file to RDF/XML using online converter

•  http://www.rdfabout.com/demo/validator

Page 30: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Blank nodes (bnode)

•  You can use an anonymous resource in RDF

•  They can be the subject or object of any triple

•  Denote the existence of a “thing” but you don’t have to explicitly give it a URI

•  In our scenario we created a URI for the Gene expression value, we didn’t have to

•  Using turtle syntax we could have said

mydata:E-­‐TABM-­‐865  rdf:type  efo:EFO_0004033 . mydata:E-­‐TABM-­‐865    mydata:hasResult                                    [  rdf:type  sio:SIO_001078  ;                                    mydata:factorValue  obo:UBERON_0002107  ;                                    mydata:t-­‐stat    “-­‐140.00183”]          

Page 31: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Querying RDF

•  Specialised databases for indexing RDF graphs

Stardog

Apache Jena

Sesame Virtuoso

Allegrograph

OWLIM

Page 32: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

OpenRDF sesame

•  http://www.openrdf.org •  OpenRDF Sesame is a de-facto standard framework for processing RDF data. This includes parsers,

storage solutions (RDF databases a.ka. triplestores), reasoning and querying, using the SPARQL query language. It offers a flexible and easy to use Java API that can be connected to all leading RDF storage solutions.

•  Easy to deploy (Java servlet)

•  Provides SPARQL endpoint and workbench for administration tasks

•  Scalable to millions of triples

•  Other more scalable implementations of the storage and inference layer available

•  OWLIM

•  Virtuoso

•  Bigdata

Page 33: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

The Sesame workbench

•  We have a workbench online for you to play with

•  http://goo.gl/K5wmIe

•  (http://ec2-54-72-241-21.eu-west-1.compute.amazonaws.com/openrdf-workbench)

•  Use this to create a repository

•  Upload data

•  Test queries

Page 34: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 8

•  Create a new in memory store repository for your data

Page 35: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 9

•  Load RDF Data file (use raw-data.rdf form the dropbox folder)

•  Set Data format to N-Triples

•  Set base URI to

•  http://www.mydomain.com/mydata#

Page 36: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

SPARQL endpoint

Page 37: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exploring a SPARQL endpoint

•  Show me some triples

•  Select all data = not a very friendly query!

•  Find the types of things

•  http://www.w3.org/TR/rdf-sparql-query/

SELECT * WHERE { ?subject ?predicate ?object }

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT DISTINCT ?type WHERE { ?subject rdf:type ?type } LIMIT 10

Page 38: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Describing a resource

•  What is known about

•  http://www.mydomain.com/mydata#E-TABM-865

DESCRIBE <http://www.mydomain.com/mydata#E-TABM-865>

Page 39: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 11 – SPARQL endpoint

•  Try some of the previous queries on the SPARQL endpoint

•  Explore clicking around URIs to follow links through the data

•  Explore download formats

•  SPARQL query results XML, JSON, CSV

Page 40: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Binding variables

•  Get all things that are types of experiment

•  Experiment URI http://www.ebi.ac.uk/efo/EFO_0004033

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?thing WHERE { ?thing rdf:type efo:EFO_0004033 } LIMIT 10

Page 41: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 12

•  Write a SPARQL query to get the labels for all experiments (hint: Use the rdfs:label relation)

•  Tip: Store SPARQL queries that work in a text file, easier to edit and re-use previous queries

Page 42: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 12 solution

•  Select labels for all classes

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?label WHERE { ?thing rdf:type efo:EFO_0004033 . ?thing rdfs:label ?label }

Page 43: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 13

•  Explore the raw-data.rdf files and try and write a SPARQL query that would show you all the genes UP in “liver” samples

•  Hint:

•  UP = http://semanticscience.org/resource/SIO_001081

•  “liver”  =  http://purl.obolibrary.org/obo/UBERON_0002107  

Page 44: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 13 solution

•  Get genes up regulated in liver samples

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX sio:<http://semanticscience.org/resource/> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result rdf:type sio:SIO_001081 . ?result mydata:hasFactorValue obo:UBERON_0002107 }

Page 45: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Filtering SPARQL queries

•  Restrict values in results from matches in the graph patterns

•  String matching

•  FILTER regex(?x, "pattern" [, "flags"])

•  E.g. FILTER regex (?label, “E-TABM-865”)

•  Testing values

•  FILTER (?tstat >0 24)

Page 46: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 14

•  Get all experiments where label contain “GEOD”

•  Get all genes up regulated with a t-statistic < 0

Page 47: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 14 solutions PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?label WHERE { ?thing rdf:type efo:EFO_0004033 . ?thing rdfs:label ?label . FILTER regex(?label, "geod", "i") }

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> SELECT DISTINCT ?geneid ?label ?tstat WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result mydata:hasTStatistic ?tstat . FILTER (?tstat < 0) }

Page 48: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Enriching data

•  Our dataset is still a bit sparse

•  e.g. no labels or descriptions for sample information

•  We used URIs form external ontologies to define some concepts

•  Let’s integrate our dataset with those ontologies and do some querying

Page 49: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 15

•  Find the Experimental Factor Ontology ontology file

•  Can get from Web or efo.owl in the course material

•  Load the ontology file into the same repository as your raw data RDF

•  Now describe the liver URI

•  http://purl.obolibrary.org/obo/UBERON_0002107

•  Create a SPARQL query to pull out labels for all of the factor values

Page 50: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 15 solution

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> SELECT DISTINCT ?factor?label WHERE { ?result mydata:hasFactorValue ?factor . ?factor rdfs:label ?label }

DESCRIBE <http://purl.obolibrary.org/obo/UBERON_0002107>

Page 51: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exploiting knowledge

•  As an ontology, EFO contains lots of biological domain knowledge

•  E.g. classification of diseases, organism parts etc..

•  We can exploit this knowledge to enhance queries over our datasets

•  E.g. What are all the parent types (or categories) for liver in EFO

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?parent ?label WHERE { obo:UBERON_0002107 rdfs:subClassOf ?parent . ?parent rdfs:label ?label }

Page 52: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Property paths

•  We can query along paths of relations using SPARQL

•  This is useful for exploiting transitive relationships

•  Special SPARQL 1.1 syntax for property paths “*”

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX obo:<http://purl.obolibrary.org/obo/> SELECT DISTINCT ?parent ?label WHERE { obo:UBERON_0002107 rdfs:subClassOf* ?parent . ?parent rdfs:label ?label }

Page 53: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 16 – Ontology query

•  Get all genes expressed in your data where the factor values is a child of “organism part” (efo:EFO_0000635)

Page 54: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

Exercise 16 solution

•  Get all genes expressed in your data where the factor values is a child of “organism part” (efo:EFO_0000635)

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?geneid ?label ?factor WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . ?result mydata:hasFactorValue ?factor . ?factor rdfs:subClassOf* efo:EFO_0000635 }

Page 55: Introduction to RDF and the Semantic Web for the life sciences · What types of things are in my data? 2. Can I identify these things? 3. How are these things related to other things?

End of 1st practical session

•  Introduced modeling data in RDF

•  Three questions I always ask of data

•  What is it (types)?

•  What is it (id)?

•  What is it related to?

•  Generating RDF statements in N-Triples

•  Loading RDF into a triple store

•  Basic querying with SPARQL