19
12 th June, 2016 BioHackathon 2016 Symposium, Japan Facilitating Semantic Alignment of EBI Resources Simon Jupp Ontology Project Lead Samples, Phenotypes and Ontologies Team www.ebi.ac.uk

Facilitating semantic alignment.-biohackathon-jupp

Embed Size (px)

Citation preview

Page 1: Facilitating semantic alignment.-biohackathon-jupp

12th June, 2016BioHackathon 2016 Symposium, Japan

Facilitating Semantic Alignment of EBI Resources

Simon JuppOntology Project Lead Samples, Phenotypes and Ontologies Team www.ebi.ac.uk

Page 2: Facilitating semantic alignment.-biohackathon-jupp

SPOT team - Adding value with ontologies

DataExplorati

onand

Cleanup

Data structuring

OntologyAnnotati

on

Data cleaning

and mapping

Ontologybuildin

g

Structured data

Page 3: Facilitating semantic alignment.-biohackathon-jupp

Data Enrichment Services• Building an interoperability

toolkit for Europe (Elixir) • Micro-service architecture

• Technology-agnostic• Pushing boundaries of

ontology “embedding”

New ontology lookup service!

Page 4: Facilitating semantic alignment.-biohackathon-jupp

Building an ontology toolkit

DataExplorati

onand

Cleanup

Data structuring

OntologyAnnotati

on

Data cleaning

and mapping

Ontologybuildin

g Webulous

OxO mapping service

Page 5: Facilitating semantic alignment.-biohackathon-jupp

Building metadata rich resources • Ontology markup of

experimental variables/samples

• Focus on Phenotype/Disease annotation• Linking common to rare

disease

ArrayExpress

Gene Expression atlas

0

20

40

60

80

10089 77 78

100 99

EFO mapped coverage

Page 6: Facilitating semantic alignment.-biohackathon-jupp

OpenTargets Data Mapping Process

Reactome Metabolic pathways DOID

GWAS catalog Common Disease (GWAS) EFO

Atlas Expression EFO

Uniprot Rare Disease (Expert-reviewed OMIM)

OMIM + own controlled vocab

European Variation Archive Rare Disease

OMIM + Orphanet + SNOMED + Genetic Alliance + HPO

ChEMBL Bioactivity dataATC classification (14 terms)

EuropePMC Literature Mining UMLS

IMPC Mouse Models MPO + HPO

Cancer Gene Census Somatic Mutationsown controlled vocab + NCIT

Acquire

Clean

Map to Ontology

Curate

Add new terms

Itera

te

Page 7: Facilitating semantic alignment.-biohackathon-jupp

Experiment Factor Ontology – Data Driven Application Ontology• EFO is an application ontology, built for use in production

services in OWL• Imports from ~10 ontologies, isolates us from external churn• Cross referenced to 25 additional ontologies• Continuous integration build process, reasoning, manual error checking,

multi-editor environmentChemical Entities of Biological Interest

(ChEBI)

Gene Ontology

Cell Type

Anatomy

Phenotype

Disease

Page 8: Facilitating semantic alignment.-biohackathon-jupp

Ontologies Data

Managing data evolution in production

OntologyAnnotation

Provenance: who, when, context

Disease

Anatomy

Cell types

Gene function(GO, HP, MP,

UBERON, DO, ORDO)

Phenotype

Page 9: Facilitating semantic alignment.-biohackathon-jupp

Ontologies in applicationsSmarter searching

Data visualisation

Data analysis

Data integration

Page 10: Facilitating semantic alignment.-biohackathon-jupp

Open TargetsWhich other diseases are associated with PDE4D?

View diseases grouped in therapeutic areas or organised in a tree

View more information about PDE4D

Filter by therapeutic area

Page 11: Facilitating semantic alignment.-biohackathon-jupp

BioSolr

“BioSolr aims to significantly advance the state of the art with regards to indexing and querying

biomedical data with freely available open source software”

flaxsearch/BioSolr

Solr documents with ontology annotation

Enriched Solr with ontology content (synonyms, structure, relations)

Solr/Elastic plugin Query expansion and hierarchical faceting

Page 12: Facilitating semantic alignment.-biohackathon-jupp

Making it all FAIR

Page 13: Facilitating semantic alignment.-biohackathon-jupp

Data resources at EMBL-EBIGenes, genomes & variation

RNA CentralArrayExpress

Expression AtlasMetabolights

PRIDE

InterPro Pfam UniProt

ChEMBL SureChEMBL ChEBI

Molecular structuresProtein Data Bank in EuropeElectron Microscopy Data Bank

European Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome Archive

Gene, protein & metabolite expression

Protein sequences, families & motifs

Chemical biology

Reactions, interactions & pathways

IntActReactome

MetaboLights

SystemsBioModels Enzyme Portal BioSamples

Ensembl Ensembl Genomes

GWAS CatalogMetagenomics portal

Europe PubMed CentralBioStudiesGene OntologyExperimental Factor Ontology

Literature & ontologies

Product of previous biohackathons

Page 14: Facilitating semantic alignment.-biohackathon-jupp

EBI RDF PlatformSuccesses• Novel queries possible

over EBI datasets• Production quality RDF

releases• Community of users

• Highly available public SPARQL endpoints

• 500+ users (10-50 million hits per month)

• Lot of interest from industry

• Catalyst for new RDF efforts

Lessons● Public SPARQL endpoints

problematic● Query federation not

performant● Inference support limited● Not scalable for all EBI

data e.g. Variation, ENA● Lack of expertise in

service teams● Too much overhead to get

started quickly in this space

Ian Dunlop
Who needs the inference support?
Simon Jupp
I think this is a key value proposition of RDF that we can infer relations based on OWL semantics. It is truly something this technology promises that you can't do well in trad technologies like RDMS or Neo etc.. Sadly, it doesn't work for us at this scale
Helen Parkinson
Is there a small scale pilot we should undertake, or just ack. that this doesn't work at our scale
Page 15: Facilitating semantic alignment.-biohackathon-jupp

Challenges for RDF at EMBL-EBI • Most EBI resources publish data in forms that

support common use cases (pre-integrated)• Individuals teams do the hard work so you don’t have

to• RDF representation not optimised for performance

• Barrier to building real (killer) applications• Technology not mature enough / developer

frameworks lacking• Doing RDF shouldn’t mandate a technology choice

anyway • RDF not yet a “core” activity for EMBL-EBI

Page 16: Facilitating semantic alignment.-biohackathon-jupp

Where we are going next with RDF• Virtualised infrastructure for RDF

• Simpler cloud deployment• Building a single EBI RDF cache

• Simpler to manage• More interesting queries

• Exploring cheaper paths to RDF• RDF from REST + JSON-LD• Via Wikidata• RDFa and schema.org (bioschemas)

Page 17: Facilitating semantic alignment.-biohackathon-jupp

Acknowledgements• Sample Phenotypes and Ontologies Team

• Olga Vrousgou, Thomas Liener, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Tony Burdett, Helen Parkinson

• Funding • European Molecular Biology Laboratory (EMBL)• European Union projects: DIACHRON, BioMedBridges

and CORBEL, Excelerate

Page 18: Facilitating semantic alignment.-biohackathon-jupp
Page 19: Facilitating semantic alignment.-biohackathon-jupp

Topic and interest for the hackathon• Ontology Mapping

• Disease (rare, common, phenotypes)• Data annotation (automated, machine learning,

text mining)• Virtualised RDF data deployment • RDF on the fly

• RDF over Mongo, Neo4j, Solr, Elastic• REST + JSON-LD