Upload
simon-jupp
View
66
Download
0
Embed Size (px)
Citation preview
12th June, 2016BioHackathon 2016 Symposium, Japan
Facilitating Semantic Alignment of EBI Resources
Simon JuppOntology Project Lead Samples, Phenotypes and Ontologies Team www.ebi.ac.uk
SPOT team - Adding value with ontologies
DataExplorati
onand
Cleanup
Data structuring
OntologyAnnotati
on
Data cleaning
and mapping
Ontologybuildin
g
Structured data
Data Enrichment Services• Building an interoperability
toolkit for Europe (Elixir) • Micro-service architecture
• Technology-agnostic• Pushing boundaries of
ontology “embedding”
New ontology lookup service!
Building an ontology toolkit
DataExplorati
onand
Cleanup
Data structuring
OntologyAnnotati
on
Data cleaning
and mapping
Ontologybuildin
g Webulous
OxO mapping service
Building metadata rich resources • Ontology markup of
experimental variables/samples
• Focus on Phenotype/Disease annotation• Linking common to rare
disease
ArrayExpress
Gene Expression atlas
0
20
40
60
80
10089 77 78
100 99
EFO mapped coverage
OpenTargets Data Mapping Process
Reactome Metabolic pathways DOID
GWAS catalog Common Disease (GWAS) EFO
Atlas Expression EFO
Uniprot Rare Disease (Expert-reviewed OMIM)
OMIM + own controlled vocab
European Variation Archive Rare Disease
OMIM + Orphanet + SNOMED + Genetic Alliance + HPO
ChEMBL Bioactivity dataATC classification (14 terms)
EuropePMC Literature Mining UMLS
IMPC Mouse Models MPO + HPO
Cancer Gene Census Somatic Mutationsown controlled vocab + NCIT
Acquire
Clean
Map to Ontology
Curate
Add new terms
Itera
te
Experiment Factor Ontology – Data Driven Application Ontology• EFO is an application ontology, built for use in production
services in OWL• Imports from ~10 ontologies, isolates us from external churn• Cross referenced to 25 additional ontologies• Continuous integration build process, reasoning, manual error checking,
multi-editor environmentChemical Entities of Biological Interest
(ChEBI)
Gene Ontology
Cell Type
Anatomy
Phenotype
Disease
Ontologies Data
Managing data evolution in production
OntologyAnnotation
Provenance: who, when, context
Disease
Anatomy
Cell types
Gene function(GO, HP, MP,
UBERON, DO, ORDO)
Phenotype
…
Ontologies in applicationsSmarter searching
Data visualisation
Data analysis
Data integration
Open TargetsWhich other diseases are associated with PDE4D?
View diseases grouped in therapeutic areas or organised in a tree
View more information about PDE4D
Filter by therapeutic area
BioSolr
“BioSolr aims to significantly advance the state of the art with regards to indexing and querying
biomedical data with freely available open source software”
flaxsearch/BioSolr
Solr documents with ontology annotation
Enriched Solr with ontology content (synonyms, structure, relations)
Solr/Elastic plugin Query expansion and hierarchical faceting
Making it all FAIR
Data resources at EMBL-EBIGenes, genomes & variation
RNA CentralArrayExpress
Expression AtlasMetabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structuresProtein Data Bank in EuropeElectron Microscopy Data Bank
European Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions & pathways
IntActReactome
MetaboLights
SystemsBioModels Enzyme Portal BioSamples
Ensembl Ensembl Genomes
GWAS CatalogMetagenomics portal
Europe PubMed CentralBioStudiesGene OntologyExperimental Factor Ontology
Literature & ontologies
Product of previous biohackathons
EBI RDF PlatformSuccesses• Novel queries possible
over EBI datasets• Production quality RDF
releases• Community of users
• Highly available public SPARQL endpoints
• 500+ users (10-50 million hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons● Public SPARQL endpoints
problematic● Query federation not
performant● Inference support limited● Not scalable for all EBI
data e.g. Variation, ENA● Lack of expertise in
service teams● Too much overhead to get
started quickly in this space
Challenges for RDF at EMBL-EBI • Most EBI resources publish data in forms that
support common use cases (pre-integrated)• Individuals teams do the hard work so you don’t have
to• RDF representation not optimised for performance
• Barrier to building real (killer) applications• Technology not mature enough / developer
frameworks lacking• Doing RDF shouldn’t mandate a technology choice
anyway • RDF not yet a “core” activity for EMBL-EBI
Where we are going next with RDF• Virtualised infrastructure for RDF
• Simpler cloud deployment• Building a single EBI RDF cache
• Simpler to manage• More interesting queries
• Exploring cheaper paths to RDF• RDF from REST + JSON-LD• Via Wikidata• RDFa and schema.org (bioschemas)
Acknowledgements• Sample Phenotypes and Ontologies Team
• Olga Vrousgou, Thomas Liener, Dani Welter, Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Tony Burdett, Helen Parkinson
• Funding • European Molecular Biology Laboratory (EMBL)• European Union projects: DIACHRON, BioMedBridges
and CORBEL, Excelerate
Topic and interest for the hackathon• Ontology Mapping
• Disease (rare, common, phenotypes)• Data annotation (automated, machine learning,
text mining)• Virtualised RDF data deployment • RDF on the fly
• RDF over Mongo, Neo4j, Solr, Elastic• REST + JSON-LD