View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Migrating to the Semantic Web: Bioinformatics as a case
study.
Phillip Lord,
Dept of Computer Science,
University of Manchester
What is the Semantic Web
OWLRDFXML
We are here!
The talk
• Three (and a half) example case studies• Two different technologies. • Why we choose the different technologies.
RDF in a nutshell;Tim Berners-Lee’s original vision…
1989
OWL in a nutshell
The Motivation
“At the doctor’s office, Lucy instructed her semantic web agent. It promptly retrieved information about her Mom’s prescribed treatment, looked up a list of several providers within 20 miles of home, with a good trust rating.”
Scientific American, May 2001:
Beware of the
Hype!
The Motivating Example
LucyDoctor
myGrid
• UK e-Science Pilot Project.• Oct 2001 – April 2005.• £3.4 million.
• £0.4 million studentships. Newcastle
NottinghamManchester
Southampton
Hinxton
Sheffield
Data(type)-intensive bioinformatics
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric
AMBITText Extraction
Service
Provenance
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery
myGrid Information Repository
Ontology Mgt
Metadata Mgt
Work bench Taverna Talisman
Native Web Services
SoapLab
Web Portal
Legacy apps
Registries
Ontologies
FreeFluo Workflow Enactment Engine
OGSA-DQPDistributed Query Processor
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rsA
pplicationsC
ore servicesE
xternal servicesService Stack
Views
Legacy apps
GowLab
WBS Workflows:
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetative elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterProPFAMPrositeSmart
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
ncbiBlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence ncbiBlastWrapper
Sort for appropriate Sequences only
Pink: Outputs/inputs of a servicePurple: Tailor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns
RepeatMasker
Semantic discovery• Query-ontology – discovering
workflows and services described in the registry by building a query in Taverna.
• A common ontology is used to annotate and query.
• Look for all workflows that accept an input of semantic type nucleotide sequence.
• Aim to have semantic discovery over public view on the Web.
Service annotation
• Adding structured metadata to a workflow registration to enable others to discover and reuse it more effectively. E.g. what semantic type of input does it accept.
Semantic Discovery
View annotations on workflow
Pedro data capture tool
Drag a workflow entry into the explorer pane and the workflow loads.Drag a service/ workflow to the scavenger window for inclusion into the workflow
Biologist
Ontologist
Service Providers
Problems when doing In Silico ExperimentsExperiments being performed repeatedly, at different site, different time, by different users or groups;
Scientists
In silico experiments:
A large repository of records about experiments!!•verification of data;• “recipes” for experiment designs;• explanation for the impact of changes;• ownership;• performance of services;• data quality;
The Current State of the Art
Tim Berners-Lee’s original vision… 1989
A Semantic Web of Provenancewha
t
Literature relevant to
provenance study or data in this
workflow
Literature relevant to
provenance study or data in this
workflowDAML+OiL Ontologies linking provenance documents
ExperimentNotes
whyInterlinking graph of the workflow that generates the provenance logs
how
who
Web page of people who has related interests as the owner of the workflow
Provenance record of a workflow run
how/which/when/where
XML
HTML
XML
XML
Population Semantic Data
Web Services
Taverna
FreeFluo
MetadataRepository
Data Repository
LaunchPad Haystack
Haystack from IBM
BiologistBiologist
Database
Biologist
Gene Ontology Next Generation Project(GONG)
• Demonstrate the utility of finer grained concept descriptions in DAML+OIL (OWL-DL)
• Develop methodologies and tools to support the process
Translating theory into practice
• Gene Ontology provides a service to the model organism database community
• Description logic (DL) is a technology born out of computer science research
• OWL is a standard ontology interchange language underpinned by DL
GONG - proof of concept
• Maintaining an exhaustive is-a structure
GO conceptIs-a relationship
Parent
Axis 1:
Chemicals
[chemical] biosynthesis (GO:0009058)
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
Example: heparin biosynthesis
Axis 1:
Chemicals
Axis 2:
Process
[chemical] biosynthesis (GO:0009058)
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Example: heparin biosynthesis
Axis 1:
Chemicals
Axis 2:
Process
[chemical] biosynthesis (GO:0009058)
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
[i] glycosaminoglycan biosynthesis (GO:0006024)
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Example: heparin biosynthesis
Is this important?
• Missing is-a not noticed by users
• BUT… improves fidelity of DB record retrieval.
– Asking for gene products involved in ‘glycosaminoglycan biosynthesis’ will lead to an additional result:
O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
Paraphrased reasoning process
• heparin biosynthesis– class heparin biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
• glycosaminoglycan biosynthesis– class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan
Is-a
Inferring a new is-a link
• heparin biosynthesis– class heparin biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
• glycosaminoglycan biosynthesis– class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass glycosaminoglycan
Is-a
Is-a
Results
• Carbohydrate metabolism ~250 concepts– 22 additional is-a links 17 of which now in GO
• Amino acid metabolism ~ 250 concepts– Further 17 additional is-a links now in GO
• GO team will be reviewing results for metabolism as a whole once we have the tools to support the process
• Useful results come from even a partial coverage
Build a practical environment
• Tools needed for:– Creating OWL definitions
– Tracking changes
– Reporting reasoning results
– Viewing definitions
Reporting tools
OWL for GONG
BiologistOntologist
Conclusions
• Three problems, three different solutions, all making use of semantic web technologies.
• A little semantics can go a long way. • The expressivity of the language has to be chosen at least
in part based on the tasks to be performed, and the user base.
• Tools, tools, tools.
Acknowledgments
• Jane Lomax and Midori Harris of the GO editorial team for help and advice and responding to the suggested changes
• UMLS and MeSH which provided valuable resources for chemical information• Sean Bechhofer for development on OilEd
• Project funded as a subcontract of the DARPA DAML programme
Chris Wroe, Robert Stevens, Carole GobleUniversity of Manchester, UKMichael AshburnerEBI, Hinxton, UK
Acknowledgements
myGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net
myGrid People
Core• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris,
Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences,
University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis,
Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker