Upload
christiana-dulcie-evans
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Semantics Enabled Industrial and Scientific Applications: Research, Technology and
Deployed Applications Part III: Biological Applications
Keynote - the First Online Metadata and Semantics Research Conference
http://www.metadata-semantics.org November 23, 2005Amit Sheth
LSDIS Lab, Department of Computer Science,University of Georgia
http://lsdis.cs.uga.edu
Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.
Computation, data and semantics in life sciences• “The development of a predictive biology will likely be one
of the major creative enterprises of the 21st century.” Roger Brent, 1999
• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000
• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins
• We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb
We will show how semantics is a key enabler for achieving the above predictions and visions.
Bioinformatics Apps & Ontologies• GlycOGlycO: A domain ontology for glycan structures, glycan functions
and enzymes (embodying knowledge of the structure and metabolisms of glycans) Contains 600+ classes and 100+ properties – describe structural
features of glycans; unique population strategy URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco
• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances Models three phases of experimental proteomics* –
Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo
• Automatic semantic annotation of high throughput experimental data Automatic semantic annotation of high throughput experimental data (in progress)
• Semantic Web Process with WSDL-S for semantic annotations of Web Semantic Web Process with WSDL-S for semantic annotations of Web ServicesServices
– http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)
GlycO – A domain ontology for glycans
GlycO
Structural modeling and populationchallenges in GlycO• Extremely large number of glycans occurring in
nature• But, frequently there are small differences
structural properties• Modeling all possible glycans would involve
significant amount of redundant classes• Redundancy results in often fatal complexities in
maintenance and upgrade• Population
– Manual– Extraction and integration from external knowledge
sources– GlycoTree – exploiting structural composition rules
Ontology population workflow
GlycoTreeTakahashi, Kato 2003
GlycoTree – A Canonical Representation of N-Glycans
N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-
-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-
-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-
-D-GlcpNAc-(1-2)+
-D-GlcpNAc-(1-6)+
Beyond expressiveness afforded in OWL• Probabilistic• more
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.
Example: Mass spectrometry analysis
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Mass Spectrometry ExperimentEach m/z value in mass spec diagrams can
stand for many different structures (uncertainty wrt to structure that corresponds to a peak)
• Different linkage• Different bond• Different isobaric structures
Very subtle differences
• Peak at 1219.1 • Same molecular
composition• One diverging link• Found in different
organisms• background knowledge
(found in honeybee venom or bovine cells) can resolve the uncertainty
These are core-fucosylated high-mannose glycans
CBank: 16155Honeybee venom
CBank: 16154Bovine
Even in the same organism
• Both Glycans found in bovine cells
• Both have a mass of 3425.11
• Same composition• Different linkage• Since expression levels
of different genes can be measured in the cell, we can get probability of each structure in the sample
Different enzymes lead to these linkages
CBank: 21821
CBank: 21982
Model 1: associate probability as part of Semantic Annotation
• Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge
P(S | M = 3461.57) = 0.6 P(T | M = 3461.57)
= 0.4
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Model 2: Probability in ontological representation of Glycan structure• Build a generalized probabilistic glycan
structure that embodies several possible glycans
N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture
Glycoprotein Fraction
Glycopeptides Fraction
extract
Separation technique I
Glycopeptides Fraction
n*m
n
Signal integrationData correlation
Peptide Fraction
Peptide Fraction
ms data ms/ms data
ms peaklist ms/ms peaklist
Peptide listN-dimensional arrayGlycopeptide identificationand quantification
proteolysis
Separation technique II
PNGase
Mass spectrometry
Data reductionData reduction
Peptide identificationbinning
n
1
Phase II: Ontology PopulationPhase II: Ontology PopulationPopulate ProPreO with all experimental
datasets?Two levels of ontology population for
ProPreO: Level 1: Populate the ontology with instances
that a stable across experimental runsEx: Human Tryptic peptides – 40,000 instances in
ProPreO Level 2: Use of URIs to point to actual
experimental datasets
Ontology-mediated Proteomics Ontology-mediated Proteomics ProtocolProtocol
RAW Files
MassSpectrometer
ConversionTo
PKL
Preprocessing DB Search Post processing
Data Processing Application
Instrument
DBStoring Output
PKL Files (XML-based Format)‘Clean’ PKL FilesRAW Results File
Output (*.dat)
Micromass_Q_TOF_ultima_quadrupole_time_of_flight_mass_spectrometer
Masslynx_Micromass_application
mass_spec_raw_data
Micromass_Q_TOF_micro_quadrupole_time_of_flight_ms_raw_dataPeoPreO
produces_ms-ms_peak_list
All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist
RAW Files
‘Clean’ PKL Files
Semantic Annotation of Scientific DataSemantic Annotation of Scientific Data
830.9570 194.9604 2580.2985 0.3592688.3214 0.2526
779.4759 38.4939784.3607 21.77361543.7476 1.38221544.7595 2.9977
1562.8113 37.47901660.7776 476.5043
ms/ms peaklist data
<ms/ms_peak_list>
<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Annotated ms/ms peaklist data
Semantic annotation of Scientific Semantic annotation of Scientific DataData
Annotated ms/ms peaklist data
<ms/ms_peak_list>
<parameter
instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
mode = “ms/ms”/>
<parent_ion_mass>830.9570</parent_ion_mass>
<total_abundance>194.9604</total_abundance>
<z>2</z>
<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>
<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>
<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>
<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>
<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>
<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>
<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>
<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>
<ms/ms_peak_list>
Formalize description and classification of Web Services using ProPreO concepts
Service description using WSDL-SService description using WSDL-S
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" …..xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> …..</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message>
WSDL ModifyDBWSDL-S ModifyDB
<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" ……xmlns:wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns:ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >
<wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema">……</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> ProPreO
process Ontology
data
sequence
peptide_sequence
Concepts defined in
process Ontology
Description of a Web Service using:WebServiceDescriptionLanguage
Summary, Observations, Conclusions• Ontology Schema: relatively simple in
business/industry, highly complex in science• Ontology Population: could have millions of assertions,
or unique features when modeling complex life science domains
• Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population
• Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)
Summary, Observations, Conclusions• Some applications: semantic search,
semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
More information at
• http://lsdis.cs.uga.edu/projects/glycomics