26
Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org November 23, 2005 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu ement: NCRR funded Bioinformatics of Glycan Expression , collaborators, partners at CCRC (Dr. William and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.

Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Embed Size (px)

Citation preview

Page 1: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Semantics Enabled Industrial and Scientific Applications: Research, Technology and

Deployed Applications Part III: Biological Applications

Keynote - the First Online Metadata and Semantics Research Conference

http://www.metadata-semantics.org November 23, 2005Amit Sheth

LSDIS Lab, Department of Computer Science,University of Georgia

http://lsdis.cs.uga.edu

Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.

Page 2: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Computation, data and semantics in life sciences• “The development of a predictive biology will likely be one

of the major creative enterprises of the 21st century.” Roger Brent, 1999

• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000

• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins

• We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb

We will show how semantics is a key enabler for achieving the above predictions and visions.

Page 3: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Bioinformatics Apps & Ontologies• GlycOGlycO: A domain ontology for glycan structures, glycan functions

and enzymes (embodying knowledge of the structure and metabolisms of glycans) Contains 600+ classes and 100+ properties – describe structural

features of glycans; unique population strategy URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco

• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances Models three phases of experimental proteomics* –

Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo

• Automatic semantic annotation of high throughput experimental data Automatic semantic annotation of high throughput experimental data (in progress)

• Semantic Web Process with WSDL-S for semantic annotations of Web Semantic Web Process with WSDL-S for semantic annotations of Web ServicesServices

– http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)

Page 4: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

GlycO – A domain ontology for glycans

Page 5: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

GlycO

Page 6: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Structural modeling and populationchallenges in GlycO• Extremely large number of glycans occurring in

nature• But, frequently there are small differences

structural properties• Modeling all possible glycans would involve

significant amount of redundant classes• Redundancy results in often fatal complexities in

maintenance and upgrade• Population

– Manual– Extraction and integration from external knowledge

sources– GlycoTree – exploiting structural composition rules

Page 7: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Ontology population workflow

GlycoTreeTakahashi, Kato 2003

Page 8: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

GlycoTree – A Canonical Representation of N-Glycans

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251

-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-

-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-

-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-

-D-GlcpNAc-(1-2)+

-D-GlcpNAc-(1-6)+

Page 9: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Beyond expressiveness afforded in OWL• Probabilistic• more

Page 10: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

Example: Mass spectrometry analysis

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Page 11: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Mass Spectrometry ExperimentEach m/z value in mass spec diagrams can

stand for many different structures (uncertainty wrt to structure that corresponds to a peak)

• Different linkage• Different bond• Different isobaric structures

Page 12: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Very subtle differences

• Peak at 1219.1 • Same molecular

composition• One diverging link• Found in different

organisms• background knowledge

(found in honeybee venom or bovine cells) can resolve the uncertainty

These are core-fucosylated high-mannose glycans

CBank: 16155Honeybee venom

CBank: 16154Bovine

Page 13: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Even in the same organism

• Both Glycans found in bovine cells

• Both have a mass of 3425.11

• Same composition• Different linkage• Since expression levels

of different genes can be measured in the cell, we can get probability of each structure in the sample

Different enzymes lead to these linkages

CBank: 21821

CBank: 21982

Page 14: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Model 1: associate probability as part of Semantic Annotation

• Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge

Page 15: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

P(S | M = 3461.57) = 0.6 P(T | M = 3461.57)

= 0.4

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Page 16: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Model 2: Probability in ontological representation of Glycan structure• Build a generalized probabilistic glycan

structure that embodies several possible glycans

Page 17: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture

Glycoprotein Fraction

Glycopeptides Fraction

extract

Separation technique I

Glycopeptides Fraction

n*m

n

Signal integrationData correlation

Peptide Fraction

Peptide Fraction

ms data ms/ms data

ms peaklist ms/ms peaklist

Peptide listN-dimensional arrayGlycopeptide identificationand quantification

proteolysis

Separation technique II

PNGase

Mass spectrometry

Data reductionData reduction

Peptide identificationbinning

n

1

Page 18: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the
Page 19: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Phase II: Ontology PopulationPhase II: Ontology PopulationPopulate ProPreO with all experimental

datasets?Two levels of ontology population for

ProPreO: Level 1: Populate the ontology with instances

that a stable across experimental runsEx: Human Tryptic peptides – 40,000 instances in

ProPreO Level 2: Use of URIs to point to actual

experimental datasets

Page 20: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Ontology-mediated Proteomics Ontology-mediated Proteomics ProtocolProtocol

RAW Files

MassSpectrometer

ConversionTo

PKL

Preprocessing DB Search Post processing

Data Processing Application

Instrument

DBStoring Output

PKL Files (XML-based Format)‘Clean’ PKL FilesRAW Results File

Output (*.dat)

Micromass_Q_TOF_ultima_quadrupole_time_of_flight_mass_spectrometer

Masslynx_Micromass_application

mass_spec_raw_data

Micromass_Q_TOF_micro_quadrupole_time_of_flight_ms_raw_dataPeoPreO

produces_ms-ms_peak_list

All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist

RAW Files

‘Clean’ PKL Files

Page 21: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Semantic Annotation of Scientific DataSemantic Annotation of Scientific Data

830.9570 194.9604 2580.2985 0.3592688.3214 0.2526

779.4759 38.4939784.3607 21.77361543.7476 1.38221544.7595 2.9977

1562.8113 37.47901660.7776 476.5043

ms/ms peaklist data

<ms/ms_peak_list>

<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer

mode = “ms/ms”/>

<parent_ion_mass>830.9570</parent_ion_mass>

<total_abundance>194.9604</total_abundance>

<z>2</z>

<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>

<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>

<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>

<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>

<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>

<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>

<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>

<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>

<ms/ms_peak_list>

Annotated ms/ms peaklist data

Page 22: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Semantic annotation of Scientific Semantic annotation of Scientific DataData

Annotated ms/ms peaklist data

<ms/ms_peak_list>

<parameter

instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”

mode = “ms/ms”/>

<parent_ion_mass>830.9570</parent_ion_mass>

<total_abundance>194.9604</total_abundance>

<z>2</z>

<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>

<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>

<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>

<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>

<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>

<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>

<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>

<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>

<ms/ms_peak_list>

Page 23: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Formalize description and classification of Web Services using ProPreO concepts

Service description using WSDL-SService description using WSDL-S

<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" …..xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> …..</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message>

WSDL ModifyDBWSDL-S ModifyDB

<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" ……xmlns:wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns:ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >

<wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema">……</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> ProPreO

process Ontology

data

sequence

peptide_sequence

Concepts defined in

process Ontology

Description of a Web Service using:WebServiceDescriptionLanguage

Page 24: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Summary, Observations, Conclusions• Ontology Schema: relatively simple in

business/industry, highly complex in science• Ontology Population: could have millions of assertions,

or unique features when modeling complex life science domains

• Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population

• Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)

Page 25: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

Summary, Observations, Conclusions• Some applications: semantic search,

semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

Page 26: Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the

More information at

• http://lsdis.cs.uga.edu/projects/glycomics