Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Peptide sequence databases, meta-search engine, machine-learning combiner available from: http://edwardslab.bmcb.georgetown.edu Application of enumeration, meta- search, and machine-learning can significantly improve the sensitivity of peptide identification. Improving the Sensitivity of Peptide Identification With Meta-Search and Machine Learning Nathan J. Edwards 1 , Xue Wu 2 , Chau-Wen Tseng 2 Introduction All peptide sequences from: Six-frame translation of EST and HTC sequences; Three-frame translation of mRNA sequences; All IPI, RefSeq, Genbank, Vega, EMBL, HInvDB, SwissProt and TrEMBL proteins; SwissProt variants, splices, conflicts, mature isoforms grouped by gene-cluster & compressed, as FASTA. 1 Georgetown University Medical Center; 2 University of Maryland, College Park Peptide Sequence Databases PepSeqDB Release 1.2 Peptide Identification Meta-Search HMMatch Spectral Matching Conclusions References We use a variety of techniques, from sequence enumeration and meta-search to machine learning to increase the number high-confidence peptide identifications from large tandem mass-spectra datasets. These techniques seek to improve the number of peptide identifications made at a given level of statistical significance. We show that these techniques can improve identification sensitivity significantly. Georgetown University 1. Edwards. Novel Peptide Identification using Expressed Sequence Tags and Sequence Database Compression. Mol. Sys. Biol. 2007. 2. Wu, Tseng, Edwards. HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra using Hidden Markov Models. J. Comp. Biol. 2007. 3. Wu, Tseng, Rudnick, Balgley, Edwards. PepArML: An Unsupervised, Model-Free, Combining, Peptide Identification Arbiter for Tandem Mass Spectra via Organism Size (AA) Size (Entries) Human 209Mb 75,043 Mouse 151Mb 55,929 Rat 67Mb 43,211 Zebra-fish 90Mb 47,922 Schedule: Automated rebuild every few months. Coming soon: Fast peptide to gene and source sequence mapping using suffix-trees and gene Annual Meeting, 2008 PepArML - Unsupervised Machine-Learning Combiner NSF TeraGrid 1000+ CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 48+ CPUs Meta-search with four search engines; Target & decoy searches automatically. Web-service API for all data Secure communication Heterogeneous compute resources Simple search description Scales to 100’s of simultaneous searches Free, instant registration Iteration Legend: Heuristic: H; Classifier w/ 5-fold-CV: C-T, C-M, C-O, C-TM, C-TO, C-MO, C-TMO; Unsupervised classifier w/ 5-fold-CV: U-TMO; Unsupervised classifier w/ no-CV: U*-TMO. Q-TOF False Positive Rate LTQ MALDI H C-TMO U-TMO U*-TMO End b 2 D 3 y 1 I 2 D 2 b 1 I 1 D 1 Begin I 0 y 2 I 4 D 4 I 3 I 0 b 1 I 1 I 2 I 3 I 4 I 5 I 6 y 1 b 2 y 2 b 3 y 3 11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%

Improving the Sensitivity of Peptide Identification With Meta-Search and Machine Learning

Download PPT Report

Upload
ashanti
View
38
Download
2

Embed Size (px)

DESCRIPTION

Georgetown University. D 1. D 2. D 3. D 4. I 0. I 1. I 2. I 4. I 3. Begin. b 1. y 1. b 2. y 2. End. Improving the Sensitivity of Peptide Identification With Meta-Search and Machine Learning. Nathan J. Edwards 1 , Xue Wu 2 , Chau-Wen Tseng 2. - PowerPoint PPT Presentation

Citation preview

Page 1: Improving the Sensitivity of Peptide Identification With Meta-Search and Machine Learning

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center

Peptide sequence databases, meta-search engine,

machine-learning combiner available from:

http://edwardslab.bmcb.georgetown.edu Application of enumeration, meta-search, and

machine-learning can significantly improve the

sensitivity of peptide identification.

Improving the Sensitivity of Peptide Identification With Meta-Search and Machine LearningNathan J. Edwards1, Xue Wu2, Chau-Wen Tseng2

Introduction

All peptide sequences from: Six-frame translation of EST and HTC sequences; Three-frame translation of mRNA sequences; All IPI, RefSeq, Genbank, Vega, EMBL, HInvDB,

SwissProt and TrEMBL proteins; SwissProt variants, splices, conflicts, mature isoforms

grouped by gene-cluster & compressed, as FASTA.

1Georgetown University Medical Center; 2University of Maryland, College Park

Peptide Sequence Databases

PepSeqDB Release 1.2

Peptide Identification Meta-Search HMMatch Spectral Matching

Conclusions

References

We use a variety of techniques, from sequence

enumeration and meta-search to machine learning

to increase the number high-confidence peptide

identifications from large tandem mass-spectra

datasets. These techniques seek to improve the number of

peptide identifications made at a given level of

statistical significance. We show that these techniques can improve

identification sensitivity significantly.

Georgetown University

1. Edwards. Novel Peptide Identification using Expressed Sequence

Tags and Sequence Database Compression. Mol. Sys. Biol. 2007.

2. Wu, Tseng, Edwards. HMMatch: Peptide Identification by Spectral

Matching of Tandem Mass Spectra using Hidden Markov Models.

J. Comp. Biol. 2007.

3. Wu, Tseng, Rudnick, Balgley, Edwards. PepArML: An Unsupervised,

Model-Free, Combining, Peptide Identification Arbiter for Tandem

Mass Spectra via Machine Learning. In preparation.

Organism Size (AA) Size (Entries)

Human 209Mb 75,043

Mouse 151Mb 55,929

Rat 67Mb 43,211

Zebra-fish 90Mb 47,922

Schedule: Automated rebuild every few months.

Coming soon: Fast peptide to gene and source sequence

mapping using suffix-trees and gene sequence-groups.

Annual Meeting, 2008

PepArML - Unsupervised Machine-Learning Combiner

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Meta-search with four search engines;Target & decoy searches automatically.

Web-service API for all data

Securecommunication

Heterogeneouscompute resources

Simple search descriptionScales to 100’s of

simultaneous searchesFree, instantregistration

Iteration

Legend: Heuristic: H; Classifier w/ 5-fold-CV: C-T, C-M, C-O, C-TM, C-TO, C-MO, C-TMO; Unsupervised classifier w/ 5-fold-CV: U-TMO; Unsupervised classifier w/ no-CV: U*-TMO.

Q-TOF

False Positive Rate

LTQ

MALDI

HC-TMO

U-TMO

U*-TMO

Endb2

Begin

I0 b1 I1 I2 I3 I4 I5 I6y1 b2 y2 b3 y3

11% 17% 6% 94% 8% 0% 11% 86% 17% 0% 6% 92% 19%