Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology

Improving the Reliability of

Peptide Identification by

Tandem Mass Spectrometry

Improving the Reliability of

Peptide Identification by

Tandem Mass SpectrometryNathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

3

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

5

Mass is fundamental!

6


• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

7


• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

8

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

9

Single Stage MS

MS

m/z

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

11

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

12

The big picture...

• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

• Key concepts:• Spectrum acquisition is unbiased• Direct observation of amino-acid sequence• Sensitive to minor sequence variation• Observed peptides represent folded proteins

13

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

14

Peptide Identification, but...

• What about novel peptides?• Search compressed ESTs (C3, PepSeqDB)

• What about peak intensity?• Spectral matching using HMMs (HMMatch)

• Which identifications are correct?• Unsupervised, model-free, result combiner

with false discovery rate estimation

15

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

16

What goes missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

17

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site

18

Novel Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

19


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

20


http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr20:61839670-61839828&hgsid=68112320&est=pack

21

Novel Mutation

• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

22

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300002887.18.xml&uid=202568&label=AAAKEPZA&homolog=AAAKEPZA&id=1838.1.1&proex=-1

23

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

24

Searching ESTs

• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.

• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret

• Make EST searching feasible for routine searching to discover novel peptides.

25

Searching Expressed Sequence Tags (ESTs)

Pros• No introns!• Primary splicing

evidence for annotation pipelines

• Evidence for dbSNP• Often derived from

clinical cancer samples

Cons• No frame• Large (8Gb)• “Untrusted” by

annotation pipelines• Highly redundant• Nucleotide error

rate ~ 1%

26

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

27

PepSeq FASTA Databases

• Organisms:• HUMAN, MOUSE, RAT, ZEBRA FISH

• Peptide Evidence:• Genbank mRNA, EST, HTC• RefSeq mRNA, Proteins• Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI

Proteins• Swiss-Prot variants• Swiss-Prot signal peptide & init. Met removal

• Singe FASTA entry per Gene

28

Spectral Matching for Peptide Identification

• Detection vs. identification• Increased sensitivity & specificity• No novel peptides!

• NIST GC/MS Spectral Library• Identifies small molecules, • 100,000’s of (consensus) spectra• Bundled/Sold with many instruments• “Dot-product” spectral comparison• Current project: Peptide MS/MS

29

NIST MS Search: Peptides

30

Peptide DLATVYVDVLK

31

Protein Families

32

Protein Families

33

Peptide DLATVYVDVLK

34

Hidden Markov Models for Spectral Matching

• Capture statistical variation and consensus in peak intensity• Only need 10 spectra to build a model

• Capture semantics of peaks• Extrapolate model to other peptides

• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (p-value < 10-5)

35

Hidden Markov Model

Ion

Delete

Insert

(m/z,int) pair emitted by ion & insert states

36

The devil in the details

• Intensity normalization

• Discretize (m/z,int) pairs

• Viterbi distance as score

• Compute p-value using “random” spectra

37

Random Spectra

• Uniform sample of (m/z,int)• Permutation (m/z) of true spectra peaks• M/z distribution between true spectra and

uniform sample (parameter)

RandomTrue False

Viterbi Score

# of

spe

ctra

38

HMM Peptide Identification Results – DLATV

DLAT (viterbi)

0

20

40

60

80

100

120

140

160

180

200

220

240

0-10

20-3

0

40-5

0

60-7

0

80-9

0

100-

110

120-

130

140-

150

160-

170

180-

190

200-

210

220-

230

240-

250

260-

270

280-

290

Viterbi Distance

# o

f s

pe

ctr

a

True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)

DLAT (-logP)

0

10

20

30

40

50

60

70

80

90

100

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10

10-11

11-12

12-13

13-14

14-15

15-16

16-17

17-18

18-19

inf

-log(p-value)

# o

f s

pe

ctr

a

True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)

39

Spectral Matching of Peptide Variants

DFLAGGVAAAISK

DFLAGGIAAAISK

40

HMM model extrapolation

41

Mascot Search Results

42

Peptide Identification Results

• Search engines always provide an answer

• Current search engines:• Hard to determine “good” scores• Significance estimates are unreliable

• Need better methods!

43

Common Algorithmic Framework

• Pre-process experimental spectra

• Filter peptide candidates

• Score match between peptides and spectra

• Rank peptides and assign

44

Comparison of search engines

• No single score is comprehensive

• Search engines disagree

• Many spectra lack confident peptide assignment

4%

OMSSA10%

2%

5%9%

69%

2%

X!Tandem

Mascot

45

Lots of published solutions!

• Treat search engines as black-boxes

• Apply supervised machine learning to results• Use multiple match metrics

• Combine/refine using multiple search engines• Agreement suggests correctness

• Use empirical significance estimates• “Decoy” databases (FDR)

46

PepArML

• Peptide identification arbiter by machine learning

• Unifies these ideas within a model-free, combining machine learning framework

• Unsupervised training procedure

47

PepArML Overview

• Unify Tandem, Mascot, and OMSSA results

X!Tandem

Mascot

OMSSA

Other

PepArML

Identified

Unidentified

48

Voting Heuristic Combiner

• Choose peptide ID with the most votes• Use best FDR as confidence

• Break ties (single votes) using FDR

• Strawman for comparison

49

Dataset construction

Machine Learningx

Spectra compare

Matched Ions

Peak_intensity

Mass delta

# of missed cleavages

Peptide length

Tandem Score

Mascot Score

OMSSA Score

Extract Features

X!Tandem

Mascot

OMSSA

Other

Search Tools

50


• Build feature vectors

T),( 11 PS

F),( 21 PS

T),( 12 PS

Tandem Mascot OMSSA

T),( mn PS

……

51


• Synthetic protein mixtures provide ground truth

• C8 • 8 standard proteins (Calibrant Biosystems)• 4594 MS/MS spectra (LTQ)• 618 (11.2%) true positives

• S17• 17 standard proteins (Sashimi Repository)• 1389 MS/MS spectra (Q-TOF)• 354 (25.4%) true positives

• AURUM• 364 standard proteins (AURUM 1.0)• 7508 MS/MS spectra (MALDI-TOF-TOF)• 3775 (50.3%) true positives

52

Machine learning improves single search engines (S17)

53

Multiple search engines are better than single search engines (S17)

54

Feature Evaluation

55

Application to Real Data

• How well do these models generalize?

• Different instruments• Spectral characteristics change scores

• Search parameters• Different parameters change score values

• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all

parameters x search engine sets x instruments

56

Model Generalization

57

Rescuing Machine Learning

• Train a new machine-learning model for every dataset!• Generalization not required• No predetermined search engines, parameters,

instruments, features

• Perhaps we can “guess” the true proteins• Most proteins not in doubt• Machine learning can tolerate imperfect labels

58

Unsupervised Learning

• Heuristic selection of “true” proteins• Train classifier, predict true peptide IDs

• Update “true” proteins• Heuristic selection of “true” proteins from

classifier predictions

• Iterate until convergence

59

Unsupervised Learning Performance

60

Unsupervised Learning Convergence

61

Conclusions

• Proteomics can inform genome annotation• Eukaryotic and prokaryotic • Functional vs silencing variants

• Peptides identify more than just proteins• Untapped source of disease biomarkers

• Computational inference can make a substantial impact in proteomics

62

Conclusions

• Compressed peptide sequence databases make routine EST searching feasible

• HMMatch spectral matching improves identification performance for familiar peptides

• Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem

63

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau• UMCP Biochemistry

• Cheng Lee• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: NIH/NCI, USDA/ARS

Documents

Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology