Upload
pierce-rose
View
233
Download
1
Tags:
Embed Size (px)
Citation preview
Improving the Reliability of
Peptide Identification by
Tandem Mass Spectrometry
Improving the Reliability of
Peptide Identification by
Tandem Mass SpectrometryNathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center
2
Mass Spectrometry for Proteomics
• Measure mass of many (bio)molecules simultaneously• High bandwidth
• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required
3
Mass Spectrometer
Ionizer
Sample
+_
Mass Analyzer Detector
• MALDI• Electro-Spray
Ionization (ESI)
• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap
• ElectronMultiplier(EM)
4
High Bandwidth
100
0250 500 750 1000
m/z
% I
nte
nsit
y
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics
• Measure mass of many molecules simultaneously• ...but not too many, abundance bias
• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to
7
Mass Spectrometry for Proteomics
• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?
• Ionization methods• MALDI, Electrospray
• Protein chemistry & automation• Chromatography, Gels, Computers
• Protein sequence databases• A reference for comparison
8
Sample Preparation for Peptide Identification
Enzymatic Digestand
Fractionation
9
Single Stage MS
MS
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection
m/z
m/z
11
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
m/z
m/z
12
The big picture...
• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.
• Key concepts:• Spectrum acquisition is unbiased• Direct observation of amino-acid sequence• Sensitive to minor sequence variation• Observed peptides represent folded proteins
13
Peptide Identification
• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well
• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...
• Automated, high-throughput peptide identification in complex mixtures
14
Peptide Identification, but...
• What about novel peptides?• Search compressed ESTs (C3, PepSeqDB)
• What about peak intensity?• Spectral matching using HMMs (HMMatch)
• Which identifications are correct?• Unsupervised, model-free, result combiner
with false discovery rate estimation
15
Why don’t we see more novel peptides?
• Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
16
What goes missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
17
Why should we care?
• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins
• Proteins have clinical implications• Biomarker discovery
• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site
18
Novel Splice Isoform
• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.
• LIME1 gene:• LCK interacting transmembrane adaptor 1
• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
19
Novel Splice Isoform
20
Novel Splice Isoform
21
Novel Mutation
• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female
healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)
• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
22
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
23
Novel Mutation
24
Searching ESTs
• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.
• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret
• Make EST searching feasible for routine searching to discover novel peptides.
25
Searching Expressed Sequence Tags (ESTs)
Pros• No introns!• Primary splicing
evidence for annotation pipelines
• Evidence for dbSNP• Often derived from
clinical cancer samples
Cons• No frame• Large (8Gb)• “Untrusted” by
annotation pipelines• Highly redundant• Nucleotide error
rate ~ 1%
26
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
27
PepSeq FASTA Databases
• Organisms:• HUMAN, MOUSE, RAT, ZEBRA FISH
• Peptide Evidence:• Genbank mRNA, EST, HTC• RefSeq mRNA, Proteins• Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI
Proteins• Swiss-Prot variants• Swiss-Prot signal peptide & init. Met removal
• Singe FASTA entry per Gene
28
Spectral Matching for Peptide Identification
• Detection vs. identification• Increased sensitivity & specificity• No novel peptides!
• NIST GC/MS Spectral Library• Identifies small molecules, • 100,000’s of (consensus) spectra• Bundled/Sold with many instruments• “Dot-product” spectral comparison• Current project: Peptide MS/MS
29
NIST MS Search: Peptides
30
Peptide DLATVYVDVLK
31
Protein Families
32
Protein Families
33
Peptide DLATVYVDVLK
34
Hidden Markov Models for Spectral Matching
• Capture statistical variation and consensus in peak intensity• Only need 10 spectra to build a model
• Capture semantics of peaks• Extrapolate model to other peptides
• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (p-value < 10-5)
35
Hidden Markov Model
Ion
Delete
Insert
(m/z,int) pair emitted by ion & insert states
36
The devil in the details
• Intensity normalization
• Discretize (m/z,int) pairs
• Viterbi distance as score
• Compute p-value using “random” spectra
37
Random Spectra
• Uniform sample of (m/z,int)• Permutation (m/z) of true spectra peaks• M/z distribution between true spectra and
uniform sample (parameter)
RandomTrue False
Viterbi Score
# of
spe
ctra
38
HMM Peptide Identification Results – DLATV
DLAT (viterbi)
0
20
40
60
80
100
120
140
160
180
200
220
240
0-10
20-3
0
40-5
0
60-7
0
80-9
0
100-
110
120-
130
140-
150
160-
170
180-
190
200-
210
220-
230
240-
250
260-
270
280-
290
Viterbi Distance
# o
f s
pe
ctr
a
True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)
DLAT (-logP)
0
10
20
30
40
50
60
70
80
90
100
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
10-11
11-12
12-13
13-14
14-15
15-16
16-17
17-18
18-19
inf
-log(p-value)
# o
f s
pe
ctr
a
True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)
39
Spectral Matching of Peptide Variants
DFLAGGVAAAISK
DFLAGGIAAAISK
40
HMM model extrapolation
41
Mascot Search Results
42
Peptide Identification Results
• Search engines always provide an answer
• Current search engines:• Hard to determine “good” scores• Significance estimates are unreliable
• Need better methods!
43
Common Algorithmic Framework
• Pre-process experimental spectra
• Filter peptide candidates
• Score match between peptides and spectra
• Rank peptides and assign
44
Comparison of search engines
• No single score is comprehensive
• Search engines disagree
• Many spectra lack confident peptide assignment
4%
OMSSA10%
2%
5%9%
69%
2%
X!Tandem
Mascot
45
Lots of published solutions!
• Treat search engines as black-boxes
• Apply supervised machine learning to results• Use multiple match metrics
• Combine/refine using multiple search engines• Agreement suggests correctness
• Use empirical significance estimates• “Decoy” databases (FDR)
46
PepArML
• Peptide identification arbiter by machine learning
• Unifies these ideas within a model-free, combining machine learning framework
• Unsupervised training procedure
47
PepArML Overview
• Unify Tandem, Mascot, and OMSSA results
X!Tandem
Mascot
OMSSA
Other
PepArML
Identified
Unidentified
48
Voting Heuristic Combiner
• Choose peptide ID with the most votes• Use best FDR as confidence
• Break ties (single votes) using FDR
• Strawman for comparison
49
Dataset construction
Machine Learningx
Spectra compare
Matched Ions
Peak_intensity
Mass delta
# of missed cleavages
Peptide length
Tandem Score
Mascot Score
OMSSA Score
Extract Features
X!Tandem
Mascot
OMSSA
Other
Search Tools
50
Dataset construction
• Build feature vectors
T),( 11 PS
F),( 21 PS
T),( 12 PS
Tandem Mascot OMSSA
T),( mn PS
……
51
Dataset construction
• Synthetic protein mixtures provide ground truth
• C8 • 8 standard proteins (Calibrant Biosystems)• 4594 MS/MS spectra (LTQ)• 618 (11.2%) true positives
• S17• 17 standard proteins (Sashimi Repository)• 1389 MS/MS spectra (Q-TOF)• 354 (25.4%) true positives
• AURUM• 364 standard proteins (AURUM 1.0)• 7508 MS/MS spectra (MALDI-TOF-TOF)• 3775 (50.3%) true positives
52
Machine learning improves single search engines (S17)
53
Multiple search engines are better than single search engines (S17)
54
Feature Evaluation
55
Application to Real Data
• How well do these models generalize?
• Different instruments• Spectral characteristics change scores
• Search parameters• Different parameters change score values
• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all
parameters x search engine sets x instruments
56
Model Generalization
57
Rescuing Machine Learning
• Train a new machine-learning model for every dataset!• Generalization not required• No predetermined search engines, parameters,
instruments, features
• Perhaps we can “guess” the true proteins• Most proteins not in doubt• Machine learning can tolerate imperfect labels
58
Unsupervised Learning
• Heuristic selection of “true” proteins• Train classifier, predict true peptide IDs
• Update “true” proteins• Heuristic selection of “true” proteins from
classifier predictions
• Iterate until convergence
59
Unsupervised Learning Performance
60
Unsupervised Learning Convergence
61
Conclusions
• Proteomics can inform genome annotation• Eukaryotic and prokaryotic • Functional vs silencing variants
• Peptides identify more than just proteins• Untapped source of disease biomarkers
• Computational inference can make a substantial impact in proteomics
62
Conclusions
• Compressed peptide sequence databases make routine EST searching feasible
• HMMatch spectral matching improves identification performance for familiar peptides
• Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem
63
Acknowledgements
• Chau-Wen Tseng, Xue Wu• UMCP Computer Science
• Catherine Fenselau• UMCP Biochemistry
• Cheng Lee• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: NIH/NCI, USDA/ARS