Tandem Mass Spectrometry - Computer Science · 2012. 2. 7. · Tandem Mass Spectrometry: Peptide identification via database search, de novo sequencing Some slides adapted by Sangtae

Tandem Mass Spectrometry:

Peptide identification via database search, de novo sequencing

Some slides adapted by Sangtae Kim and from Jones & Pevzner 2004

Peptide and protein sequencesPeptide ≈ string over a weighted alphabet:

…AFSRLEMILGF…

AFSRLSRLEMILGF

EMILG

Peptides=

substrings

Amino acid Mass

A 71.0

F 147.1

S 87.0

D 115.0

L 113.1

E 129.1

M 131.1

I 113.1

G 57.0

Protein sequence:

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

Mass spectrometry (MS and MS/MS)

Additional Fragmentation

E

L

G

R

A

L

Prefix masses for peptide LARGE

Mass

L A

RL A

GRL A

Rel

. int

ensi

ty

Adapted from slides by Vineet Bafna, UCSD

Quantification

Peptide Fragmentation

� Peptides tend to fragment along the backbone.� Mass spectrometer is a sophisticated scale to

measure the masses of these fragments

H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

H+

Prefix Fragment Suffix Fragment

Collision Induced Dissociation

N- and C-terminal Fragments

Masses of Terminal Fragments

Peptide

Mass (D) 57 + 97+ 147+114 = 415

N- and C-terminal Fragments

415

486

301

154

57

71

185

332

429

Theoretical Spectrum

415

486

301

154

57

71

185

332

429

Reconstruct peptide from the set of masses of fragm ent ions

(mass-spectrum )

57 71 154 185 301 332 415 429 486

Reconstructing Peptides


(mass-spectrum )

57 71 154 185 301 332 415 429 486

Reconstructing Peptides


(mass-spectrum ) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154

Peptide Fragmentation

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |R1 O R2 O R3 O R4

| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH

| | | | | | |H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3

Mass Spectra

G V D L K

mass0

57 Da = ‘G’ 99 Da = ‘V’LK D V G

� The peaks in the mass spectrum:� Prefix� Fragments with neutral losses (-H2O, -NH3)� Noise and missing peaks.

and Suffix Fragments.

D

H2O

Peptide Identification Problem

G V D L K

mass0

Inte

nsity

mass0

MS/MSPeptide Identification

Example of a real MS/MS spectrum

Symmetric

b10

y12

MS/MS Fundamental Challenge

Given a spectrum, find the best peptide annotation

Peptide Sequencing Problem

Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass

Output: � Peptide P with mass m, whose theoretical

spectrum matches the experimental Sspectrum the best

Sequence from spectrum

Enzymatic digestionTandem

Mass SpectrometryProteins

Peptides

…Large set of

MS/MS spectra …

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

s

s

s

f

ee

e

e

e

e

e

q

q

qu

u

u

n

n

n

ec

c

c

se

e

e

q

un

c

Peptide SEQUENCE

Database search De novo sequencing

DB search: Genomics � Proteomics

Translation

DNA

Protein

mRNA

Transcription

Ribosomal protein translation

Match between Spectra and the Shared

Peak Count

� The match between a spectrum and a peptide can be

defined as the number of shared masses (peaks)

between the two (Shared Peak Count or SPC)

� In practice, several tools use the weighted SPC that

reflects intensities of the peaks

� We will later see better ways to score peptide-spectrum

matches

Peptide Identification Problem

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:� S: experimental spectrum� D: database of peptides� ∆: set of possible ion types� m: parent mass

Output: � A peptide of mass m from the database whose

theoretical spectrum matches the experimental S spectrum the best

s(P)

MS/MS spectrum identification

Set ofMS/MSspectra

… …Peptidesfrom DB P

Database search: (Yates et al.’94 – SEQUEST)⇒ Works very well when the peptide sequence is in the database⇒ Approach of choice for everyday protein and peptide identification

Determining reliability of identifications

EVERY spectrum has some best match to the database – how can we tell whether it’s a significant match?

From Elias’07

Decoy databases

Elias’07

Decoy databases are the most common approach to determine the reliability of identifications – estimate the False Discovery Rate.

Null hypothesis of the target-decoy strategy:� Each spectrum is generated by a random (peptide-like) amino

acid sequence

0 5 10 15 20 250

0.05

0.1

0.15

0.2

Match scores

Rel

ativ

e fr

eque

ncy

of m

atch

sco

res

Matches to Decoy

Matches to Target

FDR =

� Standard definitions� True if a sample peptide generated the spectrum, False otherwise� Positive if we accept the identification, Negative otherwise

� False Discovery Rate = FP/(TP+FP), estimated using percentage of Decoy matches above the selected score threshold.

False Discovery Rate

Approaches to interpretation

De-novo interpretation of an MS/MS spectrum attempts to generate the most likely peptide from direct interpretation of the MS/MS spectrum

� Searches over the space of all possible peptides� As opposed to only those present in some specific database

� Does not require any previous knowledge of the appropriate protein sequence

� Paradoxically, a de-novo interpretation can be computed much faster than a regular database search!

Issues on de-novo interpretation

Computational issues: How to efficiently search the space of possible solutions?� Without eliminating any eligible candidates� Without double-counting peaks in the spectrum

Scoring issues: How to best score the matches between peptides and MS/MS spectra?� Most intensity explained� Most ion types explained� Maximum likelihood models

De-novo sequencing: how?

How can we find the amino acid sequence that best explains the spectrum?

a) Explains the largest number of peaksb) Explains the most intensity

� Exhaustive enumeration?1. Generate every possible sequence with the same peptide

mass2. Match each sequence to the spectrum3. Choose the sequence that explains the most intensity in the

spectrum

F S N A M S D I

V SGQ L I D

Takes too long!

Initial attempts

Computer programs for de-novo interpretation of MS/MS spectra date as far back as 1966 when Biemman et al. proposed a prefix extension algorithm:� Tries every possible prefix extension� Eliminates solutions with missing peaks� Outputs every peptide with a matching parent mass� ALL prefix peaks must be present in the spectrum

In an attempt to include interpretations with missing peaks, Sakurai et al. (1984) proposed:� Exhaustive search over the space of all possible permutations of amino

acid multisets where the total mass equals the parent mass of the MS/MS spectrum

� Score peptide/spectrum matches by counting prefix and suffix mass matches

� Naturally more sensitive but also very slow

Prefix extension revisited

The second half of the eighties saw a few better designed approaches to this problem, based on the same type of algorithm:� Prefix extension, one amino acid at a time.� Tolerate missing peaks.� Include both prefix and suffix peaks in the score.� User-specified maximum number of candidates in memory at any

point of the execution. (sub-optimal)

Ishikawa and Niwa’86, Siegel and Baumann’88, Johnson and Biemann’89, Zidarov et al.’90

In 1990 Bartels introduced a graph representation of an MS/MS spectrum� Every peak in the spectrum defines a vertex� Vertices connected by an edge if peak mass difference is an amino acid

mass

� Best peptide is defined as the best path between the two endpoint vertices: v0 to vM(S)

� No detailed algorithm was given for finding the best peptide; interactive exploration tool was made available.

Spectrum graphs

v0 vM(S)

DP de-novo sequencing

What is the DP recursion?Score(i) = intensity(i) + max( Score(j) ),

for all j with mass(i)-mass(j) ∈ Amino acid masses

Recovered de-novo sequence? ESESE

DP de-novo sequencing: EDTES

DP recursion:Score(i) = intensity(i) + max( Score(j) ),

for all j with mass(i)-mass(j) ∈ Amino acid masses

Recovered de-novo sequence? ESESE

Why was the correct peptide missed?

Forbidden pairs

The exclusion of symmetric peaks in a maximal scoring path through a spectrum graph was first proposed by Dančik et al. in 1999:

� Peaks in the spectrum are called forbidden pairs if their mass adds up to the parent mass – either vertex can be used in the output path but not both.

� A path is anti-symmetric if it uses at most one vertex from every forbidden pair.� Objective function becomes: find maximal scoring anti-symmetric path.� NP-Hard in the general case

Solution:1. Extend the sequence from either the

prefix or from the suffix2. Avoid reusing the pairing mass

(highlighted in red)

Note that this ordering is always possible (Chen’01, Bafna and Edwards’03)

What is the recursion?

Maximal scoring anti-symmetric path

Chen et al’01 provided a dynamic programming recursion to find a maximal scoring anti-symmetric path:� A peak si precedes a peak sj if it is closer to one of the ends of the

spectrum: min(si,m(S)-si)

Optimality and extensions

The anti-symmetric dynamic programming solutions are� Correct: no symmetric peak can be reused and every anti-

symmetric path is considered� Optimal: a maximal scoring anti-symmetric path is constructed� “Efficient”: runtime efficiency is O(n2), n=# peaks

Other algorithms have also been proposed:� Ma’03, Frank’05, same algorithm, different scoring� Bafna and Edwards’03, same principle, extended the concept

of forbidden pairs to avoid peak reusage between any pair of ion-types

How does de-novo perform?

� More involved scoring schemes have been proposed� Sherenga (Likelihood model)� NovoHMM (Hidden Markov Models)� Pepnovo (Bayesian network)� Peaks (commercial, scoring model unknown)

� Best algorithms predict 1 incorrect amino acid out of every 4 predictions (NovoHMM, Pepnovo)

� Main problem: MS/MS spectra are noisy!

� Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right

� Additional representative peaks (ion types)

� Noise (or unexplainable peaks)

Scoring MS/MS spectrum masses

Ion type ∆m p(ion)

b +1 0.5

b (iso) +2 0.15

b-NH3 -16 0.3

b-H2O -17 0.3

a (-CO) -27 0.17

b-NH3-H2O -34 0.16

…

F S N A M S D I

V SGQ L I D

Ion Types

� Some masses correspond to fragment

ions, others are just random noise

� Knowing ion types ∆={δ1, δ2,…, δk} lets us

distinguish fragment ions from noise

� We can learn ion types δi and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Example of Ion Type

� ∆={δ1, δ2,…, δk}

� Ion types

{ b, b-NH3, b-H2O}

correspond to

∆={0, 17, 18}

*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

Vertices of Spectrum Graph

� Masses of potential N-terminal peptides

� Vertices are generated by reverse shifts corresponding to ion types

∆={δ1, δ2,…, δk}

� Every N-terminal fragment can generate up to k ions

m-δ1, m-δ2, …, m-δk

� Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

� Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}

Reverse Shifts

Shift in H2O+NH3

Shift in H2O

Edges of Spectrum Graph

� Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A

� Gap edges for di- and tri-peptides

� Paths in the labeled graph spell out amino acid sequences – how to find the correct one?

� We need scoring to evaluate paths

Path Score

� p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

� p(P, s) = the probability that peptide Pgenerates a peak s

� Scoring = computing probabilities

� p(P,S) = πsєS p(P, s)

� For a position t that represents ion type δj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Peak Score (cont’d)

� For a position t that is not associated with an ion type:

qR , if peak is generated at tpR(P,st) =

1-qR , otherwise� qR = the probability of a noisy peak that does

not correspond to any ion type

Finding Optimal Paths in the Spectrum Graph

� For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

� Peptides = paths in the spectrum graph

� P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax=

Ions and Probabilities

� Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

� δi-ions of a partial peptide are produced independently with probabilities qi


� A fragment has all k peaks with probability

� and no peaks with probability

� A peptide also produces a ``random noise'' with uniform probability qR at any position.

∏=

k

iiq

1

∏=

−k

iiq

1

)1(

Ratio Test Scoring for Partial Peptides

� Incorporates premiums for observed ions and penalties for missing ions.

� Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321)1(

)1( ⋅−−⋅⋅

Scoring Peptides

� T- set of all positions.

� Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.

� A peak at position tδj is generated with probability qj.

� R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

Probabilistic Model

� For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.

� Similarly, for t∈R, the probability that P produces a random noise peak at t is:

−=

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP

δ

−=

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Probabilistic Score

� For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:

∏∏= =

=n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

δ

δ

Resulting sequencing accuracy

Algorithm Average Accuracy

SequenceLength

Tag 3 Tag 4 Tag 5 Tag 6

Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

Peaks 0.673 10.32 0.889 0.814 0.689 0.575

Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339

Benchmarking reported for 280 spectra.

Frank and Pevzner’05

Enhancing Sherenga (Dancik et al.’99)

Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment

ions.� Incorporates additional chemical

knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage

site.� Improves the Random Model.

Pepnovo slides by Ari Frank

pos(m)(region in peptide)

yby2

a

b2

a-NH3

a-H2O

b-NH3

b-H2O

y-NH3

y-H2O

b-H2O-NH3 b-H2O-H2O

y-H2O-NH3

y-H2O-H2O

N-aa(N-terminal amino acid)

C-aa(C-terminal amino acid)

HCID - Fragmentation Network

Amino acid influence

Ion combinations

Positional influence

pos y P(y|pos)

0 0 0.10 1 0.22

2 3 0.52

4 3 0.08

Discrete Intensity Values

� Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).

� Normalized intensities Discretized into 4 intensity levels:� zero : I < 0.05

� low : 0.05 ≤ I < 2 (62% of peaks)� medium : 2 ≤ I < 10 (26% of peaks)� high : I ≥ 10 (12% of peaks)

Combinations of Fragments

� The topology takes into account dependencies between fragments.

� The values of the probability tables are learned from the training data, so they reflect the true “fragmentation rules”.

� “Logical” combinations get higher probabilities:P(b=high | y=high ) = 0.36, vs. P(b=high | y= low ) = 0.03.

yby2

ab2

a-NH3a-H2O

b-NH3b-H2O

y-NH3

y-H2O

b-H2O-NH3 b-H2O-H2Oy-H2O-NH3

y-H2O-H2O

Additional Chemical Knowledge

� The identity of the flanking amino acids influences the peak intensities:� Increased intensities N-terminal to Proline and Glycine� Increased intensities C-terminal to Aspartic Acid.

� 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).



yb

Positional Influence

� Creates separate models for different locations of the cleavage site in the peptide.

� Models phenomena such as:� weak b/y ions near terminal ends.� prevalence of a-ions in the first half of the peptides.� prevalence of b2 towards the peptide’s C-terminal and y2

near the N-terminal.


yby2

a

b2

HRandom – Regional Density

Bin

0

1

2

3

Intensity levels

1

2

2

2

2

3

3

Window

m/zw

2ε

Computing the Random Probability

� α=1-(2ε)/w , is the probability of a single peak missing the bin.

� Let ni , 1≤i≤d, be counts of peaks with intensity i in window w:

∑=

==

∑==

∑−==

=

+=

d

idRandom

n

dRandom

nn

dRandom

nniIP

nnIP

nntIPd

ii

d

tii

t

01

1

1

1),...,|(.3

),...,|0(.2

)1(),...,|(.1

1

1

α

αα

(normalization term)

(prob. of no peak)

(prob. of peak with intensity t)

Random Model cont.

� Employing this random model increases the contribution of peaks in sparse regions of the spectrum.

� Decreases score for spurious matches in dense regions.

� Increases contribution of high intensity peaks compared to low intensity.

Probability under HCID

From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so:

where I are the ion intensities for mass m and π(f) are the intensities of the parents of node f.

))(|(),(,...},{

fIPmIPbyf

fHH CIDCIDπ∏

∈

=r

Probability under HRandom

� Peak occurrences are treated as random independent events:

� The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

),...,|(),( 1...},,{ 2

dOHybyf

fHH nnIPmIP RandomRandom ∏−∈

=r

The Likelihood Ratio Score

� A putative cleavage site is scored according to the log ratio test:

� Can be used to score a peptide by summing the score for the prefix masses:

),...,|(

))(|(

log),(

),(log),(

1,...},{

,...},{

dbyf

fH

byffH

mH

mHm nnIP

fIP

mIP

mIPmIScore

Random

CID

Random

CID

∏∏

∈

∈==π

r

rr

∑=

==n

iimn mIScorepppPScore i

121 ),()..(

r

PepNovo’s De Novo Sequencing

� A spectrum graph is created from the experimental MS/MS spectrum.

� The nodes are scored using the Bayesian network.

� Highest scoring anti-symmetric path is found using dynamic programming.

Data and Software

� 1252 spectra of doubly charged tryptic peptides (from ISB and OPD), measured on ion trap mass spectrometer:� 972 spectra in the training set.� 280 spectra in the test set (peptides up to 1400 Da.,

assignments independently verified.)

� Compared PepNovo with 3 de novo programs: Sherenga (Spectrum Mill 3.0), Lutefisk XP, and Peaks v2.3.

Results


SequenceLength


PepNovo 0.727 10.30 0.946 0.871 0.800 0.654

Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

Peaks 0.673 10.32 0.889 0.814 0.689 0.575

Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339



� Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right

� Additional representative peaks (ion types)

� Noise (or unexplainable peaks)

Scoring MS/MS spectrum masses

Ion type ∆m p(ion)

b +1 0.5

b (iso) +2 0.15

b-NH3 -16 0.3

b-H2O -17 0.3

a (-CO) -27 0.17

b-NH3-H2O -34 0.16

…

F S N A M S D I

V SGQ L I D

Ion Types

� Some masses correspond to fragment

ions, others are just random noise

� Knowing ion types ∆={δ1, δ2,…, δk} lets us

distinguish fragment ions from noise

� We can learn ion types δi and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Example of Ion Type

� ∆={δ1, δ2,…, δk}

� Ion types

{ b, b-NH3, b-H2O}

correspond to

∆={0, 17, 18}

*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

Vertices of Spectrum Graph

� Masses of potential N-terminal peptides

� Vertices are generated by reverse shifts corresponding to ion types

∆={δ1, δ2,…, δk}

� Every N-terminal fragment can generate up to k ions

m-δ1, m-δ2, …, m-δk

� Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

� Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}

Reverse Shifts

Shift in H2O+NH3

Shift in H2O

Edges of Spectrum Graph

� Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A

� Gap edges for di- and tri-peptides

� Paths in the labeled graph spell out amino acid sequences – how to find the correct one?

� We need scoring to evaluate paths

Path Score

� p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

� p(P, s) = the probability that peptide Pgenerates a peak s

� Scoring = computing probabilities

� p(P,S) = πsєS p(P, s)

� For a position t that represents ion type δj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Peak Score (cont’d)

� For a position t that is not associated with an ion type:

qR , if peak is generated at tpR(P,st) =

1-qR , otherwise� qR = the probability of a noisy peak that does

not correspond to any ion type

Finding Optimal Paths in the Spectrum Graph

� For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

� Peptides = paths in the spectrum graph

� P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax=


� Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

� δi-ions of a partial peptide are produced independently with probabilities qi


� A fragment has all k peaks with probability

� and no peaks with probability

� A peptide also produces a ``random noise'' with uniform probability qR at any position.

∏=

k

iiq

1

∏=

−k

iiq

1

)1(

Ratio Test Scoring for Partial Peptides

� Incorporates premiums for observed ions and penalties for missing ions.

� Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321)1(

)1( ⋅−−⋅⋅

Scoring Peptides

� T- set of all positions.

� Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.

� A peak at position tδj is generated with probability qj.

� R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

Probabilistic Model

� For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.

� Similarly, for t∈R, the probability that P produces a random noise peak at t is:

−=

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP

δ

−=

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Probabilistic Score

� For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:

∏∏= =

=n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

δ

δ

Resulting sequencing accuracy


SequenceLength


Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

Peaks 0.673 10.32 0.889 0.814 0.689 0.575

Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339



Enhancing Sherenga (Dancik et al.’99)

Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment

ions.� Incorporates additional chemical

knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage

site.� Improves the Random Model.

Pepnovo slides by Ari Frank


yby2

a

b2

a-NH3

a-H2O

b-NH3

b-H2O

y-NH3

y-H2O

b-H2O-NH3 b-H2O-H2O

y-H2O-NH3

y-H2O-H2O



HCID - Fragmentation Network

Amino acid influence

Ion combinations

Positional influence

pos y P(y|pos)

0 0 0.10 1 0.22

2 3 0.52

4 3 0.08