104
Tandem Mass Spectrometry: Peptide identification via database search, de novo sequencing Some slides adapted by Sangtae Kim and from Jones & Pevzner 2004

Tandem Mass Spectrometry - Computer Science · 2012. 2. 7. · Tandem Mass Spectrometry: Peptide identification via database search, de novo sequencing Some slides adapted by Sangtae

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

  • Tandem Mass Spectrometry:

    Peptide identification via database search, de novo sequencing

    Some slides adapted by Sangtae Kim and from Jones & Pevzner 2004

  • Peptide and protein sequencesPeptide ≈ string over a weighted alphabet:

    …AFSRLEMILGF…

    AFSRLSRLEMILGF

    EMILG

    Peptides=

    substrings

    Amino acid Mass

    A 71.0

    F 147.1

    S 87.0

    D 115.0

    L 113.1

    E 129.1

    M 131.1

    I 113.1

    G 57.0

    Protein sequence:

    H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

    Ri-1 Ri Ri+1

    AA residuei-1 AA residuei AA residuei+1

    N-terminus C-terminus

  • Mass spectrometry (MS and MS/MS)

    Additional Fragmentation

    E

    L

    G

    R

    A

    L

    Prefix masses for peptide LARGE

    Mass

    L A

    RL A

    GRL A

    Rel

    . int

    ensi

    ty

    Adapted from slides by Vineet Bafna, UCSD

    Quantification

  • Peptide Fragmentation

    � Peptides tend to fragment along the backbone.� Mass spectrometer is a sophisticated scale to

    measure the masses of these fragments

    H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

    Ri-1 Ri Ri+1

    H+

    Prefix Fragment Suffix Fragment

    Collision Induced Dissociation

  • N- and C-terminal Fragments

  • Masses of Terminal Fragments

    Peptide

    Mass (D) 57 + 97+ 147+114 = 415

  • N- and C-terminal Fragments

    415

    486

    301

    154

    57

    71

    185

    332

    429

  • N- and C-terminal Fragments

    415

    486

    301

    154

    57

    71

    185

    332

    429

  • Theoretical Spectrum

    415

    486

    301

    154

    57

    71

    185

    332

    429

    Reconstruct peptide from the set of masses of fragm ent ions

    (mass-spectrum )

    57 71 154 185 301 332 415 429 486

  • Reconstructing Peptides

    Reconstruct peptide from the set of masses of fragm ent ions

    (mass-spectrum )

    57 71 154 185 301 332 415 429 486

  • Reconstructing Peptides

    Reconstruct peptide from the set of masses of fragm ent ions

    (mass-spectrum ) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154

  • Peptide Fragmentation

    y3

    b2

    y2 y1

    b3a2 a3

    HO NH3+

    | |R1 O R2 O R3 O R4

    | || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH

    | | | | | | |H H H H H H H

    b2-H2O

    y3 -H2O

    b3- NH3

    y2 - NH3

  • Mass Spectra

    G V D L K

    mass0

    57 Da = ‘G’ 99 Da = ‘V’LK D V G

    � The peaks in the mass spectrum:� Prefix� Fragments with neutral losses (-H2O, -NH3)� Noise and missing peaks.

    and Suffix Fragments.

    D

    H2O

  • Peptide Identification Problem

    G V D L K

    mass0

    Inte

    nsity

    mass0

    MS/MSPeptide Identification

  • Example of a real MS/MS spectrum

    Symmetric

    b10

    y12

  • MS/MS Fundamental Challenge

    Given a spectrum, find the best peptide annotation

  • Peptide Sequencing Problem

    Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

    Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass

    Output: � Peptide P with mass m, whose theoretical

    spectrum matches the experimental Sspectrum the best

  • Sequence from spectrum

    Enzymatic digestionTandem

    Mass SpectrometryProteins

    Peptides

    …Large set of

    MS/MS spectra …

    Database ofknown peptides

    MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

    ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,

    GVFGSVLRA, EKLNKAATYIN..

    Database ofknown peptides

    MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

    ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,

    GVFGSVLRA, EKLNKAATYIN..

    s

    s

    s

    f

    ee

    e

    e

    e

    e

    e

    q

    q

    qu

    u

    u

    n

    n

    n

    ec

    c

    c

    se

    e

    e

    q

    un

    c

    Peptide SEQUENCE

    Database search De novo sequencing

  • DB search: Genomics � Proteomics

    Translation

    DNA

    Protein

    mRNA

    Transcription

  • Ribosomal protein translation

  • Match between Spectra and the Shared

    Peak Count

    � The match between a spectrum and a peptide can be

    defined as the number of shared masses (peaks)

    between the two (Shared Peak Count or SPC)

    � In practice, several tools use the weighted SPC that

    reflects intensities of the peaks

    � We will later see better ways to score peptide-spectrum

    matches

  • Peptide Identification Problem

    Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

    Input:� S: experimental spectrum� D: database of peptides� ∆: set of possible ion types� m: parent mass

    Output: � A peptide of mass m from the database whose

    theoretical spectrum matches the experimental S spectrum the best

  • s(P)

    MS/MS spectrum identification

    Set ofMS/MSspectra

    … …Peptidesfrom DB P

    Database search: (Yates et al.’94 – SEQUEST)⇒ Works very well when the peptide sequence is in the database⇒ Approach of choice for everyday protein and peptide identification

  • Determining reliability of identifications

    EVERY spectrum has some best match to the database – how can we tell whether it’s a significant match?

    From Elias’07

  • Decoy databases

    Elias’07

    Decoy databases are the most common approach to determine the reliability of identifications – estimate the False Discovery Rate.

    Null hypothesis of the target-decoy strategy:� Each spectrum is generated by a random (peptide-like) amino

    acid sequence

  • 0 5 10 15 20 250

    0.05

    0.1

    0.15

    0.2

    Match scores

    Rel

    ativ

    e fr

    eque

    ncy

    of m

    atch

    sco

    res

    Matches to Decoy

    Matches to Target

    FDR =

    � Standard definitions� True if a sample peptide generated the spectrum, False otherwise� Positive if we accept the identification, Negative otherwise

    � False Discovery Rate = FP/(TP+FP), estimated using percentage of Decoy matches above the selected score threshold.

    False Discovery Rate

  • Approaches to interpretation

    De-novo interpretation of an MS/MS spectrum attempts to generate the most likely peptide from direct interpretation of the MS/MS spectrum

    � Searches over the space of all possible peptides� As opposed to only those present in some specific database

    � Does not require any previous knowledge of the appropriate protein sequence

    � Paradoxically, a de-novo interpretation can be computed much faster than a regular database search!

  • Issues on de-novo interpretation

    Computational issues: How to efficiently search the space of possible solutions?� Without eliminating any eligible candidates� Without double-counting peaks in the spectrum

    Scoring issues: How to best score the matches between peptides and MS/MS spectra?� Most intensity explained� Most ion types explained� Maximum likelihood models

  • De-novo sequencing: how?

    How can we find the amino acid sequence that best explains the spectrum?

    a) Explains the largest number of peaksb) Explains the most intensity

    � Exhaustive enumeration?1. Generate every possible sequence with the same peptide

    mass2. Match each sequence to the spectrum3. Choose the sequence that explains the most intensity in the

    spectrum

    F S N A M S D I

    V SGQ L I D

    Takes too long!

  • Initial attempts

    Computer programs for de-novo interpretation of MS/MS spectra date as far back as 1966 when Biemman et al. proposed a prefix extension algorithm:� Tries every possible prefix extension� Eliminates solutions with missing peaks� Outputs every peptide with a matching parent mass� ALL prefix peaks must be present in the spectrum

    In an attempt to include interpretations with missing peaks, Sakurai et al. (1984) proposed:� Exhaustive search over the space of all possible permutations of amino

    acid multisets where the total mass equals the parent mass of the MS/MS spectrum

    � Score peptide/spectrum matches by counting prefix and suffix mass matches

    � Naturally more sensitive but also very slow

  • Prefix extension revisited

    The second half of the eighties saw a few better designed approaches to this problem, based on the same type of algorithm:� Prefix extension, one amino acid at a time.� Tolerate missing peaks.� Include both prefix and suffix peaks in the score.� User-specified maximum number of candidates in memory at any

    point of the execution. (sub-optimal)

    Ishikawa and Niwa’86, Siegel and Baumann’88, Johnson and Biemann’89, Zidarov et al.’90

  • In 1990 Bartels introduced a graph representation of an MS/MS spectrum� Every peak in the spectrum defines a vertex� Vertices connected by an edge if peak mass difference is an amino acid

    mass

    � Best peptide is defined as the best path between the two endpoint vertices: v0 to vM(S)

    � No detailed algorithm was given for finding the best peptide; interactive exploration tool was made available.

    Spectrum graphs

    v0 vM(S)

  • DP de-novo sequencing

    What is the DP recursion?Score(i) = intensity(i) + max( Score(j) ),

    for all j with mass(i)-mass(j) ∈ Amino acid masses

    Recovered de-novo sequence? ESESE

  • DP de-novo sequencing: EDTES

    DP recursion:Score(i) = intensity(i) + max( Score(j) ),

    for all j with mass(i)-mass(j) ∈ Amino acid masses

    Recovered de-novo sequence? ESESE

    Why was the correct peptide missed?

  • Forbidden pairs

    The exclusion of symmetric peaks in a maximal scoring path through a spectrum graph was first proposed by Dančik et al. in 1999:

    � Peaks in the spectrum are called forbidden pairs if their mass adds up to the parent mass – either vertex can be used in the output path but not both.

    � A path is anti-symmetric if it uses at most one vertex from every forbidden pair.� Objective function becomes: find maximal scoring anti-symmetric path.� NP-Hard in the general case

    Solution:1. Extend the sequence from either the

    prefix or from the suffix2. Avoid reusing the pairing mass

    (highlighted in red)

    Note that this ordering is always possible (Chen’01, Bafna and Edwards’03)

    What is the recursion?

  • Maximal scoring anti-symmetric path

    Chen et al’01 provided a dynamic programming recursion to find a maximal scoring anti-symmetric path:� A peak si precedes a peak sj if it is closer to one of the ends of the

    spectrum: min(si,m(S)-si)

  • Optimality and extensions

    The anti-symmetric dynamic programming solutions are� Correct: no symmetric peak can be reused and every anti-

    symmetric path is considered� Optimal: a maximal scoring anti-symmetric path is constructed� “Efficient”: runtime efficiency is O(n2), n=# peaks

    Other algorithms have also been proposed:� Ma’03, Frank’05, same algorithm, different scoring� Bafna and Edwards’03, same principle, extended the concept

    of forbidden pairs to avoid peak reusage between any pair of ion-types

  • How does de-novo perform?

    � More involved scoring schemes have been proposed� Sherenga (Likelihood model)� NovoHMM (Hidden Markov Models)� Pepnovo (Bayesian network)� Peaks (commercial, scoring model unknown)

    � Best algorithms predict 1 incorrect amino acid out of every 4 predictions (NovoHMM, Pepnovo)

    � Main problem: MS/MS spectra are noisy!

  • Peptide Sequencing Problem

    Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

    Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass

    Output: � Peptide P with mass m, whose theoretical

    spectrum matches the experimental Sspectrum the best

  • � Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right

    � Additional representative peaks (ion types)

    � Noise (or unexplainable peaks)

    Scoring MS/MS spectrum masses

    Ion type ∆m p(ion)

    b +1 0.5

    b (iso) +2 0.15

    b-NH3 -16 0.3

    b-H2O -17 0.3

    a (-CO) -27 0.17

    b-NH3-H2O -34 0.16

    F S N A M S D I

    V SGQ L I D

  • Ion Types

    � Some masses correspond to fragment

    ions, others are just random noise

    � Knowing ion types ∆={δ1, δ2,…, δk} lets us

    distinguish fragment ions from noise

    � We can learn ion types δi and their

    probabilities qi by analyzing a large test

    sample of annotated spectra.

  • Example of Ion Type

    � ∆={δ1, δ2,…, δk}

    � Ion types

    { b, b-NH3, b-H2O}

    correspond to

    ∆={0, 17, 18}

    *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

  • Vertices of Spectrum Graph

    � Masses of potential N-terminal peptides

    � Vertices are generated by reverse shifts corresponding to ion types

    ∆={δ1, δ2,…, δk}

    � Every N-terminal fragment can generate up to k ions

    m-δ1, m-δ2, …, m-δk

    � Every mass s in an MS/MS spectrum generates k vertices

    V(s) = {s+δ1, s+δ2, …, s+δk}

    corresponding to potential N-terminal peptides

    � Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}

  • Reverse Shifts

    Shift in H2O+NH3

    Shift in H2O

  • Edges of Spectrum Graph

    � Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A

    � Gap edges for di- and tri-peptides

    � Paths in the labeled graph spell out amino acid sequences – how to find the correct one?

    � We need scoring to evaluate paths

  • Path Score

    � p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

    � p(P, s) = the probability that peptide Pgenerates a peak s

    � Scoring = computing probabilities

    � p(P,S) = πsєS p(P, s)

  • � For a position t that represents ion type δj :

    qj, if peak is generated at t

    p(P,st) =

    1-qj , otherwise

    Peak Score

  • Peak Score (cont’d)

    � For a position t that is not associated with an ion type:

    qR , if peak is generated at tpR(P,st) =

    1-qR , otherwise� qR = the probability of a noisy peak that does

    not correspond to any ion type

  • Finding Optimal Paths in the Spectrum Graph

    � For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

    � Peptides = paths in the spectrum graph

    � P’ = the optimal path in the spectrum graph

    p(P,S)p(P',S) Pmax=

  • Ions and Probabilities

    � Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

    � δi-ions of a partial peptide are produced independently with probabilities qi

  • Ions and Probabilities

    � A fragment has all k peaks with probability

    � and no peaks with probability

    � A peptide also produces a ``random noise'' with uniform probability qR at any position.

    ∏=

    k

    iiq

    1

    ∏=

    −k

    iiq

    1

    )1(

  • Ratio Test Scoring for Partial Peptides

    � Incorporates premiums for observed ions and penalties for missing ions.

    � Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

    The score is calculated as:RRRR q

    q

    q

    q

    q

    q

    q

    q 4321)1(

    )1( ⋅−−⋅⋅

  • Scoring Peptides

    � T- set of all positions.

    � Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.

    � A peak at position tδj is generated with probability qj.

    � R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

  • Probabilistic Model

    � For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.

    � Similarly, for t∈R, the probability that P produces a random noise peak at t is:

    −=

    otherwise1

    position tat generated ispeak a if),,( j

    j

    j

    q

    qSPtP

    δ

    −=

    otherwise1

    position tat generated ispeak a if)(

    R

    RR q

    qtP

  • Probabilistic Score

    � For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:

    ∏∏= =

    =n

    i

    k

    j iR

    i

    R j

    j

    tp

    SPtp

    Sp

    SPp

    1 1 )(

    ),,(

    )(

    ),(

    δ

    δ

  • Resulting sequencing accuracy

    Algorithm Average Accuracy

    SequenceLength

    Tag 3 Tag 4 Tag 5 Tag 6

    Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

    Peaks 0.673 10.32 0.889 0.814 0.689 0.575

    Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339

    Benchmarking reported for 280 spectra.

    Frank and Pevzner’05

  • Enhancing Sherenga (Dancik et al.’99)

    Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment

    ions.� Incorporates additional chemical

    knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage

    site.� Improves the Random Model.

    Pepnovo slides by Ari Frank

  • pos(m)(region in peptide)

    yby2

    a

    b2

    a-NH3

    a-H2O

    b-NH3

    b-H2O

    y-NH3

    y-H2O

    b-H2O-NH3 b-H2O-H2O

    y-H2O-NH3

    y-H2O-H2O

    N-aa(N-terminal amino acid)

    C-aa(C-terminal amino acid)

    HCID - Fragmentation Network

    Amino acid influence

    Ion combinations

    Positional influence

    pos y P(y|pos)

    0 0 0.10 1 0.22

    2 3 0.52

    4 3 0.08

  • Discrete Intensity Values

    � Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).

    � Normalized intensities Discretized into 4 intensity levels:� zero : I < 0.05

    � low : 0.05 ≤ I < 2 (62% of peaks)� medium : 2 ≤ I < 10 (26% of peaks)� high : I ≥ 10 (12% of peaks)

  • Combinations of Fragments

    � The topology takes into account dependencies between fragments.

    � The values of the probability tables are learned from the training data, so they reflect the true “fragmentation rules”.

    � “Logical” combinations get higher probabilities:P(b=high | y=high ) = 0.36, vs. P(b=high | y= low ) = 0.03.

    yby2

    ab2

    a-NH3a-H2O

    b-NH3b-H2O

    y-NH3

    y-H2O

    b-H2O-NH3 b-H2O-H2Oy-H2O-NH3

    y-H2O-H2O

  • Additional Chemical Knowledge

    � The identity of the flanking amino acids influences the peak intensities:� Increased intensities N-terminal to Proline and Glycine� Increased intensities C-terminal to Aspartic Acid.

    � 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).

    N-aa(N-terminal amino acid)

    C-aa(C-terminal amino acid)

    yb

  • Positional Influence

    � Creates separate models for different locations of the cleavage site in the peptide.

    � Models phenomena such as:� weak b/y ions near terminal ends.� prevalence of a-ions in the first half of the peptides.� prevalence of b2 towards the peptide’s C-terminal and y2

    near the N-terminal.

    pos(m)(region in peptide)

    yby2

    a

    b2

  • HRandom – Regional Density

    Bin

    0

    1

    2

    3

    Intensity levels

    1

    2

    2

    2

    2

    3

    3

    Window

    m/zw

  • Computing the Random Probability

    � α=1-(2ε)/w , is the probability of a single peak missing the bin.

    � Let ni , 1≤i≤d, be counts of peaks with intensity i in window w:

    ∑=

    ==

    ∑==

    ∑−==

    =

    +=

    d

    idRandom

    n

    dRandom

    nn

    dRandom

    nniIP

    nnIP

    nntIPd

    ii

    d

    tii

    t

    01

    1

    1

    1),...,|(.3

    ),...,|0(.2

    )1(),...,|(.1

    1

    1

    α

    αα

    (normalization term)

    (prob. of no peak)

    (prob. of peak with intensity t)

  • Random Model cont.

    � Employing this random model increases the contribution of peaks in sparse regions of the spectrum.

    � Decreases score for spurious matches in dense regions.

    � Increases contribution of high intensity peaks compared to low intensity.

  • Probability under HCID

    From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so:

    where I are the ion intensities for mass m and π(f) are the intensities of the parents of node f.

    ))(|(),(,...},{

    fIPmIPbyf

    fHH CIDCIDπ∏

    =r

  • Probability under HRandom

    � Peak occurrences are treated as random independent events:

    � The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

    ),...,|(),( 1...},,{ 2

    dOHybyf

    fHH nnIPmIP RandomRandom ∏−∈

    =r

  • The Likelihood Ratio Score

    � A putative cleavage site is scored according to the log ratio test:

    � Can be used to score a peptide by summing the score for the prefix masses:

    ),...,|(

    ))(|(

    log),(

    ),(log),(

    1,...},{

    ,...},{

    dbyf

    fH

    byffH

    mH

    mHm nnIP

    fIP

    mIP

    mIPmIScore

    Random

    CID

    Random

    CID

    ∏∏

    ∈==π

    r

    rr

    ∑=

    ==n

    iimn mIScorepppPScore i

    121 ),()..(

    r

  • PepNovo’s De Novo Sequencing

    � A spectrum graph is created from the experimental MS/MS spectrum.

    � The nodes are scored using the Bayesian network.

    � Highest scoring anti-symmetric path is found using dynamic programming.

  • Data and Software

    � 1252 spectra of doubly charged tryptic peptides (from ISB and OPD), measured on ion trap mass spectrometer:� 972 spectra in the training set.� 280 spectra in the test set (peptides up to 1400 Da.,

    assignments independently verified.)

    � Compared PepNovo with 3 de novo programs: Sherenga (Spectrum Mill 3.0), Lutefisk XP, and Peaks v2.3.

  • Results

    Algorithm Average Accuracy

    SequenceLength

    Tag 3 Tag 4 Tag 5 Tag 6

    PepNovo 0.727 10.30 0.946 0.871 0.800 0.654

    Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

    Peaks 0.673 10.32 0.889 0.814 0.689 0.575

    Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339

    Benchmarking reported for 280 spectra.

    Frank and Pevzner’05

  • Peptide Sequencing Problem

    Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

    Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass

    Output: � Peptide P with mass m, whose theoretical

    spectrum matches the experimental Sspectrum the best

  • � Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right

    � Additional representative peaks (ion types)

    � Noise (or unexplainable peaks)

    Scoring MS/MS spectrum masses

    Ion type ∆m p(ion)

    b +1 0.5

    b (iso) +2 0.15

    b-NH3 -16 0.3

    b-H2O -17 0.3

    a (-CO) -27 0.17

    b-NH3-H2O -34 0.16

    F S N A M S D I

    V SGQ L I D

  • Ion Types

    � Some masses correspond to fragment

    ions, others are just random noise

    � Knowing ion types ∆={δ1, δ2,…, δk} lets us

    distinguish fragment ions from noise

    � We can learn ion types δi and their

    probabilities qi by analyzing a large test

    sample of annotated spectra.

  • Example of Ion Type

    � ∆={δ1, δ2,…, δk}

    � Ion types

    { b, b-NH3, b-H2O}

    correspond to

    ∆={0, 17, 18}

    *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

  • Vertices of Spectrum Graph

    � Masses of potential N-terminal peptides

    � Vertices are generated by reverse shifts corresponding to ion types

    ∆={δ1, δ2,…, δk}

    � Every N-terminal fragment can generate up to k ions

    m-δ1, m-δ2, …, m-δk

    � Every mass s in an MS/MS spectrum generates k vertices

    V(s) = {s+δ1, s+δ2, …, s+δk}

    corresponding to potential N-terminal peptides

    � Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}

  • Reverse Shifts

    Shift in H2O+NH3

    Shift in H2O

  • Edges of Spectrum Graph

    � Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A

    � Gap edges for di- and tri-peptides

    � Paths in the labeled graph spell out amino acid sequences – how to find the correct one?

    � We need scoring to evaluate paths

  • Path Score

    � p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

    � p(P, s) = the probability that peptide Pgenerates a peak s

    � Scoring = computing probabilities

    � p(P,S) = πsєS p(P, s)

  • � For a position t that represents ion type δj :

    qj, if peak is generated at t

    p(P,st) =

    1-qj , otherwise

    Peak Score

  • Peak Score (cont’d)

    � For a position t that is not associated with an ion type:

    qR , if peak is generated at tpR(P,st) =

    1-qR , otherwise� qR = the probability of a noisy peak that does

    not correspond to any ion type

  • Finding Optimal Paths in the Spectrum Graph

    � For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

    � Peptides = paths in the spectrum graph

    � P’ = the optimal path in the spectrum graph

    p(P,S)p(P',S) Pmax=

  • Ions and Probabilities

    � Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

    � δi-ions of a partial peptide are produced independently with probabilities qi

  • Ions and Probabilities

    � A fragment has all k peaks with probability

    � and no peaks with probability

    � A peptide also produces a ``random noise'' with uniform probability qR at any position.

    ∏=

    k

    iiq

    1

    ∏=

    −k

    iiq

    1

    )1(

  • Ratio Test Scoring for Partial Peptides

    � Incorporates premiums for observed ions and penalties for missing ions.

    � Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

    The score is calculated as:RRRR q

    q

    q

    q

    q

    q

    q

    q 4321)1(

    )1( ⋅−−⋅⋅

  • Scoring Peptides

    � T- set of all positions.

    � Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.

    � A peak at position tδj is generated with probability qj.

    � R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

  • Probabilistic Model

    � For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.

    � Similarly, for t∈R, the probability that P produces a random noise peak at t is:

    −=

    otherwise1

    position tat generated ispeak a if),,( j

    j

    j

    q

    qSPtP

    δ

    −=

    otherwise1

    position tat generated ispeak a if)(

    R

    RR q

    qtP

  • Probabilistic Score

    � For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:

    ∏∏= =

    =n

    i

    k

    j iR

    i

    R j

    j

    tp

    SPtp

    Sp

    SPp

    1 1 )(

    ),,(

    )(

    ),(

    δ

    δ

  • Resulting sequencing accuracy

    Algorithm Average Accuracy

    SequenceLength

    Tag 3 Tag 4 Tag 5 Tag 6

    Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

    Peaks 0.673 10.32 0.889 0.814 0.689 0.575

    Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339

    Benchmarking reported for 280 spectra.

    Frank and Pevzner’05

  • Enhancing Sherenga (Dancik et al.’99)

    Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment

    ions.� Incorporates additional chemical

    knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage

    site.� Improves the Random Model.

    Pepnovo slides by Ari Frank

  • pos(m)(region in peptide)

    yby2

    a

    b2

    a-NH3

    a-H2O

    b-NH3

    b-H2O

    y-NH3

    y-H2O

    b-H2O-NH3 b-H2O-H2O

    y-H2O-NH3

    y-H2O-H2O

    N-aa(N-terminal amino acid)

    C-aa(C-terminal amino acid)

    HCID - Fragmentation Network

    Amino acid influence

    Ion combinations

    Positional influence

    pos y P(y|pos)

    0 0 0.10 1 0.22

    2 3 0.52

    4 3 0.08

  • Discrete Intensity Values

    � Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).

    � Normalized intensities Discretized into 4 intensity levels:� zero : I < 0.05

    � low : 0.05 ≤ I < 2 (62% of peaks)� medium : 2 ≤ I < 10 (26% of peaks)� high : I ≥ 10 (12% of peaks)

  • Combinations of Fragments

    � The topology takes into account dependencies between fragments.

    � The values of the probability tables are learned from the training data, so they reflect the true “fragmentation rules”.

    � “Logical” combinations get higher probabilities:P(b=high | y=high ) = 0.36, vs. P(b=high | y= low ) = 0.03.

    yby2

    ab2

    a-NH3a-H2O

    b-NH3b-H2O

    y-NH3

    y-H2O

    b-H2O-NH3 b-H2O-H2Oy-H2O-NH3

    y-H2O-H2O

  • Additional Chemical Knowledge

    � The identity of the flanking amino acids influences the peak intensities:� Increased intensities N-terminal to Proline and Glycine� Increased intensities C-terminal to Aspartic Acid.

    � 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).

    N-aa(N-terminal amino acid)

    C-aa(C-terminal amino acid)

    yb

  • Positional Influence

    � Creates separate models for different locations of the cleavage site in the peptide.

    � Models phenomena such as:� weak b/y ions near terminal ends.� prevalence of a-ions in the first half of the peptides.� prevalence of b2 towards the peptide’s C-terminal and y2

    near the N-terminal.

    pos(m)(region in peptide)

    yby2

    a

    b2

  • HRandom – Regional Density

    Bin

    0

    1

    2

    3

    Intensity levels

    1

    2

    2

    2

    2

    3

    3

    Window

    m/zw

  • Computing the Random Probability

    � α=1-(2ε)/w , is the probability of a single peak missing the bin.

    � Let ni , 1≤i≤d, be counts of peaks with intensity i in window w:

    ∑=

    ==

    ∑==

    ∑−==

    =

    +=

    d

    idRandom

    n

    dRandom

    nn

    dRandom

    nniIP

    nnIP

    nntIPd

    ii

    d

    tii

    t

    01

    1

    1

    1),...,|(.3

    ),...,|0(.2

    )1(),...,|(.1

    1

    1

    α

    αα

    (normalization term)

    (prob. of no peak)

    (prob. of peak with intensity t)

  • Random Model cont.

    � Employing this random model increases the contribution of peaks in sparse regions of the spectrum.

    � Decreases score for spurious matches in dense regions.

    � Increases contribution of high intensity peaks compared to low intensity.

  • Probability under HCID

    From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so:

    where I are the ion intensities for mass m and π(f) are the intensities of the parents of node f.

    ))(|(),(,...},{

    fIPmIPbyf

    fHH CIDCIDπ∏

    =r

  • Probability under HRandom

    � Peak occurrences are treated as random independent events:

    � The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

    ),...,|(),( 1...},,{ 2

    dOHybyf

    fHH nnIPmIP RandomRandom ∏−∈

    =r

  • The Likelihood Ratio Score

    � A putative cleavage site is scored according to the log ratio test:

    � Can be used to score a peptide by summing the score for the prefix masses:

    ),...,|(

    ))(|(

    log),(

    ),(log),(

    1,...},{

    ,...},{

    dbyf

    fH

    byffH

    mH

    mHm nnIP

    fIP

    mIP

    mIPmIScore

    Random

    CID

    Random

    CID

    ∏∏

    ∈==π

    r

    rr

    ∑=

    ==n

    iimn mIScorepppPScore i

    121 ),()..(

    r

  • PepNovo’s De Novo Sequencing

    � A spectrum graph is created from the experimental MS/MS spectrum.

    � The nodes are scored using the Bayesian network.

    � Highest scoring anti-symmetric path is found using dynamic programming.

  • Data and Software

    � 1252 spectra of doubly charged tryptic peptides (from ISB and OPD), measured on ion trap mass spectrometer:� 972 spectra in the training set.� 280 spectra in the test set (peptides up to 1400 Da.,

    assignments independently verified.)

    � Compared PepNovo with 3 de novo programs: Sherenga (Spectrum Mill 3.0), Lutefisk XP, and Peaks v2.3.

  • Results

    Algorithm Average Accuracy

    SequenceLength

    Tag 3 Tag 4 Tag 5 Tag 6

    PepNovo 0.727 10.30 0.946 0.871 0.800 0.654

    Sherenga 0.690 8.65 0.821 0.711 0.564 0.364

    Peaks 0.673 10.32 0.889 0.814 0.689 0.575

    Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339

    Benchmarking reported for 280 spectra.

    Frank and Pevzner’05