36
MaRMot, a tool for experimental MRM transitions design from Mascot repository Master of Bioinformatics and Proteomics Samuel Croset Project supervisor: Willy Bienvenut October 2009 – May 2010

MaRMot, a tool for experimental MRM transitions design ... · PDF fileb- MaRMot bank and related ... of MRM transitions as they have been generated locally in regards to instrumentation

  • Upload
    ledang

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

  ‐ 1 ‐ 

 

MaRMot, a tool for experimental  MRM transitions design  from Mascot repository 

  

Master of Bioinformatics and Proteomics          

Samuel Croset  

Project supervisor: Willy Bienvenut  

October 2009 – May 2010 

  ‐ 3 ‐ 

Summary  

Summary ....................................................................................................................................................... 3

Abbreviations............................................................................................................................................- 5 -

Introduction................................................................................................................................................... 7

Software implementation and workflow..................................................................................................... 11

a- Mascot archives bank.................................................................................................................... 111

b- MaRMot bank and related index................................................................................................... 111

c- Protein query and peptides filtering ................................................................................................ 12

d- Peptide clustering............................................................................................................................ 12

e- Peptides ranking.............................................................................................................................. 13

f- Peptides consensus spectrum and fragments identification. ........................................................... 13

g- MRM transition selection................................................................................................................ 15

h- Graphical interface: Input, output and export forms....................................................................... 15

Discussion ................................................................................................................................................... 19

a- Mascot peptide validation and MaRMot bank creation .................................................................. 19

b- Clusterization of homologous MS/MS spectra ............................................................................... 20

c- Precursor peptide scoring................................................................................................................ 22

d- Transition selection......................................................................................................................... 29

e- Export of the transition list and the Consensus Spectrum............................................................... 30

f- Further configuration ...................................................................................................................... 31

Conclusion .................................................................................................................................................. 33

References................................................................................................................................................... 34 

 

  ‐ ‐ 5 ‐ ‐ 

Abbreviations  

AC: Protein Accession Number

CID: Collision Induced Dissociation cell

CS: Consensus Spectrum

DB: Database

FPF: Fragmented Peptide Fingerprint

GAPP: Genome Annotating Proteomic Pipeline

QqQ: Triple quadrupole

Q1: First Quadrupole

Q3: Third Quadrupole

QS: Q STARTM

QT: Q TRAPTM

MGF: Mascot Generic Format

MRM: Multiple Reaction Monitoring

SRM: Single Reaction Monitoring

TPP: Trans-Proteomic-Project

 

  ‐ 7 ‐ 

Introduction 

Nowadays, biologists need techniques to quantify several proteins simultaneously particularly

for biomarkers validation. The current trend is to combine multiple biomarkers for increasing the

specificity and accuracy of the diagnostic 1. Biomarkers are traditionally validated through

immunoassays e.g., ELISA or dipstick 2-4. However, production of antibodies is time and cost

consuming and may not be suitable for multiple screening. This observation has leaded to use

mass-spectrometry for biomarkers validation. Indeed, mass-spectrometry is one of the main

platform for biomarkers discovery 1. This technique has proven to be suitable for relative protein

quantification, with stable isotope, where samples with different labelling can be compared. The

inclusion of an internal standard allows also an absolute quantification 5, 6.

Single Reaction Monitoring (SRM) or as an extension Multiple Reaction Monitoring (MRM) are

mass-spectrometry approaches for relative quantification and this technique has been employed

for over 30 years for small molecules 7, 8. More recently, it has been successfully applied to

protein/peptide quantification 9, 10. Triple quadrupole (QqQ) mass spectrometers are now more

sensitive and have a mass-range suitable for peptide analysis such as the Q TRAP™ 4000 (AB

Sciex Concorde, ON) 11 or the Xevo TQ MS™ (Waters, Milford, Ma) can undertake MRM with

peptides.

In a SRM and more generally MRM, the first quadrupole (Q1) is set to transmit only a narrow

mass window around the precursor peptide mass, excluding other analytes. The transmitted

peptide is then fragmented in the second quadrupole (Q2) acting as a Collision Induced

Dissociation cell (CID). The third quadrupole (Q3) is set to transmit only a narrow mass window

around the mass of a fragment (Figure 1). The pairing of the masses monitored in Q1 and in Q3

is called a SRM transition. Each transition needs be defined accurately and must be specific for

the analyte of interest. A MRM is the observation of multiple SRM transitions consecutively

giving more confidence to the identification of the peptides checked. Although relative

quantification of the peptides can be inferred by comparing the obtained intensities in different

conditions 12, the absolute quantification is possible with the use of an internal standard peptide

curve 13.

  ‐ 8 ‐ 

 Figure 1: Schematic representation adapted from Kitteringham et al. 1 of a single reaction monitoring (SRM). The precursor peptide is selected by the first quadrupole (Q1). This peptide is then fragmented by the collision induced dissociation cell (Q2). One of the fragments is selected by the third quadrupole (Q3). The relative quantification is

based n the number of counts recorded by the detector for the monitored transition.

One of the critical stages in MRM experiments design is the choice of the transitions monitored.

A precursor peptide needs first to be carefully selected, with one or more associated fragments

i.e., the precursor peptide selected should not have any miscleavages and should have an

observable size within mass range of the instrument. To overcome false positive results, the

selected peptide should also uniquely match the sequence of the protein of interest. This

characteristic is referred as proteotypicity14, 15. Furthermore, the precursor should not carry PTM

or chemical modifications e.g., methionine oxydation, phosphorylation 16, 17 or acetylation 18,

unless it is the aim of the experiment. Selection of these transitions is the bottleneck of the MRM

experiments. Indeed, each of them needs to be carefully chosen for a maximal sensitivity.

Several tools have been developed to help picking up reliable transitions. Some of them, such as

MRMaid 19 or TIQUAM 14 are based on the observation of real data coming from public

repositories. This empirical approach is combined with general knowledge on MRM to predict

putative transitions and the precursors are selected according to their sequences helped with a

scoring algorithm. Fragments are deduced from experimental MS/MS spectra. MRMaid uses The

Genome Annotating Proteomic Pipeline 20 (GAPP) whereas TIQUAM mines PeptideAtlas 21.

However, the material available in these freely available databases is not always pertinent. The

type of instrument and experimental conditions can differ from the various laboratories they have

been acquired impairing an easy prediction of the transitions required for the task. Commercially

available software’s such as MRM PilotTM (AB Sciex), QuanOptimiseTM (Waters), SRM

WorkflowTM (Thermo-Fisher), MassHunter OptimiserTM (Agilent) and MIDASTM 22 predict in

silico peptides and fragments. A supplementary experimental validation stage is then mandatory.

  ‐ 9 ‐ 

Indeed, theoretical peptides must be assessed through a succession of MS scans on the in-house

instrument. The final transition list should fit observable peptides, but a verification stage is then

time consuming. The observed peptides have not been extensively characterized and are thus not

necessarily the best to monitor.

Local data accumulated throughout the years are a more reliable starting material for the design

of MRM transitions as they have been generated locally in regards to instrumentation and sample

processing protocols. This idea is used by MaRiMba 23, a software which has been developed for

the Trans-Proteomic-Project (TPP) that can exploits and ranks personal spectral libraries (*.msp

file) to predict MRM transitions. This strategy bypasses a laboratory validation stage and gives

confident transitions. Since our laboratory is not using the TPP but only the Mascot software 24 25

for protein identification the basis of this project is to develop a tool able to use Mascot

repository data for the identification of relevant and experimental MRM transition. The Mascot

repository contains all the searches made by the user throughout the years. Proteins are

potentially present several times in the archive, associated with different set of peptides used to

infer their identification. Some of these peptides are identical from one experiment to another

and the expected redundancy is use as a valuable factor to define precursors for MRM

transitions. The best ion fragments to monitor are deduced from MS/MS spectra recovered from

this archive. Then, MaRMot is a new tool dedicated to data-mine Mascot repository to identify

and to rank MRM transitions for a protein.

  ‐ 11 ‐ 

Software implementation and workflow 

a­ Mascot archives bank  

Data used as source material for the determination of experimental MRM transitions is based on

the archives stored in our on-site Mascot server. These data are related to analyses performed by

the Proteomics facility at the Beatson Institute for Cancer Research (Glasgow UK) for the last 4

years. It contains more than 17000 searches worth 168 Go of data storage. Samples were mainly

related to H. sapiens, E. coli, Rodents (M. musculus and R. norvegicus), G. gallus, D.

discoideum, A. thaliana, C. reinhardtii. Most of the data were acquired with ESI-MS/MS

instruments (Q-Trap 4000 and Q-Star XL instruments; ABi, Concorde ON) despite few

sample/data could have been related to external sources and then alternative instruments and

preparation. Commonly available databases (DB) such UniProtKB/Swiss-Prot, NCBInr have

been used for searches. Some home-made DB containing redesigned protein sequences

developed internally have also been queried. These specific sequences have an internally defined

protein accession number (AC) based on the UniProtKB/Swiss-Prot identification system.

b­ MaRMot bank and related index 

The previously described Mascot archives are parsed to first create a more restricted bank

(MaRMot bank) used later to determine the MRM transitions. To populate the MaRMot bank,

we developed a series of dedicated Perl and Linux Shell scripts involving Mascot Parser (Ver.

2.2.6, Matrix Science UK) to retrieve the information from the Mascot archives (*.dat files).

Only the proteins and peptides validated above a user defined threshold i.e., Mascot peptide

score higher than 20 and a p-value below 0.05 by default, are integrated in the MaRMot bank.

For each *.dat file, all proteins identified are saved with their associated list of matched peptides.

Each of these peptides are retrieved with their associated sequence, chemical modifications,

charge, Mascot titles, experimental and theoretical masses, Mascot peptide score and name of the

DB where they have been identified. Their MS/MS spectra are also stored using the Mascot

Generic Format (MGF) i.e., “a list of pairs of mass and intensity values” 26. Finally, these

  ‐ 12 ‐ 

recovered data are treated with a Perl script able to create an index of protein ACs listed in the

MaRMot bank. Both the MaRMot bank and its index are updated weekly to include new results

in it. At the present time, our index includes over 96000 protein AC.

c­ Protein query and peptides filtering 

The AC of the protein of interest is entered by the user to determine MRM transition couples.

Data related to this protein are retrieved from the MaRMot bank. Multiple copies of the same

peptide related to the dame raw source can be present among the data retrieved. This artefact is

mainly due to multiple searches of the same original experimental data (*.mgf file). To remove

them, a scrip checks peptides title, sequence and chemical modifications. In case of complete

similarity only one entry is retained over the few identical ones. Furthermore, some entries are

also homologous mainly due to recalibration procedure. They are only different by their peptides

title, assigned before Mascot search step. These cases are not detected by the previous filter. In

consequence, a second one has been defined to discards entries having the same amino-acid

sequence, same number of peaks in their MS/MS spectrum and the same Mascot score rounded

at two decimals. Spectra fitting these criteria are considered as a redundant entry and are thus

removed to retain a final single one. Since few instruments are used in the facility i.e., Q

STARTM (QS) and Q TRAPTM (QT), and this information is included in the peptide’s title, we

decided to add an instrument-specific filter based on the title. An empty field will considered all

data indistinctively of the instrument type.

d­ Peptide clustering 

Peptides of selected protein having successfully passed through these filters and showing the

same sequence, chemical modifications and charge state are grouped together in classes defined

as “Fragmented Peptide Fingerprint” (FPF). Each FPF includes at least one to several thousand

MS/MS spectra harvested all over Mascot archives e.g., 4069 MS/MS spectra for the peptide

SLDLDSIIAEVK (charge state of 2+ and no chemical modification) related to the keratin

K2C1_HUMAN (UniProtKB/Swiss-Prot AC: P04264).

  ‐ 13 ‐ 

e­ Peptides ranking 

To determine the proteotypicity15, 27 of the listed peptide, the taxonomy of the protein of interest

is required and harvested from UniProtKB/Swiss-Prot. If the protein AC is not found in this DB

e.g. internally defined protein sequence, the taxonomy field is left empty and all species will be

investigated for that task.

All FPF classes are ranked to display their quality to sort out the best MRM precursor candidate

for the defined protein. The score is determined by the product (Equation 2) of six coefficients

representing:

- (A) The amino-acid’s sequence composition and modifications present on the FPF (Table 1 and

2, Equation 3),

- (B) The Mascot score of peptides present in the FPF (Equation 4),

- (C) The presence of a miscleavage in the FPF sequence (Table 3),

- (D) The charge of the FPF (Table 4),

- (E) The FPF frequency among all experiments

- (F) The proteotypicity of the FPF sequence for the taxonomy studied (Table 5).

f­ Peptides consensus spectrum and fragments identification. 

It is commonly accepted that peptides with same sequence, charge and modification (FPF) give

very similar MS/MS. Thus, all spectra from a single FPF are merged together. This “spectral

alignment” is treated with a sliding window approach over the m/z axis leading to a final

consensus spectrum (CS, Figure 2). The width of the sliding window is defined depending of the

instrument e.g., 0.1 Da for Q-Star data and 0.2 Da for Q-Trap data. The window is moving with a

  ‐ 14 ‐ 

step equal to one-fifth of its width. For each step the intensities of all peaks inside the window

are summed and the corresponding intensity is attributed to the midpoint. This process is

repeated until the entire spectrum is covered. The resulting spectrum is smoothed using a

derivative function to retrieve only the highest points of the intensity distribution (Figure 2b).

Masses with intensity higher than 1% of the most intense peak are compared to the in-silico

generated masses of theoretical ion series. Masses within 0.1 Da error range are labelled with

their corresponding ion type limited to yn+, bn

+, yn++ and bn

++ series.

 

Figure 2: Scheme of the process to generate the Consensus spectrum: a) The individual spectrum of 3 peptides are merged together. The peak A corresponds to the signal of a fragment. The other peaks are related to noise signal. b)

the sliding window is applied over the m/z axis of the spectrum A. c) Consensus spectrum: spectrum (b) is treated with a derivative function to retrieve only the highest point of each distribution to clusterize the spectra. The peak A

has an improved signal over noise ratio compared to the original spectra.

  ‐ 15 ‐ 

g­ MRM transition selection 

Relevant fragment to define a MRM couple are determined from the consensus spectrum. No

fragment masses below the m/z of b2 or y2 ions are considered and three possible MRM

transition per FPF are reported in the final result. Thus, the mass of the most intense labelled

fragment peak of the spectrum is selected by default. If its m/z value is below the parent ion

mass and its most common artefacts e.g., dehydrated by-product, the two highest peaks with

masses larger than the precursor ion will also be included. In the other case, the three most

intense fragment peaks larger than the precursor ion are considered.

h­ Graphical interface: Input, output and export forms 

 

Figure 3: Result table generated by MaRMot for the protein Q96C19 (Swiprosin-1 ). The columns contain: A) A check box for the export of the consensus spectrum in MGF. B) The sequence of the peptide, chemical modifications are underlined. The characters of the sequence are a link to the consensus spectrum for the peptide (Figures 4 and

5). C) The name of the modifications found in the peptide D) The proteotypicity of the peptide E) The theoretical and the experimental mean mass of the peptide F) The number of miscleavages in the sequence G) the charge state of the

peptide H) the number of identical peptides found in all previous experiments I) The final score for the peptide J) The list of selected transitions by MaRMot

  ‐ 16 ‐ 

 Figure 4: Part of the Consensus Spectrum for the peptide LSEIDVSSEGVK (AC: Q96C19). Peaks are labeled with

their corresponding ion (y+, y2+, b+ and y2+).

The MaRMot input form requesting a protein AC to investigate is written in Perl. After data

processing, the suggested transitions are outputted with peptide sequences, and the proposed

transitions (Figure 3). The first line of the output table (light blue) is related to the top scored

FPF. The following white rows show the FPFs sharing the same peptide sequence as the first

FPF but subject to different chemical modifications or ion charges. The consensus peptide

MS/MS spectrum is accessible through an interactive link on the peptide sequence (Figures 4 and

5). This scheme is repeated up to the last non-redundant peptide sequence. On the extreme left

side of the result table, an interactive box can be checked. Then, the peaks list of the selected

peptide(s) is (are) exported as a MGF file. On the main table, the checked transitions can be

exported as a *.csv file and directly used by the software controlling the MRM acquisition.

  ‐ 17 ‐ 

 Figure 5: Table containing the theoretical ion masses. Fragments matched on the spectrum are in bold red, with the

relative intensity of the peak in parentheses. This relative intensity corresponds to the percentage of the maximum intensity of the spectrum.

  ‐ 19 ‐ 

Discussion 

MRM is a MS technique for measuring the relative quantity of one or more proteins at the same

time. To determine the abundance of a protein, a triple quadrupole is set to monitor only a list of

pre-defined and specific transitions. The determination of this list is the most important and

crucial step of a MRM experiment. These transitions could be predicted in silico by calculating

theoretical peptides and fragments or, empirically by exploiting previous experimental data. The

second approach is the closest to the laboratory reality and can avoid to the user an experimental

validation stage of the selected transitions. We have used this strategy with MaRMot by mining

the data repository of Mascot, a well-known tool for protein identification. This information

present in Mascot archives comes from previous local experiments done by the user. These data

are complemented with empirical and advance knowledge on the selection of interesting

precursors and associated fragments.

a­ Mascot peptide validation and MaRMot bank creation 

To design MRM transitions, MaRMot first parses and retrieves the information from the Mascot

repository. A MaRMot bank is created this way and is used later to identify suitable MRM

transitions. The Mascot archives used to populate the MaRMot bank is composed of *.dat files.

However, the reported list of peptides coming from a saved *.dat file is not a directly usable

material. Indeed, Mascot uses a statistical approach to identify peptides and proteins in a given

DB. Some of these identified peptides and/or proteins have a low probabilistic score are most

likely false positive. The significance threshold and the peptide score are the two parameters

which can be adjusted to skim the most likely false positive match out of this list. The remaining

peptides and proteins matches are then included in the MaRMot bank.

Mascot score related to peptide MS/MS is based on a probability calculation. This value i.e., P, is

calculated considering the match of the peptide in the DB is a random event. P depends on the

  ‐ 20 ‐ 

significance threshold and on the size of the DB searched 24, 28. The Mascot score for each

peptide is derived from this probability value (Equation 1).

Equation 1: Equation used for the calculation of the Mascot score from the absolute probability that consider the

match of the peptide with the DB as a random event

Mascot Score = -10 x log (P)

So the higher the score is, the more confident the peptide has been characterised. This score is

used as a threshold while retrieving the information from Mascot to populate the MaRMot bank.

Our experience shows that a threshold of twenty removes most of the potent false positive

matches and prevents the loss of too many true positive. Applying this value to extract data from

the *.dat files should gives an acceptable confidence level of the data imported in the MaRMot

bank. A higher cut-off could be used to improve the quality of the data available in this bank,

used later for the identification of the most relevant MRM transition. However, a too strict

threshold can excludes some proteins and peptides with low scores that could still have an

interest in MRM transitions determination e.g., poorly characterized proteins in previous

experiments, or proteins with chemical modifications such as phosphorylation. The user as the

possibility to change in the configuration file the peptide score cut-off and the significance value

and then define the level of confident he wants for the MaRMot bank.

b­ Clusterization of homologous MS/MS spectra 

As previously described, MaRMot bank contains a large number of MS/MS spectra. Although

the theoretical number of endoproteotytic peptide sequence is limited for the selected protein,

this number could expends tremendously due to co- or post-translational modification or various

charge states for the same peptide sequence. Some of these experimental peptides could be

present one to several times in the MaRMot bank. Then, they could be related to few MS/MS

spectra to thousands of occurrences. The selection of the best MS/MS spectra could be an option

but this approach would loose most of the experimental data available. The idea of the

  ‐ 21 ‐ 

 

Figure 6: Impact of sliding window on a low intensity signal: 1) Merged MS/MS spectra of 2604 peptides from P60709 (Actin cytoplasmic) 2) Consensus spectrum obtained after treatment of the spectrum (1) with the sliding

window and the derivative function. The intensity of the peaks A and C strongly decreases after treatment, as these peaks do not correspond to a fragment and have been seen only in few experiments. Peaks below B are noise and

are almost removed after the treatment. The peak D corresponds to a fragment and is consolidated and clearly rises from the spectra.

clusterization approach is to valorise all of this experimental information and use it later for the

identification of the best possible transition. Indeed, the spectra accumulation approach is well

known to improve signal over noise ratio and to decrease spectrum to spectrum peak variability.

Peptides with similar charge state, sequence and chemical modifications have the same m/z and

produce very similar MS/MS spectrum. Thus, these spectra are grouped together in a sub-set so

called “Fragmented Peptide Fingerprint” (FPF). The MS/MS spectra of the peptides in a FPF

will be use to produce one unique consensus MS/MS spectrum which is analyzed in fine to

determine MRM transitions. The consensus spectrum is obtained by merging all the MS/MS

spectra of one FPF followed by data treatment to remove data redundancy. First, summed

MS/MS spectra are treated with a sliding window approach to help to remove the noise and

emphasizes the real fragments. Indeed, the noise on a MS/MS spectrum is distributed randomly,

whereas peaks corresponding to real fragments are supposed to be at the same m/z. In practice,

due to the error of the instrument, the experimental m/z is distributed around the theoretical

mass. The sliding window enhances fragment peak intensities and averages the noise while

  ‐ 22 ‐ 

covering the merged spectra. Since MRM transitions are strongly related to the precursor and

fragment intensity, this approach has also the advantage to identify more robust transitions

couples (see Figure 6).

c­ Precursor peptide scoring 

Depending of the protein, few to hundreds of peptides could be related to a single protein entry.

All of these peptides are not equivalent and some of them are more suitable to satisfy MRM

requirements. A few parameters are used to score and rank peptides from the most effective one

to the less for the identification of MRM transitions. These parameters are:

- The frequency that a peptide is identified all over the Mascot repository. Indeed, the more

frequently a peptide is identified in the repository, the most likely it would be visible in the

future analysed.

- Proteotypicity of the peptide; to avoid bias in the quantification, the peptide sequence shall

be proteotypic to the protein selected.

- The quality of the MS/MS spectrum obtained from the peptide; Peptides with a

fragmentation pattern close to the theoretical one, with high intensity signals and a low

noise level are better candidates for MRM transitions.

- Presence of miscleavages; as miscleaved peptides are less reproducible, it is better to use

sequences correctly cut by the Trypsin.

These parameters are used to score and sort out the most relevant peptides for the identification

of the MRM precursor mass of a selected protein. In consequence, each FPF is characterized by a

score (Equation 2), composed by the product of 6 coefficients representing the properties

introduced before and balanced to fit datasets of any size.

  ‐ 23 ‐ 

Equation 2: Calculation of the score for a FPF. The calculation of coefficients is detailed below.

Score (FPF) = A x B x C x D x E x F

Coefficient A - Sequence composition and chemical modification:

The detection and reliability of a peptide can be influenced by it sequence and the chemical

modifications present on it, as they change it physical properties. The peptide sequence has a

direct influence on their signal in ESI-MS analysis. Recently, a few studies highlight these

influences where some residues and/or associated physico-chemical properties are much more in

favour of the ionisation of the peptide whereas some others have detrimental effects 29 30 31. In

MRM experiments, a peptide precursor selected for the identification of a transition needs to be

clearly detected. Then, some sequences with favoured residues are therefore preferred. To

represent the influence of the peptide sequence in each FPF sub-set, each residue is balanced

with a positive number which can be either “1” related to a bad influence, “2” for no significant

influence or “3” for a positive influence. These values have been determine using recent

publication 30 and our own experimental knowledge (Table 1).

Table 1: Residues with their corresponding scores. The values have been defined from expert knowledge and bibliographical descriptions 30

Residue Score Residue

A, D, E, I, N, Q, T, V 3

G, K, L, , P, R, S 2

C, F, H, M, W, Y 1

Peptide modification, endogenous or artifactual, could also have an influence on the physic-

chemical properties of peptides e.g., phosphorylation or acetylation. Then, depending of the

modification, the score previously attributed to a residue is weighted with a “chemical

  ‐ 24 ‐ 

modification score”. These scores are defined according the search and the characterisation of

each modification (Table 2).

Table 2: Chemical modifications on residues and their influence on the residue score. These values have been balanced, to represent their impact on the transition design.

Chemical modification Score Modification

No modification 1

Phosphorylation (S, T or Y) 0.5

Oxydation(M) 0.2

Acetylation Protein N-term(T, G, A, S or M) 2

N-term Pyroglutamination (N or E or D) 0.5

N-term Pyroglutamination Peptide N-term(Q) 0.1

Other modification 0.1

It is generally recommended to avoid most of the co-, post or chemically induced modifications

in the design of MRM transitions. Indeed, these modifications do not necessarily always happen

at the same level from sample to sample e.g., Methionine oxidation, that could be tightly related

to sample preparation and prior treatments. They can therefore prejudice and bias the

quantification of the targeted protein.

Glutamic acid residue is known to loose easily one molecule of water when they are located at

the N-terminus side of the protein/peptide triggering the cyclisation of the lateral chain. This

modification commonly called pyro-glutamination could also be found for glutamine residue and

on a lower extend aspartic acid and asparagin residues. Thus these four residues are negatively

weighted. However, it can be sometimes important to consider PTM, for example in the case of

the quantification or residue validation of a phospho-protein 13. Then, phosphorylation is

weighted less negatively in this sense.

  ‐ 25 ‐ 

One modification i.e., N-terminal protein acetylation, is the only one to be positively weighted.

Indeed, from our experience, the acetylation of the N-terminus of a protein favours a clean

fragmentation and clearly reveals the yn and some of the bn ion series. Furthermore, protein

sequence at the N-term is especially discriminating for homologous protein 32, 33 that could be

highly valuable peptide to select for MRM experiments.

Finally, each residue score weighted by the associated modification are summed. To circumvent

the influence of the peptide length, the result is divided by the number of residues. The

calculation of the final score associated to the peptide sequence including its modification is

resumed in Equation 3. Coefficient A represents the ability of a peptide to be observed and

therefore suitable for the MRM transition. It has been defined to have a value superior than zero

but lower or equal to three.

Equation 3: Calculation of the coefficient to score the suitability of a peptide through its sequence (Coef. A); n is

the number of residue related to the scored peptide. The product of the residue score (RSi) and the modification

score (ModSi) is calculated for each residue (i). Finally, the average of these score gives the A value.

A = ∑=

n

in 1

1 Residue Score (RSi) x Modification Score (ModSi)

Coefficient B - Mascot’s peptide score:

The score attributed by Mascot reflects the probability (P) that the peptide has been correctly

identified. The higher the Mascot score is, the better chances the data from the MS/MS matches

correctly a defined sequence. It can be assesses as a MS/MS spectrum quality coefficient: spectra

with a good signal to noise ratio and clear fragments should have a higher Mascot score. These

two observations are crucial while choosing the precursors for MRM transitions. Then,

coefficient B reflects the Mascot score of all of the peptides present in a single FPF sub-set.

Since Mascot peptide score has a wide dynamic range starting i.e. 20 (selected threshold) up to

few hundreds, its contribution needs to be weighted in regards to other coefficients. First, the

final average Mascot score is divided by 10. Thus, a logarithm in base 2 is applied to condense

even more the range from 1 (neutral effect from threshold score equal to 20) to few units e.g.,

3.32 for a score of 100 (Equation 4).

  ‐ 26 ‐ 

Equation 4: Calculation of the Mascot scoring factor (Coef. B); “i" is the number of the peptide and MaSi the score

related to this peptide;

B = )10

1(log1

2 ∑=⋅

n

inMascot Score(MaSi)

Coefficient C – cleavage and missed cleavage:

Proteins are generally digested with the Trypsin before their bottom-up analysis by ESI-MS/MS 34, 35. Some peptides are irregularly cleaved due to change in parameters from in one experiment

to another e.g., the temperature or the ratio enzyme/sample. This phenomenon is not

reproducible and is affected by several factors e.g., the temperature. Therefore peptides with

missed cleavages may introduce a bias in the quantification due to the unspecific amount of

material related to peptide sharing part of the same sequence. Such peptides should be avoided in

the selection of the MRM precursors. The coefficient C takes in account the presence and

number of miscleavages in a FPF sequence. The presence of one miscleavage is strongly

penalized (Table 3). If more miscleavages are present, the FPF is not considered anymore for a

MRM transition and are set to zero. The final score of a FPF with a sequence fully cleaved by

trypsin is not influenced and the coefficient is set to one.

Table 3: Value of the Coefficient C according the number of miscleavages present in the FPF sequence.

Number of miscleavage(s) Value of coefficient C

0 1

1 0.2

> 1 0

  ‐ 27 ‐ 

Coefficient D - Charge of the precursor:

Tryptic peptides in electrospray analyses are usually multi-charged cations whereas singly

charged ions are mostly related to contaminant such as polyethylene glycols. Basically, one

charge is coming from the N-terminus part of the peptide and a second is related to the lateral

chain of the arginine or lysine residue present at the C-terminus. Due to various factors e.g., large

peptides or His containing peptides, higher charge state can be observed on tryptic peptides. An

average charge distribution would be 70-90% of doubly charged ions, 10-30% of triply charged

ions and a maximum of 1% for higher charged ions (data from different large data set i.e., more

than 10000 identified peptides on 3 different data set related to H. sapiens, D. discoideum and A.

thaliana samples).

Since the vast majority of the characterised MS/MS spectra are related to doubly charged

material and in contrast, triply charged precursors tend to produce more complex fragmentation

patterns increasing the risk of fale fragment assigment, doubly charged precursors are favoured.

Coefficient D which ponders the charge of the FPF is listed in Table 4: FPF with more charges

than three are rare and ignored while designing a MRM transitions. Since no proteic material is

expected on singly charged ions for ESI-MS/MS analyses, this precursor type is set to zero.

Table 4: Value of the Coefficient D according the charge state of the FPF

Charge state of the FPF Value of coefficient D

1 0

2 1

3 0.8

> 3 0.1

Coefficient E - Number of time the peptide has been observed:

While choosing MRM transitions, it is important to focus on precursors which are easily

observable. The number of time a peptide has been observed in previous experiments is a clear

  ‐ 28 ‐ 

indication of the “visibility level” compared to the other potential FPF related to a target protein.

The most frequently observed/matched peptides for a selected protein have better chances to be

observable during MRM analysis if this protein is present in the analysed mixture. The

coefficient E gives importance to peptides well characterized from previous experimentation.

The number of identification can vary from 1 to several thousands, depending on the protein

selected and the size of the repository. To decrease the dynamic range of this value and make it

comparable to the other coefficient, we applied the logarithm in base two to the peptide

frequency in each FPF. It still favoured FPF with huge number of identifications, but does not

hide FPF with a smaller number of peptides.

Coefficient F - Proteotypicity of the sequence:

The sequence of a peptide can be found in multiple proteins. For a MRM assay, unique precursor

sequences would be preferred. Due to protein homology, it is not always possible to distinguish

different proteins with a single peptide. A bias would be therefore introduced in the

quantification. This characteristic is known as peptide “proteotypicity”. Unfortunately, for some

highly homologous proteins e.g., actins, the access to proteotypic peptides could be challenging.

Thus, non proteotypic peptides are down-scored but, still listed in the final result table, giving the

user the option to select them otherwise.

To determine the proteotypicity for the related FPF sequence, MaRMot uses the

UniProtKB/Swiss-Prot bank as a reference. Beside the good quality of data in it, this bank

presents the advantage of a low redundancy for protein sequences 36: The false positive peptides

matched against identical protein sequences with different AC are prevented this way. The

proteotypicity is consequently more accurate. MaRMot determines the proteotypicity of a FPF

sequence restricted to the taxonomy of the protein studied to reduce the calculation time

required. However, it might sometimes introduce a bias if protein orthologs are considered. If the

taxonomy of the protein has not been determined, all species are considered for the search of

proteotypicity.

The coefficient E is linked to the proteotypicity of the FPF sequence. If this sequence is found

only in the protein selected, it is equal to one with no impact on the final score. Otherwise, if the

  ‐ 29 ‐ 

sequence is found in more than one protein, the FPF is strongly penalized (Table 5). Such FPF

are usually less suitable to be monitored but, listed in the final result table.

Table 5: Value of the coefficient F according the proteotypicity of the FPF sequence.

Match in Swiss Prot (taxonomy selective) Value of coefficient F

Only one (protein of interest) – Peptide proteotypic 1

More than one 0.1

d­ Transition selection  

MRM transitions are defined by at least two m/z values i.e., the precursor ion mass and, one or

few fragment ion masses. The first one is determined once the selected precursor peptide is

chosen. In MaRMot, this value is related to the FPF classes which represent a strictly non-

redundant peptide sequence including charge and modification. The previous scoring description

should have revealed the most suitable precursor i.e., FPF class with higher calculated score, to

define the first value required in MRM transition couples.

The second m/z value is determined from the MS/MS spectrum of the selected precursor peptide.

MaRMot considers FPF classes rather than individual peptides. Indeed, each FPF class has one

to several thousand peptides in it and therefore the CS is much more representative than a single

MS/MS spectrum peaked for one of the peptide. The second mass of the transition couple is then

selected from CS.

MaRMot suggests up to three fragments to monitor for each FPF. Only peaks corresponding to

an identified ion i.e., yn+, yn

2+, bn+, bn

2+ are considered. These peaks are most likely to be specific

for the precursor peptide and give more confidence in the transition than the other non-matched

peaks. Two observations are critical in the choice of the fragment to monitor:

- First, fragments with high intensities are always preferred as they are easier to detect

and lead to a more accurate quantification.

  ‐ 30 ‐ 

- Second, the observation of m/z value of the fragment.

Since a MS/MS spectrum is noisier in the m/z range lower than the mass of the precursor, this

part of the spectrum is thus avoided if possible to identify relevant MRM transition. Moreover,

the largest mass between b2+ and/or y2

+ ion is used as the limit value. No fragments are selected

below these values to prevent false positives, coming from the internal peptide fragmentation.

According these considerations, MaRMot selects first the most intense fragment of the CS. The

m/z of this fragment could however be lower than the mass of the precursor e.g., fragments with

Proline give a very intense and useful peak, but can have a low m/z depending on the position of

the Proline in the peptide sequence. The two other m/z selected are the next most intense

fragments with m/z necessarily above the precursor value.

e­ Export of the transition list and the Consensus Spectrum  

MaRMot displays the ranked precursor peptides with their transition in a table. For each protein,

the user is able to select one to a few precursor peptides (FPF) and up to three transitions for

each FPF. These MRM transitions values can be export as a *.csv file and used later to fill the

instrument method related to the acquisition of MRM experiments. The transitions have been

suggested directly from the experimental archive of the user. No more laboratory validation is

then necessary to assess that these transitions are observable.

MaRMot also features an extra tool for exporting the CS of one or more FPF. The CS is

outputted in the MGF format i.e., a list m/z with the corresponding intensity. It can be saved to

further analysis or verification, using Mascot or another peak identification software.

  ‐ 31 ‐ 

f­ Further configuration  

As we have developed MaRMot with our local mass-spectrometers, the parameters for the

creation of the MaRMot bank and the CS have been optimized to our settings. They can be easily

changed in a configuration file, to fit the special requirements of any instrument or experiment

.

  ‐ 33 ‐ 

Conclusion 

 

Determination of MRM transition is a challenging task and despite a few tools are available, it is

still difficult to use them successfully. Since Mascot identification is one of the most used search

engine, the idea was to use the experimental data available in our on-site Mascot repository to

mine the data available and to use them to identify experimentally base MRM transitions. Then,

we propose MaRMot as a tool dedicated to any user possessing a Mascot repository providing

probable MRM transitions from local data. The experimental data used as a starting material

gives a most reliable results, as they are specific to the basic laboratory practices e.g., sample

preparation protocol or instrumentation.

This approach uses the new concept of FPF combined with the CS to first, rank the precursors

and second, identify the best fragments to create the most successful MRM transition possible

from experimental data. Other tools available involve a lot of filter to set-up before the search for

MRM transition e.g., size of the precursor, charge states allowed, etc… MaRMot is more

straightforward and flexible: The discrimination is based on the scoring system and all peptides

are considered as potential candidate for a MRM transition. The score retrieves the best peptides

and allows the user to save time and choose the suitable transitions. Peptides with modifications

are also maintained in the final result, to suit the current trend in proteomics to study PTM.

Laboratory validation stage can be avoided as MaRMot identifies, more than predicts, suitable

peptides by comparison of tools such as Midas that predicts MRM transition in-silico.

Nevertheless, using local data imposes to have a consistent data repository, containing proteins

well-characterized in different experiments. It is also not possible to determine transitions for

proteins not previously identified in Mascot. Indeed, the bigger this repository is, the more

reliable are the results generated by MaRMot. The regular update of the MaRMot bank improves

the scope of our software over the years and experiments conducted. MaRMot is the ideal tool

for a long-time user of Mascot, who wants to quickly and easily implement robust MRM assays,

for biomarker validation or proteomics experiments for example.

  ‐ 35 ‐ 

References 

1. Kitteringham, N.R., Jenkins, R.E., Lane, C.S., Elliott, V.L. & Park, B.K. Multiple reaction monitoring for quantitative biomarker analysis in proteomics and metabolomics. J Chromatogr B Analyt Technol Biomed Life Sci 877, 1229-1239 (2009).

2. Merrick, B.A. The plasma proteome, adductome and idiosyncratic toxicity in toxicoproteomics research. Brief Funct Genomic Proteomic 7, 35-49 (2008).

3. Aebersold, R. et al. Perspective: a program to improve protein biomarker discovery for cancer. J Proteome Res 4, 1104-1109 (2005).

4. Cho, W.C. OncomiRs: the discovery and progress of microRNAs in cancers. Mol Cancer 6, 60 (2007).

5. Jenkins, R.E. et al. Relative and absolute quantitative expression profiling of cytochromes P450 using isotope-coded affinity tags. Proteomics 6, 1934-1947 (2006).

6. Ong, S.E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1, 376-386 (2002).

7. Yao, M., Ma, L., Humphreys, W.G. & Zhu, M. Rapid screening and characterization of drug metabolites using a multiple ion monitoring-dependent MS/MS acquisition method on a hybrid triple quadrupole-linear ion trap mass spectrometer. J Mass Spectrom 43, 1364-1375 (2008).

8. Baty, J.D. & Robinson, P.R. Single and multiple ion recording techniques for the analysis of diphenylhydantoin and its major metabolite in plasma. Biomed Mass Spectrom 4, 36-41 (1977).

9. Stahl-Zeng, J. et al. High sensitivity detection of plasma proteins by multiple reaction monitoring of N-glycosites. Mol Cell Proteomics 6, 1809-1817 (2007).

10. Keshishian, H., Addona, T., Burgess, M., Kuhn, E. & Carr, S.A. Quantitative, multiplexed assays for low abundance proteins in plasma by targeted mass spectrometry and stable isotope dilution. Mol Cell Proteomics 6, 2212-2229 (2007).

11. Le Blanc, J.C. et al. Unique scanning capabilities of a new hybrid linear ion trap mass spectrometer (Q TRAP) used for high sensitivity proteomics applications. Proteomics 3, 859-869 (2003).

12. Kay, R.G., Gregory, B., Grace, P.B. & Pleasance, S. The application of ultra-performance liquid chromatography/tandem mass spectrometry to the detection and quantitation of apolipoproteins in human serum. Rapid Commun Mass Spectrom 21, 2585-2593 (2007).

13. Gerber, S.A., Rush, J., Stemman, O., Kirschner, M.W. & Gygi, S.P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS. Proc Natl Acad Sci U S A 100, 6940-6945 (2003).

14. Lange, V., Picotti, P., Domon, B. & Aebersold, R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol Syst Biol 4, 222 (2008).

15. Bertsch, A. et al. Optimal de novo Design of MRM Experiments for Rapid Assay Development in Targeted Proteomics. J Proteome Res.

16. Cox, D.M. et al. Multiple reaction monitoring as a method for identifying protein posttranslational modifications. J Biomol Tech 16, 83-90 (2005).

17. Ciccimaro, E., Hevko, J. & Blair, I.A. Analysis of phosphorylation sites on focal adhesion kinase using nanospray liquid chromatography/multiple reaction monitoring mass spectrometry. Rapid Commun Mass Spectrom 20, 3681-3692 (2006).

18. Griffiths, J.R. et al. The application of a hypothesis-driven strategy to the sensitive detection and location of acetylated lysine residues. J Am Soc Mass Spectrom 18, 1423-1428 (2007).

19. Mead, J.A. et al. MRMaid, the web-based tool for designing multiple reaction monitoring (MRM) transitions. Mol Cell Proteomics 8, 696-705 (2009).

  ‐ 36 ‐ 

20. Chem Mead, J.A., Bianco, L. & Bessant, C. Mining proteomic MS/MS data for MRM transitions. Methods Mol Biol 604, 187-199.

21. Desiere, F. et al. The PeptideAtlas project. Nucleic Acids Res 34, D655-658 (2006). 22. Unwin, R.D. et al. Multiple reaction monitoring to identify sites of protein phosphorylation with

high sensitivity. Mol Cell Proteomics 4, 1134-1144 (2005). 23. Sherwood, C.A. et al. MaRiMba: a software application for spectral library-based MRM

transition list assembly. J Proteome Res 8, 4396-4405 (2009). 24. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification

by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551-3567 (1999).

25. Xu, C. & Ma, B. Software for computational peptide identification from MS-MS data. Drug Discov Today 11, 595-600 (2006).

26. http://www.matrixscience.com. 27. Lange, V. et al. Targeted quantitative analysis of Streptococcus pyogenes virulence factors by

multiple reaction monitoring. Mol Cell Proteomics 7, 1489-1500 (2008). 28. http://www.matrixscience.com/help/interpretation_help.html. 29. Sanders, W.S., Bridges, S.M., McCarthy, F.M., Nanduri, B. & Burgess, S.C. Prediction of

peptides observable by mass spectrometry applied at the experimental set level. BMC Bioinformatics 8 Suppl 7, S23 (2007).

30. Scherl, A. et al. Genome-specific gas-phase fractionation strategy for improved shotgun proteomic profiling of proteotypic peptides. Anal Chem 80, 1182-1191 (2008).

31. Fang, J., Dong, Y., Williams, T.D. & Lushington, G.H. Feature selection in validating mass spectrometry database search results. J Bioinform Comput Biol 6, 223-240 (2008).

32. Wilkins, M.R., Gasteiger, E., Sanchez, J.C., Appel, R.D. & Hochstrasser, D.F. Protein identification with sequence tags. Curr Biol 6, 1543-1544 (1996).

33. Wilkins, M.R. et al. Protein identification with N and C-terminal sequence tags in proteome projects. J Mol Biol 278, 599-608 (1998).

34. Bogdanov, B. & Smith, R.D. Proteomics by FTICR mass spectrometry: top down and bottom up. Mass Spectrom Rev 24, 168-200 (2005).

35. Wysocki, V.H., Resing, K.A., Zhang, Q. & Cheng, G. Mass spectrometry of peptides and proteins. Methods 35, 211-222 (2005).

36. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142-148.