iPRG2012: A Study on Detecting Modified Peptides in a ... · PDF fileDec. 2011) • 1 template (Excel) ... generally room for improvement ... • Most common errors in site assignment

465.19 2+

464.59 3+

iPRG2012: A Study on Detecting Modified Peptides in a Complex Mixture A B R F

Proteome Informatics Research Group

John Cottrell1 , Karl R. Clauser2, Robert J Chalkley3, Ruixiang Sun4, Eugene Kapp5, Matt Chambers6, W. Hayes McDonald6, Henry H. Lam7, Nuno Bandeira8, Eric Deutsch9 and Thomas Neubert10

1Matrix Science, London, UK; 2The Broad Institute of MIT and Harvard, Cambridge, MA; 3University of California, San Francisco, CA; 4Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; 5Ludwig Institute for Cancer Research, Parkville, Victoria, Australia; 6Vanderbilt University, Nashville, TN; 7Hong Kong University of Science and Technology, Hong Kong, China; 8University of California, San Diego, CA; 9Institute for Systems Biology, Seattle, WA; 10New York University School of Medicine, New York, NY

Study Goals

1. Evaluate ability of participants to identify modified peptides in a complex mixture 2. Find out why result sets might differ between participants 3. Produce a benchmark dataset, along with an analysis resource

Study Materials

• 5600 TripleTOF dataset (AB-SCIEX) – WIFF, mzML, dta, MGF (de-isotoped);–

conversions by MS Data Converter 1.1.0 – MGF (not de-isotoped – conversion by

Mascot Distiller 2.4)

• 1 FASTA file (SwissProt S. cerevisiae, human, + 1 bovine protein + trypsin from Dec. 2011)

• 1 template (Excel) • 1 on-line survey (Survey Monkey)

Study Instructions

1. Analyze the dataset 2. Report the peptide spectrum matches in the

provided template 3. Report measures of reliability for PTM site

assignments (optional) 4. Complete an on-line survey 5. Attach a 1-2 page description of your

methodology

How Much Do the Identif ications Overlap?

• Reasonable number of participants from around the globe, mainly experienced users but a few first-timers

• Large spread in number of spectra identified • False negatives (NS) are generally much higher than false positives, so there is

generally room for improvement • Peak list was a significant factor on performance • Varied performance in detecting PTMs

• Most participants struggled with sulfation • Multiply phosphorylated harder to find than singly

• Most common errors in site assignment were: • Reporting sulfo(Y) as phospho(ST) • Mis-assignment of site/s in multiply phosphorylated peptides

• The iPRG2012 are in the process of preparing the data for publication. If you participated and would like to help out, contact the iPRG through [email protected].

For more information on the iPRG and for copies of this poster and the talk please visit: http://www.abrf.org/iPRG after the meeting

Acknowledgements: The iPRG are grateful to all of the participants. We would also like to thank Jeremy Carver (UCSD) for serving as the “Anonymizer”.

Prel iminary Conclusions

Study Design • Use a common, rich dataset • Use a common sequence database • Allow participants to use the bioinformatic tools and methods of their choosing • Use a common reporting template • Report results at an estimated 1% FDR (at the spectrum level) • Ignore protein inference

A Proteome Informatics Challenge

Nature uses a wide variety of protein post-translational modifications to regulate protein structure and activity and tandem mass spectrometry has emerged as the most powerful analytical approach to detect these moieties. However, modified peptides present special challenges for characterization. First, they are generally present at sub-stoichiometric levels, meaning that without enrichment strategies samples are dominated by unmodified peptides, so finding the modified peptides may be a challenge. Secondly, the modifications may have unique fragmentation behaviors in collision-induced dissociation (CID), which may need to be considered by database search engines. Finally, if there are multiple residues within a given peptide that could bear a particular modification type, then it is necessary to identify fragment ions that frame either side of the modification site in order to be able to localize the exact site of modification within the peptide. The Proteome Informatics Research Group (iPRG) created a collaborative data analysis study to enable proteomics laboratories to evaluate their ability to find a variety of post-translationally modified peptides within a complex peptide mixture background. The dataset consists of nearly twenty thousand high resolution and high mass accuracy tandem mass spectra. Within the sample there are peptides with a range of different natural and chemical modifications. This study enabled participants to evaluate their data analysis capabilities and approaches relative to others in analyzing a common data set, with a particular emphasis on their ability to detect and characterize peptides with modifications of potential biological significance.

Total Spectra vs. Interesting Mods

There is a very wide range in the total number of spectra with identified peptides. Once one focuses only on the spectra containing modifications for which the ability to localize the modification to a particular residue, the range is much narrower. The 5 rightmost participants went so far as to report only spectra of modified peptides.

Room for Improvement in ID Certa inty Thresholds

I d e n t i f i c a t i o n s r e p o r t e d a s : Ye s t h a t m a t c h e d t h e c o n s e n s u s ; N o , b u t s t i l l m a t c h i n g t h e c o n s e n s u s ; Ye s , b u t a d i f f e r e n t a n s w e r t h a n t h e c o n s e n s u s ; Ye s , < 3 c o n s e n s u s ; N o , t h a t d i s a g r e e d w i t h c o n s e n s u s

Table Key:

71755v 58288v 33564 93128i 11211 58409 87133i 94158i 97053i 42424i 77777i 23068 40104i 87048i 92653 34284i 23117 74564 14151 52781 47603 14152 45511 11821 Acetyl (K)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Dimethyl (K) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Dimethyl (R) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Methyl (K) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Methyl (R) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Nitro (Y) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Phospho (STY)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1

Sulfo (Y) 1

1 1

1 1 1 1 1

Trimethyl (K) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

An = Andromeda/MaxQuant MG = MS-GFDB pF = pFind Sc = Scaffold AS = A-Score MM = MyriMatch Pk = PEAKS SM = Spectrum Mill By = Byonics MO = MODa PkDB = PEAKSDB Sq = Sequest IH = In-house software O = OMSSA

Ppi =Protein Pilot ST = SpectraST IP = IDPicker Ot = Other PPr = Protein Prospector TPP = TransProteomic Pipeline M = Mascot P/PP = Pep/Prot Prophet PR = PhosphoRS XL = Excel Mde = Mascot Delta Score PD = ProteomeDiscoverer PW = ProteoWizard XT = X!Tandem Mdi = Mascot Distiller

71755v 58288v 33564 93128i 11211 58409 87133i 94158i 97053i 42424i 77777i 23068 40104i 87048i 92653 34284i 23117 74564 14151 52781 47603 14152 45511 11821

Peaklist Mgf mzML mzML mgf mgf mgf mzML mgf_nd mgf mgf_nd mgf mzML mzML mgf_nd mgf mgf mzML mgf mgf_nd mgf mgf mgf_nd mgf mgf_nd

mzML mzML mgf_nd WIFF

Spectral Pre-Processing Pk Pk PPi PPi MDi Ot SM pF Pk MDi PD PkDB PW PkDB Sq

Peptide Identification

By Pk Pk M PPi M O PPr pF M MG M P/PP SM pF M Pk M PPi M O M M PD PkDB PkDB O ST MM O TPP PkDB Sq

ST XT IH XT XT XT Ot

Discover Unexpected Mods By Pk Pk M PPi ST PPr pF IH MO SM pF M Pk PPi O M PkDB PkDB

Site Localization Pk Pk MDe PPi M PPr pF M IH M SM pF M Pk M AS IH An MDe PD

PkDB O MM O PkDB Sc Ot Sq ST IH

Results Filtering By Pk Pk IH PPi P/PP P/PP PPr pF IP IH XL IH XL pF M Pk M XL Sc M PR PkDB XL Ot TPP TPP XL PkDB

NTT 2 1 1 2 1 ? 1 1 1 2 1 ? 1 2 2 2 2 ? 1 2 2 2 ? 2

Experience 5-10 years 5-10 years 5-10 years >10 years < 1 year 3-4 years 5-10 years >10 years 5-10 years 5-10 years 5-10 years 3-4 years 5-10 years >10 years 1-2 years >10 years 1-2 years >10 years >10 years 1-2 years >10 years >10 years 1-2 years >10 years

1.48%

4.94%

6.1%

4.37%

n/a 13.68%

8.56%

2%

1.85%

16.67%

2.52%

n/a n/a

n/a

7.14% 2.92%

8.11%

3.45%

7.95%

2.52%

8%

3.03%

5.72% n/a

0

50

100

150

200

250

300

350

400

450

500

550

71755v 58288v 33564 93128i 11211 58409 87133i 94158i 97053i 42424i 77777i 23068 40104i 87048i 92653 34284i 23117 74564 14151 52781 47603 14152 45511 11821

# S

pike

d P

eptid

e PS

Ms

(FLR

%)

# Spiked PSMs Mod Loc Certainty N # Spiked PSMs Mod Loc Certainty Y, Ignored

# Spiked PSMs Wrong Mod Loc; Mod Loc Certainty Y # Spiked PSMs Correct Mod Loc; Mod Loc Certainty Y

Synthetic Peptide ID by Participant

Red corresponds to the presence of at least 1 PSM for a spiked synthetic peptide modified with the correct localization reported and the correct modification name. The localization certainty may have been reported as either Y or N. PSMs containing modification of residues other than s,t,y,k,r were excluded.

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cum

mul

ativ

e #

Spec

tra

# Participants Agreeing

Phospho vs Sulfo

Observe modified fragment ions.

Observe ‘unmodified’ fragment ions. Spectrum looks essentially identical to unmodified peptide spectrum

DISLSDY(Phospho)K

DISLSDY(Sulfo)K

Peak Processing • Two types of peak lists were supplied o Deisotoped (AB SCIEX MS Data Converter)

• Cannot infer fragment charge state • Possibly lower chance of false fragment matches

o Non-deisotoped (Mascot Distiller) • Can infer fragment charge state • Possibly higher chance of false fragment matches

• For 238 consensus spectra the peak lists had different specified charge state • 193 results only possible with deisotoped peak list • 45 results only possible with non-deisotoped peak list

0

1000

2000

3000

4000

5000

6000

7000

71755v 58288v 33564 93128i 11211 58409 87133i 94158i 97053i 42424i 77777i 23068 40104i 87048i 92653 34284i 23117 74564 14151 52781 47603 14152 45511 11821

# S

pect

ra

# No Mods # Common Mods (^q,^c,m,n,q) # Nterm Mods # AA Mutation Mods # Interesting Mods

0

100

200

300

400

500

600

700

# S

pect

ra

# Interesting Mod Loc Certainty N # Interesting Mod Loc Certainty Y

0 500

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500

# Sp

ectr

a

#ND No Id, Diff from Consensus #Y<3 P Id Yes #YD Yes Id, Diff from Consensus #NS No Id, Same as Consensus #YS Yes Id, Same as Consensus

24 submissions from 23 participants. 9 were iPRG members. Participation was international and covered a wide range of experience level. Who Participated:

16

8

ABRF Member Nonmember

15 4

2 2 1

Bioinformatician/Software Developer Mass Spectrometrist Lab Scientist Director/Manager

13 4

4 2 1

North America Europe Asia Australia/NZ Africa 17

2

1 1 3 Academic

Manufacturer/Vendor Biotech/Pharma/Industry Government Other

5

6 11

2

Core Only

Software development only (no research facility) Conduct both core functions and non-core lab research Non-core research lab

Mixed Spectra Exposed by Peak Processing

464.59 3+ Non-deisotoped peaklist

465.19 2+ Deisotoped peaklist

mailto:[email protected]�

http://www.abrf.org/iPRG�

Documents

iPRG2012: A Study on Detecting Modified Peptides in a ... · PDF fileDec. 2011) • 1 template (Excel) ... generally room for improvement ... • Most common errors in site assignment