Principles of Shotgun Proteomics and Proteogenomics Boris Maček Proteome Center Tuebingen InnoMol Proteomics Workshop April 8, 2014

Principles of Shotgun Proteomics and Proteogenomics

Boris MačekProteome Center Tuebingen

InnoMol Proteomics WorkshopApril 8, 2014

Aebersold R and Mann M. 2003. Nature 422: 198-207

General MS-based proteomics workflow

Principle of protein database search

m/z

Inte

nsity

Database

m/z

Inte

nsity

m/z

Inte

nsity

Translated Genomic SequenceTheoretical Spectra for Proteins

Theoretical spectra that fall into the defined mass range.

Each of them is compared to our fragmentIon spectra.

3

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3 MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1 MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|P62258-2|1433E_HUMAN Isoform SV of 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE MVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4 MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN>tr|F2Z3E5|F2Z3E5_HUMAN Hydroxyacid-oxoacid transhydrogenase, mitochondrial OS=Homo sapiens GN=ADHFE1 PE=4 SV=1 MAAAARARVAYLLRQLQRAACQCPTHSHTYSQDGCFKY>tr|Q5SS58|Q5SS58_HUMAN MHC class I polypeptide-related sequence A OS=Homo sapiens GN=MICA PE=4 SV=2 MGQRDQGLDRERKGPQDDPGSYQGPERRNFLKEDAMKTKTHYHAMHADCLQELRRYLESGVVLRRTVPPMVNVTRSEASEGNITVTCRASSFYPRNIILTWRQDGVSLSHDTQQWGDVLPDGNGTYQTWVATRICRGEEQRFTCYMEHSGNHSTHPVPSGKVLVLQSHWQTFHVSAVAAGCCYFCYYYFLCPLL>tr|Q5T409|Q5T409_HUMAN Disrupted in schizophrenia 1 OS=Homo sapiens GN=DISC1 PE=2 SV=1 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLFRFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLRKWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQPALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQLQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASAGQIPFHAEPPETIRSLQERIKSLNLSLKEITTKVCMSEKFCSTLRKKVNDIETQLPALLEAKMHAISGNHFWTAKDLTEEIRSLTSEREGLEGLLSKLLVLSSRNVKKLGSVKEDYNRLRREVEHQETAYETSVKENTMKYMETLKNKLCSCKCPLLGKVWEADLEACRLLIQSLQLQEARGSLSVEDERQMDDLEGAAPPIPPRLHSEDKRKTPLKESYILSAELGEKCEDIGKKLLYLEDQLHTAIHSHDEDLIHSLRRELQMVKETLQAMILQLQPAKEAGEREAAASCMTAGVHEAQA

Database

Translated Genomic SequenceTheoretical Spectra for Proteins

MaxQuantMaxQuantSoftwareSoftware

(20,246 reviewed proteins)(20,246 reviewed proteins)

(51,188 un-reviewed)(51,188 un-reviewed)

Homo Sapiens Reference Proteome71,434 entries

Homo Sapiens Reference Proteome71,434 entries

4

Principle of protein database search

MS instrumentation in proteomics

Aebersold R and Mann M. 2003. Nature 422: 198-207

Gradient elution:~200 nl/min

Column (75 µm)/spray tip (8 μm)

Reverse-phase C18 beads, 3 μm

Nanoflow LC/MS interface set-up:

Platin-wire2.0 kV

Sample Loading:~700 nl/min

No precolumn or split!

LTQ-Orbitrap

Proxeon Easy nLC nanoflow LC System

12-15 cm

Coupling LC to MS for complex mixture analysis

BSA tryptic in-solution digest 50 fmol on column

Coupling LC to MS for complex mixture analysis

SourceLinear ion trap

(LTQ) C-TrapOctopolecoll. cell

Orbitrap

LTQ-Orbitrap (2005)

MS-Full Scan

MS2 MS2

0 300 600 900 1200 1500 1800

Orbitrap-MS

LTQ-MS

LTQ-FT MS/MS optimized scan cycle:

Time [msec]

MS2 MS2 MS2

→ peptide mass measurement

→ peptide sequencing

Data processing workflow: MaxQuant

□ CID Identified+ CID Not Iidentified

Acquisition speed

LTQ Orbitrap XL LTQ Orbitrap Velos

# of MS/MS Scans

Acquisition speed

Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC)

Quantitation and identification by MS(nanoscale LC-MS/MS)

Lys-12C6 Lys-13C6

Resting cells Treated (drug, GF)

Combine and lyse,protein purification

or fractionation

”normal AA” ”heavy AA”

Proteolysis(trypsin, Lys-C, etc.)

Current research at the PCT

• Proteogenomics• B. subtilis, E. coli (Krug et al, 2011, Mol Bosystems; 2013 MCP)• Pristionchus pacificus (Borchert et al, 2010, Genome Res)• cancer cell lines/tissues

• Proteomics for systems biology• In-depth sequencing and quantitation of model organisms (B.subtilis, E.coli, S. pombe, A. thaliana) (Soufi et al, 2010, J Prot Res; Schütz et al, 2011, Plant Cell; Soufi et al, 2012, Curr Opinion Microbiol; Soares et al, 2013, JPR)

• Phosphoproteomics• targets of Aurora kinase in S. pombe (Koch et al, 2011, Science Signaling)• targets of protein kinase D in human cells (Franz-Wachtel et al., 2012, MCP)• targets of S/T/Y kinases and phosphatases in B.subtilis and E.coli

• Protein modifications• ubiquitylation (Ikeda et al, 2011, Nature)• lysine acetylation (Carpy et al., in preparation)

• Clinical proteomics• genetic rescue of Fragile X phenotype in FMR1 KO mice

Super-SILAC in Bacteria

Super-SILAC in Bacteria

E. coli: Replicate 1 and 2

Parameter Number

Total MS/MS 757,835

Total Peptides Identified 18,273

Total Proteins Identified 2,292

Single Peptide Hits 6.5%

Total Proteins Quantified* 1923*in all phases of growth

Soufi et al. in preparation

Biological reproducibility


Proteome dynamics during growth


Dynamics of stress proteins during growth


Estimation of absolute copy numbers

OD

60

0

Time (min)

T1 T2

T3

T4T5 T6 T7

1800 5760

UPS standard (iBAQ)


Summary of absolutely quantified proteins

During Growth MembraneProteins

Identified 2,292 684

Quantified (All Phases)

1,923 588

Absolutely Quantified

2,096 494


Most abundant Proteins (ES)

Protein Copies per cell (ES)

Elongation factor Tu 1;P-43 341,047.56

Outer membrane protein A 313,464.22

Braun lipoprotein 216,037.00

Cysteine synthase A;O 187,791.26

Enolase 164,914.38

DNA-binding protein HU-alpha 136,208.45

Scavengase P20;Thiol peroxidase 131,599.61

Glyceraldehyde-3-phosphate dehydrogenase A 127,416.09

Malate dehydrogenase 123,943.77

IDP;Isocitrate dehydrogenase [NADP] 117,787.02

High-affinity zinc uptake system protein znuA 111,748.80

Cadmium-induced protein yodA 107,098.12

Outer membrane protein C 106,108.02

50S ribosomal protein L6 98,724.11

Universal stress protein A 94,784.63


Dynamic range of protein abundance


Co

un

t

Log2 Protein Copy Number

Blue: All proteins Red: Membrane proteins

• Application of tandem mass spectrometry to genome re-annotation• Search MS/MS spectra against a database containing the complete genome translated in 6 reading frames

Proteogenomics

Problem: database size and structure

•Incompatibility with some data processing programs

•Long search times

•Decreased sensitivity of database search

•Unequal target and decoy search spaces

•Most translated frames are in fact decoy sequences

•Overestimation of the FDR

Predicted ORFsFrame1Frame2Frame3Frame4Frame5Frame6REV_Predicted ORFsREV_Frame1REV_Frame2REV_Frame3REV_Frame4REV_Frame5REV_Frame6

Predicted ORFsREV_Predicted ORFs

„Ususal“ Proteomics applications

Proteogenomics applications

• Model Gram-negative bacterium• Small (4.6 Mb) and well characterized genome• ~4,300 protein coding genes (manually annotated and reviewed) • Comprehensive high accuracy MS dataset comprising >42,000 unique

peptide sequences from >2,600 proteins

• Hypothesis: genome annotation approaches completeness• Assessment of general properties of a simple proteogenomic experiment

Results I

Proteogenomics of E. coli

MS/MS spectra acquired

MS/MS spectra identified

MS/MS spectra identified (%)

Peptide sequences

Novel peptides

Decoy peptides

Lab contaminant peptides

E. coli proteins

MQ 1,941,724 370,231 19,1 33,964 263 336 306 2,653TPP

1,941,724 162,028 8.3 25,724 59 0 209 2,524

1.9M peptide mass spectra

Results I


A B

C D

Position (Mb)

MFEVTFWWRDPQGSEEY... VGSESWWQSK TWGYGVTALKVGSESWWQSKHGPEWQRLNDEMFEVTFWWRDPQGSEEY...

Annotated genes

Detected peptides

Six-frame ORFs

MLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP... KPPQIRISL ...NAVFKPPQIRISL LATNFGGWILMLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP...

Position (Mb)

tref

Annotated genes

Detected peptides

Six-frame ORFs

PEP = 0.027976 PP = 0.9504

PEP = 4.02E-08 PP = 0.9999

yhja tref yhjb

fepa fes ybdz

fes


Krug et al. Mol Cell Proteomics, 2013

Majority of Novel Peptides are False Positives

Results IKrug et al. Mol Cell Proteomics, 2013

Assessment of Processing Workflows


Deep Proteome Coverage of Escherichia coli

20-fold base coverage of 27.5% genome sequence

0 50 100 150

Mean: 20 scansMedian: 7 scans

MS/MS scans


Conclusions

• proteomics reaches analytical capacity to identify and quantify all gene products in microorganisms grown in culture

• several regulatory protein modifications (e.g. S/T/Y-phosphorylation, lysine acetylation) can routinly be analyzed on a global scale

• many challenges ahead:• analysis of H/D-phosphorylation• analysis of environmental samples• coverage of genome/protein sequence by detected peptides

• future developments:

• faster MS/MS acquisition• smarter acquisition software• large-scale targeted proteomics• metaproteomics and individual proteomics

Acknowledgements

Proteome Center TuebingenBoumediene SoufiNelson C. Soares

Philipp SpätKarsten Krug

Alejantro CarpySasa PopicSilke Wahl

Funding

Documents

Principles of Shotgun Proteomics and Proteogenomics Boris Maček Proteome Center Tuebingen InnoMol Proteomics Workshop April 8, 2014