Upload
silvester-farmer
View
228
Download
2
Tags:
Embed Size (px)
Citation preview
Principles of Shotgun Proteomics and Proteogenomics
Boris MačekProteome Center Tuebingen
InnoMol Proteomics WorkshopApril 8, 2014
Aebersold R and Mann M. 2003. Nature 422: 198-207
General MS-based proteomics workflow
Principle of protein database search
m/z
Inte
nsity
Database
m/z
Inte
nsity
m/z
Inte
nsity
Translated Genomic SequenceTheoretical Spectra for Proteins
Theoretical spectra that fall into the defined mass range.
Each of them is compared to our fragmentIon spectra.
3
>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3 MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1 MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|P62258-2|1433E_HUMAN Isoform SV of 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE MVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ>sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4 MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN>tr|F2Z3E5|F2Z3E5_HUMAN Hydroxyacid-oxoacid transhydrogenase, mitochondrial OS=Homo sapiens GN=ADHFE1 PE=4 SV=1 MAAAARARVAYLLRQLQRAACQCPTHSHTYSQDGCFKY>tr|Q5SS58|Q5SS58_HUMAN MHC class I polypeptide-related sequence A OS=Homo sapiens GN=MICA PE=4 SV=2 MGQRDQGLDRERKGPQDDPGSYQGPERRNFLKEDAMKTKTHYHAMHADCLQELRRYLESGVVLRRTVPPMVNVTRSEASEGNITVTCRASSFYPRNIILTWRQDGVSLSHDTQQWGDVLPDGNGTYQTWVATRICRGEEQRFTCYMEHSGNHSTHPVPSGKVLVLQSHWQTFHVSAVAAGCCYFCYYYFLCPLL>tr|Q5T409|Q5T409_HUMAN Disrupted in schizophrenia 1 OS=Homo sapiens GN=DISC1 PE=2 SV=1 MPGGGPQGAPAAAGGGGVSHRAGSRDCLPPAACFRRRRLARRPGYMRSSTGPGIGFLSPAVGTLFRFPGGVSGEESHHSESRARQCGLDSRGLLVRSPVSKSAAAPTVTSVRGTSAHFGIQLRGGTRLPDRLSWPCGPGSAGWQQEFAAMDSSETLDASWEAACSDGARRVRAAGSLPSAELSSNSCSPGCGPEVPPTPPGSHSAFTSSFSFIRLSLGSAGERGEAEGCPPSREAESHCQSPQEMGAKAASLDGPHEDPRCLSRPFSLLATRVSADLAQAARNSSRPERDMHSLPDMDPGSSSSLDPSLAGCGGDGSSGSGDAHSWDTLLRKWEPVLRDCLLRNRRQMEVISLRLKLQKLQEDAVENDDYDKAETLQQRLEDLEQEKISLHFQLPSRQPALSSFLGHLAAQVQAALRRGATQQASGDDTHTPLRMEPRLLEPTAQDSLHVSITRRDWLLQEKQQLQKEIEALQARMFVLEAKDQQLRREIEEQEQQLQWQGCDLTPLVGQLSLGQLQEVSKALQDTLASAGQIPFHAEPPETIRSLQERIKSLNLSLKEITTKVCMSEKFCSTLRKKVNDIETQLPALLEAKMHAISGNHFWTAKDLTEEIRSLTSEREGLEGLLSKLLVLSSRNVKKLGSVKEDYNRLRREVEHQETAYETSVKENTMKYMETLKNKLCSCKCPLLGKVWEADLEACRLLIQSLQLQEARGSLSVEDERQMDDLEGAAPPIPPRLHSEDKRKTPLKESYILSAELGEKCEDIGKKLLYLEDQLHTAIHSHDEDLIHSLRRELQMVKETLQAMILQLQPAKEAGEREAAASCMTAGVHEAQA
Database
Translated Genomic SequenceTheoretical Spectra for Proteins
MaxQuantMaxQuantSoftwareSoftware
(20,246 reviewed proteins)(20,246 reviewed proteins)
(51,188 un-reviewed)(51,188 un-reviewed)
Homo Sapiens Reference Proteome71,434 entries
Homo Sapiens Reference Proteome71,434 entries
4
Principle of protein database search
MS instrumentation in proteomics
Aebersold R and Mann M. 2003. Nature 422: 198-207
Gradient elution:~200 nl/min
Column (75 µm)/spray tip (8 μm)
Reverse-phase C18 beads, 3 μm
Nanoflow LC/MS interface set-up:
Platin-wire2.0 kV
Sample Loading:~700 nl/min
No precolumn or split!
LTQ-Orbitrap
Proxeon Easy nLC nanoflow LC System
12-15 cm
Coupling LC to MS for complex mixture analysis
BSA tryptic in-solution digest 50 fmol on column
Coupling LC to MS for complex mixture analysis
SourceLinear ion trap
(LTQ) C-TrapOctopolecoll. cell
Orbitrap
LTQ-Orbitrap (2005)
MS-Full Scan
MS2 MS2
0 300 600 900 1200 1500 1800
Orbitrap-MS
LTQ-MS
LTQ-FT MS/MS optimized scan cycle:
Time [msec]
MS2 MS2 MS2
→ peptide mass measurement
→ peptide sequencing
Data processing workflow: MaxQuant
□ CID Identified+ CID Not Iidentified
Acquisition speed
LTQ Orbitrap XL LTQ Orbitrap Velos
# of MS/MS Scans
Acquisition speed
Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC)
Quantitation and identification by MS(nanoscale LC-MS/MS)
Lys-12C6 Lys-13C6
Resting cells Treated (drug, GF)
Combine and lyse,protein purification
or fractionation
”normal AA” ”heavy AA”
Proteolysis(trypsin, Lys-C, etc.)
Current research at the PCT
• Proteogenomics• B. subtilis, E. coli (Krug et al, 2011, Mol Bosystems; 2013 MCP)• Pristionchus pacificus (Borchert et al, 2010, Genome Res)• cancer cell lines/tissues
• Proteomics for systems biology• In-depth sequencing and quantitation of model organisms (B.subtilis, E.coli, S. pombe, A. thaliana) (Soufi et al, 2010, J Prot Res; Schütz et al, 2011, Plant Cell; Soufi et al, 2012, Curr Opinion Microbiol; Soares et al, 2013, JPR)
• Phosphoproteomics• targets of Aurora kinase in S. pombe (Koch et al, 2011, Science Signaling)• targets of protein kinase D in human cells (Franz-Wachtel et al., 2012, MCP)• targets of S/T/Y kinases and phosphatases in B.subtilis and E.coli
• Protein modifications• ubiquitylation (Ikeda et al, 2011, Nature)• lysine acetylation (Carpy et al., in preparation)
• Clinical proteomics• genetic rescue of Fragile X phenotype in FMR1 KO mice
Super-SILAC in Bacteria
Super-SILAC in Bacteria
E. coli: Replicate 1 and 2
Parameter Number
Total MS/MS 757,835
Total Peptides Identified 18,273
Total Proteins Identified 2,292
Single Peptide Hits 6.5%
Total Proteins Quantified* 1923*in all phases of growth
Soufi et al. in preparation
Biological reproducibility
Soufi et al. in preparation
Proteome dynamics during growth
Soufi et al. in preparation
Dynamics of stress proteins during growth
Soufi et al. in preparation
Estimation of absolute copy numbers
OD
60
0
Time (min)
T1 T2
T3
T4T5 T6 T7
1800 5760
UPS standard (iBAQ)
Soufi et al. in preparation
Summary of absolutely quantified proteins
During Growth MembraneProteins
Identified 2,292 684
Quantified (All Phases)
1,923 588
Absolutely Quantified
2,096 494
Soufi et al. in preparation
Most abundant Proteins (ES)
Protein Copies per cell (ES)
Elongation factor Tu 1;P-43 341,047.56
Outer membrane protein A 313,464.22
Braun lipoprotein 216,037.00
Cysteine synthase A;O 187,791.26
Enolase 164,914.38
DNA-binding protein HU-alpha 136,208.45
Scavengase P20;Thiol peroxidase 131,599.61
Glyceraldehyde-3-phosphate dehydrogenase A 127,416.09
Malate dehydrogenase 123,943.77
IDP;Isocitrate dehydrogenase [NADP] 117,787.02
High-affinity zinc uptake system protein znuA 111,748.80
Cadmium-induced protein yodA 107,098.12
Outer membrane protein C 106,108.02
50S ribosomal protein L6 98,724.11
Universal stress protein A 94,784.63
Soufi et al. in preparation
Dynamic range of protein abundance
Soufi et al. in preparation
Co
un
t
Log2 Protein Copy Number
Blue: All proteins Red: Membrane proteins
• Application of tandem mass spectrometry to genome re-annotation• Search MS/MS spectra against a database containing the complete genome translated in 6 reading frames
Proteogenomics
Problem: database size and structure
•Incompatibility with some data processing programs
•Long search times
•Decreased sensitivity of database search
•Unequal target and decoy search spaces
•Most translated frames are in fact decoy sequences
•Overestimation of the FDR
Predicted ORFsFrame1Frame2Frame3Frame4Frame5Frame6REV_Predicted ORFsREV_Frame1REV_Frame2REV_Frame3REV_Frame4REV_Frame5REV_Frame6
Predicted ORFsREV_Predicted ORFs
„Ususal“ Proteomics applications
Proteogenomics applications
• Model Gram-negative bacterium• Small (4.6 Mb) and well characterized genome• ~4,300 protein coding genes (manually annotated and reviewed) • Comprehensive high accuracy MS dataset comprising >42,000 unique
peptide sequences from >2,600 proteins
• Hypothesis: genome annotation approaches completeness• Assessment of general properties of a simple proteogenomic experiment
Results I
Proteogenomics of E. coli
MS/MS spectra acquired
MS/MS spectra identified
MS/MS spectra identified (%)
Peptide sequences
Novel peptides
Decoy peptides
Lab contaminant peptides
E. coli proteins
MQ 1,941,724 370,231 19,1 33,964 263 336 306 2,653TPP
1,941,724 162,028 8.3 25,724 59 0 209 2,524
1.9M peptide mass spectra
Results I
Proteogenomics of E. coli
A B
C D
Position (Mb)
MFEVTFWWRDPQGSEEY... VGSESWWQSK TWGYGVTALKVGSESWWQSKHGPEWQRLNDEMFEVTFWWRDPQGSEEY...
Annotated genes
Detected peptides
Six-frame ORFs
MLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP... KPPQIRISL ...NAVFKPPQIRISL LATNFGGWILMLNQKIQNPNPDELMIEVDLCYELDPYELKLDEMIEAEP...
Position (Mb)
tref
Annotated genes
Detected peptides
Six-frame ORFs
PEP = 0.027976 PP = 0.9504
PEP = 4.02E-08 PP = 0.9999
yhja tref yhjb
fepa fes ybdz
fes
Proteogenomics of E. coli
Krug et al. Mol Cell Proteomics, 2013
Majority of Novel Peptides are False Positives
Results IKrug et al. Mol Cell Proteomics, 2013
Assessment of Processing Workflows
Results IKrug et al. Mol Cell Proteomics, 2013
Deep Proteome Coverage of Escherichia coli
20-fold base coverage of 27.5% genome sequence
0 50 100 150
Mean: 20 scansMedian: 7 scans
MS/MS scans
Results IKrug et al. Mol Cell Proteomics, 2013
Conclusions
• proteomics reaches analytical capacity to identify and quantify all gene products in microorganisms grown in culture
• several regulatory protein modifications (e.g. S/T/Y-phosphorylation, lysine acetylation) can routinly be analyzed on a global scale
• many challenges ahead:• analysis of H/D-phosphorylation• analysis of environmental samples• coverage of genome/protein sequence by detected peptides
• future developments:
• faster MS/MS acquisition• smarter acquisition software• large-scale targeted proteomics• metaproteomics and individual proteomics
Acknowledgements
Proteome Center TuebingenBoumediene SoufiNelson C. Soares
Philipp SpätKarsten Krug
Alejantro CarpySasa PopicSilke Wahl
Funding