2019. 2nd Semester
Kang, Lin-Woo, Ph.D.Professor
Department of Biological SciencesKonkuk University
Seoul, Korea
Protein Sequence Analysis
Transport
Hemoglobin
carries O2
Defense
Antibody
fights Viruses
Enzyme
Protease
degrades Protein
SupportKeratin
forms Hair and Nails
Motion
Actin
contracts Muscles
Regulation
Insulin
controls Blood Glucose
Proteins do most of the work in the cell
Proteomics Research
•Basic research:• To understand the molecular mechanisms underlying life.
• Protein sequence analysis
•Applied research (Clinical proteomics):• Clinical testing for proteins associated with pathological states (e.g. cancer).
Wikipedia, http://en.wikipedia.org
...the large-scale study of proteins…while it is often
viewed as the “next step”, proteomics is much more
complicated than genomics.
…while the genome is a rather constant entity, the
proteome differs from cell to cell and is constantly
changing through its biochemical interactions with the
genome and the environment.
One organism will have radically different protein
expression in different parts of its body, in different
stages of its life cycle and in different environmental
conditions.
Proteomics is the analysis of the protein complement to the genome
Genomics Proteomics
Gene Transcript Protein
Clinical Proteomics
From Petricoin et al., Nature Reviews Drug Discovery (2002) 1, 683-695
7
Mass spectrometry
Mass spectrometry is based on slightly different principles to the other spectroscopic methods. Th
e physics behind mass spectrometry is that a charged particle passing through a magnetic field is
deflected along a circular path on a radius that is proportional to the mass to charge ratio, m/e.
In an electron impact mass spectrometer, a high energy beam of electrons is used to displace an
electron from the organic molecule to form a radical cation known as the molecular ion. If the m
olecular ion is too unstable then it can fragment to give other smaller ions.
The collection of ions is then focused into a beam and accelerated into the magnetic field and defl
ected along circular paths according to the masses of the ions. By adjusting the magnetic field, th
e ions can be focused on the detector and recorded.
8
9
Peptide Sequencing
In the last 15 years or so, ESI-CID-MS/MS analysis has seen increasingly wide application in the area of proteomics - especially peptide sequencing and protein identification. CID fragmentation of a singly or doubly protonated ion ([M+Hn]
n+ - where n=1 or 2) of a peptide will yield a series of fragment ions resulting from the cleavage of the peptide at any of the C-C bonds between amino-acid residues.
10
11
Single protein and shotgun analysis
Adapted from: McDonald et al. 2002. Disease Markers 18 99-105
Protein Bioinformatics
Mixture of proteinsGel base
d s
epera
tion
Single protein analysis
Digestion of protein mixture
Spot excisionand digestion
LC orLC/LC separation
Shotgun analysis
Peptides from a single protein
Peptides from many proteins
MS analysisMS/MS analysis
14
15
16
Protein Bioinformatics: Protein sequence analysis
• Helps characterize protein sequences in silico and allows prediction of protein structure and function
• Statistically significant BLAST hits usually signifies sequence homology
• Homologous sequences may or may not have the same function but would always (very few exceptions) have the same structural fold
• Protein sequence analysis allows protein classification
Development of protein sequence databases
• Atlas of protein sequence and structure – Dayhoff (1966) first
sequence database (pre-bioinformatics). Currently known as Protein
Information Resource (PIR)
• Protein data bank (PDB) – structural database (1972) remains most
widely used database of structures
• UniProt – The United Protein Databases (UniProt, 2003) is a
central database of protein sequence and function created by
joining the forces of the SWISS-PROT, TrEMBL and PIR protein
database activities
Comparative protein sequence analysis and evolution
• Patterns of conservation in sequences allows us to determine which residues are under selective constraints (are important for protein function)
• Comparative analysis of proteins more sensitive than comparing DNA
• Homologous proteins have a common ancestor
• Protein classification systems based on evolution: PIRSF and COG
Comparing proteins
• Amino acid sequence of protein generated from proteomics experiment
• e.g. protein fragment DTIKDLLPNVCAFPMEKGPCQTYMTRWFFNFETGECELFAYGGCGGNSNNFLRKEKCEKFCKFT
• Amino-acids of two sequences can be aligned and we can easily count the number of identical residues (or use an index of similarity) to find the % similarity.
• Proteins structures can be compared by superimposition
Protein sequence alignment
• Pairwise alignment• a b a c d
• a b _ c d
• Multiple sequence alignment usually provides more information
• a b a c d
• a b _ c d
• x b a c e
• Multiple alignment difficult to do for distantly related proteins
Protein sequence analysis overview
• Protein databases• UniProt, NCBI
• Searching databases• Peptide search, BLAST search, Text search
• Information retrieval and analysis• Protein records at UniProt and PIR and NCBI
• Multiple sequence alignment
• Secondary structure prediction
• Homology modeling
Universal Protein Knowledgebase(UniProt)
Literature-Based
Annotation
UniProt Archive
UniProt NREF
Swiss-
ProtPIR-PSDTrEMBL RefSeq GenBank/
EMBL/DDBJEnsEMBL PDB Patent
Data
Other
Data
UniProt Knowledgebase
Classification
Automated Annotation
Clustering at
100, 90, 50%
Literature-Based
Annotation
UniProt Archive
UniProt NREF
Swiss-
ProtPIR-PSDTrEMBL RefSeq GenBank/
EMBL/DDBJEnsEMBL PDB Patent
Data
Other
Data
UniProt Knowledgebase
Classification
Automated Annotation
Clustering at
100, 90, 50%
http://www.uniprot.org/
Peptide Search
Protein structure prediction
•Programs can predict secondary structure information with 70% accuracy
•Homology modeling -prediction of ‘target structure from closely related ‘template’ structure
Secondary structure predictionhttp://bioinf.cs.ucl.ac.uk/psipred/
Secondary structure prediction results
Homology modelinghttp://www.expasy.org/swissmod/SWISS-MODEL.html
Multiple sequence alignment
Identifying remote homologs
Structure guided sequence
alignment
Applications of Proteomics
Proteomics
Structural
Proteomics
Proteome
Mining
Post-
translational
Modifications
Protein
Expression
Profiling
Functional
Proteomics
Protein-protein
Interactions
Glycoyslation
Phosphorylation
Proteolysis
Yeast two-hybrid
Co-precipitation
Phage Display
Drug Discovery
Target ID
Differential Display
Yeast Genomics
Affinity Purified
Protein Complexes
Mouse Knockouts
Medical
Microbiology
Signal
Transduction
Disease
Mechanisms
Organelle
Composition
Subproteome
Isolation
Protein
Complexes
Proteomic tools and methods
Proteomic tools to study proteins
• Protein isolation
• Protein separation
• Protein identification
How are proteins isolated?
• Mechanical Methods• grinding – break open cell
• centrifugation – remove insoluble debris
• Chemical Methods• detergent – breaks open cell compartments
• reducing agent – breaks specific protein bonds
• heat – break peptide bonds to “linearize” protein
Protein isolation procedure
Find a samplePick it
Grind sample in buffer
Transfer to tube
Heat the sampleCentrifuge to remove
insoluble material
“pure” protein
solution
Recover supernatant Keep solution for gel analysis
Protein X
“pure” protein
solution
Isolated Protein X
Why separate proteins?
“PURE” Protein Solution
Tube 1
Decreased Protein ID
Increased Complexity
Tube 2
Increased Protein ID
Decreased Complexity
How to separate proteins?
Separating intact proteins is to take advantage of their diversity in physical properties, especially isoelectric point and
molecular weight
Methods of Protein Separation
• Sodium Dodecyl Sulfate – Polyacrylamide Gel Electrophoresis (SDS-PAGE)
• Isoelectric Focusing (IEF)
Sodium dodecyl sulfate - SDS
The anionic detergent SDS unfolds or denatures proteins
• Uniform linear shape
• Uniform charge/mass ratio
Two-dimensional gel electrophoresis (2-DGE)
Most widely used protein separation technique in
proteomics
Capable of resolving thousands of proteins from a
complex sample (i.e. blood, organs, tissue…)
1st dimension - isoelectric focusing
2nd dimension - SDS-PAGE
Isoelectric focusing (IEF) is separation of proteins according to native charge.
isoelectric point -pH at which net charge is zero
1st Dimension-Isoelectric Focusing
2-DGEprotein
samples
IEF
1st dimension
SDS-PAGE
2nd dimension
Neutral at pH 3
20 kDa
100 kDa
75 kDa
50 kDa
37 kDa
25 kDa
150 kDa
11 kDa
pH gradient 103
pI
mass
100
75
50
25
3 10
Arabidopsis developing leaf
kDa
2-DG
4 5 6 7 8 9
2-DGE
SDS-PAGE
2nd dimension
20 kDa
100 kDa
75 kDa
50 kDa
37 kDa
25 kDa
150 kDa
11 kDa
103 4 5 6 7 8 9
Protein X25 kDa
pI 5
1-DGE vs. 2-DGE
1-DGE (SDS-PAGE)• High reproduciblity
• Quick/Easy
• Separates solely based on size
• Modest resolution, dependent on complexity of sample
2-DGE• Modest reproducibility
• Slow/Demanding
• Separates based on pI and size
• High resolution, not dependent on complexity of sample
Peptide mass fingerprinting
intact protein x
protein digestion
mass spectrometry
m/z
inte
nsity
952.0984
1895.9057
1345.6342
899.8743
2794.9761
mass
Protein ID
Make proteolytic peptide
fragments - Digest the protein
into peptides (using trypsin)
Measure peptide masses -
“Weigh” the peptides in a mass
spectrometer
Match peptide masses to
protein or nucleotide sequence
database - Compare the data
to known proteins and look for
a match
Protein digestion
We use the enzyme TRYPSIN to digest (cut) proteins
into peptides – trypsin cuts after Lysine (K) and Arginine
(R)
????????K?????R????????????????K?????R????????????????K?????R????????Protein X
????????K?????R????????????????K?????R????????????????K?????R????????
????????K
?????R
????????
We then “weigh” these peptides
with a Mass Spectrometer
Mass Spectrometer
????????K
?????R
????????
We then “weigh” these peptides
with a Mass Spectrometer
692.31 Da
1106.55 Da
1002.37Da
Mass of peptides should be compared to theoretical masses of known peptides
?????R = 692.31 Da
????????K = 1106.55 Da
???????? = 1002.37Da
Computation of theoretical masses of known peptides known
Computer Peptides• WEGETMILK 1106.55
• ADEMTYEK 1105.23
• PLMEHGAK 1089.50
• LMEHHH 782.25
• ASTEER 692.31
• DMGEYIILES 1056.92
• EGEDMPAFY 1002.35
• CYHGMEI 984.36
• EFPKLYSEK 900.56
• YSEPYSSIIR 1102.34
• IESPLMIA 864.35
• AEFLYSR 600.21
• DLMILIYR 864.97
• METHIPEEK 795.36
• KISSMER 513.21
• PEPTIDEK 456.23
• MANYCQWS 792.15
• TYSMEDGHK 678.46
• YMEPSATFGHR 995.46
• GHLMEDFSAC 896.35
• HHFAASTR 564.88
• ALPMESS 469.12
Proteome = all protein sequences
Digest Proteome with
simulated Trypsin
Mass of peptides compared to theoretical masses of all peptides known, using a computer program.
?????R = 692.31 Da
????????K = 1106.55 Da
???????? = 1002.37Da
Computer Peptides• WEGETMILK 1106.55
• ADEMTYEK 1105.23
• PLMEHGAK 1089.50
• LMEHHH 782.25
• ASTEER 692.31
• DMGEYIILES 1056.92
• EGEDMPAFY 1002.35
• CYHGMEI 984.36
• EFPKLYSEK 900.56
• YSEPYSSIIR 1102.34
• IESPLMIA 864.35
• AEFLYSR 600.21
• DLMILIYR 864.97
• METHIPEEK 795.36
• KISSMER 513.21
• PEPTIDEK 456.23
• MANYCQWS 792.15
• TYSMEDGHK 678.46
• YMEPSATFGHR 995.46
• GHLMEDFSAC 896.35
• HHFAASTR 564.88
• ALPMESS 469.12
Mass of peptides matched to theoretical masses known peptides, using a computer program.
?????R = 692.31 Da
????????K = 1106.55 Da
???????? = 1002.37Da
Computer Peptides• WEGETMILK 1106.55
• ADEMTYEK 1105.23
• PLMEHGAK 1089.50
• LMEHHH 782.25
• ASTEER 692.31
• DMGEYIILES 1056.92
• EGEDMPAFY 1002.35
• CYHGMEI 984.36
• EFPKLYSEK 900.56
• YSEPYSSIIR 1102.34
• IESPLMIA 864.35
• AEFLYSR 600.21
• DLMILIYR 864.97
• METHIPEEK 795.36
• KISSMER 513.21
• PEPTIDEK 456.23
• MANYCQWS 1002.37
• TYSMEDGHK 678.46
• YMEPSATFGHR 995.46
• GHLMEDFSAC 896.35
• HHFAASTR 564.88
• ALPMESS 469.12
The unknown peptides have been identified
?????R = 692.31 Da
????????K = 1106.55 Da
???????? = 1002.37Da
WEGETMILK
ASTEER
MANYCQWS
Protein X has been identified
????????K?????R????????????????K?????R????????????????K?????R????????WEGETMILK AFTEER MANYCQWS
Concluding points about Proteomics
-Proteomics is the analysis of all proteins
-Interdisciplinary research
-Essential to both basic and clinical research
-Protein are the workhorses of the cell
- Discovery research – drugs and diseases
-Proteomics tools allow identification of proteins