Download pdf - Protein Sequence Analysisstructure2.konkuk.ac.kr/W7_protein_sequence/W7_class_sequence_analysis... · sequence analysis •Helps characterize protein sequences insilico and allows

2019. 2nd Semester

Kang, Lin-Woo, Ph.D.Professor

Department of Biological SciencesKonkuk University

Seoul, Korea

Protein Sequence Analysis

Transport

Hemoglobin

carries O2

Defense

Antibody

fights Viruses

Enzyme

Protease

degrades Protein

SupportKeratin

forms Hair and Nails

Motion

Actin

contracts Muscles

Regulation

Insulin

controls Blood Glucose

Proteins do most of the work in the cell

Proteomics Research

•Basic research:• To understand the molecular mechanisms underlying life.

• Protein sequence analysis

•Applied research (Clinical proteomics):• Clinical testing for proteins associated with pathological states (e.g. cancer).

Wikipedia, http://en.wikipedia.org

...the large-scale study of proteins…while it is often

viewed as the “next step”, proteomics is much more

complicated than genomics.

…while the genome is a rather constant entity, the

proteome differs from cell to cell and is constantly

changing through its biochemical interactions with the

genome and the environment.

One organism will have radically different protein

expression in different parts of its body, in different

stages of its life cycle and in different environmental

conditions.

Proteomics is the analysis of the protein complement to the genome

Genomics Proteomics

Gene Transcript Protein

Clinical Proteomics

From Petricoin et al., Nature Reviews Drug Discovery (2002) 1, 683-695

7

Mass spectrometry

Mass spectrometry is based on slightly different principles to the other spectroscopic methods. Th

e physics behind mass spectrometry is that a charged particle passing through a magnetic field is

deflected along a circular path on a radius that is proportional to the mass to charge ratio, m/e.

In an electron impact mass spectrometer, a high energy beam of electrons is used to displace an

electron from the organic molecule to form a radical cation known as the molecular ion. If the m

olecular ion is too unstable then it can fragment to give other smaller ions.

The collection of ions is then focused into a beam and accelerated into the magnetic field and defl

ected along circular paths according to the masses of the ions. By adjusting the magnetic field, th

e ions can be focused on the detector and recorded.

8

9

Peptide Sequencing

In the last 15 years or so, ESI-CID-MS/MS analysis has seen increasingly wide application in the area of proteomics - especially peptide sequencing and protein identification. CID fragmentation of a singly or doubly protonated ion ([M+Hn]

n+ - where n=1 or 2) of a peptide will yield a series of fragment ions resulting from the cleavage of the peptide at any of the C-C bonds between amino-acid residues.

10

11

Single protein and shotgun analysis

Adapted from: McDonald et al. 2002. Disease Markers 18 99-105

Protein Bioinformatics

Mixture of proteinsGel base

d s

epera

tion

Single protein analysis

Digestion of protein mixture

Spot excisionand digestion

LC orLC/LC separation

Shotgun analysis

Peptides from a single protein

Peptides from many proteins

MS analysisMS/MS analysis

13

https://www.creative-proteomics.com

https://www.creative-proteomics.com/

14

15

16

Protein Bioinformatics: Protein sequence analysis

• Helps characterize protein sequences in silico and allows prediction of protein structure and function

• Statistically significant BLAST hits usually signifies sequence homology

• Homologous sequences may or may not have the same function but would always (very few exceptions) have the same structural fold

• Protein sequence analysis allows protein classification

Development of protein sequence databases

• Atlas of protein sequence and structure – Dayhoff (1966) first

sequence database (pre-bioinformatics). Currently known as Protein

Information Resource (PIR)

• Protein data bank (PDB) – structural database (1972) remains most

widely used database of structures

• UniProt – The United Protein Databases (UniProt, 2003) is a

central database of protein sequence and function created by

joining the forces of the SWISS-PROT, TrEMBL and PIR protein

database activities

Comparative protein sequence analysis and evolution

• Patterns of conservation in sequences allows us to determine which residues are under selective constraints (are important for protein function)

• Comparative analysis of proteins more sensitive than comparing DNA

• Homologous proteins have a common ancestor

• Protein classification systems based on evolution: PIRSF and COG

http://pir.georgetown.edu/pirsf/

http://www.ncbi.nlm.nih.gov/COG/

Comparing proteins

• Amino acid sequence of protein generated from proteomics experiment

• e.g. protein fragment DTIKDLLPNVCAFPMEKGPCQTYMTRWFFNFETGECELFAYGGCGGNSNNFLRKEKCEKFCKFT

• Amino-acids of two sequences can be aligned and we can easily count the number of identical residues (or use an index of similarity) to find the % similarity.

• Proteins structures can be compared by superimposition

Protein sequence alignment

• Pairwise alignment• a b a c d

• a b _ c d

• Multiple sequence alignment usually provides more information

• a b a c d

• a b _ c d

• x b a c e

• Multiple alignment difficult to do for distantly related proteins

Protein sequence analysis overview

• Protein databases• UniProt, NCBI

• Searching databases• Peptide search, BLAST search, Text search

• Information retrieval and analysis• Protein records at UniProt and PIR and NCBI

• Multiple sequence alignment

• Secondary structure prediction

• Homology modeling

Universal Protein Knowledgebase(UniProt)

Literature-Based

Annotation

UniProt Archive

UniProt NREF

Swiss-

ProtPIR-PSDTrEMBL RefSeq GenBank/

EMBL/DDBJEnsEMBL PDB Patent

Data

Other

Data

UniProt Knowledgebase

Classification

Automated Annotation

Clustering at

100, 90, 50%

Literature-Based

Annotation

UniProt Archive

UniProt NREF

Swiss-

ProtPIR-PSDTrEMBL RefSeq GenBank/

EMBL/DDBJEnsEMBL PDB Patent

Data

Other

Data

UniProt Knowledgebase

Classification

Automated Annotation

Clustering at

100, 90, 50%

http://www.uniprot.org/

Peptide Search

Protein structure prediction

•Programs can predict secondary structure information with 70% accuracy

•Homology modeling -prediction of ‘target structure from closely related ‘template’ structure

http://nar.oupjournals.org/cgi/content/full/30/23/5229/GKF645F2

Secondary structure predictionhttp://bioinf.cs.ucl.ac.uk/psipred/

Secondary structure prediction results

Homology modelinghttp://www.expasy.org/swissmod/SWISS-MODEL.html

Multiple sequence alignment

Identifying remote homologs

Structure guided sequence

alignment

Applications of Proteomics

Proteomics

Structural

Proteomics

Proteome

Mining

Post-

translational

Modifications

Protein

Expression

Profiling

Functional

Proteomics

Protein-protein

Interactions

Glycoyslation

Phosphorylation

Proteolysis

Yeast two-hybrid

Co-precipitation

Phage Display

Drug Discovery

Target ID

Differential Display

Yeast Genomics

Affinity Purified

Protein Complexes

Mouse Knockouts

Medical

Microbiology

Signal

Transduction

Disease

Mechanisms

Organelle

Composition

Subproteome

Isolation

Protein

Complexes

Proteomic tools and methods

Proteomic tools to study proteins

• Protein isolation

• Protein separation

• Protein identification

How are proteins isolated?

• Mechanical Methods• grinding – break open cell

• centrifugation – remove insoluble debris

• Chemical Methods• detergent – breaks open cell compartments

• reducing agent – breaks specific protein bonds

• heat – break peptide bonds to “linearize” protein

Protein isolation procedure

Find a samplePick it

Grind sample in buffer

Transfer to tube

Heat the sampleCentrifuge to remove

insoluble material

“pure” protein

solution

Recover supernatant Keep solution for gel analysis

Protein X

“pure” protein

solution

Isolated Protein X

Why separate proteins?

“PURE” Protein Solution

Tube 1

Decreased Protein ID

Increased Complexity

Tube 2

Increased Protein ID

Decreased Complexity

How to separate proteins?

Separating intact proteins is to take advantage of their diversity in physical properties, especially isoelectric point and

molecular weight

Methods of Protein Separation

• Sodium Dodecyl Sulfate – Polyacrylamide Gel Electrophoresis (SDS-PAGE)

• Isoelectric Focusing (IEF)

Sodium dodecyl sulfate - SDS

The anionic detergent SDS unfolds or denatures proteins

• Uniform linear shape

• Uniform charge/mass ratio

Two-dimensional gel electrophoresis (2-DGE)

Most widely used protein separation technique in

proteomics

Capable of resolving thousands of proteins from a

complex sample (i.e. blood, organs, tissue…)

1st dimension - isoelectric focusing

2nd dimension - SDS-PAGE

Isoelectric focusing (IEF) is separation of proteins according to native charge.

isoelectric point -pH at which net charge is zero

1st Dimension-Isoelectric Focusing

2-DGEprotein

samples

IEF

1st dimension

SDS-PAGE

2nd dimension

Neutral at pH 3

20 kDa

100 kDa

75 kDa

50 kDa

37 kDa

25 kDa

150 kDa

11 kDa

pH gradient 103

pI

mass

100

75

50

25

3 10

Arabidopsis developing leaf

kDa

2-DG

4 5 6 7 8 9

2-DGE

SDS-PAGE

2nd dimension

20 kDa

100 kDa

75 kDa

50 kDa

37 kDa

25 kDa

150 kDa

11 kDa

103 4 5 6 7 8 9

Protein X25 kDa

pI 5

1-DGE vs. 2-DGE

1-DGE (SDS-PAGE)• High reproduciblity

• Quick/Easy

• Separates solely based on size

• Modest resolution, dependent on complexity of sample

2-DGE• Modest reproducibility

• Slow/Demanding

• Separates based on pI and size

• High resolution, not dependent on complexity of sample

Peptide mass fingerprinting

intact protein x

protein digestion

mass spectrometry

m/z

inte

nsity

952.0984

1895.9057

1345.6342

899.8743

2794.9761

mass

Protein ID

Make proteolytic peptide

fragments - Digest the protein

into peptides (using trypsin)

Measure peptide masses -

“Weigh” the peptides in a mass

spectrometer

Match peptide masses to

protein or nucleotide sequence

database - Compare the data

to known proteins and look for

a match

Protein digestion

We use the enzyme TRYPSIN to digest (cut) proteins

into peptides – trypsin cuts after Lysine (K) and Arginine

(R)

????????K?????R????????????????K?????R????????????????K?????R????????Protein X

????????K?????R????????????????K?????R????????????????K?????R????????

????????K

?????R

????????

We then “weigh” these peptides

with a Mass Spectrometer

Mass Spectrometer

????????K

?????R

????????

We then “weigh” these peptides

with a Mass Spectrometer

692.31 Da

1106.55 Da

1002.37Da

Mass of peptides should be compared to theoretical masses of known peptides

?????R = 692.31 Da

????????K = 1106.55 Da

???????? = 1002.37Da

Computation of theoretical masses of known peptides known

Computer Peptides• WEGETMILK 1106.55

• ADEMTYEK 1105.23

• PLMEHGAK 1089.50

• LMEHHH 782.25

• ASTEER 692.31

• DMGEYIILES 1056.92

• EGEDMPAFY 1002.35

• CYHGMEI 984.36

• EFPKLYSEK 900.56

• YSEPYSSIIR 1102.34

• IESPLMIA 864.35

• AEFLYSR 600.21

• DLMILIYR 864.97

• METHIPEEK 795.36

• KISSMER 513.21

• PEPTIDEK 456.23

• MANYCQWS 792.15

• TYSMEDGHK 678.46

• YMEPSATFGHR 995.46

• GHLMEDFSAC 896.35

• HHFAASTR 564.88

• ALPMESS 469.12

Proteome = all protein sequences

Digest Proteome with

simulated Trypsin

Mass of peptides compared to theoretical masses of all peptides known, using a computer program.

?????R = 692.31 Da

????????K = 1106.55 Da

???????? = 1002.37Da




• LMEHHH 782.25

• ASTEER 692.31



• CYHGMEI 984.36



• IESPLMIA 864.35

• AEFLYSR 600.21

• DLMILIYR 864.97


• KISSMER 513.21

• PEPTIDEK 456.23

• MANYCQWS 792.15




• HHFAASTR 564.88

• ALPMESS 469.12

Mass of peptides matched to theoretical masses known peptides, using a computer program.

?????R = 692.31 Da

????????K = 1106.55 Da

???????? = 1002.37Da




• LMEHHH 782.25

• ASTEER 692.31



• CYHGMEI 984.36



• IESPLMIA 864.35

• AEFLYSR 600.21

• DLMILIYR 864.97


• KISSMER 513.21

• PEPTIDEK 456.23

• MANYCQWS 1002.37




• HHFAASTR 564.88

• ALPMESS 469.12

The unknown peptides have been identified

?????R = 692.31 Da

????????K = 1106.55 Da

???????? = 1002.37Da

WEGETMILK

ASTEER

MANYCQWS

Protein X has been identified

????????K?????R????????????????K?????R????????????????K?????R????????WEGETMILK AFTEER MANYCQWS

Concluding points about Proteomics

-Proteomics is the analysis of all proteins

-Interdisciplinary research

-Essential to both basic and clinical research

-Protein are the workhorses of the cell

- Discovery research – drugs and diseases

-Proteomics tools allow identification of proteins