Upload
jemimah-casey
View
226
Download
4
Embed Size (px)
Citation preview
Burkhard Rost (Columbia New York)
Evolution teaches to predict protein Evolution teaches to predict protein structure and functionstructure and function
Evolution teaches to predict protein Evolution teaches to predict protein structure and functionstructure and function
Burkhard Rost
CUBIC Columbia University
http://www.columbia.edu/~rost
http://cubic.bioc.columbia.edu/
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Is Bioinformatics up to the data deluge?• Sequence comparison: do we know what we do?
– conservation of structure and function
• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?
– secondary structure– transmembrane proteins– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/http://cubic.bioc.columbia.edu/
• Volker Eyrich
• Rajesh Nair
• Jinfeng Liu
• Dariusz Przybylski
• Yanay Ofran
• Henry Bigelow
• Kazimierz Wrzeszczynski
• Sven Mika
• Chien Peter Chen
• Burkhard Rost
• http://cubic.bioc.columbia.edu/
• Miguel AndradeEMBL
• Sean O’DonoghueLION
• Andrej Sali Marc Marti-Renom
Rockefeller
• Alfonso Valencia Florencio Pazos Madrid
• Michal Linial Jerusalem
• Claus AndersenCopenhagen
• Bastian BruningNijmegen
• Hepan TanColumbia
• Trevor Siggers Columbia
Burkhard Rost (Columbia New York)
CUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.eduCUBIC http://cubic.bioc.columbia.edu
Dariusz Przybylski
Jinfeng Liu
Trevor Siggers
Murat CokolHepan Tan
Volker Eyrich
Burkhard Rost (Columbia New York)
The Data DelugeThe Data DelugeThe Data DelugeThe Data Deluge
Conclusion:Bioinformaticswill have a hell of a problem
102
103
104
105
106
107
3-19823-19833-19843-19853-19863-19876-19886-19896-19906-19916-19926-19936-19946-19956-19966-19976-1999
PDB
SWISS-PROT
EMBL
Computer
Number of sequences in data base
Year
Date: 3-2001
Burkhard Rost (Columbia New York)
Data Deluge: what do we want?Data Deluge: what do we want?Data Deluge: what do we want?Data Deluge: what do we want?
Expressed?
• cellular function• physiological function• substrate binding sites• protein-protein interfaces
• activity• specificity• docking• localisation
DNA
ORF
Protein
Active proteinDomains =smallest functional /structural subunits
3D structure
Function
Burkhard Rost (Columbia New York)
Data Deluge: numbersData Deluge: numbersData Deluge: numbersData Deluge: numbers
Expressed?
• cellular function• physiological function• substrate binding sites• protein-protein interfaces
• activity• specificity• docking• localisation
30 entireorganisms
600.000 genes (GenBank)
1.500 domains (DALI)
10.000 structures (PDB) 600 'unique' (FSSP)
30.000 annotations(SWISS-PROT)
350.000 proteins (TrEMBL)
50
1.200.000
500.000
2000
17.000800
35.000
Burkhard Rost (Columbia New York)
Data Deluge: what CAN we do?Data Deluge: what CAN we do?Data Deluge: what CAN we do?Data Deluge: what CAN we do?
Expressed?
• cellular function• physiological function• substrate binding sites• protein-protein interfaces
• activity• specificity• docking• localisation
Introns:100% mycoplasma
30-50% eukaryotes
ORF:10% error in bacteria
Signal peptides: sometimesProSite / SignalP
Domains: sometimesPfam / ProDom
3D structure:sometimes
Function:?motifs (ProSite / PRAM)alignment
Burkhard Rost (Columbia New York)
Data Deluge: we CAN we do?Data Deluge: we CAN we do?Data Deluge: we CAN we do?Data Deluge: we CAN we do?
Not much …… yet
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!• Sequence comparison: do we know what we do?
– conservation of structure and function
• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?
– secondary structure– transmembrane proteins– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Dynamic programming: optimal alignmentDynamic programming: optimal alignmentDynamic programming: optimal alignmentDynamic programming: optimal alignment
GGQLAKEEALE 0000001100G 1100000110Q 0120000011P 0012000001V 0001200000E 0000121100V 0000012110L 0001001212U GGQLAKEEAL
T EGQP.VE.VL
U GGQLAKEEALT EGQPVEVL
Pair of protein sequences
Optimal alignment (with gaps)
Optimal alignment (no gaps)U GGQLAKEEALT1 EVLT2 EGQPVEVL
Burkhard Rost (Columbia New York)
BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’BLAST: fast matching of single ‘words’
T T Y K L I L N G K T L K G E T T T E A V D A A T A E K V F K Q Y A N D N G V D G E W T Y D D A T K T F T V T E K
T T Y K L I L L L L L L L L L L L L L L L L A W T V E K A F K T F A A A A A A A A A W T V E K A F K T F A A A A A
T T Y K L I L
T T Y K L I L
W T Y D D A T K T F
W T V E K A F K T F
A A T A E K V F K Q Y A
A W T V E K A F K T F A? ?
Burkhard Rost (Columbia New York)
Profile-based comparisonProfile-based comparisonProfile-based comparisonProfile-based comparison 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
Burkhard Rost (Columbia New York)
ZonesZonesZonesZones
Burkhard Rost (Columbia New York)
Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure
• Sequence folds into unique structure
S -> T
structurespace sequence
space
Burkhard Rost (Columbia New York)
Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure
• Sequence folds into unique structureS -> T
• Similar sequences fold into similar structuresS + S’-> T
structurespace sequence
space
Burkhard Rost (Columbia New York)
Sequence -> StructureSequence -> StructureSequence -> StructureSequence -> Structure
• Sequence folds into unique structureS -> T
• Similar sequences fold into similar structuresS + S’-> T
• Most sequences don’t fold, at allS -> no T
structurespace sequence
space
Burkhard Rost (Columbia New York)
101
102
103
104
105
106
-15 -10 -5 0 5 10
10 15 20 25 30 35
Num
ber
of p
rote
in p
airs
Distance from HSSP threshold
Percentage sequence identity
Twilight Twilight
zone zone
= =
false false
positives positives
explodeexplode
Twilight Twilight
zone zone
= =
false false
positives positives
explodeexplode
50%
10%
90%
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Significant Significant sequence sequence identityidentity
Significant Significant sequence sequence identityidentity
B Rost 1999 Prot. Engin.:12, 85-94
HSSP_PIDE (ϑ) = ϑ +
480 ⋅ L - 0.32 ⋅ 1 + e -L / 1000 { }
Burkhard Rost (Columbia New York)
Evolution did it !Evolution did it !Evolution did it !Evolution did it !
.
0
20
40
60
80
100
0 50 100 150 200 250
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?.
0
2 0
4 0
6 0
8 0
1 0 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
id e n t i tys im ila r i ty
Number of residues alignedB Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Detecting true hits in Twilight zoneDetecting true hits in Twilight zoneDetecting true hits in Twilight zoneDetecting true hits in Twilight zone
.
0
20
40
60
80
100
-10 -5 0 5 10
15 20 25 30 35
Distance from threshold
old HSSP
idesim 10%
similarity-larger-than-
identity
they-dont-know-what-
they-doonly
sequenceidentity
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Finding similar structures in Twilight zoneFinding similar structures in Twilight zoneFinding similar structures in Twilight zoneFinding similar structures in Twilight zone
.
4 1 0 3
6 1 0 3
8 1 0 31 0 4
3 1 0 4
5 1 0 4
-10 -5 0 5 10
15 20 25 30 35
Distance from threshold
old HSSP
ide
sim5%
similarity-larger-than-
identity
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
‘‘Secure’ thresholds for BLASTSecure’ thresholds for BLAST‘‘Secure’ thresholds for BLASTSecure’ thresholds for BLAST
coverageaccuracy
truefalse
101
102
103
104
105
106
0
20
40
60
80
100
0.1 1 10 100 1000 104 105
Probability score of PSI-BLAST
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Accuracy vs. coverageAccuracy vs. coverageAccuracy vs. coverageAccuracy vs. coverage
0
2 0
4 0
6 0
8 0
1 0 0
0 2 0 4 0 6 0 8 0 1 0 0
Accuracy
Coverage
• how many of thecorrect proteins
were found?
• how many of theproteins found
are correct?
Burkhard Rost (Columbia New York)
BLAST is not enough ...BLAST is not enough ...BLAST is not enough ...BLAST is not enough ...
∆similarity∆identityHSSP-curve % identityalignment score
blast2psi-blast
0
20
40
60
80
100
0 20 40 60 80 100
Accuracy
8
12
16
20
60 70 80 90 100
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Sequence Space HoppingSequence Space HoppingSequence Space HoppingSequence Space Hopping
p r o t e i n A
s e l _ x
a n l _ z
p r o t e i n B
u n k _ y
a n b _ x
u n k _ x
p r o t e i n C
c a l _ y
c a l _ x
s e q _ x
s e q _ y
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Success through sequence space hoppingSuccess through sequence space hoppingSuccess through sequence space hoppingSuccess through sequence space hopping
5 0
6 0
7 0
8 0
9 0
1 0 0
- 1 0 - 5 0 5
1 5 2 0 2 5 3 0
D i s t a n c e f r o m t h r e s h o l d
P e r c e n t a g e s e q u e n c e i d e n t i t y
o l d
i d e
0
1 0 0
2 0 0
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
ZonesZonesZonesZones
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
Family
U
U
B Rost 2001 Structural Bioinformatics:in press
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
Family
U
safe forpairwise
safe zone
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
zonereached throughposition-specific
family profileFamily
U
safe forpairwise
safe zoneU
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
zonereached throughposition-specific
family profileFamily
U
safe forpairwise
safe zoneUlost after
iteration
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
zonereached throughposition-specific
family profileFam
ily U
safe forpairwise
safe zoneU
safe zonesof close
homologues
lost afteriteration
Burkhard Rost (Columbia New York)
Profile-based database searchProfile-based database searchProfile-based database searchProfile-based database search
Burkhard Rost (Columbia New York)
ZonesZonesZonesZones
Burkhard Rost (Columbia New York)
Hypothetical distribution of similar structuresHypothetical distribution of similar structuresHypothetical distribution of similar structuresHypothetical distribution of similar structures
Burkhard Rost (Columbia New York)
0 25 50 75 1000204060
Percentage of identical residues
Burkhard Rost (Columbia New York)
Midnight zone: real - randomMidnight zone: real - randomMidnight zone: real - randomMidnight zone: real - random
0
20
40
60
0 5 10 15 20 25
Percentage identical residues
B Rost 1997 Folding & Design:2, S19-S24 AS Yang and B Honig 2000 J. Mol. Biol.:301, 679-689
Burkhard Rost (Columbia New York)
0
400
800
1200
1600
0 5 10 15 20 25
Num
ber
of s
truc
ture
pai
rs
Percentage pairwise sequence identity
25 50 75 100
0
Evolution into the Midnight zoneEvolution into the Midnight zoneEvolution into the Midnight zoneEvolution into the Midnight zone
B Rost and S O'Donoghue 1998 EMBL preprint
Burkhard Rost (Columbia New York)
Protein structures evolved at random - almostProtein structures evolved at random - almostProtein structures evolved at random - almostProtein structures evolved at random - almost
• average < 10%
– -> most pairs have ‘random’ identity levels
• 3 - 4% anchor residues
• 4 billion years of evolution reached equilibrium
– rate of creating new structures slower than drift towards mean
• averages for convergent and divergent evolution similar
• convergent evolution may have been a major event
Burkhard Rost (Columbia New York)
Structure spaceStructure spaceStructure spaceStructure space
B Rost 1998 Structure:6, 259-263
Burkhard Rost (Columbia New York)
Gold-mine out of reach!Gold-mine out of reach!Gold-mine out of reach!Gold-mine out of reach!
0
2
4
6
8
10
30 40 50 60 70 80 90 100
MJ
MG
YE
HI
PDB
Fraction of protein pairs
Percentage of identical residues
Per
cent
age
of p
airs
Burkhard Rost (Columbia New York)
Conservation of functionConservation of functionConservation of functionConservation of function
Devon & Valencia 2000, Proteins, 41, pp. 98
Burkhard Rost (Columbia New York)
Conservation of EC numberConservation of EC numberConservation of EC numberConservation of EC number
0
20
40
60
80
100
101
102
103
104
105
0 20 40 60 80 100
Percentage of proteins N
umber of proteins
Percentage pairwise sequence identity
first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage
Number of proteins
Burkhard Rost (Columbia New York)
Conservation of EC number 2Conservation of EC number 2Conservation of EC number 2Conservation of EC number 2
first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage
Number of proteins
0
20
40
60
80
100
101
102
103
104
105
-40 -20 0 20 40
0 20 40 60 80
Percentage of proteins
Number of proteins
Distance from threshold (identity/length)
Corresponding percentage sequence identity
0
20
40
60
80
100
101
102
103
104
105
20 30 40 50 60 70 80 90 100
Percentage of proteins N
umber of proteins
Percentage pairwise sequence identity
Burkhard Rost (Columbia New York)
Conservation of EC number: BLASTConservation of EC number: BLASTConservation of EC number: BLASTConservation of EC number: BLAST
0
20
40
60
80
100
101
102
103
104
-4-2024
Percentage of proteins
log(BLAST E)
0
20
40
60
80
100
102
103
104
-200-150-100-500
Number of proteins
log(BLAST E)
first EC digit: accuracyfirst EC digit: coverageall EC digits: accuracyall EC digits: coverage
Number of proteins
Burkhard Rost (Columbia New York)
Conservation Conservation in detailin detail
Conservation Conservation in detailin detail
A OxidoreductasesC OxidoreductasesA OxidoreductasesC OxidoreductasesA TransferasesC TransferasesA HydrolasesC HydrolasesA LyasesC LyasesA IsomerasesC IsomerasesA LigasesC Ligases
FULL EC number
0
20
40
60
80
100
20 40 60 80 100
Percentage of protein pairs
Percentage pairwise sequence identity
Pairwise Blast
0
20
40
60
80
100
20 40 60 80 100
Percentage of protein pairs
Percentage pairwise sequence identity
PSI-Blast
Burkhard Rost (Columbia New York)
Accuracy vs. Accuracy vs. coverage: coverage:
EC numberEC number
Accuracy vs. Accuracy vs. coverage: coverage:
EC numberEC number
Pairwise
PSI-Blast
0
20
40
60
80
100
0 20 40 60 80 100
Coverage
Accuracy
0
20
40
60
80
85 90 95
0
20
40
60
80
100
0 20 40 60 80 100
Coverage
Accuracy
ONE pideALL pide
ONE distALL dist
ONE probALL prob
0
20
40
60
80
85 90 95
Burkhard Rost (Columbia New York)
Conservation of EC numbersConservation of EC numbersConservation of EC numbersConservation of EC numbers
AccuracyCoverageFirst digit ECAccuracy
CoverageFull ECNumber of pairs
0
50
100
-40 -20 0 20 40
0 20 40 60 80
101
102
103
104
105
106
Percentage of pairs
Corresponding sequence identity
0
50
100
101
102
103
104
105
-40 -20 0 20 40
0 20 40 60 80
Percentage of pairs
0
50
100
102
103
104
-200-150-100-500
Number of proteins
0
50
100
-4-2024
Distance from HSSP-threshold log(BLAST E)log(BLAST E)
Number of pairs
0
50
100
102
103
104
-200-150-100-5000
50
100
-4-2024
PSI
PSI
PSI
Pair
Pair
Pair
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Structure prediction: where are we today?• How to learn from the evolutionary odyssey?
– secondary structure– transmembrane proteins– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Notation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DNotation: protein structure 1D, 2D, 3DPQITLWQRPLVTIKIGGQLKEALLDTGADDTVL
PP PQQQYFFQVISSIVRLLSTLWWQEDRKQAKRRRPQPPPPPVVTKFVVLIITTKEKAALIVHYKKFIILVIEENGGGGGTGQQKRRPPLWWVVFKVEESKKVVGLGLLILLLLLVVDDDDDTTTTTGGGGGAAAAADDDDDDDAKESSTTVIIVIVVVIVL
1281757077
120238169200247114740
904
466268
11831
1241
292449726217
102691
140
1109760691481976248590
690
730
415371597395000
5851300
79586900
EEEEE
EEEEEE
EEEEEEE
EE
EEEEE
EEEEEE
EE
kcal/mol0 -1 -2 -3 -4 -5
1 10 20 30 40 50 60 70 80 90
1
10
20
30
40
50
60
70
80
90
1D1D 2D2D 3D3D
Burkhard Rost (Columbia New York)
Ch r i s t i n e O ren g o (S t ru c tu res , 1997 , 5 , 1093 -1108)
Burkhard Rost (Columbia New York)
Ch r i s t i n e O ren g o (S t ru c tu res , 1997 , 5 , 1093 -1108)
Burkhard Rost (Columbia New York)
Goal of structure predictionGoal of structure predictionGoal of structure predictionGoal of structure prediction
• Epstein & Anfinsen, 1961:sequence uniquely determines structure
• INPUT: sequence
3D structure3D structureand functionand function
• OUTPUT:
Burkhard Rost (Columbia New York)
Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality
FoRc
HoMo
3D
1D
Burkhard Rost (Columbia New York)
EEEE B B B B EEEEEE EEEEEE EEEEEEEEHHHEEE1shf 100% VTLFVALYDYEARTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNYVAPVD1srm 78% VTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLTTGQTGYIPSNYVAPSD1sem 39% ....VAEHDFQAGSPDELSFKRGNTLKVLNKDEDPHWYKAEL.DGNEGFIPSNYIRMTE
WHAT IF
Burkhard Rost (Columbia New York)
• assumption: H and U homolgous 3D structures• strategy: modelling of U based on H
U (sequence)
PDB
Hsignificant sequence identity
Homology modelling/comparative Homology modelling/comparative modellingmodelling
Homology modelling/comparative Homology modelling/comparative modellingmodelling
Burkhard Rost (Columbia New York)
Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality
FoRc
HoMo
3D
1D
Burkhard Rost (Columbia New York)
Protein structure prediction in realityProtein structure prediction in realityProtein structure prediction in realityProtein structure prediction in reality
FoRc
HoMo
1D
….the art of being humble
SWISS-PROT view Genome view
Burkhard Rost (Columbia New York)
Structure prediction for protein universeStructure prediction for protein universeStructure prediction for protein universeStructure prediction for protein universe
Percentage of proteins in the proteome Percentage of residues in the proteome0 10 20 30 40
Percentage of residues
0 10 20 30 40 50
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidis
R prowazekiiS PCC6803T maritimaT pallidum
U urealyticum
S cerevisiaeC elegans
D melanogasterH sapiens(SP/TrEmbl)
H sapiens(chr 22)
Percentage of proteins
Euka
Prokaryotes
Archae
Burkhard Rost (Columbia New York)
Improving prediction by waiting it out …Improving prediction by waiting it out …Improving prediction by waiting it out …Improving prediction by waiting it out …
1991
1995
1999
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• How to learn from the evolutionary odyssey?
– secondary structure– transmembrane proteins– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Evolution did it !Evolution did it !Evolution did it !Evolution did it !
.
0
20
40
60
80
100
0 50 100 150 200 250
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
B Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
Burkhard Rost (Columbia New York)
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
Burkhard Rost (Columbia New York)
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Membrane predictionMembrane predictionMembrane predictionMembrane prediction
Burkhard Rost (Columbia New York)
HTM prediction waiting for database HTM prediction waiting for database growth ...growth ...
HTM prediction waiting for database HTM prediction waiting for database growth ...growth ...
1993
1999
1996
Burkhard Rost (Columbia New York)
.
eexxttrraa--ccyyttooppllaassmmiicc
iinnttrraa--ccyyttooppllaassmmiicc in
protein A protein CC-term
out inprotein BC-term
C-term
Topology for membrane helical proteinsTopology for membrane helical proteinsTopology for membrane helical proteinsTopology for membrane helical proteins
Burkhard Rost (Columbia New York)
HEADER LIPOPROTEIN(SURFACE FILM)COMPND PULMONARY SURFACTANT-ASSOCIATED POLYPEPTIDE C(SP-C)SOURCE PIG (SUS SCROFA)AUTHOR J.JOHANSSON,T.SZYPERSKI,T.CURSTEDT,K.WUTHRICH
AA LRIPCCPVNLKRLLVVVVVVVLVVVVTVGALLMGLOBS sec HHHHHHHHHHHHHHHHHHHHHHHHHPHD sec EEEEEEEEEEEEEEEEEEEEEEE
PHDsec success on Poly-ValinePHDsec success on Poly-ValinePHDsec success on Poly-ValinePHDsec success on Poly-Valine
Burkhard Rost (Columbia New York)
HTM
nonHTM
outputlayer
inputlayer
hiddenlayer
20444
21+3""""""
percentage of each amino acid in protein
length of protein (≤60, ≤120, ≤240, >240)
distance: centre, N-term (≤40,≤30,≤20,≤10)
distance: centre, C-term (≤40,≤30,≤20,≤10)
input global in sequence
input local in sequence
local
align-
ment
13
adjacent
residues
:::
AAA
AA.
LLL
LII
AAG
CCS
GVV
:::
global
statist.
whole
protein
% AA
Length
∆ N-term
∆ C-term
A C L I G S V ins del cons
100 0 0 0 0 0 0 0 0 1.17
100 0 0 0 0 0 0 33 0 0.42
0 0 100 0 0 0 0 0 33 0.92
0 0 33 66 0 0 0 0 0 0.74
66 0 0 0 33 0 0 0 0 1.17
0 66 0 0 0 33 0 0 0 0.74
0 0 0 33 0 0 66 0 0 0.48
H TM
nonH TM
3+1""""""
20444
first levelsequence-to- structure
second levelstructure-to- structure
PHDhtm
Burkhard Rost (Columbia New York)
Refine by dynamic programming on NN Refine by dynamic programming on NN ‘energy’‘energy’
Refine by dynamic programming on NN Refine by dynamic programming on NN ‘energy’‘energy’
1
0
1
0
r e s id u e n u m b e r
T
N
Burkhard Rost (Columbia New York)
PHDhtmPHDhtm
refinerefinetopologytopologypredictiopredictio
nn
PHDhtmPHDhtm
refinerefinetopologytopologypredictiopredictio
nn
0.920.95
0.93
0.91 0.900.92
0.870.89
N-term C-term
5 30 6 5
oouutt
Eight bestHTM's
µ=0: 0 HTM
µ=2: 2 HTM
µ=3: 3 HTM
µ=1: 1 HTM
Loop lengths
Charge:Number of R+Kin loops 1-4
final prediction:∆ =(5+1) - (2+3)>0=> first loop out
lipid membrane bilayer
extra-cytoplasmic
intra-cytoplasmic
R+K
Σ=2+R KΣ=5
+R KΣ=3
+R KΣ=1
Burkhard Rost (Columbia New York)
HEADER LIPOPROTEIN(SURFACE FILM)COMPND PULMONARY SURFACTANT-ASSOCIATED POLYPEPTIDE C(SP-C)SOURCE PIG (SUS SCROFA)AUTHOR J.JOHANSSON,T.SZYPERSKI,T.CURSTEDT,K.WUTHRICH
AA LRIPCCPVNLKRLLVVVVVVVLVVVVTVGALLMGLOBS htm TTTTTTTTTTTTTTTTTTTTTTTTTPHD htm TTTTTTTTTTTTTTTTTTTTTTTT
PHDhtm on Poly-ValinePHDhtm on Poly-ValinePHDhtm on Poly-ValinePHDhtm on Poly-Valine
Burkhard Rost (Columbia New York)
Example IS representativeExample IS representativeExample IS representativeExample IS representative
M etho d/Subset N pro t Q % correctsegm ents
% correcttopolog y
P H D htm _fil 1 31 9 4.4 8 8.5 ±3 .1 8 2.4 ±3 .8
P H D htm _ref 1 31 9 3.8 8 9.3 ±3 .1 8 6.3 ±3 .1
P H D htm _ref 8 3 9 3.6 8 8.0 ±3 .6 8 5.5 ±4 .8
Jones e t a l., 1 994 8 3 7 9.5 ±3 .7 7 7.1 ±3 .8
E u karyo tes 9 9 9 5.8 9 3.5 ±3 .2 9 0.3 ±3 .2
P roka ry otes 3 3 8 5.6 7 5.8 ±9 .1 7 2.7 ±9 .1
M etho d/Subset N pro t Q % correctsegm ents
% correcttopolog y
P H D htm _fil 1 31 9 4.4 8 8.5 ±3 .1 8 2.4 ±3 .8
P H D htm _ref 1 31 9 3.8 8 9.3 ±3 .1 8 6.3 ±3 .1
P H D htm _ref 8 3 9 3.6 8 8.0 ±3 .6 8 5.5 ±4 .8
Jones e t a l., 1 994 8 3 7 9.5 ±3 .7 7 7.1 ±3 .8
E u karyo tes 9 9 9 5.8 9 3.5 ±3 .2 9 0.3 ±3 .2
P roka ry otes 3 3 8 5.6 7 5.8 ±9 .1 7 2.7 ±9 .1
allHTM
correct:89.3 ± 3.1
topologycorrect:
86.3 ± 3.1
Burkhard Rost (Columbia New York)
To be or not to be (HTM)To be or not to be (HTM)To be or not to be (HTM)To be or not to be (HTM)
1
0
residue number
H
ϑ strict = 0.8 , and ϑ loose = 0.7
Burkhard Rost (Columbia New York)
False positives: globular proteinsFalse positives: globular proteinsFalse positives: globular proteinsFalse positives: globular proteins
Method Nglob Eglob
PHDhtm, ϑstrict = 0.8 435 1.6 % ± 0.7%
PHDhtm, ϑloose = 0.7 435 3.7 % ± 0.9%PHDhtm_fil 435 5.7 % ± 1.1%
PHDhtm_fil a 278 4.3 % ± 1.4%Jones et al., 1994 b 155 3.2 % ± 1.9%Edelman, 1993 c 14 21.4 % ±14.3%
ϑ=0.8:false
1.6 ± 0.7
ϑ=0.7:false
3.7 ± 0.9
Burkhard Rost (Columbia New York)
Details PHDsec: Wrong alignmentDetails PHDsec: Wrong alignmentDetails PHDsec: Wrong alignmentDetails PHDsec: Wrong alignment
• single sequences => accuracy clearly lower• sufficient information in multiple alignment
– many sequences– diversity
• wrong alignment -> wrong prediction
ID %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LGAP LSEQftsh_ecoli 1.00 1.00 1 644 1 644 644 0 0 644ftsh_haein 0.76 0.84 256 635 1 380 380 0 0 381ftsh_bacsu 0.50 0.62 3 630 6 637 623 6 14 637ftsh_porpu 0.48 0.59 5 604 9 623 598 5 19 628ftsh_lacla 0.46 0.57 1 638 12 695 635 7 52 695ftsh_odosi 0.45 0.56 2 611 5 644 609 5 32 644
Burkhard Rost (Columbia New York)
....,....1....,....2....,....AA |MAKNLILWLVIAVVLMSVFQSFGPSESNG|OBS htm | HHHHHHHHHHHHHHHHHHHH |PHD htm | |Rel htm |99999999999888889999999999999|
Details PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignmentDetails PHDhtm: wrong for ‘save’ alignment
Burkhard Rost (Columbia New York)
....,....1....,....2....,....AA |MAKNLILWLVIAVVLMSVFQSFGPSESNG|OBS htm | HHHHHHHHHHHHHHHHHHHH |PHD htm | HHHHHHHHHHH |Rel htm |88877651000000000001357899999|PHDRhtm | HHHHHHHHHHHHHHHHHH |PHDThtm |iiiiTTTTTTTTTTTTTTTTTTooooooo|
Details PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignmentDetails PHDhtm: correct for accurate alignment
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Defining residue solvent accessibilityDefining residue solvent accessibilityDefining residue solvent accessibilityDefining residue solvent accessibility
Burkhard Rost (Columbia New York)
hiddenlayer
0
1
2
9
16
25
36
49
64
81
outputlayer
inputlayer
20444
21+3""""""
percentage of each amino acid in protein
length of protein (≤60, ≤120, ≤240, >240)
distance: centre, N-term (≤40,≤30,≤20,≤10)
distance: centre, C-term (≤40,≤30,≤20,≤10)
input global in sequence
input local in sequence
A C L I G S V ins del cons
100 0 0 0 0 0 0 0 0 1.17
100 0 0 0 0 0 0 33 0 0.42
0 0 100 0 0 0 0 0 33 0.92
0 0 33 66 0 0 0 0 0 0.74
66 0 0 0 33 0 0 0 0 1.17
0 66 0 0 0 33 0 0 0 0.74
0 0 0 33 0 0 66 0 0 0.48
local
align-
m ent
13
adjacent
residues
:::
AAA
AA.
LLL
LII
AAG
CCS
GVV
:::
global
statist.
whole
protein
% AA
Length
∆ N-term
∆ C-term
first level only
PHDacc
Burkhard Rost (Columbia New York)
Evolution for accessibility predictionEvolution for accessibility predictionEvolution for accessibility predictionEvolution for accessibility prediction
• Detailed prediction problematic• Significant gain by evolutionary information:
in/out with > 75% accuracy!
Burkhard Rost (Columbia New York)
PHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory detailsPHDacc: the un-g(l)ory details
• accuracy > 75% (two states: buried, exposed)
• distribution with ≈ 10%
• stronger predictions more accurate
• WARNING: reliability index almost factor
2 too large for single
sequences
• accuracy below average for intermediate state
• VERY dependent on alignment accuracy
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• Are 1D predictions useful?– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• Are 1D predictions useful? Of course to experts– sub-cellular localisation– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
EXTRACELLULAR
NUCLEAR
CYTOPLASMIC
Simplistic
perspective
of
sub-cellular
location
Simplistic
perspective
of
sub-cellular
location
Burkhard Rost (Columbia New York)
-0.3 - 0.2 - 0.1 0 0.1 0.2 0.3- 0.73 C+0.26 I+0.55 L
- 0.3
- 0.2
- 0.1
0
0.1
ccc
c
c
cc
cc
c
c
cc
cc
c
cc c
ccccc
c
c
c c
cc
c
c
c
ccc
c cc
c
cc
c
c
c
cc
cc
c
c
cc
c
c
c
cccc cc
cccce
e
eee
e
e
eee
en nn
n
n
nn
n
n
n
n
nn
n
n
n
n
nn
n
nn
n
n
n
n
n
nn
n
nn
n nn
n
nn
n
n
nn
n
n
- 0.3 - 0.2 - 0.1 0 0.1 0.2 0.3
- 0.6 G - 0.32 T - 0.21 N + 0.23 E +0.4 K +0.44 R
- 0.1
0
0.1
0.2
0.3
cc
cc
c
cc
cc
c
c
c
cc
cc c
c
ccc
cc
cc
cc c
c cc
c c
ccc
c
c
c
c cc
cc cc c
cc
c
cc
cc
c
ccc
c
ccc
cc c
ce
e ee
ee
eeee
e
nn
n
n
n
n
n
n n
n
n
nn
n
n
nn
n
nn
n
nn
n
n
nn
n
n
n
n
nn
n
nn n
n
n
n
nn
n
n
- 0.3 - 0.2 - 0.1 0 0.1 0.2 0.3- 0.62 G- 0.26 V+0.25 K +0.32 E+0.54 R
- 0.2
- 0.1
0
0.1
0.2
ccccc
cccc ccc
cc cc
cccccccc
ccc
cc c
cccc
ccccc
ccc
cc cc c
ccc c ccc
c cc
c
c
cc
ccc cc
eee
e
e
e
e
eee
enn
n
n
nn
n
n
n
nn
n n
n
n
nn
nn n
nnn
n
nn
n
nn
n nnn n
n
n
n
n
n nnn
n
n
Residuecomposition
projected ontofirst twoprinciple
components
Surface residues
Core residuesAll residues
Burkhard Rost (Columbia New York)
-0.2 - 0.1 0 0.1 0.2
- 0.4 G - 0.35 N - 0.25 T - 0.21 S +0.21 E +0.47 K +0.52 R
- 0.2
- 0.1
0
0.1
0.2
g
g
g
g
g
g
g
ggg
g
g
g
g
g
gi
ii
oo
o
oo
o
o
o
- 0.2 - 0.1 0 0.1 0.2
- 0.4 G - 0.35 N - 0.25 T - 0.21 S +0.21 E +0.47 K +0.52 R
- 0.2
- 0.1
0
0.1
0.2
cc
cc
cccccc
cccccccccccccccccccccccccccccccccccccccccccc
ccc cccccc cccccccccccc
ccccccccccccccc
c
cc
c
c
c
c
ccc
c
c
c
ccc
ccc
c
cc
c
ccc c
c
c
ccc
c
cc
c
c
cc
cc
c
c
c
c
cccc
c c
c
ccc
cccccc
c ccc
cc cccc
cc
cc
ccc
cc
cc
ccccc
c
cc
c
ccc
ccc
c
cc
c
cc
cc ccc
ccc
c
cc
c
c
c
cc
cc
c
c
c cc
cccc
c
cccc
cc
cc ccc cc
ccccc c
cc
cc
c
c
c
c
c
cc
c
c
c
c
c
cccc
ccccc
cc
c
cc
ccc
ccc
c
c cc
cc c cc
ccc
cc
c
ccc
ccc
c
c c
c
c
cc
ccc
c
cc
c
c
c
cc c
cc
c
c
ccc
c
c
cc
c
c
cc c
cc
cc
c
cc cccc
c
c
c
c
c
cc
c
c
c
c
cc
cc
c
c
ccccc
c
cc
ccc
ccc
ccc c
c
cc
c
c
cc
ccc
ccccc
c c
c
cc
c
cccc
cc
ccc
cc
c
ccc
c
c
c c
cc
cc
c
cc
cc
c
cc
c c
ccc
c c
c
cc
cccc cc
c
cc
cc
c
c c
c
ccc cc c
ccc c cc
cc
c c
c
cc
cc
c cc
cccc c
c c
c
c
c
c
cc
c
ccc
cc
cc cc
cccc
c
cc
c
c
c ccccc
ccc
c
c cc
c
c
c
c
c
c
c
c
c
c cc
cc
cc
c
c
c
ccc
cc
c
cc
c
cc
cc
c
ccc
cc
cc ccc cc ccc
c
c cc
ccc c
cc
cc
c
c
c
cc
c
c
cc cc
ccc
ccc
cc
c
c
c
c
c
ccc
cccc
c cc
c
c
c
cc
c
c c
cc
c
cccc
c
ccc
cc
c
ccc
c
cc
c
c
c
c c
cc
c
cc
cc
c
c
cccccc cc
cc
cc ccc c
c
ccccc
c
cc
c
c
cc c
c
c
c
ccc
cc
c
cc
c
c
c
cccccc
ccccc
cc
c
c
c
c
c
c
cc
c
c
c
c
cc
c
c
ccc c
cc
c
cc
c c
cc
ccc
cc
cc
c cc
cc ccc
ccc
cc
cc
c
c
c
cc
cccc
ccc c
c
ccc
ccc
c
c cccc
c cc
cc cccc
c
c
cc
c
c
c
c
c
c
cc
c
c
c
c
ccc
c
c
cc
ccc
ccc
c
cc
ccc
cc
c
cc
ccc
c
c
cccc c
cc
c
ccc
c
ccc
cc c
cc
cc
cc
c
c ccc
c
c cc
c
c c
ce e
e
e
e
e
ee
e
e
e
e
ee e
ee
eeeee
e
ee
ee
e
ee
nn
nn
nn
n
n
n nn
n
n
n n n n
n
n
n
nn
nnn
n
n
n
nn
n
nn
n
nnnn
n
nnn
n
n
n n
n
n
n
nn
nn
n
n
n
n
n
nn
n
nnn
n
n
n
n
nn
nnnn n
n
nn
n
n
nn
n
n
nn
nn
nn
nnn
nn
n
nn
n
n
nn n n
nnn
n
n
n
n
n
n
n n n
n
nn nnn
nn
n
nnn
nn
nn
n
n
n
n
nnn
nn
nn
n
nn
nnn
nn
nn
n
n
n
n
n
n
n
nn
nnn
n
n
n
n
n
n
n
n
nn
nn
n
n
n
n
nn
n
nn
nn
n
n
nnn
nn
nnn n
nn
nn
n
n
n
n
nnnnn
n
nnnn
n
n
n
nnnn n
n
n
n
n
nn
n
nn
nn
n
n
n
n
n
nn
nn
nnnnnnn
nnn
nnnn
n
nn
nnnn n
n
nn
n
n
n
nnn
nn n
n
n n
nn
n n
n
n nnn
nn
nn
n
n
nnn
n
n
nn
nn
n
nnnn
nn
nnn
nnn
n
n nn
nnn n n
nn
n
n
n
nn
n
n
nn
n
nn
nn nn
n
n
n
n
n
n
n
nnn
nn
n n
nn
n
nn
nnn
nn
nn
nnn
nn
nnnn n
n
n
n
n
n
n n nnnn n
n
nnnnnnnn n
n
n
nnnnnnn
nnn
nnnn
n nnnn
nnn
nnnn
nn
n
nnnnn
nnn
nn
n
Surface compositionprojected onto
first twoprinciple components
Burkhard Rost (Columbia New York)
extracellular
cytoplasmic
nuclear 51015
51015
51015
Average surface composition
A
A
A
C
C
C
D
D
D
E
EE
F
F
F
GG
H
H
H
I
I
I
K
KK
L
L
L
M
M
M
N
N
P
P
P
Q
Q
Q
R
R
R
SS
S
T
T
V
V
V
W
W
W
Y
Y
Y
TG NC<N E<C
N<CE<CE<N
N<CN<E
E<CE<N
E<N E<CE<N
E<CE<N
C<EN<E
N<C C<E C<NE<N
C<EN<E
C<EN<E
C<N
Burkhard Rost (Columbia New York)
Electrostaticproperties
extracellular
positive7%
negative9%
polar50%
apolar34%
+-
p
p
pp
pppp
pp
a
a
aa
aa
cytoplasmic
positive19%
negative19%
polar29%
apolar33%
+ ++
-
--
pppp
p
a
a
aa
aa
nuclear
positive26%
negative15%
polar27%
apolar32%
+ ++
+
+
--
-pppp
p
aa
aa
aa
Burkhard Rost (Columbia New York)
Shuttle into the nucleusShuttle into the nucleusShuttle into the nucleusShuttle into the nucleus
CYTOPLASM
NUCLEUS
NL S M9
T ransport in Import in
Nucleus
Cytoplasm
Burkhard Rost (Columbia New York)
How many NLS motifs in databases?How many NLS motifs in databases?How many NLS motifs in databases?How many NLS motifs in databases?
• ONE in PROSITEbi-partite motif
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Experimental NLS: positive chargesExperimental NLS: positive chargesExperimental NLS: positive chargesExperimental NLS: positive charges
NLS Protein Reference
RKRKK YstDNApolalpha Hsieh et al., 1998RKRRR Amida Irie et al., 2000KKKKRKREK LEF-1 Prieve et al., 1998KKKRRSREK TCF-1 Prieve et al.,. 1998RQARRNRRRRWR HIV-1 Rev Truant et al., 1999RRMKWKK PDX-1 Moede et al., 1999PKKKRKV SV40 LrgT Kalderon et al., 1984PRRRK SRY Sudbeck and Scherer, 1997GKKRSKA H2B Moreland et al., 1987KAKRQR v-Rel Gilmore and Temin, 1988RGRRRRQR Amida Irie et al., 2000PPVKRERTS RanBP3 Welch et al., 1999PYLNKRKGKP Pho4p Welch et al., 1999KRx{7,9}PQPKKKP p53-NLS1 Liang and Clarke, 1999KVTKRKHDNEGSGSKRPK Hum-Ku70 Koike et al., 1999RLKKLKCSKx{19}KTKR GAL4 Chan et al., 1998RKRIREDRKx{18}RKRKR TCPTP Chan et al., 1998RRERx{4}RPRKIPR BDV-P Schwemmle et al., 1999KKKKKEEEGEGKKK act/inh betaA Blauer et al., 1999PRPRKIPR BDV-P Shoya et al., 1998PPRIYPQLPSAPT BDV-P Shoya et al., 1998KDCVINKHHRNRCQYCRLQR TR2 Yu et al., 1998APKRKSGVSKC PolyomaVP1 Chang et al., 1992RKKRRQRRR HIV-1 Tat Truant et al., 1999MPKTRRRPRRSQRKRPPT Rex Palmeri and Malim, 1999KRPMNAFIVWSRDQRRK SRY Sudbeck and Scherer, 1997KRPMNAFMVWAQAARRK SOX9 Sudbeck and Scherer, 1997PPRKKRTVV NS5A Ide et al., 1996YKRPCKRSFIRFI DNAse EBV Liu et al., 1998LKDVRKRKLGPGH DNAse EBV Lyons et al., 1987KRPRP AdenovE1a Bouvier and Baldacci, 1995RRSMKRK hVDR Vihinen-Ranta et al., 1997PAKRARRGYK CPV capsid Kaneko et al., 1997RKCLQAGMNLEARKTKK hGlu.cort. Kaneko et al., 1997RRERNKMAAAKCRNRRR CFOS Kaneko et al., 1997KRMRNRIAASKCRKRKL CJUN Kaneko et al., 1997
Burkhard Rost (Columbia New York)
Experimental NLS: more complicatedExperimental NLS: more complicatedExperimental NLS: more complicatedExperimental NLS: more complicated
NLS Protein Reference
CYGSKNTGAKKRKIDDA DNAhelicaseQ1 Miyamoto et al., 1997
[AKR]TPIQKHWRPTVLTEGPPV KIRIETGEWE[KA] ASVintegrase Kukolj G. 1998
GGGx{3}KNRRx{6}RGGRN Nab2 Truant et al., 1998
KRxxxxxxxxxKTKK THOV NP Weber et al., 1998
EYLSRKGKLEL VirD2-Nterm Tinland et al., 1992KRPACTL KPECVQQLLVCSQEA KK HCDA Somasekaram et al., 1999
RVHPYQR QKI-5 Wu et al., 1999HARNT Eguchi et al., 1997YNNQSSNFGPMKGGN M9 Bonifaci et al., 1997
SxGTKRSYxxM InfluenzaNP Wang et al., 1997TKRSxxxM InfluenzaNP Wang et al., 1997VNEAFETLKRC MyoD Vandromme et al., 1995
MNKIPIKDLLNPG Mat-alpha Hall et al., 1984
Burkhard Rost (Columbia New York)
In silico mutagenisisIn silico mutagenisisIn silico mutagenisisIn silico mutagenisis
Burkhard Rost (Columbia New York)
Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Increasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverageIncreasing accuracy and coverage
Set A N NLS B Nprot nuc C Nfam nuc D Accuracy E
Coverage F
PROSITE 1 96 31 90 % 3 %
SWISS-PROT 322 290 n.a. 9 %
NLS-lit cleaned 91 309 35 100 % 10 %
NLS-lit consensus 91 537 35 100 % 17 %
PredictNLS_DB 214 1354 186 100 % 43 %
Coverage
Burkhard Rost (Columbia New York)
Nuclear protein in proteomesNuclear protein in proteomesNuclear protein in proteomesNuclear protein in proteomes
Genome No ORFs No prot with NLS Estimated % nuclear
Human 13933 1311 > 22 % F
Drosophila 14219 1256 > 21 %C. elegans 16232 1141 > 17 %Yeast 6307 479 > 18%
E. coli 4286 54 0 %
Burkhard Rost (Columbia New York)
Un-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLSUn-annotated nuclear proteins with NLS
• ATAXIN-1 GERGHGGG
• Breast Cancer type2 (Brc2) RIKKKQR
• Fibroblast Growth factor (fgf) KKRRRRR
• Brg1 ERKRRQ
Burkhard Rost (Columbia New York)
Using NLS to bind DNAUsing NLS to bind DNAUsing NLS to bind DNAUsing NLS to bind DNA
Burkhard Rost (Columbia New York)
DNA-binding predictions in proteomesDNA-binding predictions in proteomesDNA-binding predictions in proteomesDNA-binding predictions in proteomes
Genome Nprot Nprot bind-DNA Nprot bind-DNApredicted known
Human 13933 419 141Drosophila 14219 300 37C. elegans 16232 251 10Yeast 6307 67 10E. coli 4286 13 3
Burkhard Rost (Columbia New York)
Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu
• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known
Burkhard Rost (Columbia New York)
Significant motifsSignificant motifsSignificant motifsSignificant motifs
AFWKLMDDSEQGFWKLMDESNQ
AFWKLMDDSEQGFWRISAEPNN
Burkhard Rost (Columbia New York)
Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu
• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known• choose unique subset
Burkhard Rost (Columbia New York)
Finding unique subsets of proteinsFinding unique subsets of proteinsFinding unique subsets of proteinsFinding unique subsets of proteins
Burkhard Rost (Columbia New York)
Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?Similar sequence -> similar structure?.
0
2 0
4 0
6 0
8 0
1 0 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
id e n t i tys im ila r i ty
Number of residues alignedB Rost 1999 Prot. Engin.:12, 85-94
Burkhard Rost (Columbia New York)
Rotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.eduRotation @ CUBIC.bioc.columbia.edu
• want all cell-cycle protein• search in SWISS-PROT, PROSITE• search literature• build ‘expert’ set of known• choose unique subset• find motifs
…. sorry time run out, here!
Burkhard Rost (Columbia New York)
RetentiRetention on
signals signals in ER in ER and and
GolgiGolgi
RetentiRetention on
signals signals in ER in ER and and
GolgiGolgi
Sequence motif (1) Total Eukaryotes Non-Eukaryotes
ER/Golgi Non-ER /Non-Golgi
N N % N % N % N %Endoplasmic reticulum (ER) motifs: (2)KDEL-C-term 61 55 90 6 10 56 92 5 8KDEL 775 455 59 320 41 61 7 714 92HDEL-C-term 49 49 100 0 0 45 92 4 8HDEL 315 185 59 130 41 46 15 269 2HDEF-C-term 4 3 75 1 25 2 50 2 50HDEF 91 50 55 41 45 2 2 89 98KKXX-C-term 907 492 52 415 48 53 6 854 94KKXX 57848 32493 56 25355 44 810 1 57038 99XXRR 51849 28043 56 23806 46 688 1 51161 99KKFF-C-term 4 3 75 1 25 1 25 3 75KKFF 261 168 64 93 36 5 2 256 98KKAA-C-term 22 7 22 15 68 5 23 17 77KKAA 995 600 60 395 40 24 3 964 97
Golgi apparatus motifs: (3)YQRL 273 137 50 136 50 3 1 270 99YKGL 447 237 54 210 46 5 1 442 99YHPL 80 40 50 40 50 4 5 76 95YXXZ 83589 44335 53 39234 47 477 1 83112 99NPFKD 14 12 86 2 14 0 0 14 100FXFXD 3200 1762 55 1438 45 31 1 3169 99FQFND 4 1 25 3 75 1 25 3 75PXPXP 8542 6043 71 2499 29 65 1 8477 99[DE]X[DE] 80940 42436 53 38504 47 479 1 80461 99GRIP-motif (5) 2 2 100 0 0 1 50 1 50GRIP-motif (shortened) (6) 29 17 59 12 41 1 3 28 97
C-term variations: (4)PROSITE Pattern (7) 173 151 88 22 12 134 77 39 23
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
0
1
2
3
4
5
0 100 200 300 400 500 600
Distribution of ORF lenghts for Eukaryotes
caeel-Phuman-Pyeast-P
Percentage of ORFs in entire genome
Length of ORF
0
2
4
6
8
10
12
0 100 200 300 400 500 600
Distribution of ORF lenghts for Archaes
aerpe-Parcfu-Pmetja-Pmettm-Ppyrab-Ppyrho-Pthema-P
Length of ORF
0
1
2
3
4
5
0 100 200 300 400 500 600
Distribution of ORF lenghts for Prokaryotes aquae-Pbacsu-Pborbu-Pchlpn-Pchltr-Pdeira-Pecoli-Phaein-Phelpy-Pmycge-Pmycpn-Pmyctu-Pricpr-Psyny3-Ptrepa-P
Percentage of ORFs in entire genome
Length of ORF 20
40
60
80
100
1000 1500 2000 2500
Distribution of ORF lenghts for Eukaryotes
caeelhumanyeast
Distribution
Length of ORF
Burkhard Rost (Columbia New York)
ArcheansArcheans
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100
aquaebacsuborbucamjechlpnchltrdeiraecolihaeinhelpymycgemycpnmyctuneimericprsyny3thematrepaureur
0 20 40 60 80 100
010203040506070
0 20 40 60 80 100
yeastcaeeldromehumanhs22
ProkaryotesProkaryotes
0102030
4050607080
0 20 40 60 80 100
aerpearcfumetjamettmpyrabpyrho
0 20 40 60 80 100
Family sizeFamily sizeFamily sizeFamily size
Cum
ulat
ive
perc
enta
ge o
f pr
otei
ns
Number of proteins in family
EukaryotesEukaryotes
Aeropyrum pernix K1
Burkhard Rost (Columbia New York)
Structure prediction for protein Structure prediction for protein universeuniverse
Structure prediction for protein Structure prediction for protein universeuniverse
Percentage of proteins in the proteome Percentage of residues in the proteome0 10 20 30 40
Percentage of residues
0 10 20 30 40 50
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidis
R prowazekiiS PCC6803T maritimaT pallidum
U urealyticum
S cerevisiaeC elegans
D melanogasterH sapiens(SP/TrEmbl)
H sapiens(chr 22)
Percentage of proteins
Euka
Prokaryotes
Archae
Burkhard Rost (Columbia New York)
Do we aim at getting one structure per Do we aim at getting one structure per fold?fold?
Do we aim at getting one structure per Do we aim at getting one structure per fold?fold?
• Structural proteomics = hunt for new folds ?
Tough task for theory!
-> Practice:Shrink complexes: 14747 technicians!
• Can we avoid non-globular proteins?
• Can we prioritise aspects of function?
Burkhard Rost (Columbia New York)
Similar amino acid compositionSimilar amino acid compositionSimilar amino acid compositionSimilar amino acid composition
20%15%10% 5%20%15%10% 5%20%15%10% 5%
20%15%10% 5%
20%15%10% 5%
20%15%10% 5%
Aeropyrum pernix K1
Yeast
Archaeoglobus fulgidus
Caenorhabditis elegans
Escherichia coliBacillus subtilis
Burkhard Rost (Columbia New York)
Inventory of life: membrane proteinsInventory of life: membrane proteinsInventory of life: membrane proteinsInventory of life: membrane proteins
0 5 10 15 20 25 30
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidis
R prowazekiiS PCC6803T maritimaT pallidum
U urealyticum
S cerevisiaeC elegans
D melanogasterH sapiens (SP/TrEmbl
H sapiens(chr 22)
%mem
Eukaryotes
Prokaryotes
Archaea
Burkhard Rost (Columbia New York)
Number of transmembrane helices
Cumulative percentage of membrane proteins
0
20
40
60
80
100
0 5 10 15 20
ArchaeaProkaryoteEukaryote
Number of membrane helices -> Number of membrane helices -> complexity?complexity?
Number of membrane helices -> Number of membrane helices -> complexity?complexity?
Burkhard Rost (Columbia New York)
MembraneMembraneproteins:proteins:
kingdomskingdomsinventedinventeddifferentdifferent
trickstricks
MembraneMembraneproteins:proteins:
kingdomskingdomsinventedinventeddifferentdifferent
trickstricks
0
10
20
30
40 aerpe bacsu yeast
0
10
20
30
40 arcfu camje caeel
0
10
20
30
40 metja ecoli drome
0
10
20
30
40
1 3 5 7 9 11 13 15 17
pyrho haein human
inout
1 3 5 7 9 11 13 15 17 1 3 5 7 9 11 13 15 17
Burkhard Rost (Columbia New York)
The The membranemembrane
LEGOLEGO
The The membranemembrane
LEGOLEGO
Burkhard Rost (Columbia New York)
Length of Length of globular regions globular regions
in membrane in membrane proteinsproteins
Length of Length of globular regions globular regions
in membrane in membrane proteinsproteins
IntracellularExtracellular
Length of globular regions in membrane proteins
Percentage of globular regions
10
20
30
40
50
10
20
30
40
50
0
10
20
30
40
50
100 200 300 400 500 600 100 200 300 400 500 600 700
Aeropyrum pernix K1
Caenorhabditiselegans
Bacillussubtilis
Drosophilamelangoster
Archaeoglobus fulgidus
Escherichiacoli
Burkhard Rost (Columbia New York)
Inventory of life: coiled-coil proteinsInventory of life: coiled-coil proteinsInventory of life: coiled-coil proteinsInventory of life: coiled-coil proteins
0 5 10 15 20 25 30
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidis
R prowazekiiS PCC6803T maritimaT pallidum
U urealyticum
S cerevisiaeC elegans
D melanogasterH sapiens (SP/TrEmbl
H sapiens(chr 22)
%mem
0 2 4 6 8 10 12
%coils
Eukaryotes
Prokaryotes
Archaeans
Burkhard Rost (Columbia New York)
Number of coiled-coil regions
Percentage of coiled-coil proteins
arcfu
0
2 0
4 0
6 0
8 0 aerpe
0
2 0
6 0
8 0
4 0
bacsu ecoli
0
20
40
60
80 caeel
1 2 3 4 5 6 7
human
1 2 3 4 5 6 7
Length of coiled-coil regions
Percentage of coiled-coil regions 20
40
60
80aerpe arcfu
20
40
60
80bacsu ecoli
0
20
40
60
80
28 84 140 196 252
caeel human
28 84 140 196 252
Coiled-coil proteins: detailsCoiled-coil proteins: detailsCoiled-coil proteins: detailsCoiled-coil proteins: details
Burkhard Rost (Columbia New York)
Inventory of life: compartmentsInventory of life: compartmentsInventory of life: compartmentsInventory of life: compartments
5 10 15 20 25
% extra-cellular
0 5 10 15 20 25 30
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidis
R prowazekiiS PCC6803T maritimaT pallidum
U urealyticum
S cerevisiaeC elegans
D melanogasterH sapiens (SP/TrEmbl
H sapiens(chr 22)
% membrane
5 10 15 20
% nuclear
Burkhard Rost (Columbia New York)
ProteinProteinstructurstructur
eeuniverseuniverse
ProteinProteinstructurstructur
eeuniverseuniverse
S y s t e m a t i c d i s c o v e r y o f t a r g e t s t r u c t u r e s
Burkhard Rost (Columbia New York)
0
20
40
60
80
100
0 200 400 600 800 1000
Cumulative distribution of ORF lengths for Eukaryotes
caeel-PChuman-PCyeast-PC
Length of ORF
Distribution of protein lengthDistribution of protein lengthDistribution of protein lengthDistribution of protein length
Burkhard Rost (Columbia New York)
Bottleneck 5: money ...Bottleneck 5: money ...Bottleneck 5: money ...Bottleneck 5: money ...
• Goal 500 in 5 years• money:
total of $ 25 M in 5 years
50,000,000,000 Lire
Burkhard Rost (Columbia New York)
What will we get?What will we get?What will we get?What will we get?
• many new structures• the machinery for structural genomics• some weired structures ...
Burkhard Rost (Columbia New York)
Recipe to determine targetsRecipe to determine targetsRecipe to determine targetsRecipe to determine targets
•Is it a known structure?•Is it similar to a known structure?•Is it a membrane protein?•Does it look like a known fold?•Does it look like a globular protein?•Is it a big family?•Is it short (NMR) does it contain Met (MAD)?
Burkhard Rost (Columbia New York)
Alternative recipe to determine targetsAlternative recipe to determine targetsAlternative recipe to determine targetsAlternative recipe to determine targets
•Do we have a crystal?
•Is it a known structure?•Is it similar to a known structure?
Burkhard Rost (Columbia New York)
Reality check:Reality check:
the invaluable the invaluable contribution of contribution of
bioinformatics to bioinformatics to target selectiontarget selection
Reality check:Reality check:
the invaluable the invaluable contribution of contribution of
bioinformatics to bioinformatics to target selectiontarget selection
Protein expressed?
Protein purified/well behaved?
Crystal?
Known structure?
YESNO
YESNO
YESNO
YES
Dostructure
NO
Do
structure,anyways
Burkhard Rost (Columbia New York)
Target Target selectionselection
Target Target selectionselection
Experimental fe
asibility
Function space Structure space
Burkhard Rost (Columbia New York)
Priority classesPriority classesPriority classesPriority classes
• Experimental feasibility
• Biophysical properties
– length
– presence of Methionine
• Bioinformatics criteria
– similarity to known structure
– family size
– functional annotation
• Functional genomics
Burkhard Rost (Columbia New York)
Target Target selection selection
machinerymachinery
Target Target selection selection
machinerymachinery
Burkhard Rost (Columbia New York)
Conclusions: Structural GenomicsConclusions: Structural GenomicsConclusions: Structural GenomicsConclusions: Structural Genomics
• we get: • most major functional elements• most structural scaffolds• evolutionary links• structure-based comparison• high-throughput techniques
• we won’t get:• complexes• interaction between them• particular structures
• when? • 70% of the human genome by 2010 2015• remainder = HTMs?
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– 3D structure: threading– floppy regions
Burkhard Rost (Columbia New York)
0
400
800
1200
1600
0 5 10 15 20 25
Num
ber
of s
truc
ture
pai
rs
Percentage pairwise sequence identity
25 50 75 100
0
Midnight zone STRONGLY populatedMidnight zone STRONGLY populatedMidnight zone STRONGLY populatedMidnight zone STRONGLY populated
Burkhard Rost (Columbia New York)
What we are threading forWhat we are threading forWhat we are threading forWhat we are threading for
.
Number of residues aligned
100
75
50
25
0
Sequence identityimplies
structuralsimilarity !
Don't know region
Burkhard Rost (Columbia New York)
Goals of fold recognition, threading,Goals of fold recognition, threading,remote homology modellingremote homology modelling
Goals of fold recognition, threading,Goals of fold recognition, threading,remote homology modellingremote homology modelling
• Recognising similar fold(s) (entire proteins)
• Detecting remote homologies for fragments (part of protein)
• Align target and fold
• Remote homology modelling (prediction in 3D)
Burkhard Rost (Columbia New York)
Str 3
...
...
3DPDB
EEH
HEEH
HEHH
EHHÉHE
FosfosProfile
1D Projectionsec acc
1aap
1tcp
1btr
Seq (U) PHD 3
...
...
1DPHD
PHD 1
PHD 2
PHD n
Str 1
Str 2
Str n
Two paths to fold recognitionTwo paths to fold recognitionTwo paths to fold recognitionTwo paths to fold recognition
Burkhard Rost (Columbia New York)
TOPITSTOPITSTOPITSTOPITS
good match to one of the known structures?=>
• predict fold of matching structure• model 3D coordinates by homology
LWQRPLVTIKIGGQLKEALLDTGAD
LWQRPLVTIKIGGQLKEALLDTGADLWRRPVVTAHIEGQLVEVLLDTGAD DRPLVRVILTNTGstALLDSGADLEKRPTTIVLINDTPLNVLLDTGAD :
-----EEEEE-----EEHHHH----o•oo•••••o•ooo•oo•••oo••o
align pre-dicted andknownstructure(s)
Project known 3D structureonto 1D
Predict 1D structure from sequence
input:sequence
generatesequencealignment
predict 1Dstructure
-----EEEEE----EEEEEE-----oooo•o•o•o•ooooo•ooooo•oo
-----EEEEE-----EEHHHH----o•oo•••••o•ooo•oo•••oo••o
note: exposed = oburied = •
.
55
60
65
70
75
80
85
55
60
65
70
75
80
85
302520151050Percentage of pairwise sequence identity
55
60
65
70
75
80
85
55
60
65
70
75
80
85
302520151050Percentage of pairwise sequence identity
0
100
200
0 5 10 15 20 25 30Percentage of pairwise sequence identity
Burkhard Rost (Columbia New York)
Prediction-based threadingPrediction-based threadingPrediction-based threadingPrediction-based threading
SWISS-PROT
BLASTBLAST
PHDsecPHDsecPHDaccPHDacc
DSSP
MaxHomMaxHom
Burkhard Rost (Columbia New York)
1tcp-3aapA identity = 16% ; AS = 68% ; ali% = 51%
Protease inhibitor domain ofAlzheimer's Amyloid (1aap)
Blood coagulution inhibitor (1tcp)
EEEEEEE EEEEEEE HHHHHHHHSEQ....AETGPCRAMISRWYFDVTEGKCAPFFYGGCGG.NRNNFDTEEYCMAVC ////////////////// ||||||||||||||///// ||||||||||...SEQAETGPCRAMISRWYFDVT.EGKCAPFFYGGCGGNRNNF.DTEEYCMAVC...RDWIDECDSNEGGERAYFRNG.KGGCDSFWICPEDHTGADYYSSYRDCFNAC HHHH EEEEE EEEEE HHHHHHH
1aap1tcp
Example of remote sequence identityExample of remote sequence identityExample of remote sequence identityExample of remote sequence identity
Burkhard Rost (Columbia New York)
30% correct first, better if stronger30% correct first, better if stronger30% correct first, better if stronger30% correct first, better if stronger.
0
20
40
60
80
100
10 14 22 29 68 92 100
Percentage of pairs predicted at given zscore (coverage)
all z > 2
z > 2.5
z > 3
z > 3.5
z > 4 z > 4.5
.
10
20
30
40
50
60
70
10
20
30
40
50
60
70
2 4 6 8 10
µ=50; sequence (Blosum62) + 1D structureµ=50; sequence (McLachlan) + 1D structure
µ=100; structure onlyµ= 0; sequence only (McLachlan)
Rank R of first correctly detected remote homologue
Burkhard Rost (Columbia New York)
Other threading methodsOther threading methodsOther threading methodsOther threading methods
• TOPITS is not the best!• CASP
PredictionCenter.llnl.gov/content.html• CAFASP
www.cs.bgu.ac.il/~dfischer/CAFASP2/• EVA
cubic.bioc.columbia.edu/eva/• CUBIC links
cubic.bioc.columbia.edu/doc/links_index.html
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– threading: better than sequence alignment!– floppy regions (NORS: no regular secondary structure)
Burkhard Rost (Columbia New York)
Long floppy regionsLong floppy regionsLong floppy regionsLong floppy regions
• less than 5% helix or strand over > 70 residues
Burkhard Rost (Columbia New York)
Formate Dehydrogenase H (1aa6.pdb)phiX174 virion
(1al0F.pdb)
Isoamylase (1bf2.pdb)DNA-containing capsid of CPV (4dpv.pdb)
Floppy loops between domainsFloppy loops between domainsFloppy loops between domainsFloppy loops between domains
Burkhard Rost (Columbia New York)
Floppy endsFloppy endsFloppy endsFloppy ends
pyruvate:ferredoxin oxidoredisoamylase(1b0pA.pdb)
Capsid protein of CPV(1b35C.pdb)
Hexon from adenovirus type 2 (1dhx.pdb)
Myeloperoxidase (1mhlA.pdb)
Aspartate aminotrans-ferase (2aat.pdb)
Prothrombin fragment 2 (2hppP.pdb)
SH3 domainof PLC-gamma (1hsq.pdb)
Hydroxylase com-ponent of MMOH (1mtyB.pdb)
Burkhard Rost (Columbia New York)
Floppy-wrapFloppy-wrapFloppy-wrapFloppy-wrap
SH3 and adjacent ligand site (1awj.pdb)
Erythrocyte catalase (7cat.pdb)
GmDNV capsid protein (1dnx.pdb)
Cellulase (1tf4A.pdb) Phosphoglycerate mutase (3pgm.pdb)
Carboxypeptidase T (1obr.pdb)
Burkhard Rost (Columbia New York)
WeirdoesWeirdoesWeirdoesWeirdoes
Extracellular domain of T beta RI (1tbi.pdb)
HIVZ2 Tat protein (1tac.pdb)
Plasminogen Kringle 4 (1krn.pdb)
Gene 5 DNA binding protein (2gn5.pdb)
Recombinant Kringle 5 domain (5hpg.pdb)
Aspartate Trans-carbamoylase (9atc.pdb)
Burkhard Rost (Columbia New York)
0 5 10 15 20 25 30 35
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidisR prowazekiiS PCC6803
T maritimaT pallidum
N urealyticum
C elegansD melanogaster
S cerevisiaeH sapiens
H sapiens chr.22
Percentage of proteins with non-structured regionsWeirdoes are not alone !Weirdoes are not alone !Weirdoes are not alone !Weirdoes are not alone !
Burkhard Rost (Columbia New York)
0 5 10 15
A pernixA fulgidus
M jannaschiiM thermoautotrophicu
P abyssiP horikoshii
A aeolicusB subtilis
B burgdorferiC jejuni
C pneumoniaeC trachomatisD radiodurans
E coliH influenzae
H pyloriM genitalium
M pneumoniaeM tuberculosisN meningitidisR prowazekiiS PCC6803
T maritimaT pallidum
N urealyticum
C elegansD melanogaster
S cerevisiaeH sapiens
H sapiens chr.22
Percentage of residues in the non-structured region
10% of biomass weird !10% of biomass weird !10% of biomass weird !10% of biomass weird !
Burkhard Rost (Columbia New York)
0
5
10
15
20
70 90 110 130 150 170 190
A. pernix
0
10
20
30
40
50
70 90 110 130 150 170 190
E. coli
0
5
10
15
20
70 90 110 130 150 170 190
C. elegans
Length distribution of Non-structured regions
Length of non-structured regions
Percentage of non-structured regions
Length distribution of floppy regionsLength distribution of floppy regionsLength distribution of floppy regionsLength distribution of floppy regions
Burkhard Rost (Columbia New York)
Weirdoes functional !Weirdoes functional !Weirdoes functional !Weirdoes functional !
0
10
20
30
40
50
60
70
80
leftright
-100 -50 0 50 100 150 200
0
10
20
30
40
50
60
70
80
leftright
-100 -50 0 50 100 150 200
0
10
20
30
40
50
60
70
80
leftright
-100 -50 0 50 100 150 200
A. pernix E. coli C. elegans
Percentage of non-structured regions
Difference in percentage of aligned proteins
Burkhard Rost (Columbia New York)
Yeast-2-hybrid interactionsYeast-2-hybrid interactionsYeast-2-hybrid interactionsYeast-2-hybrid interactions
0
5
10
15
20
25
30
35
0 2 4 6 8 10
non-NSRNSR
Accumulative percentage of proteins
Number of interacting partners
Burkhard Rost (Columbia New York)
Evolution teaches predictionEvolution teaches predictionEvolution teaches predictionEvolution teaches prediction• Bioinformatics up to the data deluge? NO, but work in progress!
• Know what we do? Some do, 30% over 100 residues!• Where are we today? NO 3D prediction from sequence!• Evolutionary odyssey applied:
– secondary structure +15% -> 76% ± 10%– transmembrane proteins +10% -> 65% topo ok– solvent accessibility + 5% -> 75%
• High-throughput success of predictions:– localisation: accessibility useful, but not enough!– whole genomes: kingdoms differ in some respects!– threading: better than sequence alignment!– NORS: weirdoes not alone AND
functional!
Burkhard Rost (Columbia New York)
ConclusionsConclusionsConclusionsConclusions
• no prediction of 3D structure
• no prediction of function
• but: quantum leap through using ‘frozen knowledge’ from evolutionand protein structures
• the data deluge floods bioinformatics
• the unsolved urgent problems are legion
• but: it is still time to get it done:running BLAST is NOT all there is …the key is intelligent use of biological knowledge ...
Burkhard Rost (Columbia New York)
ThanksgivingThanksgivingThanksgivingThanksgiving• Volker Eyrich Schrödinger, New York• Chris Sander Whitehead, Boston• Reinhard Schneider LION, Boston• Alfonso Valencia CNB, Madrid
• Miguel Andrade EMBL, Heidelberg• Séan O’Donoghue LION, Heidelberg
• Amos Bairoch SIB, Genève• Michael Braxenthaler La Roche, New York• Søren Brunak CBS, København• Rita Casadio Univ. Bologna• Antoine De Daruvar LION, Bordeaux• David Eisenberg UCLA, Los Angeles• Piero Fariselli Univ. Bologna• Barry Honig Columbia, New York• Tim Hubbard Sanger, Hinxton• Michael Levitt Univ. Stanford• Marc Marti-Renom Rockefeller, New York• Andrej Sali Rockefeller, New York• Michael Scharf Take 5, Heidelberg• Gerrit Vriend Univ. Nijmegen• Manfred Sippl Univ. Salzburg
localisation
.. in general
•Jinfeng Liu genomes, floppy, domains•Rajesh Nair NLS, localisation•Yanay Ofran protein interactions•Dariusz Przybylski PSI-Blast, EVA, threading•Henry Bigelow predict porins
•Claus Andersen continuous DSSP•Bastiaan Bruning transcription factors•Sven Mika nuclear matrix proteins•Chien Peter Chen membrane proteins•Kazimierz Wrzeszczynski cell-cycle/ER-Golgi•Hepan Tan floppy regions
Burkhard Rost (Columbia New York)
Availability of methodsAvailability of methodsAvailability of methodsAvailability of methods
• email: [email protected]– subject: HELP– file:
• WWW: http://cubic.bioc.columbia.edu/predictprotein/
• META: http://cubic.bioc.columbia.edu/ predictprotein/submit_meta.html
• EVA: http://cubic.bioc.columbia.edu/eva
• CUBIC: http://cubic.bioc.columbia.edu/
Email addressoptions# protein nameSEQWENCE