Upload
doankhanh
View
232
Download
2
Embed Size (px)
Citation preview
Consensus Prediction of Protein Secondary
Structures
by
Zheng Wang
Bachelor of Management Information System,Shandong Economic University, Jinan, P. R. China, 2004
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF
Master of Computer Science
In the Graduate Academic Unit of Computer Science
Supervisor(s): Dr. Patricia Evans, Ph.D., Faculty of Computer ScienceDr. Virendrakumar C. Bhavsar, Ph.D., Faculty ofDr. Computer Science
Examining Board: Dr. Eric Aubanel, Ph.D., Faculty of Computer ScienceDr. Bradford Nickerson, Ph.D., Faculty ofDr. Computer ScienceDr. Julian Meng, Ph.D., Department ofDr. Electrical and Computer Engineering
This thesis is accepted by the
Dean of Graduate Studies
THE UNIVERSITY OF NEW BRUNSWICK
July, 2007
c©Zheng Wang, 2007
Dedication
I dedicate this thesis to
my parents.
They raised me up,
provide me happiness,
cultivate my personality,
and provide me the best education.
I dedicate this thesis to
my grandmother,
a diligent and kind Chinese woman,
who made great contributions to the whole family,
and left us in 2004 when I was pursuing this degree in Canada.
I wish her peace forever.
ii
Abstract
Protein structure prediction is one of the most significant problems in bioinformat-
ics. Currently, there are some tools which can predict protein secondary structure,
or find protein structural motifs and some specific structure segments. However,
their results are sometimes different or contradictory.
CISPred is a consensus protein structure prediction system which integrates
results in order to provide overall consensus predictions of protein secondary struc-
tures. The average accuracy of CISPred predictions is 82.6% on a dataset con-
taining 109 CASP sequences, and 89.3% on a dataset containing 1758 sequences.
iii
Acknowledgements
I sincerely appreciate my supervisors, Dr. Patricia Evans and Dr. Virendra
Bhavsar. They impart their knowledge, direct my research, and provide finan-
cial support. Dr. Virendra Bhavsar and Dr. Patricia Evans are professors with
profound knowledge and experience, and have been respectful mentors. The two
years of study and research with them have been one of the best periods in my
life.
Sili Huang and Lu Yang, system administrators of the Advanced Computa-
tional Research Laboratory at the University of New Brunswick, provided a lot of
technical support to the development and testing of CISPred. Particularly, special
thanks to Sili Huang, who provided many helpful suggestions on the concurrent
implementation of CISPred.
I thank my colleagues: Rachita Sharma, a Ph.D of Computer Science can-
didate; En Zhang, a Master of Computer Science candidate; Aijazuddin Syed,
Master of Computer Science; and Marc Cooper, Master of Computer Science.
They began their research in our bioinformatics laboratory earlier than I, and
provided much help and many suggestions for my research.
The members of my entire family greatly supported my study in Canada. I
appreciate them all.
iv
Table of Contents
Dedication ii
Abstract iii
Acknowledgments iv
Table of Contents vii
List of Figures ix
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 32.1 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Protein Secondary Structure . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Secondary Structure Definitions . . . . . . . . . . . . . . . . 42.2.2 Secondary Structure Assignments . . . . . . . . . . . . . . . 8
2.3 Protein Secondary Structure Prediction . . . . . . . . . . . . . . . . 102.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 PHD, PSIPRED and SSPRO . . . . . . . . . . . . . . . . . 142.3.3 The Threading Method and THREADER . . . . . . . . . . 152.3.4 Comparison of Protein Structure Prediction Tools . . . . . . 162.3.5 Benchmarked Non-redundant Dataset . . . . . . . . . . . . . 17
2.4 Protein Motif and Motif Databases . . . . . . . . . . . . . . . . . . 192.4.1 Protein Motif . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 PATMATMOTIFS and the PROSITE database . . . . . . . 19
3 CISPred: Consensus Integrated Protein Structure Prediction 203.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Selection of Integrated Tools . . . . . . . . . . . . . . . . . . 203.2.2 THREADER . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
3.2.2.1 Sorting THREADER Reports . . . . . . . . . . . . 233.2.2.2 Clustering THREADER Alignments . . . . . . . . 25
3.2.3 Finding Motif Secondary Structures . . . . . . . . . . . . . . 283.2.4 PSIPRED and SSPRO . . . . . . . . . . . . . . . . . . . . . 343.2.5 Generating Consensus Structure Prediction . . . . . . . . . . 34
4 System Implementation 484.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 System Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Concurrent Implementation . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.2 THREADER . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 Finding Protein Motif Secondary Structures . . . . . . . . . 554.3.4 SSPRO and PSIPRED . . . . . . . . . . . . . . . . . . . . . 56
4.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Experimental Results 625.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 CISPred Testing Results on CASP Sequences . . . . . . . . . . . . 645.3 CISPred Testing Results on 1758 Sequences . . . . . . . . . . . . . 695.4 Selection of CISPred Default Threshold . . . . . . . . . . . . . . . . 765.5 Comparison of CISPred and Integrated Tools . . . . . . . . . . . . 78
5.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.5.2 Comparison on CASP Sequences . . . . . . . . . . . . . . . 785.5.3 Comparison on 1758 Sequences . . . . . . . . . . . . . . . . 895.5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Conclusion 1036.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
References 113
A Submission Templates on Cluster 114A.1 Submission Template for THREADER . . . . . . . . . . . . . . . . 114A.2 Submission Template for SSPRO . . . . . . . . . . . . . . . . . . . 115A.3 Submission Template for PSIPRED . . . . . . . . . . . . . . . . . . 115A.4 Submission Template for Finding Motif Structures . . . . . . . . . . 116
Vita 117
vi
List of Figures
2.1 The general formula of an amino acid . . . . . . . . . . . . . . . . . 32.2 The amino acid sequence of protein 1AD5. . . . . . . . . . . . . . . 42.3 An α helix in protein 1R7G [26]. . . . . . . . . . . . . . . . . . . . . 52.4 An ideal β strand [43]. . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 A parallel β sheet in protein 1DIN [26]. . . . . . . . . . . . . . . . . 62.6 An anti-parallel β sheet in protein 1IC9 [26]. . . . . . . . . . . . . . 62.7 A β barrel in protein 1BY3 [26]. . . . . . . . . . . . . . . . . . . . . 72.8 The secondary structure of protein 1AD5. . . . . . . . . . . . . . . 72.9 Part of the PDB file of protein 1WCK. . . . . . . . . . . . . . . . . 92.10 The entire amino acid sequence of protein 1WCK. . . . . . . . . . . 102.11 Part of the DSSP file of protein 1WCK. . . . . . . . . . . . . . . . . 112.12 Part of the PDBFINDER entry of protein 1WCK. . . . . . . . . . . 122.13 Part of a GARNIER [16] prediction report. . . . . . . . . . . . . . . 162.14 Part of a PREDATOR [14] prediction report. . . . . . . . . . . . . 162.15 Part of a PSIPRED [24] horizontal prediction report. . . . . . . . . 172.16 Part of a SSPRO [37] prediction report. . . . . . . . . . . . . . . . . 17
3.1 CISPred system architecture. . . . . . . . . . . . . . . . . . . . . . 223.2 Example of THREADER score report. . . . . . . . . . . . . . . . . 243.3 Alignment results of THREADER. . . . . . . . . . . . . . . . . . . 263.4 Structure segments for a protein motif. . . . . . . . . . . . . . . . . 293.5 Example of a structure formula result. . . . . . . . . . . . . . . . . 303.6 Example of a PATMATMOTIFS result. . . . . . . . . . . . . . . . . 323.7 Example of a PROSITE entry. . . . . . . . . . . . . . . . . . . . . . 333.8 The generation of motif structure formulae. . . . . . . . . . . . . . . 353.9 Example of a PSIPRED vertical result. . . . . . . . . . . . . . . . . 363.10 Example of a SSPRO result. . . . . . . . . . . . . . . . . . . . . . . 373.11 Example of the information available in one amino acid position. . . 393.12 Average 3-state accuracy of THREADER predictions on 80 random
sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.13 Example of a CISPred vertical result in an amino acid position. . . 453.14 Example of a CISPred vertical result. . . . . . . . . . . . . . . . . . 463.15 Example of a CISPred horizontal result. . . . . . . . . . . . . . . . 47
4.1 The system infrastructure of CISPred. . . . . . . . . . . . . . . . . 50
vii
4.2 Example of a CISPred queried sequence. . . . . . . . . . . . . . . . 504.3 Web page for submitting query sequences to CISPred. . . . . . . . . 514.4 A CISPred web page displaying user jobs. . . . . . . . . . . . . . . 524.5 Overview of the concurrent implementation of CISPred. . . . . . . . 544.6 Concurrent implementation of THREADER. . . . . . . . . . . . . . 544.7 The sorting of THREADER reports. . . . . . . . . . . . . . . . . . 554.8 Concurrent finding of motif structures. . . . . . . . . . . . . . . . . 574.9 The execution time of CISPred. . . . . . . . . . . . . . . . . . . . . 604.10 The speedup of CISPred. . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Two chains of protein 1ZD6. . . . . . . . . . . . . . . . . . . . . . . 645.2 Average 3-state accuracy of CISPred on the 109 CASP sequences. . 655.3 Standard deviation of CISPred on the 109 CASP sequences. . . . . 665.4 Coefficient of variation of CISPred on the 109 CASP sequences. . . 675.5 Number of sequences CISPred predicts with 3-state accuracy in
several specific ranges on the 109 sequences dataset. . . . . . . . . . 685.6 Distribution of the 109 CASP sequences predicted by CISPred with
1% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 705.7 Distribution of the 109 CASP sequences predicted by CISPred with
3% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 715.8 Distribution of the 109 CASP sequences predicted by CISPred with
5% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 725.9 Average 3-state accuracy of CISPred predictions on 1758 sequences. 735.10 Standard deviation of CISPred predictions on the 1758 sequences. . 745.11 Coefficient of variation of CISPred predictions on the 1758 sequences. 755.12 Number of sequences CISPred predicts with 3-state accuracy in
several specific ranges on the 1758 sequences dataset. . . . . . . . . 765.13 3-state accuracies of PSIPRED predictions on the 109 CASP se-
quences with average Q3 score 0.778, standard deviation 0.084, andcoefficient of variation 15.6%. . . . . . . . . . . . . . . . . . . . . . 79
5.14 Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on 109 CASP sequences. . . . . . . . . . . . 80
5.15 3-state accuracies of SSPRO predictions on 109 CASP sequenceswith an average Q3 score 0.821, standard deviation 0.095, and co-efficient of variation 11.6%. . . . . . . . . . . . . . . . . . . . . . . . 81
5.16 Bar graph showing the distribution of 3-state accuracies of SSPROpredictions on the 109 CASP sequences. . . . . . . . . . . . . . . . 82
5.17 3-state accuracies of CISPred predictions on the 109 CASP se-quences when the threshold at which to stop clustering equals 0.42. 84
5.18 Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions when the threshold equals 0.42 on the 109 CASPsequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.19 Bar graph showing the distribution of the 3-state accuracies of pre-dictions of CISPred, PSIPRED, and SSPRO. . . . . . . . . . . . . . 86
viii
5.20 Prediction results of CISPred and integrated tools on protein 1WCK. 885.21 Amino acid sequences and secondary structure sequences predicted
by SSPRO with 100% 3-state accuracy. . . . . . . . . . . . . . . . . 905.22 3-state accuracy of PSIPRED on 1758 sequences with average of
3-state accuracy 0.789, standard deviation 0.089, and coefficient ofvariation 11.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.23 Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on the 1757 sequences dataset. . . . . . . . . 92
5.24 The amino acid sequences for which PSIPRED fails to predict thesecondary structures. . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.25 3-state accuracy of SSPRO predictions on 1758 sequences with av-erage of 3-state accuracy 0.911, standard deviation 0.101, and co-efficient of variation 11.1%. . . . . . . . . . . . . . . . . . . . . . . . 94
5.26 Bar graph showing the distribution of the 3-state accuracies ofSSPRO predictions on the 1758 sequences dataset. . . . . . . . . . . 95
5.27 The 3-state accuracies of CISPred predictions on the 1758 sequenceswhen the threshold at which to stop clustering equals 0.42. The av-erage 3-state accuracy, standard deviation, and coefficient of varia-tion of these predictions are 0.893, 0.095, and 10.7% respectively. . . 97
5.28 The amino acid sequence, 8-state DSSP secondary structure, and3-state secondary structure of protein 1XR0. . . . . . . . . . . . . . 98
5.29 Prediction result of SSPRO on protein 1XR0. . . . . . . . . . . . . 985.30 Prediction result of PSIPRED on protein 1XR0. . . . . . . . . . . . 985.31 Alignment result of THREADER on protein 1XR0. . . . . . . . . . 995.32 Bar graph showing the distribution of the 3-state accuracies of CIS-
Pred predictions on the 1758 sequences when the threshold is 0.42. . 1005.33 Bar graph showing the distribution of the 3-state accuracies of the
predictions of PSIPRED, SSPRO, and CISPred on the 1758 se-quences dataset when the threshold is 0.42. . . . . . . . . . . . . . . 101
5.34 Summary of experimental results. . . . . . . . . . . . . . . . . . . . 102
ix
Chapter 1
Introduction
1.1 Motivation
It has been proven by researchers that protein functions are determined by their
specific three-dimensional structures. Experimental techniques such as X-ray crys-
tallography or NMR analysis are inadequate, and the gap between the number of
known tertiary and primary structures is widening; therefore, it is necessary to de-
velop approaches that deduce protein structures from their amino acid sequences.
The prediction of protein secondary structures using computer technologies is one
of the necessary efforts to narrow the gap.
There are many tools and algorithms for protein secondary structure predic-
tion. These tools are based on specific methods to predict structures, and their
results sometimes are not identical, and are even contradictory for some proteins.
A method that is able to integrate different prediction tools and make consensus
predictions is necessary for researchers.
1
1.2 Objective
The main objective of the thesis is to integrate results of selected protein structure
prediction tools and make a consensus protein secondary structure prediction in
a position-specific way.
1.3 Organization
Chapter 2 gives some background knowledge about protein, protein structures,
and protein motifs. This chapter also briefly introduces several protein structure
prediction tools and methods.
Chapter 3 presents the architecture and methodology of CISPred, a consensus
integrated protein structure prediction system.
Chapter 4 presents the concurrent implementation of CISPred.
Chapter 5 presents the testing strategy and testing results of CISPred.
Chapter 6 presents the contributions of CISPred, and offers some suggestions
for future work.
2
Chapter 2
Background
2.1 Protein
Proteins are large molecules made up of 20 types of amino acids. Each protein
molecule is a long and unique chain of these amino acid residues1. These long
chains tend to fold into massive and complicated structures because of the power
of bonds between atoms. After they fold into their structures, these long chains
are stable.
H 2 N C COOH
H
R
The a-carbon atom
Side chain group
Amino group Carboxyl group
Figure 2.1: The general formula of an amino acid
The sequence of atoms along the core of a chain is called the backbone of the
1“In biochemistry and molecular biology, a residue refers to a specific monomer within thepolymeric chain of a polysaccharide, protein or nucleic acid” [45].
3
protein. The portions of the amino acids that are not involved in this backbone
are called side chains. Figure 2.1 illustrates the general formula of an amino acid,
in which R represents one of 20 different side chains of amino acids. Every protein
backbone has a C-terminus and a N-terminus, which represent the two ends of the
backbone.
EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVD SLETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVK HYKIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAW EIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQH DKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIE QRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWS FGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEER PTFEYIQSVLDDFYTATESQYQQQP
Figure 2.2: The amino acid sequence of protein 1AD5.
Each of the 20 amino acids can be represented by a 1-letter code or 3-letter
code; for example, amino acid “Alanine” is represented by the letter “A” or “Ala,”
and amino acid “Cysteine” is represented by the letter “C” or “Cys.” A protein
molecule is then represented as a string of 1-letter codes, each of which represents
an amino acid. This string of letters is called an “amino acid sequence.” Figure 2.2
illustrates the amino acid sequence of protein 1AD52.
2.2 Protein Secondary Structure
2.2.1 Secondary Structure Definitions
After comparing the 3D structures of many different proteins, some regular fold-
ing patterns are often found, such as the α helix, β strand, and β sheet. The
Dictionary of Protein Secondary Structure (DSSP) [49] defines these patterns and
2Protein Data Bank [2] I.D.
4
uses a single letter code to describe each of them. This single letter code is called
“DSSP code,” and is frequently used to describe protein secondary structures.
An α helix is a structure formed when the backbone chain of a protein twists
around itself, and in which the backbone N-H group in each amino acid forms
a hydrogen bond with the C=O group of the amino acid four residues earlier.
Figure 2.3 [26] illustrates an α helix in protein 1R7G. An α helix is also called a
“4-turn helix,” and it is represented by “H” in DSSP code. If a hydrogen bond
is formed between two amino acids that are three residues apart, this is called
a “3-turn helix,” represented by “G”; if a hydrogen bond is formed between two
amino acids that are five residues apart, this is called a “5-turn helix,” represented
by “I”; and if a hydrogen bond is formed between two adjacent amino acids, this
is called a “hydrogen bonded turn,” represented by “T.”
Figure 2.3: An α helix in protein 1R7G [26].
A β strand is illustrated in Figure 2.4 [43], in which the backbone of the protein
is folded with successive 120 degree angles.
Figure 2.4: An ideal β strand [43].
5
A β sheet consists of two or more β strands connected by hydrogen bonds,
and its minimum length is two amino acid residues. If its length is less than two
residues, then it is called a “residue in isolated beta-bridge,” and represented by
“B.” The two neighbouring β strands may be parallel if they are aligned in the
same direction from one terminus (N or C) to the other, which is called a parallel
β sheet as shown in Figure 2.5 [26].
Figure 2.5: A parallel β sheet in protein 1DIN [26].
If the two neighbouring β strands are aligned in the opposite direction, then
it is called an anti-parallel β sheet as shown in Figure 2.6 [26].
Figure 2.6: An anti-parallel β sheet in protein 1IC9 [26].
A closed β sheet is called a β barrel, which is illustrated in Figure 2.7 [26]. All
β strands, β sheets and β barrels are represented by “E” in DSSP code.
In DSSP code, the structures formed by non-hydrogen bonds are called “bend,”
and are represented by “S.” The random turns and the structures which are not in
6
any of the above conformations are designated as “ ”(space), which is sometimes
also written as “C.” Usually, the eight secondary structure types defined in the
DSSP are reduced into three types based on a 3-state scheme [40]: “G” and “H”
are taken to be helix (“H”), “E” and “B” are taken to be strand (“E”), and all of
the other structure types are treated as random turns or a coil (“C”).
Based on DSSP code, the secondary structure of a protein molecule can be
represented by a string of letters, which is called a “protein secondary structure
sequence.” Figure 2.8 illustrates the secondary structure sequence of the pro-
tein 1AD5, in which each of the letters, such as “H”, “B” and “C”, represents a
particular protein folding pattern.
Figure 2.7: A β barrel in protein 1BY3 [26].
CCCEEEESSCBCCCSSSBCCBCTTCEEEEEECCTTEEEEEETTTCCEEEEEGGGEEETT SGGGSTTEETTCCHHHHHHHHTSTTCCTTCEEEEECSSSTTSEEEEEEEECTTSCEEEE EEECEECSSSCEESSTTSCBSCHHHHHHHHTTCCSSSSSCCCSBCCCCCCCCCCCTTCS EECGGGEEEEEEEECCSSEEEEEEEETTTEEEEEEEECTTSSCHHHHHHHHHHHTTCCC TTBCCEEEEECSSSEEEEEECCTTCBHHHHHTSHHHHTCCHHHHHHHHHHHHHHHHHHH HTTCCCSCCSTTSEEECTTSCEEECCCCCCCCCCCCCGGGCCHHHHHHCCCCHHHHHHH HHHHHHHHHTTTCCSSSSCCTHHHHHHHHTTCCCCCCTTSCHHHHHHHHHHTCSSGGGS CCHHHHHHHHHTTTSCGGGSSCCCC
Figure 2.8: The secondary structure of protein 1AD5.
7
2.2.2 Secondary Structure Assignments
Protein secondary structures are assigned to amino acid sequences based on the
three dimensional orthogonal coordinates of the atoms in proteins. The three di-
mensional orthogonal coordinates of proteins are stored in the RCSB (Research
Collaboratory for Structural Bioinformatics) Protein Data Bank (PDB) [2], which
is a database that stores information about known proteins such as their amino
acid sequences, the methods used to find them, their atoms, and the three dimen-
sional orthogonal coordinates of each atom in a protein. By April, 2007, the RCSB
Protein Data Bank has stored information from about 39,261 known proteins. The
information from a protein is stored in a PDB file in plain text format. Figure 2.9
illustrates part of the PDB file for protein 1WCK. The lines starting with “ATOM”
list the three dimensional orthogonal coordinates of the atoms in protein 1WCK.
Considering the first line starting with “ATOM” as an example: the “N” in the
third column indicates that the atom is Nitrogen; the “GLY” in the fourth column
indicates that this atom is one of the atoms of the amino acid type “GLY,” which
is “Glycine” with 1-letter code G; the “80” in the sixth column indicates that the
amino acid “Glycine” is the 80th amino acid of the protein 1WCK; the “-0.522”
indicates the orthogonal coordinate X in angstroms; the “84.984” indicates the
orthogonal coordinate Y in angstroms; and the “-3.507” indicates the orthogonal
coordinate Z in angstroms. The protein 1WCK contains 220 amino acids in total
as shown in Figure 2.10, in which the atoms in the underlined amino acids are
those included in the PDB file shown in Figure 2.9 and the atoms in the non-
underlined amino acids are not included in the PDB file shown in Figure 2.9. The
reason that the PDB file (see Figure 2.9) does not provide the three dimensional
orthogonal coordinates of the atoms in the non-underlined amino acid is that these
amino acids are partially or completely unstructured and do not fold into a stable
8
state, which is labeled as “disordered regions” by structural biologists.
ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.014586 0.008421 0.000000 0.00000 SCALE2 0.000000 0.016843 0.000000 0.00000 SCALE3 0.000000 0.000000 0.006122 0.00000 ATOM 1 N GLY A 80 -0.522 84.984 -3.507 0.70 21.33 N ATOM 2 CA GLY A 80 -0.637 83.613 -2.936 0.70 21.02 C ATOM 3 C GLY A 80 -2.019 83.003 -3.100 0.70 20.72 C ATOM 4 O GLY A 80 -3.033 83.679 -2.950 0.70 20.99 O ATOM 5 N LEU A 81 -2.051 81.712 -3.403 1.00 20.46 N ATOM 6 CA LEU A 81 -3.300 80.999 -3.613 1.00 20.44 C ATOM 7 C LEU A 81 -3.792 80.270 -2.371 1.00 19.00 C ATOM 8 O LEU A 81 -4.944 79.840 -2.325 1.00 20.49 O ATOM 9 CB LEU A 81 -3.212 80.043 -4.818 1.00 20.90 C ATOM 10 CG ALEU A 81 -3.101 80.656 -6.221 0.50 21.33 C ATOM 11 CD1ALEU A 81 -2.821 79.575 -7.249 0.50 21.54 C ATOM 12 CD2ALEU A 81 -4.357 81.434 -6.598 0.50 21.68 C ATOM 13 CG BLEU A 81 -1.876 79.810 -5.538 0.50 21.08 C ATOM 14 CD1BLEU A 81 -0.895 78.992 -4.707 0.40 20.94 C ATOM 15 CD2BLEU A 81 -2.113 79.145 -6.882 0.50 20.90 C ATOM 16 N GLY A 82 -2.938 80.146 -1.354 1.00 16.73 N ATOM 17 CA GLY A 82 -3.351 79.483 -0.121 1.00 14.57 C ATOM 18 C GLY A 82 -2.701 78.133 0.087 1.00 12.96 C ATOM 19 O GLY A 82 -1.730 77.784 -0.588 1.00 13.51 O ATOM 20 N LEU A 83 -3.245 77.380 1.037 1.00 11.79 N ATOM 21 CA LEU A 83 -2.718 76.063 1.386 1.00 10.98 C ATOM 22 C LEU A 83 -3.533 74.969 0.715 1.00 11.39 C ATOM 23 O LEU A 83 -4.726 75.153 0.491 1.00 11.82 O ATOM 24 CB LEU A 83 -2.758 75.859 2.902 1.00 10.83 C ATOM 25 CG LEU A 83 -2.057 76.908 3.770 1.00 10.83 C ATOM 26 CD1 LEU A 83 -2.218 76.506 5.221 1.00 11.61 C ATOM 27 CD2 LEU A 83 -0.575 77.039 3.399 1.00 12.20 C ATOM 28 N PRO A 84 -2.892 73.831 0.390 1.00 11.49 N ATOM 29 CA PRO A 84 -3.635 72.714 -0.218 1.00 11.69 C ATOM 30 C PRO A 84 -4.651 72.020 0.688 1.00 11.03 C ATOM 31 O PRO A 84 -5.626 71.448 0.191 1.00 12.31 O ATOM 32 CB PRO A 84 -2.535 71.726 -0.636 1.00 12.29 C ATOM 33 CG PRO A 84 -1.316 72.159 0.021 1.00 12.72 C . . . ATOM 966 N HIS A 215 4.176 70.860 -0.778 1.00 15.76 N ATOM 967 CA HIS A 215 4.125 69.550 -1.440 1.00 17.82 C ATOM 968 C HIS A 215 5.505 68.909 -1.429 1.00 18.68 C ATOM 969 O HIS A 215 6.524 69.610 -1.419 1.00 20.85 O ATOM 970 CB HIS A 215 3.631 69.675 -2.888 1.00 18.07 C ATOM 971 CG HIS A 215 2.216 70.148 -3.020 1.00 19.06 C ATOM 972 ND1 HIS A 215 1.155 69.520 -2.405 1.00 18.86 N ATOM 973 CD2 HIS A 215 1.681 71.170 -3.732 1.00 19.26 C ATOM 974 CE1 HIS A 215 0.032 70.141 -2.716 1.00 18.89 C ATOM 975 NE2 HIS A 215 0.324 71.145 -3.523 1.00 21.03 N TER 976 HIS A 215 HETATM 977 AS CAC A1216 -0.013 79.161 8.880 0.11 15.18 AS HETATM 978 O1 CAC A1216 0.838 80.589 8.363 0.16 15.89 O HETATM 979 O2 CAC A1216 -0.094 79.116 10.616 0.11 14.61 O HETATM 980 C1 CAC A1216 -1.826 79.189 8.138 0.22 15.58 C HETATM 981 C2 CAC A1216 0.927 77.576 8.217 0.22 15.65 C
Figure 2.9: Part of the PDB file of protein 1WCK.
The DSSP program [49] is a program that assigns secondary structures to
amino acid sequences based on the three dimensional coordinates of the atoms in
9
proteins. The DSSP program reads the PDB files as shown in Figure 2.9, assigns
a secondary structure type to each of the amino acid positions, and saves the
secondary structures in a DSSP file. Figure 2.113 illustrates part of the DSSP file
of protein 1WCK. The fourth column from the left lists the amino acids of the
protein, and the fifth column from the left lists the secondary structures in each
amino acid position. PDBFINDER [49] is a database that stores the secondary
structures of all protein entries in the Protein Data Bank. Figure 2.12 shows the
entry of 1WCK in the PDBFINDER database, in which the line starting with
“Sequence” lists the amino acid of protein 1WCK (without disordered segments),
and the line starting with “DSSP” lists the secondary structure of protein 1WCK.
The secondary structures in the PDBFINDER database are assigned by the DSSP
program.
2.3 Protein Secondary Structure Prediction
2.3.1 Overview
Protein secondary structure prediction methods usually do not distinguish all of
the secondary structure types defined in the “Dictionary of Protein Secondary
Structure,” but only consider three structural states. Generally, α helix (“H”)
and 3-turn helix(“G”) are all treated as Helix, represented by “H,” β strand(E)
3In order to fit on the paper, some unrelated segments or columns in the examples shown inthis thesis may be deleted or omitted as indicated by “...”.
>1WCK:A|PDBID|CHAIN|SEQUENCE MAFDPNLVGPTLPPIPPFTLPTGPTGPTGPTGPTGPTGPTGPTGDTGTTGPTGPTGPTGPTGPTGATGL TGPTGPTGPS GLGLPAGLYAFNSGGISLDLGINDPVPFNTVGSQFGTAISQLDADTFVISETGFYKITV IANTATASVLGGLTIQVNGVPVPGTGSSLISLGAPIVIQAITQITTTPSLVEVIVTGLGLSLALGTSAS IIIEKVAH HHHHH
Figure 2.10: The entire amino acid sequence of protein 1WCK.
10
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI
1 80 A G 0 0 131 0, 0.0 3,-0.1 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 140.4
2 81 A L - 0 0 193 1,-0.3 2,-0.2 0, 0.0 0, 0.0 0.585 360.0-120.3 -95.7 -12.5
3 82 A G - 0 0 67 131,-0.1 -1,-0.3 3,-0.0 3,-0.0 -0.659 42.4 -36.2 109.3-166.7
4 83 A L - 0 0 71 -2,-0.2 130,-0.1 1,-0.1 3,-0.1 -0.708 41.3-132.4 -99.1 149.3
5 84 A P S S- 0 0 50 0, 0.0 2,-0.3 0, 0.0 33,-0.2 0.771 83.6 -1.9 -67.9 -27.1
6 85 A A E +A 133 0A 1 127,-0.6 127,-2.4 30,-0.1 2,-0.3 -0.988 58.4 171.9-164.6 150.3
7 86 A G E -AB 132 35A 3 28,-2.2 28,-2.4 -2,-0.3 2,-0.4 -0.962 9.3-163.8-163.8 151.3
8 87 A L E -AB 131 34A 3 123,-2.5 123,-2.5 -2,-0.3 2,-0.5 -0.991 0.8-167.8-143.5 131.8
9 88 A Y E +AB 130 33A 92 24,-2.8 23,-2.2 -2,-0.4 24,-1.6 -0.986 18.4 166.5-117.7 125.4
10 89 A A E +AB 129 31A 0 119,-2.5 119,-2.6 -2,-0.5 2,-0.3 -0.951 7.4 179.4-135.8 157.5
11 90 A F E -AB 128 30A 41 19,-2.8 19,-2.3 -2,-0.3 2,-0.5 -0.966 26.5-120.0-151.7 165.4
12 91 A N E -A 127 0A 13 115,-2.5 114,-1.6 -2,-0.3 115,-0.8 -0.962 28.4-175.4-114.6 127.3
13 92 A S E +A 125 0A 57 -2,-0.5 2,-0.3 112,-0.2 112,-0.2 -0.978 15.0 143.6-126.2 132.5
14 93 A G E -A 124 0A 34 110,-1.8 110,-3.0 -2,-0.4 4,-0.1 -0.988 50.0-124.8-162.8 162.7
15 94 A G S S+ 0 0 67 -2,-0.3 2,-0.3 108,-0.2 108,-0.2 0.504 91.5 49.1 -92.2 -6.5
16 95 A I S S- 0 0 141 106,-0.1 108,-0.1 108,-0.1 2,-0.1 -0.899 92.4 -91.6-130.0 158.7
17 96 A S - 0 0 60 -2,-0.3 2,-0.5 106,-0.1 105,-0.2 -0.360 40.9-131.0 -67.5 152.1
18 97 A L E -E 121 0B 47 103,-2.3 103,-2.5 -4,-0.1 2,-0.5 -0.925 12.5-155.1-116.8 124.3
19 98 A D E -E 120 0B 117 -2,-0.5 2,-0.3 101,-0.2 101,-0.2 -0.873 15.5-176.8-104.0 127.6
20 99 A L E -E 119 0B 22 99,-2.8 99,-2.1 -2,-0.5 2,-0.2 -0.919 10.5-149.9-125.0 146.5
21 100 A G > - 0 0 26 -2,-0.3 3,-2.1 97,-0.2 93,-0.3 -0.574 45.4 -57.9-106.4 174.4
.
.
.
125 204 A T E S+A 13 0A 69 -2,-0.4 -112,-0.2 -112,-0.2 -63,-0.2 -0.758 70.0 178.5 -85.7 101.1
126 205 A S E - 0 0 4 -114,-1.6 -64,-2.4 -2,-1.1 2,-0.3 0.802 62.5 -10.4 -74.8 -29.0
127 206 A A E -AD 12 61A 0 -115,-0.8 -115,-2.5 -66,-0.3 2,-0.3 -0.983 55.0-161.1-165.4 154.4
128 207 A S E -AD 11 60A 22 -68,-2.6 -68,-2.3 -2,-0.3 2,-0.3 -0.974 5.6-167.0-139.1 157.2
129 208 A I E -AD 10 59A 0 -119,-2.6 -119,-2.5 -2,-0.3 2,-0.4 -0.997 5.3-167.3-142.7 138.5
130 209 A I E -AD 9 58A 63 -72,-2.2 -72,-2.3 -2,-0.3 2,-0.4 -1.000 6.0-174.1-121.4 129.8
131 210 A I E +AD 8 57A 0 -123,-2.5 -123,-2.5 -2,-0.4 2,-0.4 -0.992 9.4 179.9-118.8 129.1
132 211 A E E -AD 7 56A 51 -76,-2.6 -76,-2.5 -2,-0.4 2,-0.7 -0.993 29.4-131.8-131.0 136.9
133 212 A K E +AD 6 55A 6 -127,-2.4 -127,-0.6 -2,-0.4 -78,-0.2 -0.798 30.7 171.9 -83.8 117.0
134 213 A V E + 0 0 60 -80,-2.3 2,-0.3 -2,-0.7 -79,-0.2 0.622 60.3 14.4-104.6 -17.5
135 214 A A E D 0 54A 42 -81,-1.6 -81,-2.5 -131,-0.0 -1,-0.3 -0.969 360.0 360.0-156.3 145.3
136 215 A H 0 0 157 -2,-0.3 -83,-0.3 -83,-0.2 -85,-0.0 -0.506 360.0 360.0 -78.4 360.0
Fig
ure
2.11
:Par
tof
the
DSSP
file
ofpro
tein
1WC
K.
11
ID : 1WCK Header : STRUCTURAL PROTEIN Date : 2005-10-25 Compound : bcla protein Source : (bacillus anthracis) Author : S.Rety Author : S.Salamitou Author : L.A.Augusto Author : R.Chaby Author : F.Lehegarat Author : A.Lewit-bentley Exp-Method : X Resolution : 1.36 R-Factor : 0.179 Free-R : 0.190 Ref-Prog : REFMAC HSSP-N-Align : 23 T-Frac-Beta : 0.60 T-Nres-Prot : 136 T-Water-Mols : 190 HET-Groups : 1 Het-Id : 1216 Natom : 5 Name : CACODYLATE ION Chain : A Sec-Struc : 136 Beta : 82 B-Bridge : 2 Anti-Hb : 108 Amino-Acids : 136 Substrate : 5 Sequence : GLGLPAGLYAFNSGGISLDLGI ... ALGTSASIIIEKVAH DSSP : CCCCSEEEEEEEEESSCEEECT ... CSEEEEEEEEEEEEC Nalign : 4455555555555555555555 ... 444444444444442| 10.9706 Nindel : 0000000000000000001100 ... 000000000000000| 0.1324 Entropy : 0202444334323414334305 ... 020232202000429| 0.2487 Cons-Weight : 9294244115135384211191 ... 939466292999553| 0.5021 Chain : Z Water-Mols : 190
Figure 2.12: Part of the PDBFINDER entry of protein 1WCK.
12
and “residue in isolated beta-bridge”(B) are all treated as Strand, represented by
“E,” and all of the others are treated as Coil, represented by “C” or “ ”(space).
Correspondingly, the “3-state accuracy” score (also called “Q3 score”) is used to
evaluate prediction accuracy, which is the percentage of the residues which have
predictions matching the real structures.
Many methods and algorithms have been used to predict protein secondary
structures. The early methods used in protein structure prediction usually only
contained linear statistics [16, 5, 15, 17, 34] and stereochemical principles [31].
Subsequently, machine learning algorithms proved to be a successful way to predict
protein secondary structures. The successful machine learning algorithms used
include decision tree [35], neural networks [38, 40, 19, 28, 39, 3, 4, 21], and K-
way nearest neighbours [6, 7, 13, 12, 30]. Currently, most of the top successful
prediction tools with prediction accuracy higher than 75%, such as PHD [40],
PSIPRED [24] and SSPRO [37], use artificial neural network (ANN) algorithms
to make their predictions.
It has also been proven that considering evolutionary information, or multiple
aligned sequences, in protein structure prediction can improve prediction accu-
racy [8]. This is because multiple sequence alignment can be obtained from the
core structure or a consensus structure of a whole protein family which can then
be used to predict the structure of proteins which belong to or are related to that
protein family. Currently, multiple sequence alignment is used quite often in pro-
tein secondary structure prediction [33, 36, 14, 9], and is considered a successful
method [18, 32].
Recently, a trend is not to use only one technique to predict protein secondary
structures, but to combine several techniques; for example, to combine ANNs
and multiple sequence alignment [47, 9, 25] , and to combine statistical methods,
13
homology methods, information theory methods, and artificial neural network
algorithms [46]. Besides combining various techniques in one tool, some tools
combine other prediction tools to make consensus prediction. One typical and
successful example is JPRED [10] which combines 6 different prediction tools:
DSC [27], PHD [40], NNSSP [6], PREDATOR [14], ZPRED [48], and MULPRED4.
Each of these tools combines multiple sequences alignment with a specific method;
for example, PHD uses jury decision neural networks, NNSSP is based on nearest
neighbours, and DSC uses linear discrimination.
CISPred integrates two existing prediction tools SSPRO [37] and PSIPRED [24],
which have prediction results with relatively high 3-state accuracy, and are freely
downloaded and easily integrated. Moreover, CISPred also integrates the protein
motif structures database and the threading method, which have not been widely
used by existing protein secondary structure prediction tools.
2.3.2 PHD, PSIPRED and SSPRO
Currently, PHD [40], PSIPRED [24] and SSPRO [37] are three of the most suc-
cessful protein secondary structure prediction tools.
As mentioned above, multiple sequence alignments can improve the accuracy
of protein secondary structure prediction, and are widely used in protein sec-
ondary structure prediction tools. The generation of sequence profiles by multiple
sequence alignment is time-consuming. For example, a very successful method,
PHD [40], uses a multi-processor computer to generate multiple sequence align-
ment; therefore, the PHD server [42] cannot be moved to a new site. In 1999,
PSIPRED [24], a protein secondary structure prediction system that could be
easily ported to any workstation, was created. The approach of PSIPRED is
4Barton, 1988, unpublished
14
to use the position-based scoring matrix of PSI-BLAST, instead of multiple se-
quence alignments, as the inputs for a two-stage neural network. According to
the experiments conducted by its author, PSIPRED can achieve an average 3-
state accuracy between 76.5% and 78.3% on the CASP3 (Critical Assessment of
Techniques for Protein Structure Prediction experiment) [29] dataset. The output
of PSIPRED [24] gives the confidence score of each of the three secondary struc-
tures “C,” “H,” and “E,” respectively, in each amino acid position. The details of
PSIPRED output reports are presented in Chapter 3.
In 2004, SSPRO [37], a protein secondary structure prediction tool, was cre-
ated based on an ensemble of 100 1D-RNNs (one dimensional recurrent neural net-
works), PSI-BLAST-derived profiles (position-based scoring matrix), and a large
non-redundant training set. According to the experiments conducted by its au-
thor, SSPRO can achieve a 3-state accuracy of 77%. The details of the prediction
results from SSPRO [37] are presented in Chapter 3.
2.3.3 The Threading Method and THREADER
The threading method is an algorithm which can be used to predict protein struc-
tures. A protein fold library usually is constructed which contains protein folds as
structural templates. Then a score function is chosen to evaluate any alignments
of a queried amino acid sequence with a structural template. The score function
usually computes the free energy of this queried sequence in a structural tem-
plate. The less free energy the queried sequence has, the more stable the queried
sequence is in this structural template, which also indicates a higher likelihood
that this template is the final structure of this queried sequence. Based on the
score function, the best alignment of a query sequence with each of the structural
templates can be found. Then the most appropriate structural templates with
15
optimal alignments are selected as the predicted structures.
THREADER [23] is a tool which implements the threading method. Its output
includes a score report showing the score of the alignment of a queried sequence
with each structural template, and an alignment report showing the alignments of
the query sequence with each of the structural templates. The details of the score
report and alignment report are presented in Chapter 3.
2.3.4 Comparison of Protein Structure Prediction Tools
Because each method and tool has a different approach to prediction, results may
be different, and sometimes part of the results are contradictory. Figures 2.13,
2.14, 2.15, and 2.16 are part of the prediction results based on the first 50 amino
acids of protein 1AP9, from GARNIER [16], PREDATOR [14], PSIPRED [24],
and SSPRO [37].
. 10 . 20 . 30 . 40 . 50 QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP helix HHH H HHHHH sheet EEE EE EEEEE EEEEEE EEEEEEEEE turns TT TT coil CCCCC CCC CCCC
Figure 2.13: Part of a GARNIER [16] prediction report.
1 QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP 50 _______HHHHHHHHHHHHHHHHHHHHHHHHHH___HHHHHHHHHHHHHH
Figure 2.14: Part of a PREDATOR [14] prediction report.
These four reports illustrate the differences between the four prediction tools.
For example, PREDATOR [14] predicts more α helices (“H”) than GARNIER [16],
and at some positions PREDATOR [14] predicts an α helix (“H”) where GAR-
NIER [16] predicts contradictory results, a β sheet (“E”). PSIPRED [24] and
16
SSPRO [37] have similar prediction results; both of them predict two series of α
helices (“H”) with some coils (“C”) in the middle. But the lengths of the two
α helices (“H”) series predicted by PSIPRED [24] and SSPRO [37] are slightly
different.
2.3.5 Benchmarked Non-redundant Dataset
As mentioned above, most of the successful methods to predict protein secondary
structure use machine learning algorithms. Accordingly, datasets that contain
non-redundant protein amino acid sequences are needed for cross-validation. In
1994, Burkhard Rost and Chris Sander provided a dataset, often referred to as
dataset “RS126” [41], that contains 126 non-redundant protein sequences. The
non-redundancy of “RS126” means that any two proteins in the dataset share no
more than a 25% sequence identity over a length of more than 80 residues. The
“RS126” dataset was used as a benchmark dataset by many machine learning
algorithms that predict protein secondary structure.
In 1999, James Cuff and Geoffrey Barton pointed out that the standard used
to determine the non-redundancy of the “RS126” dataset, percentage identity, is
a poor measure of sequence similarity [8], and they provided a more sophisticated
Conf: 97124684478999989899899999995268999873356467887788 Pred: CCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHHHHH AA: QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP 10 20 30 40 50
Figure 2.15: Part of a PSIPRED [24] horizontal prediction report.
QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP CCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHH
Figure 2.16: Part of a SSPRO [37] prediction report.
17
method. To compute the similarity of two amino acid sequences A and B, their
method aligns A and B using a standard dynamic programming algorithm, and
then obtains the score for the alignment V. The order of the amino acids in both
sequence A and sequence B are randomized, and the two randomized sequences are
aligned by the dynamic programming algorithm. This process is usually repeated
more than 100 times, and then the mean of the alignment scores of randomized
sequences x and the standard deviation of the alignments scores of randomized
sequences σ are computed. A SD score, or a Z-score, that measures the similarity
of the original sequences A and B is computed using Equation 2.1. The sequences
with an SD score higher than 5 are considered similar. A dataset that contains
396 sequences, usually named dataset “CB396” [8], is provided using this similar-
ity method. Each sequence in “CB396” is not similar to any other sequence in
“CB396” and also not similar to any sequence in “RS126.” “CB396” is another
of the benchmark non-redundant datasets currently used.
SD =V − x
σ(2.1)
The similarity of each pair of sequences in dataset “RS126” was also measured
using the new method mentioned above, and 9 sequences were removed from
“RS126” in order to make the remaining 117 sequences non-redundant. These 117
sequences from “RS126” are combined with “CB396” to form a dataset named
“CB513,” which is one of the benchmarked non-redundant datasets currently used.
18
2.4 Protein Motif and Motif Databases
2.4.1 Protein Motif
Motifs are biologically significant sites and patterns existing in proteins. They can
be used to characterize protein families.
2.4.2 PATMATMOTIFS and the PROSITE database
PATMATMOTIFS [1], a protein motifs finding tool, compares a given protein se-
quence to the PROSITE [22] database, which stores the information about known
protein motifs. In some cases, an unknown protein sequence is distantly related
to known proteins, therefore it is difficult to determine the features of an un-
known protein by overall sequence alignment. By comparing an unknown protein
to the PROSITE [22] database, some biologically important patterns, motifs or
fingerprints can be found, which can help determine to which family it belongs.
Examples of PATMATMOTIFS results and PROSITE entries are presented in
Chapter 3.
19
Chapter 3
CISPred: Consensus Integrated
Protein Structure Prediction
3.1 Overview
CISPred, a Consensus Integrated Structure Prediction tool, predicts protein sec-
ondary structures by integrating several prediction tools and databases. The tools
and databases integrated in CISPred include two protein secondary structure pre-
diction tools, SSPRO [37] and PSIPRED [24]; a protein motif searching tool,
PATMATMOTIFS [1]; a motif database, PROSITE; a protein secondary structure
database, PDBFINDER [20]; and a threading method tool, THREADER [23].
3.2 System Architecture
3.2.1 Selection of Integrated Tools
The tools selected for integration in CISPred meet several requirements. The
integrated tools have a relatively high prediction accuracy. The techniques of
20
protein secondary structure prediction have had great improvements in the last
20 years. Some early algorithms, such as GARNIER [16], have no more than
65% accuracy. Because the consensus results of CISPred are computed based
on existing tools, the prediction accuracy of these tools directly influences the
accuracy of CISPred.
The tools integrated by CISPred can be downloaded and installed on local ma-
chines. Currently, there are many successful protein structure prediction servers,
yet some of them cannot be downloaded, but only accessed from their web pages.
To improve the stability and reduce the execution time of CISPred, the existing
tools integrated are designed to be executed on local machines. Therefore, the
protein secondary structure prediction tools PSIPRED [24] and SSPRO [37] are
selected to be integrated into CISPred, because both of them can be downloaded
and installed on local machines, and both have relatively high accuracy: PSIPRED
has 80.6% accuracy and SSPRO has 80% or better.
PSIPRED and SSPRO mainly use ANN techniques. In order to provide con-
sensus predictions, more tools which use different prediction approaches are inte-
grated into CISPred. THREADER [23] is integrated into CISPred and implements
the threading method, a completely different method than neural networks. More-
over, CISPred also integrates protein motif structural information by finding the
“structure formulae” of protein motifs. In total, CISPred integrates two protein
secondary structure prediction tools using neural networks and PSI-BLAST pro-
files, a tool using the threading method, and protein motif “structure formulae.”
Figure 3.1 illustrates the system architecture of CISPred. A program run-
ning at the web server of CISPred submits queried sequences to the integrated
tools which are executed at a Cluster. Motif structure formulae are found by in-
tegrating the tool PATMATMOTIFS [1] and two databases: PDBFINDER [20]
21
Web Server
PATMATMOTIFS
PDBFinder Database
PROSITE Database
THREADER
Clustering
Finding the Motif Structure Formulas
SSPRO
PSIPRED
Web Server
Query sequences
Cluster
Structure Formulas
Submission Program
Consensus Prediction Program
Consensus Results
Email Program
Prediction results
Prediction results Protein
folds
Figure 3.1: CISPred system architecture.
22
and PROSITE [22]. The results generated from THREADER are clustered before
being integrated. When the execution of the integrated tools and the finding of
motif structure formulae are finished, a program integrates the structure formulae
and the results of each integrated tool, and then generates consensus prediction
for each queried sequence. The consensus predictions are then sent to a program
running at the web sever of CISPred, which sends the results to CISPred users by
email.
3.2.2 THREADER
3.2.2.1 Sorting THREADER Reports
THREADER [23] is a tool that implements the threading method. THREADER
has a library which contains the structures of 6251 protein folds. A queried amino
acid sequence is “threaded” through each of the protein folds. For each protein
fold, an alignment between the amino acid sequence and the protein fold with
minimum free energy is selected as the optimum alignment. THREADER provides
information about the optimum alignment for each fold in the library. Figure 3.2
shows part of the output report from THREADER, in which column 8 lists the
filtered combined energy Z-scores [23] and the rightmost column lists the PDB ID
codes of the protein folds. Based on the manual of THREADER, protein structure
predictions should be based on the filtered combined energy Z-scores, because
the higher the filtered combined energy Z-scores are, the more appropriately the
amino acid sequence fits the protein fold, and the higher the probability that the
protein fold is the correct prediction structure. Usually, the protein fold with the
filtered combined energy Z-score above 3.5 is considered to have significantly high
probability to be the correct predicted structure of a queried sequence.
The alignments between a queried amino acid sequence and the protein fold
23
-545.43 -1145.34 4.20 -11.34 3.14 -780.91 4.73 4.73 73.6 350 90.9 79.9 0 1b3mA0
-523.75 -1070.86 3.98 2.05 0.34 -481.13 2.79 2.79 -46.0 309 91.2 70.8 0 1b7gO0
-323.79 -980.73 1.97 -7.26 2.29 -474.51 2.74 2.74 79.2 350 93.1 79.9 0 1bhe00
-430.87 -1046.17 3.05 -1.68 1.12 -465.82 2.69 2.69 -47.9 305 94.1 69.6 0 1a5t00
-511.70 -841.22 3.86 3.99 -0.06 -428.74 2.45 2.45 50.4 326 99.4 74.7 0 1ak500
-233.74 -767.98 1.07 -9.17 2.69 -424.16 2.42 2.42 91.7 305 93.6 69.6 0 1b6cB0
-326.61 -844.66 2.00 -4.30 1.67 -415.82 2.36 2.36 44.3 288 99.0 65.8 0 1bf6A0
-249.10 -786.89 1.22 -8.04 2.45 -416.13 2.36 2.36 -4.8 369 80.4 84.5 0 1a4yA0
-471.73 -925.15 3.46 3.83 -0.03 -392.12 2.21 2.21 -3.4 351 90.3 80.4 0 1a4gA0
-367.95 -994.76 2.41 -0.62 0.90 -380.87 2.14 2.14 9.6 336 94.1 76.9 0 1a7kA0
-289.06 -610.82 1.62 -4.40 1.69 -380.48 2.13 2.13 20.1 316 96.9 72.4 0 1a6o00
-444.86 -775.12 3.19 3.60 0.02 -370.00 2.07 2.07 -27.4 297 96.7 67.8 0 1b3oA0
-270.28 -1119.54 1.43 -4.77 1.77 -369.42 2.06 2.06 9.7 376 87.3 86.1 0 1bif00
-480.57 -869.64 3.55 5.52 -0.38 -366.00 2.04 2.04 12.6 306 94.2 70.1 0 1b4kA0
-275.31 -916.65 1.48 -4.39 1.69 -366.47 2.04 2.04 -4.1 359 89.5 82.0 0 1aye00
-367.21 -1060.40 2.41 0.41 0.69 -358.79 1.99 1.99 43.2 348 91.3 79.5 0 1bd0A0
-296.23 -729.19 1.69 -2.92 1.38 -356.96 1.98 1.98 5.0 245 89.1 55.9 0 1b5tA0
-465.23 -1115.40 3.39 5.70 -0.42 -346.81 1.92 1.92 74.5 358 89.3 81.7 0 1a12A0
-239.84 -620.05 1.13 -4.71 1.75 -337.66 1.86 1.86 104.7 245 88.2 56.2 0 1a0600
-270.70 -770.96 1.44 -2.75 1.35 -327.85 1.79 1.79 25.3 258 94.2 58.9 0 1af700
-502.09 -1146.20 3.90 -9.59 2.65 -688.63 4.16 4.16 30.9 359 89.5 82.0 0 1d7yA0
-393.29 -760.31 2.80 -5.73 1.89 -504.60 2.98 2.98 -67.8 275 94.5 63.0 0 1dhpA0
-323.57 -625.21 2.09 -7.62 2.26 -471.71 2.77 2.77 21.1 249 94.3 57.1 0 1c8zA0
-318.76 -945.51 2.05 -6.38 2.02 -442.82 2.58 2.58 31.9 299 94.6 68.5 0 1bsvA0
-351.78 -727.15 2.38 -3.65 1.48 -422.71 2.45 2.45 16.4 322 88.7 73.5 0 1c0kA0
-456.34 -1183.65 3.44 5.05 -0.23 -358.07 2.04 2.04 158.9 340 89.7 77.6 0 1c3oB0
-299.23 -1064.07 1.85 -2.38 1.23 -345.45 1.96 1.96 78.1 377 93.6 86.3 0 1c0nA0
-348.78 -1026.27 2.35 0.22 0.72 -344.50 1.95 1.95 -74.4 335 98.2 76.7 0 1bx4A0
-291.71 -1436.03 1.77 -2.68 1.29 -343.77 1.95 1.95 -18.5 394 69.8 90.2 0 1dceA0
-203.92 -1546.30 0.89 -7.04 2.15 -340.81 1.93 1.93 -2.3 380 90.3 87.0 0 1bk5A0
-322.13 -1001.99 2.08 0.67 0.63 -309.14 1.72 1.72 120.5 349 89.3 79.7 0 1cl2A0
-300.23 -1152.30 1.86 -0.44 0.85 -308.80 1.72 1.72 79.5 335 88.2 76.7 0 1bjwA0
Fig
ure
3.2:
Exam
ple
ofT
HR
EA
DE
Rsc
ore
repor
t.
24
are also provided by THREADER, as shown in Figure 3.3. The alignments contain
confidence scores which fall in an integer range from 0 to 9 inclusive at positions
with structure types “H” and “E.” Each of these confidence scores indicates the
possibility that a structure type is the correct prediction at an amino acid posi-
tion. However, in order to integrate the structures with a confidence score of 0,
CISPred raises all of the confidence scores by 1, and subsequently the range of
these confidence scores becomes 1 to 10 inclusive.
For a queried sequence, THREADER generates a filtered combined energy Z-
score and an alignment for each of the 6251 protein folds in its library. By sorting
the column listing the filtered combined energy Z-scores, CISPred finds the 20 pro-
tein folds with the highest filtered combined energy Z-scores. Usually, the filtered
combined energy Z-scores of these 20 folds are all above 3.5, however, CISPred
checks the filtered combined energy Z-scores of these 20 folds and eliminates the
folds with filtered combined energy Z-scores lower than 3.5. The folds left are con-
sidered to be highly appropriate prediction structures for the queried sequence.
To integrate the most appropriate protein folds into CISPred, the structural seg-
ments of these protein folds are clustered, and only the cluster of folds containing
the highest average confidence score is integrated into CISPred.
3.2.2.2 Clustering THREADER Alignments
The structural segments of the protein folds with the highest filtered combined
energy Z-scores are clustered by a hierarchical clustering algorithm [11]. Initially,
each cluster contains the secondary structures of one of these fold segments. The
distance between each pair of clusters is computed, and the two clusters with the
smallest distance are merged into one cluster. This process continues until the
smallest distance between each pair of clusters reaches a threshold. The distance
25
THREADER 3.5 - Protein Sequence Threading Program Build date : Sep 4 2004 Copyright (C) 2002 University College London Portions Copyright (C) 1990 D.T.Jones
Registered user: [email protected]
Reading mean force potential tables... Alignment with 1sb8A0: 10 20 30 40 ----------00000000000----44444-----999999999999----------999 -------CCCHHHHHHHHHHHC-CCEEEEEC-CCCHHHHHHHHHHHHC-------CCEEE -------MMSRYEELRKELPAQ-PKVWLITG-VAGFIGSNLLETLLKL-------DQKVV | | | | EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGE---WWKARSLATRKEGYIPSNY-VA 10 20 30 40 50
50 60 70 80 90 999--------------5555555555--0000--22222-----00000000------- EEECCCC--------CCHHHHHHHHHHCCHHHHCCEEEEECCCCCHHHHHHHHCC----- GLDNFAT--------GHQRNLDEVRSLVSEKQWSNFKFIQGDIRNLDDCNNACAG----- | | | RVDSLETEEWFFKGISRKDAERQLLAPGN--MLGS-FMIRDSETTKGSYSLSVRDYDPRQ 60 70 80 90 100 110
.
.
.
220 230 240 250 -22222222222---------666-----00---33233333333333------------ CHHHHHHHHHHHC----CCC-EEECCCCCEECC-EEHHHHHHHHHHHHCC-------CCC AVIPKWTSSMIQG----DDV-YINGDGETSRDF-CYIENTVQANLLAATA-------GLD | || | PKLIDFSAQIAEGMAFIEQRNYIHRDLRA-ANILVSASLVCKIADFGLARVGAKFPIKWT 280 290 300 310 320 330
260 270 280 290 300 ---55555--------0042222222222222------------222------------- CCCEEEEEC---CCCCEEHHHHHHHHHHHHHHCCCCCC---CCCEEEC-CCCCCCCCCCC ARNQVYNIA---VGGRTSLNQLFFALRDGLAENGVSYH---REPVYRD-FREGDVRHSLA | | | | | | | APE-AINFGSFTIKS--DVWSFGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPEN 340 350 360 370 380 390
320 330 340 --000000-----------------999999999999999----- CCHHHHHHC--CC--CC-CC----CHHHHHHHHHHHHHHHCC--- DISKAAKLL--GY--AP-KY----DVSAGVALAMPWYIMFLK--- | CPEELYNIMMRCWKNRPEERPTFEYIQSVLDDFYTATESQYQQQP 400 410 420 430
Percentage Identity = 7.6.
Figure 3.3: Alignment results of THREADER.
26
between two clusters is computed according to Equation 3.1 [11], in which C
and C∗ are two clusters, |C| and |C∗| are the number of fold segments in the
two clusters respectively, and d(x, y) is the distance between two fold segments
located in two clusters. The shorter the distance between two clusters, the greater
the similarities between them.
davg(C, C∗) =
∑d(x, y)
|C||C∗| (3.1)
The distance between two fold segments is computed according to Equa-
tion 3.2, in which Nidentical represents the number of positions with identical sec-
ondary structures, and Ntotal represents the total number of positions in a fold
segment.
d(x, y) =Nidentical
Ntotal
(3.2)
After the clustering stops, the cluster with the highest average confidence score
is selected, and the fold segments in that cluster are integrated into CISPred. The
average confidence score is computed by dividing the sum of confidence scores of
each structure type (“H,” “E,” and “C”) by the total number of positions in that
cluster, as shown in Equation 3.3. THREADER only provides confidence scores
for the positions with structure types “H” and “E,” so the confidence score for the
positions with structure “C” is set to 5, one of the middle integers in the range 1
to 10.
Cavg =
∑CH +
∑CE +
∑CC
NH + NE + NC
(3.3)
The threshold of the clustering algorithm influences the number of fold seg-
ments integrated into CISPred, and therefore influences the final consensus pre-
27
dictions of CISPred. To illustrate the influence this threshold has on the accuracy
of CISPred, several experiments were conducted, and are presented in Chapter 5:
Experimental Results.
3.2.3 Finding Motif Secondary Structures
The proteins in one family, or having similar functions, are found to contain some
common or similar amino acid segments. These segments can distinguish families
of proteins, and are called motifs. A motif can exist in many individual pro-
teins; however, the structures of the motif are conservative and fit a particular
structure template. For example, Figure 3.4 illustrates the structures of a motif
named “BACTERIAL OPSIN 1” in different individual proteins. The structures
of the first five positions and the last three positions are usually “H,” and for the
five positions in the middle the structures can be either “H” or “T.” By statisti-
cally analyzing all of the structures of an existing motif in different proteins, the
structure template or “structure formula” of the motif can be determined, which
provides the proportion of each secondary structure type at each of the amino
acid positions of a motif. Figure 3.5 shows the structure formula of a motif named
“PROTEIN KINASE ATP.”
The structure formula is generated by integrating two databases, PROSITE [22]
and PDBFINDER [20], and the motif finding tool, PATMATMOTIFS [1]. PAT-
MATMOTIFS finds motifs from queried amino acid sequences and provides the
name, length, and start and end positions of these motifs. Figure 3.6 is an example
of a PATMATMOTIFS result, which illustrates that PATMATMOTIFS finds a
motif named “PROTEIN KINASE ATP” which is 23 amino acids long, starting
from position 190 and ending at position 212. After executing PATMATMOTIFS
on a queried sequence, CISPred then searches the PROSITE database for entries
28
2nd Structure PDB ID AA Sequence HHHHH HHHTH HHH 2BRD RYADW LFTTP LLL HHHHH HHHTT THH 2AT9 RYADW LFTTP LLL HHHHH HHHTH HHH 1XJI RYADW LFTTP LLL HHHHH HHHTH HHH 1VJM RYADW LFTTP LLL HHHHH TTHHH HHH 1UCQ RYADW LFTTP LLL HHHHH HHHTH HHH 1UAZ RYADW LFTTP LLL HHHHH TTHHH HHH 1TN5 RYADW LFTTP LLL HHHHH TTHHH HHH 1TN0 RYADW LFTTP LLL HHHHH TTHHH HHH 1S54 RYADW LFTTP LLL HHHHH TTHHH HHH 1S53 RYADW LFTTP LLL HHHHH TTHHH HHH 1S52 RYADW LFTTP LLL HHHHH TTHHH HHH 1S51 RYADW LFTTP LLL HHHHH HHHTH HHH 1R84 RYADW LFTTP LLL HHHHH HHHTH HHH 1R2N RYADW LFTTP LLL HHHHH TTHHH HHH 1QM8 RYADW LFTTP LLL HHHHH HHHTH HHH 1QKP RYADW LFTTP LLL HHHHH HHHTH HHH 1QKO RYADW LFTTP LLL HHHHH HHHTH HHH 1QHJ RYADW LFTTP LLL HHHHH TTHHH HHH 1Q5I RYADW LFTTP LLL HHHHH TTHHH HHH 1PY6 RYADW LFTTP LLL HHHHH TTHHH HHH 1PXS RYADW LFTTP LLL HHHHH TTHHH HHH 1PXR RYADW LFTTP LLL HHHHH HHHTH HHH 1P8U RYADW LFTTP LLL HHHHH HHHTH HHH 1P8I RYADW LFTTP LLL HHHHH HHHTH HHH 1P8H RYADW LFTTP LLL HHHHH TTHHH HHH 1O0A RYADW LFTTP LLL HHHHH HHHTH HHH 1M0M RYADW LFTTP LLL HHHHH HHHTH HHH 1M0L RYADW LFTTP LLL HHHHH HHHTH HHH 1M0K RYADW LFTTP LLL HHHHH TTHHH HHH 1KME RYADW LFTTP LLL HHHHH HHHTH HHH 1KGB RYADW LFTTP LLL HHHHH HHHTH HHH 1KG9 RYADW LFTTP LLL HHHHH HHHTH HHH 1KG8 RYADW LFTTP LLL HHHHH HHHTH HHH 1JGJ RYIDW ILTTP LIV HHHHH HHHTH HHH 1IXF RYADW LFTTP LLL HHHHH HHHTH HHH 1IW9 RYADW LFTTP LLL HHHHH HHHTH HHH 1IW6 RYADW LFTTP LLL HHHHH HHHTH HHH 1BRR RYADW LFTTP LLL HHHHH HHHHH HHH 1BRD RYADW LFTTP LLL HHHHH HHHTH HHH 1BM1 RYADW LFTTP LLL HHHHH HHHHH HHH 1AT9 RYADW LFTTP LLL THHHH TTTHH HHT 1AP9 RYADW LFTTP LLL
Figure 3.4: Structure segments for a protein motif.
29
MOTIF PROTEIN_KINASE_ATP
LENGTH 23
START 190
END 212
STR_FORMULA [C:0.11, H:0.00, T:0.00, S:0.01, E:0.87, G:0.01, I:0.00, B:0.00]
STR_FORMULA [C:0.09, H:0.00, T:0.01, S:0.01, E:0.87, G:0.02, I:0.00, B:0.00]
STR_FORMULA [C:0.23, H:0.00, T:0.01, S:0.01, E:0.73, G:0.02, I:0.00, B:0.00]
STR_FORMULA [C:0.63, H:0.00, T:0.01, S:0.04, E:0.28, G:0.03, I:0.00, B:0.01]
STR_FORMULA [C:0.01, H:0.00, T:0.35, S:0.63, E:0.00, G:0.01, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.35, S:0.64, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.20, H:0.00, T:0.01, S:0.19, E:0.59, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.10, H:0.00, T:0.00, S:0.03, E:0.87, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:1.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.99, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.02, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.00, H:0.00, T:0.00, S:0.00, E:0.00, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.04, H:0.00, T:0.05, S:0.02, E:0.88, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.04, H:0.00, T:0.01, S:0.01, E:0.93, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.03, H:0.00, T:0.00, S:0.02, E:0.95, G:0.00, I:0.00, B:0.00]
STR_FORMULA [C:0.01, H:0.00, T:0.00, S:0.00, E:0.98, G:0.00, I:0.00, B:0.00]
Fig
ure
3.5:
Exam
ple
ofa
stru
cture
form
ula
resu
lt.
30
about the motifs PATMATMOTIFS found.
Figure 3.7 shows an entry in the PROSITE database which contains informa-
tion about the motif “PROTEIN KINASE ATP.” The line starting with “ID” is
the name of the motif. The lines starting with “PA” are the consensus pattern
of the motif, which is the amino acid template of that motif. For example, in
a consensus pattern [ALT] indicates that any one of Alanine(A), Leucine(L) or
Threonine(T) may occur at this position; {AM} indicates that any amino acid ex-
cept Alanine(A) and Methionine(M) may occur in this position; x indicates that
any amino acid may be in this position; x(3) corresponds to x-x-x, which indicates
that any three amino acids may occur in this position; and x(2,4) corresponds
to x-x, x-x-x or x-x-x-x, which indicates that any two, three or four amino acids
may occur in this position. The lines starting with “3D” contain the PDB [2]
IDs of the proteins in which this motif exists. Based on the PDB IDs of these
proteins, CISPred searches the PDBFINDER [20] database and retrieves the sec-
ondary structures and the amino acid sequences of these proteins. PATMATMO-
TIFS is then executed on each of these amino acid sequences in order to locate
the position of the motif “PROTEIN KINASE ATP” in each of these proteins.
According to the position of the motif provided by PATMATMOTIFS, CISPred
finds the secondary structures of the motif in each of these proteins. A statisti-
cal analysis which computes the proportion of the occurrence of each secondary
structure type and generates the structure formula of this motif is then performed
on these secondary structures. The process of finding the structure formulae of
protein motifs is illustrated in Figure 3.8. Figure 3.5 illustrates the structure for-
mula of the motif “PROTEIN KINASE ATP,” and each line therein starting with
“STR FORMULA” provides the proportion of the occurrence of each structure
type in that amino acid position. Since the motif “PROTEIN KINASE ATP” is
31
23 amino acids long, the structure formula of “PROTEIN KINASE ATP” contains
23 lines, each of which contains the proportion of each structure type.
######################################## # Program: patmatmotifs # Rundate: Sun Sep 17 16:34:39 2006 # Report_format: dbmotif # Report_file: 1158521678131202243110-1.Pat ########################################
#======================================= # # Sequence: SEQUENCE from: 1 to: 438 # HitCount: 2 # # Full: No # Prune: Yes # Data_file: /usr/local/share/EMBOSS/data/PROSITE/ prosite.lines # #=======================================
Length = 23 Start = position 190 of sequence End = position 212 of sequence
Motif = PROTEIN_KINASE_ATP
KLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPG | | 190 212
Length = 13 Start = position 299 of sequence End = position 311 of sequence
Motif = PROTEIN_KINASE_TYR
IEQRNYIHRDLRAANILVSASLV | | 299 311
#--------------------------------------- #---------------------------------------
Figure 3.6: Example of a PATMATMOTIFS result.
The length of some motifs are variable; for example, the consensus pattern of
32
ID PROTEIN_KINASE_ATP; PATTERN. AC PS00107; DT APR-1990 (CREATED); NOV-1995 (DATA UPDATE); SEP-2006 (INFO UPDATE). DE Protein kinases ATP-binding region signature. PA [LIV]-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW}-[LIVCAT]-{PD}-x-[GSTACLIVMFY]- PA x(5,18)-[LIVMFYWCSTAR]-[AIVP]-[LIVMFAGCKR]-K. NR /RELEASE=50.8,234112; NR /TOTAL=1989(1969); /POSITIVE=1890(1870); /UNKNOWN=1(1); NR /FALSE_POS=98(98); /FALSE_NEG=373; /PARTIAL=30; CC /TAXO-RANGE=??EPV; /MAX-REPEAT=2; CC /VERSION=1; DR P13368, 7LESS_DROME, T; P20806, 7LESS_DROVI, T; P45894, AAPK1_CAEEL, T; DR Q13131, AAPK1_HUMAN, T; Q5EG47, AAPK1_MOUSE, T; Q5RDH5, AAPK1_PONPY, T; DR P54645, AAPK1_RAT , T; Q95ZQ4, AAPK2_CAEEL, T; P54646, AAPK2_HUMAN, T; DR Q28948, AAPK2_PIG , T; Q5RD00, AAPK2_PONPY, T; Q09137, AAPK2_RAT , T; DR Q6ZMQ8, AATK_HUMAN , T; Q80YE4, AATK_MOUSE , T; P03949, ABL1_CAEEL , T; DR P00519, ABL1_HUMAN , T; P00520, ABL1_MOUSE , T; P42684, ABL2_HUMAN , T; DR P00522, ABL_DROME , T; P10447, ABL_FSVHY , T; P00521, ABL_MLVAB , T; . . .
DR Q9GMA3, VSX1_BOVIN , F; P29944, YCB2_PSEDE , F; P33222, YJFC_ECOLI , F; DR Q12291, YL063_YEAST, F; Q09371, YS42_CAEEL , F; O32095, YUEF_BACSU , F; DR P47917, ZRP4_MAIZE , F; Q9PIN2, ZUPT_CAMJE , F; 3D 1A9U; 1AD5; 1AGW; 1APM; 1ATP; 1B6C; 1BKX; 1BL6; 1BL7; 1BLX; 1BMK; 1BX6; 3D 1BYG; 1CKI; 1CKJ; 1CSN; 1CTP; 1DAW; 1DAY; 1DI9; 1DS5; 1E9H; 1EH4; 1ERK; 3D 1F0Q; 1FGI; 1FGK; 1FIN; 1FMK; 1FMO; 1FOT; 1FPU; 1FQ1; 1FVV; 1G3N; 1GAG; 3D 1GJO; 1GNG; 1GOL; 1GY3; 1H1P; 1H1Q; 1H1R; 1H1S; 1H1W; 1H24; 1H25; 1H26; 3D 1H27; 1H28; 1H8F; 1HOW; 1I09; 1I44; 1IAN; 1IAS; 1IEP; 1IG1; 1IR3; 1IRK; 3D 1J1B; 1J1C; 1J3H; 1J91; 1JAM; 1JKK; 1JKL; 1JKS; 1JKT; 1JLU; 1JPA; 1JQH; 3D 1JST; 1JWH; 1K2P; 1K3A; 1K9A; 1KMU; 1KMW; 1KSW; 1KV1; 1KV2; 1KWP; 1L3R; 3D 1LC9; 1LCH; 1LD2; 1LEW; 1LEZ; 1LFN; 1LFR; 1LG3; 1LHX; 1LP4; 1LPU; 1LR4; 3D 1LUF; 1LWP; 1M14; 1M17; 1M2P; 1M2Q; 1M2R; 1M52; 1M7N; 1M7Q; 1MP8; 1MQ4; 3D 1MQB; 1MRU; 1MUO; 1NA7; 1NXK; 1NY3; 1O6K; 1O6L; 1O6Y; 1O9U; 1OB3; 1OEC; 3D 1OGU; 1OI9; 1OIU; 1OIY; 1OKV; 1OKW; 1OKY; 1OKZ; 1OL1; 1OL2; 1OL5; 1OL6; 3D 1OL7; 1OM1; 1OMW; 1OPJ; 1OPK; 1OPL; 1OUK; 1OUY; 1OVE; 1OZ1; 1P14; 1P38; 3D 1P4F; 1P4O; 1P5E; 1PF6; 1PF8; 1PHK; 1PJK; 1PKD; 1PKG; 1PME; 1PVK; 1PY5; 3D 1PYX; 1Q24; 1Q3D; 1Q3W; 1Q41; 1Q4L; 1Q5K; 1Q61; 1Q62; 1Q8T; 1Q8U; 1Q8W; 3D 1Q8Y; 1Q8Z; 1Q97; 1Q99; 1QCF; 1QL6; 1QMZ; 1QPC; 1QPE; 1QPJ; 1R0E; 1R0P; 3D 1R1W; 1R39; 1R3C; 1RDQ; 1RE8; 1REJ; 1REK; 1RJB; 1RQQ; 1RW8; 1S9I; 1S9J; 3D 1SM2; 1SMH; 1SNU; 1SNX; 1STC; 1SYK; 1SZM; 1T45; 1T46; 1TVO; 1U59; 1U5Q; 3D 1U5R; 1U7E; 1UNL; 1URC; 1UU3; 1UU7; 1UU8; 1UU9; 1UV5; 1UVR; 1UWH; 1UWJ; 3D 1V0B; 1V0P; 1VJY; 1VR2; 1VYW; 1VZO; 1W7H; 1W82; 1W83; 1W84; 1W98; 1WBN; 3D 1WBO; 1WBP; 1WBS; 1WBT; 1WBV; 1WBW; 1WFC; 1WMK; 1WZY; 1X8B; 1XH4; 1XH5; 3D 1XH6; 1XH7; 1XH8; 1XH9; 1XHA; 1XJD; 1XKK; 1XO2; 1XQZ; 1XR1; 1XWS; 1Y57; 3D 1Y6B; 1Y8G; 1YDR; 1YDS; 1YDT; 1YHS; 1YI3; 1YI4; 1YKR; 1YM7; 1YMI; 1YOL; 3D 1YOM; 1YVJ; 1YW2; 1YWR; 1Z57; 1Z5M; 1ZMU; 1ZMW; 1ZOE; 1ZOG; 1ZOH; 1ZRZ; 3D 1ZXE; 1ZY4; 1ZY5; 1ZYC; 1ZYD; 1ZZ2; 1ZZL; 2A19; 2A1A; 2AC3; 2AC5; 2AUH; 3D 2B4S; 2B54; 2B7A; 2B9F; 2B9H; 2B9I; 2B9J; 2BAJ; 2BAK; 2BAL; 2BAQ; 2BCJ; 3D 2BIK; 2BIL; 2BIY; 2BKZ; 2BMC; 2BPM; 2BZH; 2BZI; 2BZJ; 2C1A; 2C1B; 2C30; 3D 2C3I; 2C4G; 2C5N; 2C5O; 2C5P; 2C5T; 2C5X; 2C6D; 2C6E; 2C6T; 2CDZ; 2CHL; 3D 2CPK; 2CSN; 2ERK; 2ERZ; 2ESM; 2ETK; 2ETO; 2ETR; 2EU9; 2EXM; 2F49; 2F4J; 3D 2F57; 2FA2; 2FGI; 2FO0; 2G15; 2HCK; 2PHK; 2SRC; 3ERK; 3LCK; 4ERK; DO PDOC00100; //
Figure 3.7: Example of a PROSITE entry.
33
the motif “PROTEIN KINASE ATP” shown in figure 3.7 contains an x(2,4) which
indicates that any two, three or four amino acids may occur in that position. These
variable length positions do not affect the illustration of the structural template
of the motif; therefore, these variable length positions are ignored by CISPred.
3.2.4 PSIPRED and SSPRO
PSIPRED [24] and SSPRO [37] are the two existing protein secondary structures
prediction tools integrated into CISPred. Figure 3.9 illustrates an example of a
PSIPRED result: the first column provides an index of the amino acid residues
in the queried sequence, the second column provides the amino acid sequence; the
third column provides the predicted structures which have the highest confidence
scores in each amino acid position; the fourth column provides the confidence
scores of the structure type “C”; the fifth column provides the confidence scores of
the structure type “H”; and the sixth column provides the confidence scores of the
structure type “E.” Figure 3.10 shows a sample result from SSPRO, and contains
the ID and description of the queried amino acid sequence, followed by the queried
amino acid sequence and the predicted sequence of secondary structures. SSPRO
does not provide any confidence scores about its predictions.
PSIPRED and SSPRO are independent and successful prediction tools. They
provide complete predictions for all of the amino acid residues of the queried
sequence. Therefore, their results and the confidence scores are directly integrated
into CISPRED.
3.2.5 Generating Consensus Structure Prediction
The consensus predictions of CISPred are determined by integrating the fold
structures provided by THREADER [23], the predicted structures provided by
34
Query amino acid sequence
PATMATMOTIFS
Motif name
PROSITE Database
Protein PDB IDs of all proteins
having this motif
PDBFINDER Database
Protein secondary structure sequences of the proteins having this motif
PATMATMOTIFS
Sequence segments for
this motif
Program Generating Structure Formula
Motif Structure Formula
Consensus Prediction Program
Figure 3.8: The generation of motif structure formulae.
35
# PSIPRED VFORMAT
1 E C 0.998 0.001 0.006
2 D C 0.843 0.009 0.186
3 I E 0.347 0.002 0.655
4 I E 0.046 0.002 0.970
5 V E 0.017 0.002 0.988
6 V E 0.018 0.003 0.984
7 A E 0.063 0.009 0.949
8 L E 0.453 0.018 0.549
9 Y C 0.627 0.013 0.419
10 D C 0.843 0.020 0.154
11 Y C 0.883 0.025 0.113
12 E C 0.926 0.016 0.037
13 A C 0.964 0.021 0.033
14 I C 0.885 0.019 0.017
15 H C 0.840 0.014 0.023
16 H C 0.931 0.063 0.024
17 E C 0.799 0.142 0.045
.
.
.
420 Q H 0.014 0.977 0.001
421 S H 0.018 0.979 0.001 422 V H 0.025 0.975 0.001
423 L H 0.023 0.979 0.001
424 D H 0.024 0.977 0.002
425 D H 0.035 0.963 0.003
426 F H 0.055 0.933 0.005
427 Y H 0.163 0.802 0.007
428 T H 0.386 0.562 0.008
429 A C 0.520 0.430 0.010
430 T C 0.794 0.215 0.016
431 E C 0.828 0.169 0.023
432 S C 0.611 0.412 0.021
433 Q C 0.599 0.466 0.021
434 Y C 0.629 0.307 0.049
435 Q C 0.785 0.140 0.082
436 Q C 0.852 0.033 0.012
437 Q C 0.819 0.021 0.012
438 P C 0.733 0.000 0.001
Figure 3.9: Example of a PSIPRED vertical result.
36
PSIPRED [24] and SSPRO [37], and motif structure formulae.
CISPred parses the results of SSPRO, PSIPRED, THREADER alignments,
and motif structure formulae, and then generates consensus predictions from the
first amino acid position to the last, one amino acid position at a time. Figure 3.11
shows an example of the available structures and confidence scores for one amino
acid position. The line starting with “AA” shows the type of queried amino acid
in this position, a “K”(Lysine) in this example. The line starting with “SSPRO”
shows the structure type predicted by SSPRO in this position, a “H” in this
example. The line starting with “PSIPRED” shows the PSIPRED result in this
position, and provides an index of this position in the queried amino acid sequence
(“10”), the queried amino acid type in this position (“K”), the structure type
predicted by PSIPRED in this position (“H”), the confidence score of the structure
type “C” in this position (“0.013”), the confidence score of the structure type
“H” in this position (“0.627”), and the confidence score of the structure type
“E” in this position (“0.419”). The lines starting with “THREADER” show the
THREADER alignments in this position. In this example, five alignments are
1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDSLETEEWFFKGI SRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHYKIRTLDNGGFYISPRSTFSTLQ ELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKT MKPGSMSVEAFLAEANVMKTLQHDKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFS AQIAEGMAFIEQRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSF GILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQSVLDDF YTATESQYQQQP CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHHHEEECCCHHHCCCEECCC CHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCEEEEEEEEEEECCCCCEECCCCCCECCHH HHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCCCECCHHHEEEEEEEECCCCEEEEEEEECCCEEEEEEE ECCCCECHHHHHHHHHHHCCCCCCCECCEEEEECCCCCEEEECCCCCCEHHHHHHCHHHHCCCHHHHHHHH HHHHHHHHHHHHCCCCCCCCCHHHEEECCCCCEEECCCCHHHHCCCCCHHHCCHHHHHHCCCCHHHHHHHH HHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHHHHHHCC CCCCCCCCECCC
Figure 3.10: Example of a SSPRO result.
37
chosen for integration into CISPred, and their structures in this position are “C,”
“H,” “E,” “H” and “H,” with confidence scores of “-,” “6,” “3,” “0” and “5,”
respectively1. However, as previously presented, all of the confidence scores of
the structure types “E” and “H” are increased by 1 from their original numbers
provided by THREADER alignments, and the confidence scores of the structure
type “C” are arbitrarily set to be 5. Therefore, the confidence scores actually
considered by CISPred are “5,” “7,” “4,” “1” and “6,” respectively. The line
starting with “Str F” shows the structure formula in this position, and indicates
the proportion of the occurrence of each structure type in the position. In the
“Str F” line, “C:0.11” means that 11% of the protein motifs have a structure type
“C” in this position.
For each of the amino acid positions in a queried sequence, CISPred computes
a total confidence score Ctotal for each of the three structure types: “H,” “E,” and
“C.” The structure type with the highest total confidence score Ctotal is considered
as the consensus prediction in an amino acid position. Equation 3.4 shows the
computation of the total confidence score Ctotal for the structure type “H” in an
amino acid position.
Ctotal H = WpsiCpsi H + WsspCssp H + WsfPsf H + WthrCthr avg H (3.4)
In Equation 3.4, Cpsi H is the confidence score of the structure type “H” pro-
vided by PSIPRED. The confidence scores provided by PSIPRED are converted to
a scale from 0 to 10; therefore, original confidence scores provided by PSIPRED
are multiplied by 10. In the example presented in Figure 3.11, the integrated
1THREADER does not provide confidence scores for the structure type “C,” and it uses “-”to represent the confidence score of the structure type “C”.
38
AA K
SSPRO H
PSIPRED 10 K H 0.013 0.627 0.419
THREADER - K C
THREADER 6 K H
THREADER 3 K E
THREADER 0 K H
THREADER 5 K H
Str_F [C:0.11, H:0.87, T:0.00, S:0.01, E:0.00, G:0.01, I:0.00, B:0.00]
Figure 3.11: Example of the information available in one amino acid position.
39
confidence score of the structure type “H” (Cpsi H) is 0.627×10=6.27. Cssp H is
the confidence score of the structure type “H” provided by SSPRO. As previously
mentioned, SSPRO does not provide any confidence scores for its predictions. To
integrate the predictions of SSPRO and assign its predictions appropriate confi-
dence scores, each of the predictions of SSPRO is considered to have a confidence
score of 5, one of the two integers halfway between 1 and 10. In the example
illustrated in Figure 3.11, SSPRO predicts an “H” in that position; therefore, the
Cssp H in that example is set to be “5.” If the structure type predicted by SSPRO
in that position is not an “H,” the Cssp H was set to be “0.” Psf H is the propor-
tion of the structure type “H” provided by the structure formula of an amino acid
position. CISPred uses a 3-state scheme [40], in which “G” and “H” are taken to
be helix (“H”), “E” and “B” are taken to be strand (“E”), and all of the other
structure types are considered as coil (“C”). The proportion of the structure type
“H” includes the proportion of the structure type “G”. Like the confidence scores
provided by PSIPRED, the proportions provided by structure formula are multi-
plied by 10 in order to change their scale to between 0 and 10. In the example
illustrated in Figure 3.11, Psf H = (0.87+0.01) × 10 = 8.8. Cthr avg H is the av-
erage confidence score of the structure type “H” in an amino acid position of the
selected THREADER alignments. As previously mentioned, the fold structures
from the THREADER alignments are clustered, and one cluster of fold struc-
tures is chosen for integration in CISPred. In each position, CISPred computes
an average of the confidence scores of these selected fold structures. Equation 3.5
illustrates the computation of Cthr avg H : Cthr H i is the confidence score of the
structure type “H” in an amino acid position of the THREADER alignments, and
NH is the number of the structure types “H” in the amino acid position. For the
example illustrated in Figure 3.11, Cthr H 1, Cthr H 2, and Cthr H 3 are equal to “7,”
40
“1,” and “6,” respectively;∑N
i=1(Cthr H i) equals (7+1+6)=14, and NH equals to
“3”.
Cthr avg H =
∑NH
i=1 Cthr H i
NH
(3.5)
The consensus predictions of CISPred are determined not only by considering
the confidence scores provided by each tool, but also the overall prediction accura-
cies of the integrated tools. In Equation 3.4, Wpsi, Wssp, and Wthr are the weights
of PSIPRED, SSPRO, and THREADER, and Wsf is the weight for the structure
formulae. A weight is a real number from 0 to 1 inclusive which indicates the accu-
racy rate of the information provided by a tool. The weight of structure formulae is
set to 1 because the proportions provided by structure formulae are determined by
statistical analysis of the real structural data of existing motifs, and not predicted
by algorithms. The weights of PSIPRED, SSPRO, and THREADER are equal
to their average 3-state accuracies on a training dataset containing 80 randomly
selected amino acid sequences. 3-state accuracy, also called a Q3 score, is used to
evaluate the prediction accuracy of secondary structure prediction tools. 3-state
accuracy only considers the prediction accuracy of the following three states: helix
(“H”), strand (“E”), and coil (’C’). 3-state accuracy is defined as the percentage of
the amino acid residues that are correctly predicted. Equation 3.6 defines 3-state
accuracy, where Ncorrect is the number of residues that are correctly predicted, and
Ntotal is the total number of amino acid residues in the sequence.
Q3 =Ncorrect
Ntotal
(3.6)
SSPRO and PSIPRED are independent existing protein secondary structure
prediction tools, and their outputs are in the format of protein secondary struc-
ture sequences. The 3-state accuracy scores of their predictions can be computed
41
by comparing the predicted structure type in each amino acid position with the
real structure type retrieved from the PDBFINDER database [20]. The average
3-state accuracy of SSPRO predictions on the 80 training sequences is 0.937, and
the average 3-state accuracy of PSIPRED predictions on the 80 training sequences
is 0.798. THREADER does not provide predicted secondary structure sequences,
rather the structures of the most appropriate folds, and the number of the most
appropriate folds is influenced by the threshold at which the hierarchy cluster-
ing stops. The structure type that has the highest average confidence score is
considered to be the structure type predicted by THREADER in an amino acid
position. Equation 3.5 illustrates the computation of the average confidence score
of the structure type “H.” Figure 3.12 illustrates the average 3-state accuracy of
THREADER predictions on 80 random sequences. The average 3-state accuracy
declines as the threshold of the hierarchy clustering rises, which indicates that
the more similarities the selected folds have, the higher the prediction accuracy of
THREADER is.
As shown in Figure 3.12, the minimum average 3-state accuracy of THREADER
is 0.684, and the maximum average 3-state accuracy of THREADER is 0.770.
Therefore, the 3-state accuracy of THREADER is considered to be 0.727 which is
halfway between the maximum accuracy and the minimum accuracy.
In total, based on Equation 3.4, the total confidence score for the structure
type “H” in the amino acid position presented in Figure 3.11 is computed as
Ctotal H = 0.798 × 10 × 0.627 + 0.937 × 5 + 1 × 10 × (0.87 + 0.01) + 0.727
× (7 + 1 + 6)/3 ≈ 21.88. Similarly, Ctotal E and Ctotal C are computed as Ctotal E
= 0.798 × 10 × 0.419 + 0.937 × 0 + 1 × 10 × (0.00 + 0.00) + 0.727 × 4/1≈6.07, and Ctotal C = 0.798 × 10 × 0.013 + 0.937 × 0 + 1 × 10 × (0.11 + 0.00
+ 0.01 + 0.00) + 0.727 × 5/1 ≈ 4.94, respectively. In the amino acid position
42
0.67
0.68
0.69
0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Ave
rag
e 3-
stat
e ac
cura
cy (
Q3
sco
re)
Figure 3.12: Average 3-state accuracy of THREADER predictions on 80 randomsequences.
43
presented in Figure 3.11, the maximum among Ctotal H , Ctotal E and Ctotal C is
Ctotal H , which approximately equals 21.88; therefore, the structure type “H” is
considered to be the consensus structure type predicted by CISPred in this amino
acid position. Similar to the prediction performed in the position illustrated in
Figure 3.11, CISPred provides a consensus prediction in each of the amino acid
positions of a queried sequence, which composes a protein secondary structure
sequence as the consensus prediction on the queried sequence.
Ctotal reaches its maximum limit when the confidence score provided by PSIPRED
equals 1.00, the confidence score provided by SSPRO equals 5, the proportion
provided by structure formula equals 1.00, and the average confidence score of
THREADER alignments equals 10. Therefore, the maximum limit of Ctotal is
computed as Lmax = 0.798 × 10 × 1.000 + 0.937 × 5 + 1 × 10 × 1.00 + 0.727
× 10 = 29.935.
Ctotal reaches its minimum limit when the confidence score provided by PSIPRED
equals 0.00, the confidence score provided by SSPRO equals 0, the proportion
provided by structure formula equals 0.00, and the average confidence score of
THREADER alignments equals 0 when a fold has a gap “-” aligned in an amino
acid position. Therefore, the minimum limit of Ctotal is computed as Lmin = 0.798
× 10 × 0.000 + 0.937 × 0 + 1 × 10 × 0.00 + 0.727 × 0 = 0. In order to clearly
present total confidence scores, the Ctotal of each structure type is converted into
a real number from 0 to 1 inclusive. Equation 3.7 illustrates the conversion of the
total confidence score of the structure type “H.”
C ′total H = 1× Ctotal H − Lmin
Lmax − Lmin
(3.7)
In the example illustrated in Figure 3.11, the converted total confidence score
of the structure type “H” is computed as C ′total H = 21.88/29.935 ≈ 0.731, the
44
converted total confidence score of the structure type “E” is computed as C ′total E
= 6.07/29.935 ≈ 0.203, and the converted total confidence score of the structure
type “C” is computed as C ′total C = 4.94/29.935 ≈ 0.144. Figure 3.13 illustrates
the consensus prediction of CISPred in the amino acid position illustrated in Fig-
ure 3.11, in which “10” is the index of the amino acid position, “K” is the amino
acid type, “H” is the structure type predicted by CISPred, “0.144” is the con-
verted total confidence score of structure type “C,” “0.731” is the converted total
confidence score of structure type “H,” and “0.203” is the converted total confi-
dence score of structure type “E.” Figure 3.14 illustrates an example of CISPred
result in a vertical format, in which each line is a CISPred prediction result for
an amino acid position. Figure 3.15 illustrates an example of a CISPred result
in a horizontal format, in which the queried sequence is shown in FASTA format
followed by CISPred prediction results.
10 K H 0.144 0.731 0.203
Figure 3.13: Example of a CISPred vertical result in an amino acid position.
45
# CISPred vertical result
1 E C 0.330 0.000 0.000 2 D C 0.325 0.000 0.005 3 I C 0.312 0.000 0.017 4 I E 0.001 0.000 0.426 5 V E 0.000 0.000 0.427 6 V E 0.000 0.000 0.427 7 A E 0.148 0.000 0.182 8 L C 0.315 0.000 0.015 9 Y C 0.320 0.000 0.011 10 D C 0.179 0.025 0.004 11 Y E 0.024 0.025 0.160 12 E C 0.181 0.025 0.001 13 A C 0.182 0.025 0.001 14 I C 0.180 0.025 0.000 15 H C 0.179 0.025 0.001 16 H C 0.181 0.026 0.001 17 E C 0.178 0.028 0.001 . . .
420 Q H 0.000 0.183 0.000 421 S H 0.000 0.183 0.000 422 V H 0.001 0.183 0.000 423 L H 0.001 0.183 0.000 424 D H 0.001 0.183 0.000 425 D C 0.157 0.026 0.000 426 F C 0.158 0.025 0.000 427 Y C 0.161 0.021 0.000 428 T C 0.167 0.015 0.000 429 A C 0.170 0.011 0.000 430 T C 0.178 0.006 0.000 431 E C 0.179 0.005 0.001 432 S C 0.173 0.011 0.001 433 Q C 0.172 0.012 0.001 434 Y C 0.173 0.008 0.001 435 Q E 0.021 0.004 0.159 436 Q C 0.179 0.001 0.000 437 Q C 0.178 0.001 0.000 438 P C 0.176 0.000 0.000
Figure 3.14: Example of a CISPred vertical result.
46
>1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP
CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHCCEEECCC CCHCCCEECCCCHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCCEEEEEE EEEECCCCCEECCCCCCHHHHHHHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCHHHHH HHHHHHEEEEECCCCEEEEEEEECCCEEEEEEEECCCCCCHHHHHHHHHHHCCCCCCCEE EEEEEECCCCCEEEECCCCCCEHHHHHHCHCCHCCCHHHHHHHHHHHHHHHHHHHHCCCC CCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCCCHCCHHHHHHCCCCHHHHEEEEEEEEE HHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHH HH HHCCCCCCCCCCECCC
Figure 3.15: Example of a CISPred horizontal result.
47
Chapter 4
System Implementation
4.1 Overview
From a software engineering perspective, CISPred is an Internet web application,
an application that is accessed through a web browser over the Internet. The
inputs of CISPred, amino acid sequences, are submitted on a web page of the
CISPred website. After CISPred predictions are finished, the consensus prediction
results are sent to the email address of a user.
The tools integrated in CISPred are concurrently executed on a 164-processor
high performance SUN cluster. The computations for one queried sequence are
simultaneously performed on at least 12 processors, which greatly reduces the
execution time of CISPred.
4.2 System Infrastructure
Figure 4.1 illustrates the system infrastructure of CISPred. A user of CISPred ac-
cesses the CISPred website at the HTTP address http://acrl.cs.unb.ca/conpred/
through a web browser. The CISPred website is constructed in HTML and
48
CGI(Common Gateway Interface), and runs on a web server which is a 4-processor
SUN computer with the Fully Qualified Domain Name (FQDN) quartet.cs.unb.ca.
The user email address and queried amino acid sequences in FASTA format shown
in Figure 4.2 are submitted to the CISPred web sever by the submission web page
of CISPred shown in Figure 4.3. A PERL CGI program running on the CISPred
web server parses the queried sequences and submits concurrent jobs to the 164-
processor high performance SUN cluster with FQDN chorus.cs.unb.ca. A MySQL
database that runs at the 4-processor SUN computer stores the information about
queried sequences and their concurrent job IDs. A program that runs on the web
server retrieves the IDs of unfinished concurrent jobs and checks the status of these
jobs. After the concurrent jobs are finished, a program on the cluster is executed
which integrates the results of concurrent jobs and generates consensus prediction
results. The consensus prediction results are sent to the CISPred web server via
FTP and then sent to the email address of a user.
Moreover, the CISPred website contains web pages for CISPred administrators
to view the usage information of CISPred. The administration web pages are
implemented in PERL CGI and are able to list the submitted and finished time
of each queried task, and the IP address and the geographical location of the
computer a CISPred user uses to submit queried sequences. This information
is stored in a table in the MySQL database that runs at the 4-processor SUN
machine. Figure 4.4 shows one of the administration web pages of CISPred.
49
Users
Web Server & Database Server
Program Submitting Execution Jobs
SUN Cluster with 160 Processors
Execution of Integrated Tools
HTTP Email
CISPred Web Page
Queryied sequences
Consensus Predictions
Consensus Predictions
Queried sequences
Job IDs
MySQL Database
Program Generating Consensus Predictions
Program Checking Job Status & Sending
Emails
Task Information
Queried sequences
Job IDs
Task ID
Job status
Results of Integrated Tools
Figure 4.1: The system infrastructure of CISPred.
>1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP
Figure 4.2: Example of a CISPred queried sequence.
50
Figure 4.3: Web page for submitting query sequences to CISPred.
51
Figure 4.4: A CISPred web page displaying user jobs.
52
4.3 Concurrent Implementation
4.3.1 Overview
The tools integrated in CISPred, THREADER [23], PSIPRED [24], and SSPRO [37],
and the finding of motif structure formulae are concurrently executed in a high per-
formance SUN cluster containing 164 processors. Figure 4.5 illustrates an overview
of the concurrent implementation of CISPred. A PERL CGI program that runs at
the CISPred web server parses the submission file that may contains more than one
queried amino acid sequence in FASTA format. For each of the queried sequences,
the PERL CGI program submits several concurrent jobs to the high performance
SUN cluster, which contains 10 concurrent jobs of THREADER [23], 1 concurrent
job of PSIPRED [24], 1 concurrent job of SSPRO [37], and n concurrent jobs of
finding motif structural formulae, where n equals the number of existing motifs
in a queried amino acid sequence. After the executions of the concurrent jobs are
finished, a program integrates the results of each concurrent job and generates
consensus predictions.
4.3.2 THREADER
The default fold library of THREADER contains 6251 protein folds. A queried
sequence is “threaded” through each of the protein folds and free energy is com-
puted during this process. Figure 4.6 illustrates the concurrent implementation
of THREADER. CISPred divides the library into 10 sub-libraries, each of which
contains approximately 625 protein folds. A queried sequence is threaded through
the folds in the 10 sub-libraries simultaneously on 10 processors.
In total, a PERL CGI program that runs at the CISPred web server submits
10 concurrent jobs of THREADER according to the submission template shown in
53
. . .
THREADER Task 1
THREADER Task 10
Motif Structure Finding Task 1
Motif Structure Finding Task n
. . .
SSPRO Task
PSIPRED Task
A Query Sequence Results Concurrent Job
Submission Program
Result Integration Program
Figure 4.5: Overview of the concurrent implementation of CISPred.
Concurrent Job Submission
Program
Fold Library 10
Fold Library 1
THREADER 10
THREADER 1
Result Integration Program
R e s u l
t 1 0
R e s u l t 1
… …
…
A queried sequence
…
Figure 4.6: Concurrent implementation of THREADER.
54
Appendix A.1. After the executions of these concurrent jobs are finished, 10 align-
ment reports and 10 score reports are generated by THREADER. An integration
program that runs on the high performance SUN cluster sorts each of the score
reports. In each of the score reports, 20 folds with the highest filtered combined
energy Z-scores [23] are selected. The integration program gathers 20 folds from
each of the sub-libraries, sorts the 20 × 10 folds based on the filtered combined
energy Z-scores [23], and selects the top 20 folds with the highest filtered combined
energy Z-scores [23] as shown in Figure 4.7. These 20 folds are the most appro-
priate folds in the 6251 protein folds of the THREADER default library. From
these 20 folds, CISPred checks the filtered combined energy Z-scores of each of
them and the protein folds with filtered combined energy Z-scores lower than 3.5
are eliminated. The folds left are then clustered and only the folds in one cluster
are integrated in CISPred.
Score report 1
Score report 10
Top 20 folds in report 1
Top 20 folds in report 10
Top 20 folds in all reports
Sort function
Sort function
Sort function
…
…
Figure 4.7: The sorting of THREADER reports.
4.3.3 Finding Protein Motif Secondary Structures
A queried amino acid sequence may contain several motifs. The submission pro-
gram that runs at the CISPred web server executes PATMATMOTIFS [1] on each
55
of the queried sequences in order to find the existing motifs in a queried sequence.
The finding of the structural formulae of a motif is performed on one processor.
By parsing the results of PATMATMOTIFS, the submission program that runs at
the CISPred web server retrieves the number of motifs found in a queried sequence
and submits the corresponding number of concurrent jobs to the high performance
SUN cluster, each of which performs an independent process of finding structural
formulae for one motif. Appendix A.4 illustrates the template for submitting
concurrent jobs of finding motif structural formulae. Figure 4.8 illustrates the
concurrent implementation of finding motif structural formulae.
After the concurrent jobs of finding motif structural formulae are finished, the
structural formulae are integrated in a program that runs on the high performance
SUN cluster.
4.3.4 SSPRO and PSIPRED
PSIPRED [24] and SSPRO [37] are independent existing protein structure pre-
diction tools, and their execution time is within two minutes, much shorter than
the execution time of THREADER, which may be up to several hours based on
the length of a queried sequence. PSIPRED and SSPRO are not further divided
and each of them is performed by a processor. The submission PERL CGI pro-
gram that runs at the CISPred web server submits jobs for each of SSPRO and
PSIPRED. Appendices A.2 and A.3 illustrate the template for submitting jobs of
SSPRO and PSIPRED. After the executions of the jobs are finished, the integra-
tion program that runs on the high performance SUN cluster directly integrates
the results of SSPRO and PSIPRED.
56
PATMATMOTIFS
Query amino acid sequence
Names of the motifs found in
query sequence
Processor 1
The PDB IDs of all proteins that contain Motif 1
The whole amino acid sequence and
secondary structures of the
proteins that contain Motif 1
Segments of the secondary structure sequences of Motif 1
. . .
N a m e o f M o t i f 1
Concurrent Tasks
Submission Program
PROSITE Database
PDBFINDER Database
PATMATMOTIFS
Structure Formula
Generation Program
Consensus Prediction Program of
CISPred
. . .
Processor n
The PDB IDs of all proteins that contain Motif n
The whole amino acid sequence and
secondary structures of the
proteins that contain Motif n
Segments of the secondary structure sequences of Motif n
PROSITE Database
PDBFINDER Database
PATMATMOTIFS
Structure Formula
Generation Program
N a m e o f M o t i f n
Structure Formula of
Motif 1
Structure Formula of
Motif n
Figure 4.8: Concurrent finding of motif structures.
57
4.4 Execution Time
The executions of integrated tools and the finding of motif structural formulae of
CISPred are concurrently implemented on a high performance 160-processor SUN
cluster, which greatly reduces its execution time.
The PERL CGI program that runs at the CISPred web server sequentially
submits concurrent jobs to the high performance SUN cluster, and the first sub-
mitted concurrent jobs are first executed in the high performance cluster. The
submission order of concurrent jobs is the same as the execution order of concur-
rent jobs; this is called the “job schedule.” Sometimes the number of available
processors in the high performance cluster is less than the number of concurrent
jobs submitted. After being submitted, some of the concurrent jobs cannot be
executed immediately, but have to wait to be executed. Moreover, the execution
time of concurrent jobs is different; for example, the execution time of SSPRO on
one queried sequence is usually within 2 minutes, while the execution time of a
THREADER concurrent job usually takes 30-40 minutes. Job schedule influences
the total execution time of concurrent jobs.
Figures 4.9 and 4.10 illustrate the execution time and speedup of CISPred
on protein 1AD5 shown in Figure 4.2. “Optimized Schedule” indicates that the
concurrent jobs with longer execution times are submitted ahead of the concurrent
jobs that take less time. The order of jobs submitted is as follows:
1. Job(s) for finding motif structural formulae1;
2. 10 THREADER concurrent jobs;
3. The SSPRO job;
1If a queried amino acid sequence contains more than one motif, the order of jobs submittedfor finding motif structural formulae of these motifs is the same as the order in which thesemotifs were found in the queried amino acid sequence.
58
4. The PSIPRED job;
“Non-optimized Schedule” indicates that the concurrent jobs are submitted in
a random order. As shown in Figures 4.9 and 4.10, submitting the concurrent jobs
that have longer execution times ahead of the concurrent jobs that have shorter
execution times reduces the total execution time and improves the speedup of
CISPred.
CISPred concurrently generates structure formulae for each of the motifs found
in a queried sequence. Initially, CISPred sequentially generates structure formulae
for each of the motifs found in a queried sequence. In Figures 4.9 and 4.10, the line
with the legend “ Optimized Schedule” indicates the execution time and speedup
of CISPred when the CISPred job schedule is optimized but finding structural for-
mulae of existing motifs in a queried sequence are not concurrently implemented.
The line with the legend “Non-optimized Schedule” indicates the execution time
and speedup of CISPred when the CISPred job schedule is not optimized and
the finding of structural formulae of existing motifs in a queried sequence are not
concurrently implemented. The line with the legend “Optimized schedule and con-
current finding of motif structures” indicates the execution time and speedup of
CISPred when the CISPred job schedule is optimized and the finding of structural
formulae of existing motifs in a queried sequence are concurrently implemented.
As shown in Figure 4.9, when the CISPred job schedule is optimized and
the finding of structural formulae of existing motifs in a queried sequence are
concurrently implemented, the execution time of CISPred has a large decrease
when the number of processors increases from 1 to 6. As Figure 4.10 shows, when
the CISPred job schedule is optimized and the finding of structural formulae of
existing motifs in a queried sequence are concurrently implemented, the execution
time of CISPred decreases almost 8 times when the number of processors increases
59
from 1 to 11.
Optimized schedule and concurrent finding of motif structure formulas
Optimized schedule
Non-optimized schedule
Figure 4.9: The execution time of CISPred.
60
Optimized schedule and concurrent finding of motif structure formulas
Optimized schedule
Non-optimized schedule
Figure 4.10: The speedup of CISPred.
61
Chapter 5
Experimental Results
5.1 Overview
The purpose of the experiments is to test the prediction accuracy of CISPred,
to determine a default threshold to be used in the clustering of protein folds
generated from THREADER alignments, and to compare CISPred with other
existing protein structure prediction tools.
Two test datasets are used in the experiments. One of the test datasets con-
sists of 109 “Critical Assessment of Techniques for Protein Structure Prediction
Experiment” (CASP) [29] target amino acid sequences. CASP [29] is an organi-
zation which evaluates protein structure prediction methods. Prediction methods
provide blind predictions before the structures of the target sequences are observed
by experimental methods. The CASP target sequences have a variety of lengths
and are newly discovered proteins; therefore, they have been widely used by cur-
rent prediction tools as a standard test dataset. The 109 CASP sequences used
in the experiments conducted are randomly selected from CASP3 (1998), CASP4
(2000), CASP5 (2002), and CASP6 (2004).
The experiments are performed on a dataset containing 1758 amino acid se-
62
quences selected from the PDBFINDER database [20] as a result of the follow-
ing procedure. The PDBFINDER database contains information, such as the
amino acid sequences and the secondary structures, of current known proteins.
PDBFINDER database is in a plain text format. The entries in the PDBFINDER
database are listed by the alphabetical order of the protein PDB IDs. We select
5000 proteins starting from the first protein listed in the PDBFINDER database
with PDB ID 100D to the protein with PDB ID 4HTC by the alphabetical order.
Out of these 5000 protein sequences, the following sequences are deleted:
1. The sequences included in the training dataset which contains 80 sequences
presented in Chapter 3.
2. The sequences that contain illegal characters.
3. Some protein chains with the same PDB ID. Some proteins contain several
chains. For example, “1ZD6:A” and “1ZD6:B” are the two chains of protein
1ZD6. The two chains “1ZD6:A” and “1ZD6:B” have very high sequence
similarities as shown in Figure 5.1 and are listed in PDBFINDER as two
separate entries. In this situation, only the first chain of the protein is
included, all of the other chains of the same proteins are eliminated.
The test dataset (which contains 1758 sequences) is much larger than the CASP
dataset (which has 109 sequences) and contains selected amino acid sequences of
regular known proteins.
As is the case with most of the other existing protein structure prediction
tools, CISPred only predicts three structural states: helix (“H”), strand (“E”),
and coil (“C”). The eight secondary structure types defined by DSSP are reduced
into these three states based on a 3-state scheme [40]: “G” and “H” are taken to
63
be helix (“H”), “E” and “B” are both taken to be strand (“E”), and all of the
other structure types are considered to be coil (“C”).
3-state accuracy is used to evaluate prediction accuracy in the experiments
conducted. 3-state accuracy is defined as the percentage of the amino acid residues
that are correctly predicted, as shown in Equation 3.6.
5.2 CISPred Testing Results on CASP Sequences
As presented in Chapter 3, the secondary structure sequences of the protein folds
generated from THREADER alignments are clustered, and only the folds in one
cluster are integrated by CISPred. The clustering process stops when the distance
between the nearest two clusters reaches a threshold. In order to test the predic-
tion accuracy of CISPred on each of the clustering thresholds from 1% to 100%,
CISPred is executed 100 times on the two datasets, each time with a different
clustering threshold. Figure 5.2 illustrates the average Q3 scores of the 109 CASP
sequences on each threshold from 1% to 100%, with 1% as the interval.
As shown in Figure 5.2, the average 3-state accuracy of CISPred predictions
stays above 0.826 when the threshold increases from 0 to almost 0.20. It declines
to 0.823 when the threshold is 0.25. It reaches its peak, 0.828, when the threshold
is 0.40, and declines from its peak to the lowest point, 0.815, when the threshold
is raised to 0.66. It has a slight increase from the lowest point to 0.819 when
>1ZD6:A|PDBID|CHAIN|SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE
>1ZD6:B|PDBID|CHAIN|SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE
Figure 5.1: Two chains of protein 1ZD6.
64
the threshold increases from 0.66 to 1.00. In total, the average 3-state accuracy
has a large decline when the threshold is raised from 0.40 to 0.66. Moreover, it
stays relatively high when the threshold is between 0 and 0.40, and relatively low
when the threshold is between 0.66 and 1, which illustrates that the clustering of
the protein folds generated from THREADER alignments improves the prediction
accuracy of CISPred. This is because when the threshold reaches 1, all of the
protein folds are integrated, which is the equivalent of integrating all of the protein
folds without clustering them.
Figure 5.3 illustrates the standard deviation of the 3-state accuracies of the
CISPred predictions with different thresholds on the 109 CASP sequences. Stan-
dard deviation is used to measure how the values in a distribution are spread.
Standard deviation is computed according to the Equation 5.1.
0.814
0.816
0.818
0.82
0.822
0.824
0.826
0.828
0.83
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Ave
rag
e 3-
stat
e ac
cura
cy (
Q3
sco
re)
Figure 5.2: Average 3-state accuracy of CISPred on the 109 CASP sequences.
65
σ =
√√√√ 1
N
N∑i=1
(xi − x)2 (5.1)
N stands for the number of samples taken, xi is the value of each sample, and
x is the average of the sample values.
0.094
0.095
0.096
0.097
0.098
0.099
0.1
0.101
0.102
0.103
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Sta
nd
ard
dev
iati
on
Figure 5.3: Standard deviation of CISPred on the 109 CASP sequences.
In statistics, “coefficient of variation” [44] also measures the dispersion of a
probability distribution. The coefficient of variation is defined using Equation 5.2,
in which µ stands for the arithmetic mean or average, and σ stands for standard
deviation:
Cv =σ
µ(5.2)
Figure 5.4 illustrates the “coefficient of variation” of the 3-state accuracies of
66
the CISPred predictions with different thresholds on the 109 CASP sequences.
Both Figures 5.3 and 5.4 show that the 3-state accuracies of the 109 CASP
sequences have the highest dispersion when the threshold is 0.60, and have the
lowest dispersion when the threshold is 0.72. In total, the standard deviation is
roughly 0.1 when the threshold is increased from 0 to 0.62, and declines to 0.095
when the threshold is increased from 0.60 to 1. The coefficient of variation is
roughly 12.1% when the threshold is increased from 0 to 0.53, and after a rise,
the coefficient of variation declines to 11.6% when the threshold is increased from
0.6 to 1. Generally speaking, both the standard deviation and the coefficient of
variation stay at lower values when the threshold reaches a relatively higher value.
The standard deviation ranges from 0.095 to 0.102, and the coefficient of variation
ranges from 11.6% to 12.4%, neither of which changes greatly and both remain at
a low level.
0.115
0.116
0.117
0.118
0.119
0.12
0.121
0.122
0.123
0.124
0.125
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Co
effi
cien
t o
f va
rian
ce
C o e
f f i c i
e n t o
f v a r
i a t i o
n
Figure 5.4: Coefficient of variation of CISPred on the 109 CASP sequences.
67
Figure 5.5 illustrates the number of sequences that have 3-state accuracies in
some specific ranges. It shows that CISPred always predicts at least 20 sequences
out of 109 CASP sequences with 3-state accuracy higher than 90%, which is about
18.3% of the 109 CASP sequences; CISPred always predicts at least 35 sequences
out of the 109 CASP sequences with 3-state accuracy higher than 85%, which is
about 32% of the 109 CASP sequences; and CISPred always predicts at least 60
sequences out of the 109 CASP sequences with 3-state accuracy higher than 80%,
which is about 55% of the 109 CASP sequences. The number of sequences with
higher 3-state accuracies, such as those above 90%, 85%, 80%, and 75%, declines
when the threshold is increased from 0.48 to 0.64, while the number of sequences
with lower 3-state accuracies, such as 65%, 60%, and 55%, increases when the
threshold is increased from 0.48 to 0.64.
0
10
20
30
40
50
60
70
80
90
100
110
0 0.2 0.4 0.6 0.8 1
Threshold
Nu
mb
er o
f se
qu
ence
s
Above 90%
Above 85%
Above 80%
Above 75%
Above 70%
Above 65%
Above 60%
Above 55%
Above 50%
Figure 5.5: Number of sequences CISPred predicts with 3-state accuracy in severalspecific ranges on the 109 sequences dataset.
68
Figures 5.6, 5.7, and 5.8 illustrate the distributions of the CISPred predictions
on the 109 CASP sequences with 1%, 3%, and 5% 3-state accuracy as intervals,
respectively. In these three gray-scale graphs, the more sequences located in an
area, the darker this area is. In Figure 5.6, the area with 3-state accuracy of 80%
and a threshold from 0% to 58% is the darkest area, which indicates that this area
contains the densest distribution of sequences. Figure 5.7 shows that besides the
area with 80% 3-state accuracy and a threshold from 0% to 58%, the area with
90% 3-state accuracy and a threshold from 25% to 45% also contains a relatively
high density of sequences. Figure 5.8 shows that area with 3-state accuracy from
75% to 85%, and a threshold from 40% to 44%, 48% to 58%, and 60% to 100%
contains a higher density of sequences. Overall, the sequences with higher than
90% 3-state accuracy are mostly generated when the threshold is from 25% to 50%,
the sequences with 80% to 85% 3-state accuracy are mostly generated when the
threshold is from 70% to 100%, the sequences with 78% to 80% 3-state accuracy
are mostly generated when the threshold is from 0% to 60%.
5.3 CISPred Testing Results on 1758 Sequences
The experiments conducted on the 109 CASP sequences are also performed on a
dataset containing 1758 amino acid sequences. The 1758 sequences are selected
form the PDBFINDER database [20], which contains information about known
proteins, such as amino acid sequences and secondary structure sequences.
Figure 5.9 illustrates the average 3-state accuracies of CISPred on the 1758
sequences. It shows that the average of 3-state accuracy rises from 0.8887 to
0.8929 when the threshold is increased from 0 to 0.45; it declines to 0.8913 when
the threshold is increased to 0.62, and then rises and reaches its peak at 0.8938.
In total, the average 3-state accuracy rises when the threshold is increased from
69
3 - s t
a t e
a c c u
r a c y
( i n
p e r
c e n t
a g e )
Threshold (in percentage)
Figure 5.6: Distribution of the 109 CASP sequences predicted by CISPred with1% 3-state accuracy as interval.
70
Threshold (in percentage)
3 - s t
a t e
a c c u
r a c y
( i n
p e r
c e n t
a g e )
Figure 5.7: Distribution of the 109 CASP sequences predicted by CISPred with3% 3-state accuracy as interval.
71
Threshold (in percentage)
3 - s t
a t e
a c c u
r a c y
( i n
p e r
c e n t
a g e )
Figure 5.8: Distribution of the 109 CASP sequences predicted by CISPred with5% 3-state accuracy as interval.
72
0 to 1, which is inverse to the 3-state accuracy of CISPred predictions on the
109 CASP sequences shown in Figure 5.2, in which the average 3-state accuracy
declines when the threshold is increased from 0 to 1. When the threshold that
stops the clustering of protein folds generated from THREADER alignments is
equal to 1, the protein folds in all of the clusters are integrated by CISPred. The
increasing trend of the 3-state accuracy shown in Figure 5.9 indicates that the
folds generated from THREADER are overwhelming the prediction results from
other integrated tools. Furthermore, as more folds generated from THREADER
are integrated, the 3-state accuracy of CISPred predictions increases. The peak of
the 3-state accuracy on the 1758 sequences shown in Figure 5.9 is 0.8938, which is
higher than the peak of the 3-state accuracy on the 109 CASP sequences, which is
0.8278, as shown in Figure 5.2. This indicates that CISPred has a higher average
prediction accuracy on the 1758 sequences than on the 109 CASP sequences.
0.888
0.889
0.89
0.891
0.892
0.893
0.894
0.895
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Ave
rag
e 3-
stat
e ac
cura
cy (
Q3
sco
re)
Figure 5.9: Average 3-state accuracy of CISPred predictions on 1758 sequences.
73
Figure 5.10 illustrates the standard deviation of CISPred predictions on the
1758 sequences. It shown that the standard deviation increases from its lowest
point, 0.0944, to its peak, 0.0967, when the threshold is increased from 0 to 0.37.
The standard deviation then declines to 0.0952 and hovers around 0.0952, when
the threshold is increased from 0.37 to 1. The range of the standard deviation
of CISPred predictions on the 1758 sequences is 0.0023 (the highest point 0.0967
minus the lowest point 0.0944), which is very low and indicates that the dispersion
of CISPred predictions on the 1758 sequences is rarely influenced by the threshold
of clustering. Compared with the standard deviation of CISPred predictions on the
109 CASP sequences shown in Figure 5.3, the predictions on the 1758 sequences
have slightly less dispersion than the predictions on the 109 CASP sequences.
0.094
0.0945
0.095
0.0955
0.096
0.0965
0.097
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Sta
nd
ard
dev
iati
on
Figure 5.10: Standard deviation of CISPred predictions on the 1758 sequences.
Figure 5.11 illustrates the coefficient of variation of the CISPred predictions on
the 1758 sequences. The lowest point of the coefficient of variation is 10.62%, when
74
the threshold equals 0, and the highest point of the coefficient of variation is at
10.84%, when the threshold equals 0.37. When the threshold is increased from 0.37
to 1, the coefficient of variation declines and then hovers around 10.69%. When
the threshold is above 0.8, the coefficient of variation stays at 10.69%. Generally
speaking, the coefficient of variation of CISPred predictions on the 1758 sequences
is about 2% less than the coefficient of variation of CISPred predictions on the 109
CASP sequences shown in Figure 5.4, which indicates that CISPred predictions
on the 1758 sequences have less dispersion than on the 109 CASP sequences.
0.106
0.1065
0.107
0.1075
0.108
0.1085
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold
Co
effi
cien
t o
f va
rian
ce
C o
e f f i c
i e n t
o f v
a r i a
t i o n
Figure 5.11: Coefficient of variation of CISPred predictions on the 1758 sequences.
Figure 5.12 illustrates the number of sequences which have 3-state accuracies in
some specific ranges. It shows that CISPred always predicts at least 600 sequences
with 3-state accuracy higher than 95%, which is 34% of the 1758 sequences; 1000
sequences with 3-state accuracy higher than 90%, which is 56.9% of the 1758
75
sequences; 1200 sequences with 3-state accuracy higher than 85%, which is 68.3%
of the 1758 sequences; 1400 sequences with 3-state accuracy higher than 80%,
which is 79.7% of the 1758 sequences. Compared to Figure 5.5, CISPred has better
performance on the 1758 sequences dataset regarding the number of sequences with
high 3-state accuracy.
0
200
400
600
800
1000
1200
1400
1600
0 0.2 0.4 0.6 0.8 1
Threshold
Nu
mb
er o
f se
qu
ence
s
Above 95%
Above 90%
Above 85%
Above 80%
Above 75%
Above 70%
Above 65%
Above 60%
Above 55%
Abvoe 50%
Above 45%
Figure 5.12: Number of sequences CISPred predicts with 3-state accuracy in sev-eral specific ranges on the 1758 sequences dataset.
5.4 Selection of CISPred Default Threshold
As presented in Chapter 3, the protein folds generated from THREADER align-
ments are clustered, and only the folds in one cluster are integrated in CISPred.
The process of clustering stops when the distance between the two closest clusters
is larger than a threshold. The experiments conducted on two datasets indicate
76
that the average 3-state accuracy and the dispersions of CISPred predictions are
influenced by the threshold in clustering. A default threshold is determined based
on the experiments conducted, and is used as the clustering threshold on any
queried sequences submitted by CISPred users.
The default threshold is determined by considering the average 3-state accu-
racy, standard deviation, and coefficient of variation of CISPred predictions on
the two datasets. At the default threshold, a compromise is made which balances
the average 3-state accuracy and the dispersion: CIPred has relatively higher av-
erage 3-state accuracy and lower standard deviation and coefficient of variation.
Based on these requirements, 0.42 is selected as the default threshold in CISPred,
because at this threshold, the average 3-state accuracy of CISPred predictions
on the 109 CASP sequences equals 0.826, which is very close to the peak, 0.828,
as shown in Figure 5.2, and the average 3-state accuracy of CISPred predictions
on the 1758 sequences equals 0.8926, which is very close to its peak, 0.8930, as
shown in Figure 5.9. Moreover, CISPred predictions have less dispersion when the
threshold equals 0.42: the standard deviation of CISPred predictions on the 109
CASP sequences equals 0.0996, as shown in Figure 5.3, the coefficient of variation
of CISPred predictions on the 109 CASP sequence equals 0.1069, as shown in
Figure 5.4; the standard deviation of CISPred predictions on the 1758 sequences
equals 0.0955, as shown in Figure 5.10; and the coefficient of variation of CISPred
predictions on the 1758 sequences equals 0.1206, as shown in Figure 5.11.
Figure 5.5 shows that CISPred predicts more sequences with 3-state accuracy
higher than 90% and higher than 85% when the threshold equals 0.42 on the 109
CASP sequences. Figure 5.12 illustrates that CISPred predicts more sequences
with 3-state accuracy higher than 90%, higher than 85%, and higher than 80%,
when the threshold equals 0.42 on the 1758 sequences.
77
For all the above reasons, 0.42 is determined as the default threshold to be used
in the hierarchical clustering of the folds generated by THREADER alignments.
5.5 Comparison of CISPred and Integrated Tools
5.5.1 Overview
PSIPRED and SSPRO are two existing protein secondary structure prediction
tools that are integrated by CISPred. The manuals of PSIPRED and SSPRO
provide their average 3-state accuracies based on the experiments conducted by
their authors. The datasets used to test the average 3-state accuracies of PSIPRED
and SSPRO may be different than the two datasets used to test CISPred, which
may mean that the experimental results of PSIPRED and SSPRO cannot be
compared to the experimental results of CISPred. PSIPRED and SSPRO are
independently tested using the same test datasets used to test CISPred. The
experimental results of PSIPRED and SSPRO are compared to the experimental
results of CISPred with a default threshold of 0.42.
5.5.2 Comparison on CASP Sequences
Figure 5.13 illustrates the 3-state accuracies of PSIPRED predictions on the 109
CASP sequences. Figure 5.14 is a bar graph showing the distribution of the 3-state
accuracies of PSIPRED predictions on 109 CASP sequences. The average 3-state
accuracy, standard deviation, and coefficient of variation of PSIPRED predictions
on the 109 CASP sequences is 0.778, 0.084, and 15.6%, respectively. Figure 5.13
shows that PSIPRED predicts proteins number 21 and 9 with a 3-state accuracy
of 0.43 and 0.46, which are the two lowest 3-state accuracies of the 109 predictions.
Figure 5.14 shows that PSIPRED does not predict any sequences with 3-state
78
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Sequence I.D.
3-st
ate
accu
racy
(Q
3 sc
ore
)
Figure 5.13: 3-state accuracies of PSIPRED predictions on the 109 CASP se-quences with average Q3 score 0.778, standard deviation 0.084, and coefficient ofvariation 15.6%.
79
accuracy between 0.95 and 1; predicts 33 sequences with 3-state accuracy between
0.80 and 0.85, which is 30.3% of the 109 CASP sequences; and predicts 28 se-
quences with 3-state accuracy between 0.80 and 0.85, which is 25.7% of the 109
CASP sequences.
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e
q u
e n c e
s
Figure 5.14: Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on 109 CASP sequences.
Figure 5.15 shows the 3-state accuracies of the predictions of SSPRO on the 109
CASP sequences. The average 3-state accuracy, standard deviation, and coefficient
of variation of SSPRO predictions on the 109 CASP sequences is 0.821, 0.095, and
11.6%, respectively. Figure 5.15 shows that SSPRO predicts 7 CASP sequences
with 100% 3-state accuracy, and protein number 9 with 3-state accuracy 0.5, which
is the lowest 3-state accuracy in the 109 predictions.
Figure 5.16 is a bar graph showing the distribution of 3-state accuracies of
SSPRO predictions on the 109 CASP sequences. Figure 5.16 shows that SSPRO
80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Sequence I.D.
3-st
ate
accu
racy
(Q
3 sc
ore
)
Figure 5.15: 3-state accuracies of SSPRO predictions on 109 CASP sequences withan average Q3 score 0.821, standard deviation 0.095, and coefficient of variation11.6%.
81
predicts the same number as PSIPRED of sequences with 3-state accuracy between
0.75 and 0.85. However, SSPRO predicts 15 sequences with 3-state accuracy
between 0.95 and 1.00, and 8 sequences with 3-state accuracy between 0.90 and
0.95. Compared to PSIPRED, SSPRO predicts more sequences with high 3-state
accuracy.
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5 3-state accuracy ranges
N u
m b
e r o
f s e
q u
e n c e
s
Figure 5.16: Bar graph showing the distribution of 3-state accuracies of SSPROpredictions on the 109 CASP sequences.
Figure 5.17 illustrates the 3-state accuracies of CISPred predictions on the 109
CASP sequences when the threshold equals 0.42. The average 3-state accuracy,
standard deviation, and coefficient of variation of CISPred predictions on 109
CASP sequences when the threshold equals 0.42 are 0.826, 0.0997, and 12.06%,
respectively. Compared to the average 3-state accuracy, standard deviation, and
coefficient of variation of PSIPRED predictions on the 109 CASP sequences, which
82
are 0.778, 0.084, and 15.6% respectively, the average 3-state accuracy of CISPred,
0.826, is higher than the average 3-state accuracy of PSIPRED, 0.778; the standard
deviation of CISPred, 0.0997, is higher than the standard deviation of PSIPRED,
0.084, which indicates that the predictions of CISPred are slightly more distributed
than the predictions of PSIPRED. The coefficient of variation of CISPred, 12.6%,
is lower than the coefficient of variation of PSIPRED, 15.6%, which indicates that
CISPred predictions have less dispersion percentage of average 3-state accuracy
than PSIPRED. In total, the predictions of CISPred have 0.048 (0.826-0.778)
higher average 3-state accuracy than PSIPRED, 0.0157 (0.0997-0.0840) higher
standard deviation than PSIPRED, and 2.54% (15.60%-12.06%) lower coefficient
of variation than PSIPRED.
Compared to the average 3-state accuracy, standard deviation, and coefficient
of variation of SSPRO predictions on the 109 CASP sequences, which are 0.821,
0.0950, and 11.6% respectively, the average 3-state accuracy of CISPred, 0.826, is
slightly higher than the average 3-state accuracy of SSPRO, 0.821. The standard
deviation of CISPred, 0.0997, is slightly higher than the standard deviation of
SSPRO, 0.0950, which indicates that the predictions of CISPred are slightly more
distributed than the predictions of SSPRO. The coefficient of variation of CISPred,
12.6%, is slightly higher than the coefficient of variation of SSPRO, 11.6%, which
indicates that CISPred predictions have a slightly higher dispersion percentage of
average 3-state accuracy than SSPRO. The predictions of SSPRO and CISPred are
very close regarding average 3-state accuracy, standard deviation, and coefficient of
variation. CISPred has a slightly higher average 3-state accuracy than SSPRO, but
the dispersion of CISPred predictions is slightly higher than the SSPRO dispersion.
Figure 5.18 depicts a bar graph that shows the distribution of CISPred pre-
dictions on the 109 CASP sequences when the threshold equals 0.42. It shows
83
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Sequence I.D.
3-st
ate
accu
racy
(Q
3 sc
ore
)
Figure 5.17: 3-state accuracies of CISPred predictions on the 109 CASP sequenceswhen the threshold at which to stop clustering equals 0.42.
84
that CISPred predicts 13 sequences with 3-state accuracy between 0.95 and 1,
which is 11.9% of the 109 CASP sequences; 20 sequences with 3-state accuracy
between 0.90 and 0.95, which is 18.3% of the 109 CASP sequences; 28 sequences
with 3-state accuracy between 0.80 and 0.85, which is 25.7% of the 109 CASP
sequences.
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5 3-state accuracy ranges
N u
m b
e r o
f s e q
u e n
c e s
Figure 5.18: Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions when the threshold equals 0.42 on the 109 CASP sequences.
Figure 5.19 depicts a bar graph showing the distributions of the 3-state accu-
racies of the predictions of CISPred, PSIPRED, and SSPRO, which is made by
combining Figures 5.14, 5.16, and 5.18. Figure 5.19 shows that CISPred predicts
a slightly smaller number of sequences with 3-state accuracy between 0.95 and
1.00, 0.85 and 0.90, and 0.80 and 0.85 than does SSPRO. CISPred predicts 11
(19-8=11) more sequences with 3-state accuracy between 0.90 and 0.95 than does
SSPRO. In total, CISPred predicts 74 (14+19+12+29=74) sequences with 3-state
85
accuracy higher than 0.80, and SSPRO predicts 66 (15+8+13+30=66) sequences
with 3-state accuracy higher than 0.80, which indicates that CISPred predicts 8
(74-66=8) more sequences with 3-state accuracy higher than 0.80, which is 7.4%
of the 109 CASP sequences. PSIPRED predicts 52 (0+4+15+33=52) sequences
with 3-state accuracy higher than 0.80, which is 22 (74-52=22) sequences less than
CISPred, or 20.2% of the 109 CASP sequences. CISPred has better prediction
performance regarding the number of predicted sequences with 3-state accuracy
higher than 0.80 as compared to PSIPRED and SSPRO.
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e
q u
e n c e
s
Figure 5.19: Bar graph showing the distribution of the 3-state accuracies of pre-dictions of CISPred, PSIPRED, and SSPRO.
PSIPRED, SSPRO and CISPred all have predicted very low 3-state accuracy
for some of the 109 CASP sequences. The sequence with sequence I.D. “9” is one
of the sequences with very low 3-state accuracy: figure 5.17 shows that CISPred
predicts sequence “9” with 3-state accuracy 0.5, figure 5.15 shows that SSPRO pre-
86
dicts sequence “9” with 3-state accuracy 0.5, and figure 5.13 shows that PSIPRED
predicts sequence “9” with 3-state accuracy 0.46. PSIPRED, SSPRO and CIS-
Pred all predict sequence “9” with very low 3-state accuracy in the 109 CASP
sequences. Sequence “9” is a segment of protein 1WCK, which was discovered in
the year 2005. Figure 5.20 shows the predictions of PSIPRED, SSPRO, and CIS-
Pred for the protein 1WCK, the amino acid sequence of protein 1WCK, and the
real secondary structures of protein 1WCK. The lines starting with “PSIPRED”
are the predicted structures of protein 1WCK provided by PSIPRED, the lines
starting with “SSPRO” are the predicted structures of protein 1WCK provided by
SSPRO, the lines starting with “CISPred” are the predicted structures of protein
1WCK provided by CISPred, the lines starting with “AA” indicate the amino
acid sequence of protein 1WCK, and the lines starting with “REAL” are the real
structures of protein 1WCK, which were determined by experimental methods.
Figure 5.20 shows that CISPred predictions are the same as SSPRO predictions,
which indicates that SSPRO predictions are always selected as the consensus pre-
dictions when predicting the structures of protein 1WCK. As presented in Chapter
3, the folds generated by THREADER are sorted based on the filtered combined
energy Z-scores, and usually only 20 of the folds with filtered combined energy Z-
scores above 3.5 are clustered, parts of which are integrated in CISPred. None of
the folds generated by THREADER has filtered combined energy Z-scores above
3.5 when the sequence of protein 1WCK is queried into THREADER, which in-
dicates that none of the 6251 protein folds in the THREADER library has a
structure that perfectly fits the protein 1WCK. CISPred predicts the structure of
protein 1WCK without integrating any protein folds generated by THREADER.
Furthermore, Figure 5.20 shows that CISPred predicts several segments of helices
(“H”); the real structures are sheets (“E”). In total, both PSIPRED and SSPRO
87
does not predict the structures of protein 1WCK with high 3-state accuracy, and
the THREADER library does not have any folds that fit well with protein 1WCK.
Thus CISPred does not predict the structures of protein 1WCK with high 3-state
accuracy. The 3-state accuracy of PSIPRED predictions on protein 1WCK is 0.46,
and the 3-state accuracy of SSPRO predictions on protein 1WCK is 0.5. CISPred
selects the predictions provided by SSPRO as consensus predictions, which have
a higher 3-state accuracy.
SSPRO: CCCCCCCEEEECCCCEEECCCCCCCCCCCCCCCCCHHHHHHCCCCEEEEECCCCEEEEEE PSIPRED: CCCCCCEEEEEECCCEEEEECCCCCCCHHHHHHHHHHHHHHCCCCEEEEECCCCEEEEEE CISPRED: CCCCCCCEEEECCCCEEECCCCCCCCCCCCCCCCCHHHHHHCCCCEEEEECCCCEEEEEE REAL: CCCCCEEEEEEEEECCCEEECCCCECCCCEEEEEECCCEEEEECCEEEECCCEEEEEEEE AA: GLGLPAGLYAFNSGGISLDLGINDPVPFNTVGSQFGTAISQLDADTFVISETGFYKITVI 10 20 30 40 50 60
SSPRO: EEEHHHHHHCCCEEEECCCCCCCCCHHHHHCCCCEEEEHEHECCCCCCHHHHHHHCCHHH PSIPRED: ECCCHHHHHCCEEEEECCEECCCCCCHHHHHCHHHHHHHHHHHCCCHHHHHHHHHCCCCE CISPRED: EEEHHHHHHCCCEEEECCCCCCCCCHHHHHCCCCEEEEHEHECCCCCCHHHHHHHCCHHH REAL: EEECCCCCCCEEEEEECCEECCCCCEECCCCCCEEEEEEEEEECCCCEEEEEEEECCCEE AA: ANTATASVLGGLTIQVNGVPVPGTGSSLISLGAPIVIQAITQITTTPSLVEVIVTGLGLS 70 80 90 100 110 120
SSPRO: HHHCCCHHHEEEECCC PSIPRED: EEECCCHHHHHHHHCC CISPRED: HHHCCCHHHEEEECCC REAL: ECCEEEEEEEEEEEEC AA: LALGTSASIIIEKVAH 130
Figure 5.20: Prediction results of CISPred and integrated tools on protein 1WCK.
As presented above, none of the folds generated by THREADER has a filtered
combined energy Z-score above 3.5 when the sequence of protein 1WCK is queried
into THREADER. THREADER does provide several folds with filtered combined
energy Z-scores lower than 3.5. An experiment was conducted, in which CISPred
provides a consensus prediction on protein 1WCK by integrating the THREADER
folds with a filtered combined energy Z-score above 2.0. The result of this exper-
iment shows that the 3-state accuracy of CISPred prediction on protein 1WCK
becomes 0.37, which is lower than 0.5, the 3-state accuracy of CISPred prediction
of protein 1ACK provided by integrating THREADER folds with a filtered com-
88
bined energy Z-score above 3.5. This experiment indicates that integrating folds
with a low filtered combined energy Z-score does not improve the 3-state accuracy
of CISPred consensus prediction on protein 1WCK.
Figure 5.15 illustrates that SSPRO predicts 7 of the 109 CASP sequences
with 100% 3-state accuracy. Figure 5.21 illustrates the amino acid sequences and
secondary structure sequences of the 7 sequences. These 7 secondary structure
sequences do not have distinct structural patterns or structural features, and the
occurence of each structure type is random. No paticular structure patterns or
structure type compositions are found to be predicted with high 3-state accuracy
by SSPRO.
5.5.3 Comparison on 1758 Sequences
Figure 5.22 illustrates the 3-state accuracy of PSIPRED on 1758 sequences; Fig-
ure 5.23 is a bar graph showing the distribution of the 3-state accuracies of
PSIPRED predictions on the 1758 sequences. The average 3-state accuracy, stan-
dard deviation, and the coefficient of variation of PSIPRED predictions on 1758
sequences are 0.789, 0.089, and 11.2%. Figures 5.22 and 5.23 illustrate that
PSIPRED predicts 21 sequences with 3-state accuracy between 0.95 and 1.00, and
predicts 528 sequences with 3-state accuracy between 0.80 and 0.85, which is 30%
of the 1758 sequences. Figure 5.22 shows that PSIPRED predicts two proteins
with 3-state accuracy 0. Figure 5.24 shows the amino acid sequences of these two
proteins. The reasons PSIPRED fails to predict these two proteins are uncertain,
but it is probably due to some problems in the PSIPRED program; the sequences
do show considerable similarity.
Figure 5.25 illustrates the 3-state accuracy of SSPRO predictions on 1758
sequences, and Figure 5.26 shows the distribution of the 3-state accuracies of
89
>1O8V MEAFLGTWKMEKSEGFDKIMERLGVDFVTRKMGNLVKPNLIVTDLGGGKYKMRSESTFKTTEXSFKLGEKFKEVT PDSREVASLITVENGVMKHEQDDKTKVTYIERVVEGNELKATVKVDEVVCVRTYSKVA
CHHHCEEEEEEEEECHHHHHHHHCCCHHHHHHHHHCCCEEEEEEEECCEEEEEEECCCCEEEEEEECCCCEEEEC CCCCEEEEEEEEECCEEEEEEECCCCEEEEEEEEECCEEEEEEEECCEEEEEEEEECC
>1M20 AVKYYTLEEIQKHNNSKSTWLILHYKVYDLTKYLEEHPGGEEVLREQAGGDATENFEDVGHSTDARELSKTFIIG ELHPDDR
CCCEECHHHHCCCECCCCEEEEECCEEEECCCCCCCCCCCCHHHHHHCCCECHHHHHHCCCCHHHHHHHHHHEEE EECHHHC
>1NIJ NPIAVTLLTGFLGAGKTTLLRHILNEQHGYKIAVIENEFGEVSVDDQLIGDRATQIKTLTNGCICCSRSNELEDA LLDLLDNLDKGNIQFDRLVIECTGMADPGPIIQTFFSHEVLCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGY ADRILLTKTDVAGEAEKLHERLARINARAPVYTVTHGDIDLGLLFNTNGFMLEENVVSTKPRFHFIADKQNDISS IVVELDYPVDISEVSRVMENLLLESADKLLRYKGMLWIDGEPNRLLFQGVQRLYSADWDRPWGDEKPHSTMVFIG IQLPEEEIRAAFAGLRK
CCEEEEEEEECCCCCCHHHHHHHHHCCCCCCEEEECCCCCCCCEEEEEECCCCCEEEEECCCCEEECCCCCHHHH HHHHHHHHHHCCCCCCEEEEEEECCCCHHHHHHHHHHCHHHHHHEEEEEEEEEEECCCHHHHHHHCHHHHHHHHC CCEEEEECCCCCCCCHHHHHHHHHHCCCCCEEECCCCCCCHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHCCEEE EEEEECCCECHHHHHHHHHHHHHHCCCCEEEEEEEECECCCCEEEEEEEECCEEEEEEEEECCCCCCCEEEEEEE ECCCHHHHHHHHHCCCC
>1H7H SVDFAFELRKAQDTGKIVMGARKSIQYAKMGGAKLIIVARNARPDIKEDIEYYARLSGIPVYEFEGTSVELGTLL GRPHTVSALAVVDPGESRILAL
CCCHHHHHHHHHHHCEEEECHHHHHHHHHCCCCCEEEEECCCCHHHHHHHHHHHHHHCCCEEEECCCHHHHHHHC CCCCCCCEEEEEECCCCCHHHC
>1MQ7 STTLAIVRLDPGLPLPSRAHDGDAGVDLYSAEDVELAPGRRALVRTGVAVAVPFGMVGLVHPRSGLATRVGLSIV NSPGTIDAGYRGEIKVALINLDPAAPIVVHRGDRIAQLLVQRVELVELVEVSSFDEAGL
CCCCEEEECCCCCCCCCCCCCCCCEEEEECCCCEEECCCCEEEEEEEEEEECCCCEEEEEECCCCHHHHCCEEEC CCCEEECCCCCCEEEEEEEECCCCCCEEECCCCEEEEEEEEECCCCCCCCCCCCCCCCC
>1NXJ AISFRPTADLVDDIGPDVRSCDLQFRQFGGRSQFAGPISTVRCFQDNALLKSVLSQPSAGGVLVIDGAGSLHTAL VGDVIAELARSTGWTGLIVHGAVRDAAALRGIDIGIKALGTNPRKSTKTGAGERDVEITLGGVTFVPGDIAYSDD DGIIVV
CCCCCCHHHHHHHHCCCCEECCCCCEECCCECCEEEEEEEEECCCECHHHHHHHHCCCCCCEEEEECCCCCCCEE ECHHHHHHHHHHCCCEEEEEEEECCHHHHCCCCCEEEEEEECCCECECCCCCEECCCEEECCEEECCCCEEEECC CCEEEC
>1IZN RVSDEEKVRIAAKFITHAPPGEFNEVFNDVRLLLNNDNLLREGAAHAFAQYNMDQFTPVKIEGYDDQVLITEHGD LGNGRFLDPRNKISFKFDHLRKEASDPQPEDTESALKQWRDACDSALRAYVKDHYPNGFCTVYGKSIDGQQTIIA CIESHQFQPKNFWNGRWRSEWKFTITPPTAQVAAVLKIQVHYYEDGNVQLVSHKDIQDSVQVSSDVQTAKEFIKI IENAENEYQTAISENYQTMSDTTFKALRRQLPVTRTKIDWNKILSYKIGK
CCCHHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHCCHHHHHHHHHHHHHHHHHHCCEEECCCCCCCCEEECCCCE CCCCEEECCCCCEEEECCCCCCCCCCCEECCCCCCCHHHHHHHHHHHHHHHHHHCCCEEEEEEEEEECCEEEEEE EEEEEEEEHHHCEEEEEEEEEEEECCCCEEEEEEEEEEEEEECCCEEEEEEEEEEEEEEEECCCCHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHCHHHHHCCCCCCCCCCCCHHHHCCCCCCC
Figure 5.21: Amino acid sequences and secondary structure sequences predictedby SSPRO with 100% 3-state accuracy.
90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
0 20
0 40
0 60
0 80
0 10
00
1200
14
00
1600
Seq
uen
ce I.
D.
3-state accuracy (Q3 score)
Fig
ure
5.22
:3-
stat
eac
cura
cyof
PSIP
RE
Don
1758
sequen
ces
wit
hav
erag
eof
3-st
ate
accu
racy
0.78
9,st
andar
ddev
iati
on0.
089,
and
coeffi
cien
tof
vari
atio
n11
.2%
.
91
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e q
u e n
c e s
Figure 5.23: Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on the 1757 sequences dataset.
>1BT0:_|PDBID|CASP|TEST|SEQUENCE MLIKVKTLTGKEIEIDIEPTDTIDRIKERVEEKEGIPPVQQRLIYAGKQLADDKTAKDYN IEGGSVLHLVLAL
>1V80:_|PDBID|CASP|TEST|SEQUENCE MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN IQKESTLHLVLRLRGG
Figure 5.24: The amino acid sequences for which PSIPRED fails to predict thesecondary structures.
92
SSPRO on 1758 sequences. The average 3-state accuracy, standard deviation,
and the coefficient of variation of SSPRO predictions on the 1758 sequences are
0.911, 0.101, and 11.1%. Figure 5.25 shows that SSPRO predicts 949 sequences
with 3-state accuracy between 0.95 and 1.00, which is 54.0% of the 1758 sequences.
SSPRO has much better prediction performance compared to PSIPRED regarding
3-state accuracy and the number of sequences with high 3-state accuracies. The
standard deviations of SSPRO and PSIPRED predictions are almost the same.
Figure 5.27 shows the 3-state accuracies of CISPred predictions on the 1758
sequences when the threshold at which to stop clustering equals 0.42. The av-
erage 3-state accuracy of CISPred predictions on the 1758 sequences is 0.893.
Compared to the average 3-state accuracy of PSIPRED and SSPRO on 1758 se-
quences, 0.789 and 0.911, the average 3-state accuracy of CISPred is 0.104 (0.893-
0.789=0.104) higher than the average 3-state accuracy of PSIPRED, but 0.018
(0.911-0.893=0.018) lower than the average 3-state accuracy of SSPRO. The stan-
dard deviation of CISPred predictions on the 1758 sequences is 0.095. Compared
to the standard deviation of PSIPRED and SSPRO on the 1758 sequences, 0.089
and 0.101, the average 3-state accuracy of CISPred is 0.006 (0.095-0.089=0.006)
higher than the standard deviation of PSIPRED, but 0.006 (0.101-0.095=0.006)
lower than the standard deviation of SSPRO. The coefficient of variation of CIS-
Pred predictions on the 1758 sequences is 10.7%. Compared to the coefficient
of variation of PSIPRED and SSPRO on 1758 sequences, 11.2% and 11.1%, the
coefficient of variation of CISPred is 0.5% (11.2%-10.7%=0.5%) lower than the
coefficient of variation of PSIPRED, and 0.4% (11.1%-10.7%=0.4%) lower than
the coefficient of variation of SSPRO. In total, the average 3-state accuracy of
CISPred is slightly lower (0.018 lower or 1.8% lower) than SSPRO, while siginif-
icantly higher (0.104 or 10.4%) than PSIPRED; however, the dispersions of the
93
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
0 20
0 40
0 60
0 80
0 10
00
1200
14
00
1600
Seq
uen
ce I.
D.
3-state accuracy (Q3 score)
Fig
ure
5.25
:3-
stat
eac
cura
cyof
SSP
RO
pre
dic
tion
son
1758
sequen
ces
wit
hav
erag
eof
3-st
ate
accu
racy
0.91
1,st
andar
ddev
iati
on0.
101,
and
coeffi
cien
tof
vari
atio
n11
.1%
.
94
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e q
u e n
c e s
Figure 5.26: Bar graph showing the distribution of the 3-state accuracies of SSPROpredictions on the 1758 sequences dataset.
95
predictions of PSIPRED, SSPRO, and CISPred do not have large differences.
As shown in Figure 5.27, CISPred provides prediction with very low 3-state
accuracy, 0.14, on a protein sequence with sequence I.D. “856.” Sequence “856”
is protein 1XR0. Figure 5.28 shows the amino acid sequence, 8-state DSSP sec-
ondary structure, and 3-state secondary structure of protein 1XR0; Figure 5.29
shows the prediction results from SSPRO for protein 1XR0; Figure 5.30 shows the
prediction results from PSIPRED for protein 1XR0; and Figure 5.31 shows the
alignment results from THREADER for protein 1XR0. The 3-state accuracy of
SSPRO and PSIPRED predictions for protein 1XR0 are 0.59 and 0.45 respectively,
which are relatively low 3-state accuracies. The 3-state accuracy of CISPred pre-
diction for protein 1XR0 is lower than that of both SSPRO and PSIPRED, and
is 0.14, a very low 3-state accuracy. THREADER only provides one fold with fil-
tered combined energy Z-scores higher than 3.5; however, this fold has a very high
confidence score as shown in Figure 5.31. However, the fold structures aligned
with protein 1XR0 are “HHHHHHHHHHHHHCCCHHHHHH,” which have few
correlations to the actual structures of 1XR0 as shown in Figure 5.28, “CCCCCC-
CCCCCCCCCCCCEECC.” These fold structures do not have a high correlation
to the actual structures, but have high confidence scores, which makes the 3-state
accuracy of CISPred lower. However, the predictions for proteins like 1XR0 are
rare in both the 109 CASP sequences and the 1758 sequences.
Figure 5.27 depicts a bar graph showing the distribution of the 3-state accu-
racies of CISPred predictions on the 1758 sequences dataset when the threshold
is 0.42. Figure 5.33 depicts a bar graph showing the distribution of the 3-state
accuracies of the predictions of PSIPRED, SSPRO, and CISPred on the 1758
sequences dataset when the threshold is 0.42. Figure 5.33 is made by combin-
ing Figures 5.23, 5.26, and 5.27. Figure 5.33 shows that compared to CIS-
96
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
0 20
0 40
0 60
0 80
0 10
00
1200
14
00
1600
Seq
uen
ce I.
D.
3-state accuracy (Q3 score)
Fig
ure
5.27
:T
he
3-st
ate
accu
raci
esof
CIS
Pre
dpre
dic
tion
son
the
1758
sequen
ces
when
the
thre
shol
dat
whic
hto
stop
clust
erin
geq
ual
s0.
42.
The
aver
age
3-st
ate
accu
racy
,st
andar
ddev
iati
on,an
dco
effici
ent
ofva
riat
ion
ofth
ese
pre
dic
tion
sar
e0.
893,
0.09
5,an
d10
.7%
resp
ecti
vely
.
97
ID: 1XR0 AA: HSQMAVHKLAKSIPLRRQVTVS DSSP: CCSCSCCCCCSCCCCSCCEECC 3-state: CCCCCCCCCCCCCCCCCCEECC
Figure 5.28: The amino acid sequence, 8-state DSSP secondary structure, and3-state secondary structure of protein 1XR0.
1XR0:_|PDBID|CASP|TEST|SEQUENCE HSQMAVHKLAKSIPLRRQVTVS CCHHHHHHHHCCCCCCCEEECC
Figure 5.29: Prediction result of SSPRO on protein 1XR0.
# PSIPRED VFORMAT
1 H C 0.988 0.005 0.004 2 S C 0.640 0.351 0.037 3 Q H 0.305 0.506 0.155 4 M H 0.153 0.641 0.236 5 A H 0.102 0.757 0.122 6 V H 0.038 0.915 0.036 7 H H 0.029 0.908 0.043 8 K H 0.053 0.898 0.019 9 L H 0.069 0.928 0.009 10 A H 0.200 0.800 0.015 11 K H 0.358 0.608 0.016 12 S C 0.577 0.412 0.027 13 I C 0.916 0.081 0.038 14 P C 0.789 0.125 0.106 15 L C 0.523 0.254 0.257 16 R E 0.369 0.367 0.386 17 R E 0.258 0.344 0.421 18 Q E 0.263 0.145 0.486 19 V E 0.230 0.064 0.606 20 T E 0.293 0.037 0.639 21 V C 0.606 0.033 0.391 22 S C 0.994 0.000 0.005
Figure 5.30: Prediction result of PSIPRED on protein 1XR0.
98
Pred, SSPRO predicts 301 (949-648=301) more sequences with 3-state accuracy
between 0.95 and 1.00, and compared to PSIPRED, SSPRO predicts 928 (949-
21=928) more sequences with 3-state accuracy between 0.95 and 1.00. This indi-
cates that SSPRO has significantly better prediction performance regarding the
number of sequences with 3-state accuracy between 0.95 to 1.00. CISPred pre-
dicts 1357 (648+454+255=1357) sequences with 3-state accuracy higher than 0.85,
and SSPRO predicts 1330 (949+217+164=1330) sequences with 3-state accuracy
higher than 0.85, which indicates that CISPred has slightly better performance (27
more sequences) regarding the number of sequences with 3-state accuracy higher
than 0.85. In total, SSPRO has significantly better performance than CISPred
regarding the number of sequences with 3-state accuracy between 0.95 and 1.00;
CISPred has slightly better performance than SSPRO regarding to the number of
sequences with 3-state accuracy higher than 0.85, and significantly better perfor-
mance than PSIPRED regarding to the number of sequences with 3-state accuracy
between 0.95 and 1.00, and 0.90 and 0.95; PSIPRED predicts more sequences than
both SSPRO and CISPred with 3-state accuracy between 0.55 and 0.90, but also
THREADER 3.5 - Protein Sequence Threading Program Build date : Sep 4 2004 Copyright (C) 2002 University College London Portions Copyright (C) 1990 D.T.Jones
Registered user: [email protected]
Reading mean force potential tables... Alignment with 2cpgA0: 10 20 30 40 -----------9999999999999---99999999999999-- CEEEEEEEEEHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHHCC MKKRLTITLSESVLENLEKMAREMGLSKSAMISVALENYKKGQ | | | | -----------HSQMAVHKLAKSIPLRRQVTVS---------- 10 20
Percentage Identity = 18.2.
Figure 5.31: Alignment result of THREADER on protein 1XR0.
99
more sequences with low 3-state accuracy, 0.00-0.40.
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e q
u e n
c e s
Figure 5.32: Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions on the 1758 sequences when the threshold is 0.42.
5.5.4 Conclusions
CISPred has an average 3-state accuracy 0.826 (82.6%) on the 109 CASP se-
quences, and 0.893 (89.3%) average 3-state accuracy on the 1758 sequences. CIS-
Pred predicts 74 sequences of the 109 CASP sequences with 3-state accuracy higher
than 0.80, which is 67.9% of the 109 CASP sequences. CISPred predicts 1357 se-
quences of the 1758 sequences with 3-state accuracy higher than 0.85, which is
77.2% of the 1758 sequences.
Figure 5.34 illustrates the average 3-state accuracy, standard deviation, and
coefficient of variation of PSIPRED, SSPRO and CISPred on the 109 CASP se-
100
0 . 9 5
- 1 . 0
0
0 . 9 0
- 0 . 9
5
0 . 8 5
- 0 . 9
0
0 . 8 0
- 0 . 8
5
0 . 7 5
- 0 . 8
0
0 . 7 0
- 0 . 7
5
0 . 6 5
- 0 . 7
0
0 . 6 0
- 0 . 6
5
0 . 5 5
- 0 . 6
0
0 . 5 0
- 0 . 5
5
0 . 4 5
- 0 . 5
0
0 . 4 0
- 0 . 4
5
0 . 0 5
- 0 . 4
0
0 . 0 0
- 0 . 0
5
3-state accuracy ranges
N u
m b
e r o
f s e
q u
e n c e
s
Figure 5.33: Bar graph showing the distribution of the 3-state accuracies of thepredictions of PSIPRED, SSPRO, and CISPred on the 1758 sequences datasetwhen the threshold is 0.42.
101
quences and the 1758 sequences. Figure 5.34 shows that CISPred has higher
average 3-state accuracy than SSPRO and PSIPRED on the 109 CASP sequence,
and CISPred has higher average 3-state accuracy than PSIPRED on the 1758
sequences; however, CISPred has slightly lower average 3-state accuracy than
SSPRO on the 1758 sequences. The standard deviation and coefficient of varia-
tion of CISPred usually are between SSPRO and PSIPRED, which indicates the
dispersion of CISPred predictions is in a similar range as those of SSPRO and
PSIPRED.
Figure 5.34: Summary of experimental results.
SSPRO predicts more sequences with high 3-state accuracy (≥0.95) than
PSIPRED and SSPRO, particularly in the 1758 sequences. CISPred predicts
more sequences with 3-state accuracy higher than 0.85% than both PSIPRED
and SSPRO on both the 109 CASP sequences and the 1758 sequences.
PSIPRED fails to predict the structures of two sequences in the 1758 sequences,
while SSPRO and CISPred do not have any failures, which indicates that CISPred
can provide results when some of the integrated tools fail to predict structures.
102
Chapter 6
Conclusion
6.1 Thesis Contributions
Currently, protein secondary structure prediction tools have combined evolution-
ary information and many machine learning algorithms to make predictions, and
have already achieved a 3-state accuracy around 70%-80%. Because of the va-
riety of methods used by existing tools, the results from the existing tools have
discrepancies. CISPred provides consensus predictions that are determined by in-
tegrating several existing popular tools. The methods integrated in CISPred are
various: the two popular secondary structure prediction tools integrated in CIS-
Pred, SSPRO and PSIPRED, make predictions by combining machine learning
algorithms, neural networks, and PSI-BLAST, which provides evolutionary infor-
mation; the fold recognition tool, THREADER, integrated in CISPred implements
the threading method, which combines both comparative modelling methods (the
alignment of target sequences and fold sequences), and ab initio prediction meth-
ods (determine the fitness of a template fold by computing the free energy of a
target sequence); and the motif structure formulae integrated in CISPred use the
structures of protein motifs to predict the structure of proteins. The experimental
103
results shown in Chapter 5 illustrate that CISPred has a higher 3-state accuracy
than both PSIPRED and SSPRO (0.6% higher than SSPRO and 4.8% higher
than PSIPRED) on the 109 CASP sequences, and has a higher 3-state accuracy
than PSIPRED (10.4% higher), but a slightly lower (1.8% lower) 3-state accu-
racy than SSPRO on the 1758 sequences. CISPred predicts more sequences with
3-state accuracy higher than 85% than both SSPRO and PSIPRED on both the
dataset containing the 109 CASP sequences and the dataset containing the 1758
sequences.
The threading method is a unique and representative method in protein folding
recognition. The predicted structure of a target sequence is found by threading
the target sequence through each fold in a fold library, and the fold in which the
target sequence has minimum free energy is considered as the fit structure of the
target sequence. The threading method requires a great deal of computation time
to thread the target sequence through each template fold and calculate free en-
ergy. CISPred concurrently implements the threading method by dividing the fold
library into 10 sub-libraries, and threading a target sequence through all of the
10 sub-libraries simultaneously. Moreover, the other two existing tools integrated
in CISPred, and the finding of motif structure formulae, are concurrently imple-
mented, which saves up to 8 times the execution time, according to Figure 4.9
in Chapter 4. CISPred allows a user to query multiple sequences each time CIS-
Pred is executed, and the executions of the queried sequences are concurrently
performed on a high-performance SUN Cluster.
The user interface of CISPred is a website, which enables easy accessibility:
any users who have the Internet access are able to execute CISPred, and the
prediction results are sent to the email addresses of users.
The experiments mentioned in Chapter 5 show that PSIPRED fails to predict
104
the structures of two sequences in the test dataset containing the 1758 sequences.
CISPred is still able to predict the structures of these two sequences, because
CISPred provides prediction results by integrating other tools, which shows that
CISPred is relatively more stable and more reliable than a single prediction tool.
6.2 Future Work
CISPred provides predictions of 3 secondary structural types: Coil (“C”), Helix
(“H”), and Strand (“E”). The DSSP (Dictionary of Protein Secondary Structure)
defines 8 secondary structure types, as mentioned in Chapter 2. Currently, some
existing tools, such as SSPRO8 [37], predict all 8 secondary structure types in
DSSP, although most of them are in the experimental stage. CISPred could be
modified to predict 8 secondary structure types by integrating existing tools that
predict all 8 DSSP structure types, or improve the CISPred predictions of 3 struc-
tural types by integrating existing tools that predict all 8 DSSP structure types.
Both SSPRO and PSIPRED use PSI-BLAST profiles as inputs for their neural
network architectures. The PSI-BLAST profiles used by SSPRO were generated in
2004, and the PSI-BLAST profiles used by PSIPRED were generated in 1999. The
109 CASP sequences used in the experiments mentioned in Chapter 5 are from
CASP3 (1998), CASP4 (2000), CASP5 (2002), and CASP6 (2004), and the 1758
sequences may contain sequences discovered before 1999. Therefore, the sequences
in the 109 CASP sequences dataset and the 1758 sequences dataset may already
have been included in the PSI-BLAST profiles used by SSPRO, and the sequences
in CASP3 (1998) and the 1758 sequences may already have been included in
the PSI-BLAST profiles used by PSIPRED, which means that the experiments
results mentioned in Chapter 5 may not be objective estimates of the performance
of both SSPRO and PSIPRED. The solution to this problem is to test PSIPRED
105
and SSPRO on recently discovered proteins, and then to predict the structures
of these proteins without considering their homologous information. Experiments
may be conducted to see whether the prediction performances of both SSPRO
and PSIPRED could be improved by using the most recent PSI-BLAST profiles,
such as the PSI-BLAST profiles generated in 2006 or 2007. The source codes
of both PSIPRED and SSPRO are available for download, which makes it more
feasible to conduct these experiments. Using the most recent PSI-BLAST profiles
may include homologous information about the newly discovered proteins, which
makes it difficult to find test sequences that do not have homologous information
included in their prediction.
SSPRO combines both neural networks and homologue information to make
predictions. For a queried sequence, SSPRO finds homologues of the queried se-
quence, and uses the structure of the most significant homologue as its predicted
structure of the sequence. For the queried sequences that do not have significant
homologues, SSPRO uses neural networks and PSI-BLAST profiles to make pre-
dictions. In the experiments mentioned in Chapter 5, SSPRO uses the structures
of the most significant homologues as its predictions on the sequences that have
been found for several years and have had plenty of homologue information avail-
able. This makes SSPRO predictions on these sequences have very high 3-state
accuracy. To have an objective estimate of the performance of SSPRO, exper-
iments may be conducted when homologue information are not considered, but
only neural networks and PSI-BLAST profiles are used to make predictions.
106
References
[1] A. Bairoch, P. Bucher, and K. Hofmann. The PROSITE database, its status
in 1997. Nucleic Acids Research, 25:217–221, 1997.
[2] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig,
I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids
Research, 28:235–242, 2000.
[3] J. M. Chandonia and M. Karplus. The importance of larger data sets for
protein secondary structure prediction with neural networks. Protein Science,
5:768–744, 1996.
[4] J. M. Chandonia and M. Karplus. New methods for accurate prediction of
protein secondary structure. Proteins, 35:293–306, 1999.
[5] P. Y. Chou and G. D. Fasman. Conformational parameters for amino acids
in helical, b-sheet, and random coil regions calculated from proteins. Bio-
chemistry, 28:211–222, 1974.
[6] J. A. Cuff and G. J. Barton. Prediction of protein secondary structure by com-
bining nearest-neighbor algorithms and multiple sequence alignments. Jour-
nal of Molecular Biology, 247:11–15, 1995.
[7] J. A. Cuff and G. J. Barton. Secondary structure prediction using segment
similarity. Protein Engineering, 10:1143–1153, 1997.
107
[8] J. A. Cuff and G. J. Barton. Evaluation and improvement of multiple sequence
methods for protein secondary structure prediction. Proteins, 34:508–519,
1999.
[9] J. A. Cuff and G. J. Barton. Application of multiple sequence alignment
profiles to improve protein secondary structure prediction. Proteins, 40:502–
511, 2000.
[10] J. A. Cuff, M. Clamp, A. S. Siddiqui, M. Finlay, and G. J. Barton. JPred: A
consensus secondary structure prediction server. Bioinformatics, 14:892–893,
1998.
[11] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis
and display of genome-wide expression patterns. Proceedings of the National
Academy of Sciences of the United States of America, 95:14863–14868, 1998.
[12] D. Frishman and P. Argos. Protein secondary structure prediction using
nearest-neighbor methods. Journal of Molecular Biology, 232:1117–1129,
1993.
[13] D. Frishman and P. Argos. Incorporation of non-local interactions in pro-
tein secondary structure prediction from the amino acid sequence. Protein
Engineering, 9:133–142, 1996.
[14] D. Frishman and P. Argos. Seventy-five percent accuracy in protein secondary
structure prediction. Proteins, 27:329–335, 1997.
[15] J. Garnier, J. Gibrat, and B. Robson. GOR methos for predicting protein
secondary structure from amino acid sequence. Methods Enzymol, 266:540–
553, 1996.
108
[16] J. Garnier, D. J. Osguthorpe, and B. Robson. Analysis of the accuracy
and implications of simple methods for predicting the secondary structure of
globular proteins. Journal of Molecular Biology, 120:97–120, 1978.
[17] J. F. Gibrat, J. Garnier, and B. Robson. Further developments of protein
secondary structure prediction using information theory. New parameters and
consideration of residue pairs. Journal of Molecular Biology, 198:425–443,
1987.
[18] N. Goldman, J. L. Thorne, and D. T. Jones. Using evolutionary trees in pro-
tein secondary structure prediction and other comparative sequence analysis.
Journal of Molecular Biology, 263:196–208, 1996.
[19] L. H. Holley and M. Karplus. Protein Secondary Structure Prediction with
a Neural Network. Proceedings of the National Academy of Sciences of the
United States of America, 86:152–156, 1989.
[20] R. W. W. Hooft, M. Scharf, C. Sander, and G. Vriend. The PDBFINDER
database: A summary of PDB, DSSP and HSSP information with added
value. Computer Application in the Bioscience (CABIOS), 12:525–529, 1996.
[21] X. Huang, D. S. Huang, G. Z. Zhang, Y. P. Zhu, and Y. X. Li. Predic-
tion of protein secondary structure using improved two-level neural network
architecture. Protein and Peptide Letters, 12:805–811, 2005.
[22] N. Hulo, C. J. Sigrist, V. L. Saux, P. S. Langendijk-Genevaux, L. Bordoli,
A. Gattiker, E. D. Castro, P. Bucher, and A. Bairoch. Recent improvements
to the PROSITE database. Nucleic Acids Research, 32:D134–D137, 2004.
[23] D. T. Jones. THREADER: Protein sequence threading by double dynamic
programming. In Steven Salzberg, David Searls, and Simon Kasif, editors,
109
Computational Methods in Molecular Biology, volume 32, chapter 13. Elsevier
Science, Amsterdam, Netherland, 1998.
[24] D. T. Jones. Protein secondary structure prediction based on position-specific
scoring matrices. Journal of Molecular Biology, 292:195–202, 1999.
[25] H. Kaur and G. P. Raghava. Prediction of beta-turns in proteins from multiple
alignment using neural network. Proteins, 12:627–634, 2003.
[26] Kinemage. “Kinemage, Next Generation”. On-
line. Accessed September 26th 2006, Available HTTP:
http://kinemage.biochem.duke.edu/software/king.php.
[27] R. D. King and M. J. E. Sternberg. Identification and application of the
concepts important for accurate and reliable protein secondary structure pre-
diction. Protein Science, 5:2298–2310, 1996.
[28] D. G. Kneller, F. E. Cohen, and R. Langridge. Improvements in protein
secondary structure prediction by an enhanced neural network. Journal of
Molecular Biology, 214:171–182, 1990.
[29] A. Kryshtafovych, C. Venclovas, K. Fidelis, and J. Moult. Progress over
the first decade of CASP experiments. Proteins: Structure, Function, and
Bioinformatics, 61:225–236, 2005.
[30] J. M. Levin. Exploring the limits of nearest neighbour secondary structure
prediction. Protein Engineering, 10:771–776, 1997.
[31] V. I. Lim. Algorithms for prediction of alpha helices and structural regions
in globular proteins. Journal of Molecular Biology, 88:873–894, 1974.
110
[32] P. Lio, N. Goldman, J. L. Thorne, and D. T. Jones. PASSML: combining
evolutionary inference and protein secondary structure prediction. Bioinfor-
matics, 14:726–733, 1998.
[33] P. K. Mehta, J. Heringa, and P. Argos. A simple and fast approach to pre-
diction of protein secondary structure from multiply aligned sequences with
accuracy above 70%. Protein Science, 4:2517–2525, 1995.
[34] K. Nishikawa and T. Nogughi. Predicting protein secondary structure based
on amino acid sequence. Method Enzymol, 202:31–34, 1995.
[35] D. Petrey, Z. Xiang, C. L. Tang, and L. Xie. Decision tree-based formation of
consensus protein secondary structure prediction. Bioinformatics, 15:1039–
1046, 1999.
[36] D. Petrey, Z. Xiang, C. L. Tang, and L. Xie. Using multiple structure align-
ments, fast model building, and energetic analysis in fold recognition and
homology modeling. Proteins, 53:430–435, 2003.
[37] G. Pollastri, D. Pizybylski, B. Rost, and P. Baldi. Improving the prediction of
protein secondary structure in three and eight classes using recurrent neural
networks and profiles. Proteins, 47:228–235, 2002.
[38] N. Qian and T. J. Sejnowski. Predicting the secondary structure of globular
proteins using neural network models. Journal of Molecular Biology, 202:865–
884, 1988.
[39] S. K. Riis and A. Krogh. Improving prediction of protein secondary structure
using structured neural networks and multiple sequence alignments. Journal
of Computational Biology, 3:163–183, 1996.
111
[40] B. Rost and C. Sander. Prediction of protein secondary structure at better
than 70% accuracy. Journal of Molecular Biology, 232:584–599, 1993.
[41] B. Rost and C. Sander. Combining Evolutionary Information and Neural Net-
works to Predict Protein Secondary Structure. Proteins: Structure, Function,
and Genetics, 19:52–72, 1994.
[42] B. Rost, G. Yachdav, and J. Liu. The PredictProtein server. Nucleic Acids
Research, 23:W321–W326, 2004.
[43] Web-Books.Com. Molecular biology web book. Online, September
2005. Accessed August 26th 2006, Available HTTP: http://www.web-
books.com/MoBio/Free/Ch2C6.htm.
[44] Wikipedia.Com. Coefficient of variation. Online, May 2007. Accessed July
5th 2007, Available HTTP: www.wikipedia.org.
[45] Wikipedia.Com. Residue (chemistry). Online, May 2007. Accessed July 5th
2007, Available HTTP: www.wikipedia.org.
[46] K. Zimmermann and J. F. Gibrat. New joint prediction algorithm (Q7-
JASEP) improves the prediction of protein secondary structure. Biochem-
istry, 30:11164–11172, 1991.
[47] K. Zimmermann and J. F. Gibrat. In unison: regularization of protein sec-
ondary structure predictions that makes use of multiple sequence alignments.
Protein Engineering, 11:861–865, 1998.
[48] M. Zvelebil, G. Barton, W. Taylor, and M. Sternberg. Prediction of pro-
tein secondary structure and active sites using the alignment of homologous
sequences. Journal of Molecular Biology, 195:957–961, 1987.
112
[49] M. Zvelebil, G. Barton, W. Taylor, and M. Sternberg. Dictionary of protein
secondary structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers, 22:2577–2637, 2004.
113
Appendix A
Submission Templates on Cluster
A.1 Submission Template for THREADER
The following is the template for submitting concurrent jobs for THREADER [23]
at the Cluster.
#/bin/bash
# Use /bin/sh shell
#$ -S /bin/sh
# Run in submit directory
#$ -cwd
# Job name
#$ -N TSeq-AAAAAA-Str-BBBBBB
# Direct output
#$ -j y
#$ -o logSeqAAAAAAStrBBBBBB.screen
./threader-linux -p -j AAAAAA.seq AAAAAA-BBBBBB.out BBBBBB.lst
114
A.2 Submission Template for SSPRO
The following is the template for submitting concurrent jobs for SSPRO [37] at a
Cluster.
#/bin/bash
# Use /bin/sh shell
#$ -S /bin/sh
# Run in submit directory
#$ -cwd
# Job name
#$ -N SSpro-AAAAAA
# Direct output
#$ -j y
#$ -o logSeqAAAAAA.screen
/apps/cs/sspro4/bin/predict_ssa.sh AAAAAA.seq AAAAAA.out
A.3 Submission Template for PSIPRED
The following is the template for submitting concurrent jobs for PSIPRED [24] at
a Cluster.
#/bin/bash
# Use /bin/sh shell
#$ -S /bin/sh
# Run in submit directory
#$ -cwd
# Job name
#$ -N Psi-AAAAAA
115
# Direct output
#$ -j y
#$ -o logSeqAAAAAA.screen
./runpsipred AAAAAA.seq
A.4 Submission Template for Finding Motif Struc-
tures
The following is the template for submitting concurrent jobs for finding motif
structures at a Cluster.
#/bin/bash
# Use /bin/sh shell
#$ -S /bin/sh
# Run in submit directory
#$ -cwd
# Job name
#$ -N Pat-SeqID-MotifNo
# Direct output
#$ -j y
#$ -o logSeqID-MotifNo.screen
perl MotifStrFinder.pl MotifName MotifNo SeqID Start End Length
116
Vita
Candidate’s full name:
Zheng Wang
University attended:
Shandong Economic University, Jinan, Shandong, P.R.China.
Bachelor of Management Information System, 2004.
Poster:
Zheng Wang, Patricia Evans and Virendra Bhavsar, CISPred: Consensus
Integrated Protein Structure Prediction. April 4, 2007, University of New
Brunswick.