Consensus Prediction of Protein Secondary Structuresorca.st.usm.edu/~zwang/thesis/ZhengWang_Mythesis.pdf · Consensus Prediction of Protein Secondary Structures by ... A THESIS SUBMITTED

Consensus Prediction of Protein Secondary

Structures

by

Zheng Wang

Bachelor of Management Information System,Shandong Economic University, Jinan, P. R. China, 2004

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF

Master of Computer Science

In the Graduate Academic Unit of Computer Science

Supervisor(s): Dr. Patricia Evans, Ph.D., Faculty of Computer ScienceDr. Virendrakumar C. Bhavsar, Ph.D., Faculty ofDr. Computer Science

Examining Board: Dr. Eric Aubanel, Ph.D., Faculty of Computer ScienceDr. Bradford Nickerson, Ph.D., Faculty ofDr. Computer ScienceDr. Julian Meng, Ph.D., Department ofDr. Electrical and Computer Engineering

This thesis is accepted by the

Dean of Graduate Studies

THE UNIVERSITY OF NEW BRUNSWICK

July, 2007

c©Zheng Wang, 2007

Dedication

I dedicate this thesis to

my parents.

They raised me up,

provide me happiness,

cultivate my personality,

and provide me the best education.

I dedicate this thesis to

my grandmother,

a diligent and kind Chinese woman,

who made great contributions to the whole family,

and left us in 2004 when I was pursuing this degree in Canada.

I wish her peace forever.

ii

Abstract

Protein structure prediction is one of the most significant problems in bioinformat-

ics. Currently, there are some tools which can predict protein secondary structure,

or find protein structural motifs and some specific structure segments. However,

their results are sometimes different or contradictory.

CISPred is a consensus protein structure prediction system which integrates

results in order to provide overall consensus predictions of protein secondary struc-

tures. The average accuracy of CISPred predictions is 82.6% on a dataset con-

taining 109 CASP sequences, and 89.3% on a dataset containing 1758 sequences.

iii

Acknowledgements

I sincerely appreciate my supervisors, Dr. Patricia Evans and Dr. Virendra

Bhavsar. They impart their knowledge, direct my research, and provide finan-

cial support. Dr. Virendra Bhavsar and Dr. Patricia Evans are professors with

profound knowledge and experience, and have been respectful mentors. The two

years of study and research with them have been one of the best periods in my

life.

Sili Huang and Lu Yang, system administrators of the Advanced Computa-

tional Research Laboratory at the University of New Brunswick, provided a lot of

technical support to the development and testing of CISPred. Particularly, special

thanks to Sili Huang, who provided many helpful suggestions on the concurrent

implementation of CISPred.

I thank my colleagues: Rachita Sharma, a Ph.D of Computer Science can-

didate; En Zhang, a Master of Computer Science candidate; Aijazuddin Syed,

Master of Computer Science; and Marc Cooper, Master of Computer Science.

They began their research in our bioinformatics laboratory earlier than I, and

provided much help and many suggestions for my research.

The members of my entire family greatly supported my study in Canada. I

appreciate them all.

iv

Table of Contents

Dedication ii

Abstract iii

Acknowledgments iv

Table of Contents vii

List of Figures ix

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Protein Secondary Structure . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Secondary Structure Definitions . . . . . . . . . . . . . . . . 42.2.2 Secondary Structure Assignments . . . . . . . . . . . . . . . 8

2.3 Protein Secondary Structure Prediction . . . . . . . . . . . . . . . . 102.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 PHD, PSIPRED and SSPRO . . . . . . . . . . . . . . . . . 142.3.3 The Threading Method and THREADER . . . . . . . . . . 152.3.4 Comparison of Protein Structure Prediction Tools . . . . . . 162.3.5 Benchmarked Non-redundant Dataset . . . . . . . . . . . . . 17

2.4 Protein Motif and Motif Databases . . . . . . . . . . . . . . . . . . 192.4.1 Protein Motif . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 PATMATMOTIFS and the PROSITE database . . . . . . . 19

3 CISPred: Consensus Integrated Protein Structure Prediction 203.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Selection of Integrated Tools . . . . . . . . . . . . . . . . . . 203.2.2 THREADER . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

3.2.2.1 Sorting THREADER Reports . . . . . . . . . . . . 233.2.2.2 Clustering THREADER Alignments . . . . . . . . 25

3.2.3 Finding Motif Secondary Structures . . . . . . . . . . . . . . 283.2.4 PSIPRED and SSPRO . . . . . . . . . . . . . . . . . . . . . 343.2.5 Generating Consensus Structure Prediction . . . . . . . . . . 34

4 System Implementation 484.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 System Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Concurrent Implementation . . . . . . . . . . . . . . . . . . . . . . 53

4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.2 THREADER . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 Finding Protein Motif Secondary Structures . . . . . . . . . 554.3.4 SSPRO and PSIPRED . . . . . . . . . . . . . . . . . . . . . 56

4.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Experimental Results 625.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 CISPred Testing Results on CASP Sequences . . . . . . . . . . . . 645.3 CISPred Testing Results on 1758 Sequences . . . . . . . . . . . . . 695.4 Selection of CISPred Default Threshold . . . . . . . . . . . . . . . . 765.5 Comparison of CISPred and Integrated Tools . . . . . . . . . . . . 78

5.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.5.2 Comparison on CASP Sequences . . . . . . . . . . . . . . . 785.5.3 Comparison on 1758 Sequences . . . . . . . . . . . . . . . . 895.5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Conclusion 1036.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

References 113

A Submission Templates on Cluster 114A.1 Submission Template for THREADER . . . . . . . . . . . . . . . . 114A.2 Submission Template for SSPRO . . . . . . . . . . . . . . . . . . . 115A.3 Submission Template for PSIPRED . . . . . . . . . . . . . . . . . . 115A.4 Submission Template for Finding Motif Structures . . . . . . . . . . 116

Vita 117

vi

List of Figures

2.1 The general formula of an amino acid . . . . . . . . . . . . . . . . . 32.2 The amino acid sequence of protein 1AD5. . . . . . . . . . . . . . . 42.3 An α helix in protein 1R7G [26]. . . . . . . . . . . . . . . . . . . . . 52.4 An ideal β strand [43]. . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 A parallel β sheet in protein 1DIN [26]. . . . . . . . . . . . . . . . . 62.6 An anti-parallel β sheet in protein 1IC9 [26]. . . . . . . . . . . . . . 62.7 A β barrel in protein 1BY3 [26]. . . . . . . . . . . . . . . . . . . . . 72.8 The secondary structure of protein 1AD5. . . . . . . . . . . . . . . 72.9 Part of the PDB file of protein 1WCK. . . . . . . . . . . . . . . . . 92.10 The entire amino acid sequence of protein 1WCK. . . . . . . . . . . 102.11 Part of the DSSP file of protein 1WCK. . . . . . . . . . . . . . . . . 112.12 Part of the PDBFINDER entry of protein 1WCK. . . . . . . . . . . 122.13 Part of a GARNIER [16] prediction report. . . . . . . . . . . . . . . 162.14 Part of a PREDATOR [14] prediction report. . . . . . . . . . . . . 162.15 Part of a PSIPRED [24] horizontal prediction report. . . . . . . . . 172.16 Part of a SSPRO [37] prediction report. . . . . . . . . . . . . . . . . 17

3.1 CISPred system architecture. . . . . . . . . . . . . . . . . . . . . . 223.2 Example of THREADER score report. . . . . . . . . . . . . . . . . 243.3 Alignment results of THREADER. . . . . . . . . . . . . . . . . . . 263.4 Structure segments for a protein motif. . . . . . . . . . . . . . . . . 293.5 Example of a structure formula result. . . . . . . . . . . . . . . . . 303.6 Example of a PATMATMOTIFS result. . . . . . . . . . . . . . . . . 323.7 Example of a PROSITE entry. . . . . . . . . . . . . . . . . . . . . . 333.8 The generation of motif structure formulae. . . . . . . . . . . . . . . 353.9 Example of a PSIPRED vertical result. . . . . . . . . . . . . . . . . 363.10 Example of a SSPRO result. . . . . . . . . . . . . . . . . . . . . . . 373.11 Example of the information available in one amino acid position. . . 393.12 Average 3-state accuracy of THREADER predictions on 80 random

sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.13 Example of a CISPred vertical result in an amino acid position. . . 453.14 Example of a CISPred vertical result. . . . . . . . . . . . . . . . . . 463.15 Example of a CISPred horizontal result. . . . . . . . . . . . . . . . 47

4.1 The system infrastructure of CISPred. . . . . . . . . . . . . . . . . 50

vii

4.2 Example of a CISPred queried sequence. . . . . . . . . . . . . . . . 504.3 Web page for submitting query sequences to CISPred. . . . . . . . . 514.4 A CISPred web page displaying user jobs. . . . . . . . . . . . . . . 524.5 Overview of the concurrent implementation of CISPred. . . . . . . . 544.6 Concurrent implementation of THREADER. . . . . . . . . . . . . . 544.7 The sorting of THREADER reports. . . . . . . . . . . . . . . . . . 554.8 Concurrent finding of motif structures. . . . . . . . . . . . . . . . . 574.9 The execution time of CISPred. . . . . . . . . . . . . . . . . . . . . 604.10 The speedup of CISPred. . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Two chains of protein 1ZD6. . . . . . . . . . . . . . . . . . . . . . . 645.2 Average 3-state accuracy of CISPred on the 109 CASP sequences. . 655.3 Standard deviation of CISPred on the 109 CASP sequences. . . . . 665.4 Coefficient of variation of CISPred on the 109 CASP sequences. . . 675.5 Number of sequences CISPred predicts with 3-state accuracy in

several specific ranges on the 109 sequences dataset. . . . . . . . . . 685.6 Distribution of the 109 CASP sequences predicted by CISPred with

1% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 705.7 Distribution of the 109 CASP sequences predicted by CISPred with

3% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 715.8 Distribution of the 109 CASP sequences predicted by CISPred with

5% 3-state accuracy as interval. . . . . . . . . . . . . . . . . . . . . 725.9 Average 3-state accuracy of CISPred predictions on 1758 sequences. 735.10 Standard deviation of CISPred predictions on the 1758 sequences. . 745.11 Coefficient of variation of CISPred predictions on the 1758 sequences. 755.12 Number of sequences CISPred predicts with 3-state accuracy in

several specific ranges on the 1758 sequences dataset. . . . . . . . . 765.13 3-state accuracies of PSIPRED predictions on the 109 CASP se-

quences with average Q3 score 0.778, standard deviation 0.084, andcoefficient of variation 15.6%. . . . . . . . . . . . . . . . . . . . . . 79

5.14 Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on 109 CASP sequences. . . . . . . . . . . . 80

5.15 3-state accuracies of SSPRO predictions on 109 CASP sequenceswith an average Q3 score 0.821, standard deviation 0.095, and co-efficient of variation 11.6%. . . . . . . . . . . . . . . . . . . . . . . . 81

5.16 Bar graph showing the distribution of 3-state accuracies of SSPROpredictions on the 109 CASP sequences. . . . . . . . . . . . . . . . 82

5.17 3-state accuracies of CISPred predictions on the 109 CASP se-quences when the threshold at which to stop clustering equals 0.42. 84

5.18 Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions when the threshold equals 0.42 on the 109 CASPsequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.19 Bar graph showing the distribution of the 3-state accuracies of pre-dictions of CISPred, PSIPRED, and SSPRO. . . . . . . . . . . . . . 86

viii

5.20 Prediction results of CISPred and integrated tools on protein 1WCK. 885.21 Amino acid sequences and secondary structure sequences predicted

by SSPRO with 100% 3-state accuracy. . . . . . . . . . . . . . . . . 905.22 3-state accuracy of PSIPRED on 1758 sequences with average of

3-state accuracy 0.789, standard deviation 0.089, and coefficient ofvariation 11.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.23 Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on the 1757 sequences dataset. . . . . . . . . 92

5.24 The amino acid sequences for which PSIPRED fails to predict thesecondary structures. . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.25 3-state accuracy of SSPRO predictions on 1758 sequences with av-erage of 3-state accuracy 0.911, standard deviation 0.101, and co-efficient of variation 11.1%. . . . . . . . . . . . . . . . . . . . . . . . 94

5.26 Bar graph showing the distribution of the 3-state accuracies ofSSPRO predictions on the 1758 sequences dataset. . . . . . . . . . . 95

5.27 The 3-state accuracies of CISPred predictions on the 1758 sequenceswhen the threshold at which to stop clustering equals 0.42. The av-erage 3-state accuracy, standard deviation, and coefficient of varia-tion of these predictions are 0.893, 0.095, and 10.7% respectively. . . 97

5.28 The amino acid sequence, 8-state DSSP secondary structure, and3-state secondary structure of protein 1XR0. . . . . . . . . . . . . . 98

5.29 Prediction result of SSPRO on protein 1XR0. . . . . . . . . . . . . 985.30 Prediction result of PSIPRED on protein 1XR0. . . . . . . . . . . . 985.31 Alignment result of THREADER on protein 1XR0. . . . . . . . . . 995.32 Bar graph showing the distribution of the 3-state accuracies of CIS-

Pred predictions on the 1758 sequences when the threshold is 0.42. . 1005.33 Bar graph showing the distribution of the 3-state accuracies of the

predictions of PSIPRED, SSPRO, and CISPred on the 1758 se-quences dataset when the threshold is 0.42. . . . . . . . . . . . . . . 101

5.34 Summary of experimental results. . . . . . . . . . . . . . . . . . . . 102

ix

Chapter 1

Introduction

1.1 Motivation

It has been proven by researchers that protein functions are determined by their

specific three-dimensional structures. Experimental techniques such as X-ray crys-

tallography or NMR analysis are inadequate, and the gap between the number of

known tertiary and primary structures is widening; therefore, it is necessary to de-

velop approaches that deduce protein structures from their amino acid sequences.

The prediction of protein secondary structures using computer technologies is one

of the necessary efforts to narrow the gap.

There are many tools and algorithms for protein secondary structure predic-

tion. These tools are based on specific methods to predict structures, and their

results sometimes are not identical, and are even contradictory for some proteins.

A method that is able to integrate different prediction tools and make consensus

predictions is necessary for researchers.

1

1.2 Objective

The main objective of the thesis is to integrate results of selected protein structure

prediction tools and make a consensus protein secondary structure prediction in

a position-specific way.

1.3 Organization

Chapter 2 gives some background knowledge about protein, protein structures,

and protein motifs. This chapter also briefly introduces several protein structure

prediction tools and methods.

Chapter 3 presents the architecture and methodology of CISPred, a consensus

integrated protein structure prediction system.

Chapter 4 presents the concurrent implementation of CISPred.

Chapter 5 presents the testing strategy and testing results of CISPred.

Chapter 6 presents the contributions of CISPred, and offers some suggestions

for future work.

2

Chapter 2

Background

2.1 Protein

Proteins are large molecules made up of 20 types of amino acids. Each protein

molecule is a long and unique chain of these amino acid residues1. These long

chains tend to fold into massive and complicated structures because of the power

of bonds between atoms. After they fold into their structures, these long chains

are stable.

H 2 N C COOH

H

R

The a-carbon atom

Side chain group

Amino group Carboxyl group

Figure 2.1: The general formula of an amino acid

The sequence of atoms along the core of a chain is called the backbone of the

1“In biochemistry and molecular biology, a residue refers to a specific monomer within thepolymeric chain of a polysaccharide, protein or nucleic acid” [45].

3

protein. The portions of the amino acids that are not involved in this backbone

are called side chains. Figure 2.1 illustrates the general formula of an amino acid,

in which R represents one of 20 different side chains of amino acids. Every protein

backbone has a C-terminus and a N-terminus, which represent the two ends of the

backbone.

EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVD SLETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVK HYKIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAW EIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQH DKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIE QRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWS FGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEER PTFEYIQSVLDDFYTATESQYQQQP

Figure 2.2: The amino acid sequence of protein 1AD5.

Each of the 20 amino acids can be represented by a 1-letter code or 3-letter

code; for example, amino acid “Alanine” is represented by the letter “A” or “Ala,”

and amino acid “Cysteine” is represented by the letter “C” or “Cys.” A protein

molecule is then represented as a string of 1-letter codes, each of which represents

an amino acid. This string of letters is called an “amino acid sequence.” Figure 2.2

illustrates the amino acid sequence of protein 1AD52.

2.2 Protein Secondary Structure

2.2.1 Secondary Structure Definitions

After comparing the 3D structures of many different proteins, some regular fold-

ing patterns are often found, such as the α helix, β strand, and β sheet. The

Dictionary of Protein Secondary Structure (DSSP) [49] defines these patterns and

2Protein Data Bank [2] I.D.

4

uses a single letter code to describe each of them. This single letter code is called

“DSSP code,” and is frequently used to describe protein secondary structures.

An α helix is a structure formed when the backbone chain of a protein twists

around itself, and in which the backbone N-H group in each amino acid forms

a hydrogen bond with the C=O group of the amino acid four residues earlier.

Figure 2.3 [26] illustrates an α helix in protein 1R7G. An α helix is also called a

“4-turn helix,” and it is represented by “H” in DSSP code. If a hydrogen bond

is formed between two amino acids that are three residues apart, this is called

a “3-turn helix,” represented by “G”; if a hydrogen bond is formed between two

amino acids that are five residues apart, this is called a “5-turn helix,” represented

by “I”; and if a hydrogen bond is formed between two adjacent amino acids, this

is called a “hydrogen bonded turn,” represented by “T.”

Figure 2.3: An α helix in protein 1R7G [26].

A β strand is illustrated in Figure 2.4 [43], in which the backbone of the protein

is folded with successive 120 degree angles.

Figure 2.4: An ideal β strand [43].

5

A β sheet consists of two or more β strands connected by hydrogen bonds,

and its minimum length is two amino acid residues. If its length is less than two

residues, then it is called a “residue in isolated beta-bridge,” and represented by

“B.” The two neighbouring β strands may be parallel if they are aligned in the

same direction from one terminus (N or C) to the other, which is called a parallel

β sheet as shown in Figure 2.5 [26].

Figure 2.5: A parallel β sheet in protein 1DIN [26].

If the two neighbouring β strands are aligned in the opposite direction, then

it is called an anti-parallel β sheet as shown in Figure 2.6 [26].

Figure 2.6: An anti-parallel β sheet in protein 1IC9 [26].

A closed β sheet is called a β barrel, which is illustrated in Figure 2.7 [26]. All

β strands, β sheets and β barrels are represented by “E” in DSSP code.

In DSSP code, the structures formed by non-hydrogen bonds are called “bend,”

and are represented by “S.” The random turns and the structures which are not in

6

any of the above conformations are designated as “ ”(space), which is sometimes

also written as “C.” Usually, the eight secondary structure types defined in the

DSSP are reduced into three types based on a 3-state scheme [40]: “G” and “H”

are taken to be helix (“H”), “E” and “B” are taken to be strand (“E”), and all of

the other structure types are treated as random turns or a coil (“C”).

Based on DSSP code, the secondary structure of a protein molecule can be

represented by a string of letters, which is called a “protein secondary structure

sequence.” Figure 2.8 illustrates the secondary structure sequence of the pro-

tein 1AD5, in which each of the letters, such as “H”, “B” and “C”, represents a

particular protein folding pattern.

Figure 2.7: A β barrel in protein 1BY3 [26].

CCCEEEESSCBCCCSSSBCCBCTTCEEEEEECCTTEEEEEETTTCCEEEEEGGGEEETT SGGGSTTEETTCCHHHHHHHHTSTTCCTTCEEEEECSSSTTSEEEEEEEECTTSCEEEE EEECEECSSSCEESSTTSCBSCHHHHHHHHTTCCSSSSSCCCSBCCCCCCCCCCCTTCS EECGGGEEEEEEEECCSSEEEEEEEETTTEEEEEEEECTTSSCHHHHHHHHHHHTTCCC TTBCCEEEEECSSSEEEEEECCTTCBHHHHHTSHHHHTCCHHHHHHHHHHHHHHHHHHH HTTCCCSCCSTTSEEECTTSCEEECCCCCCCCCCCCCGGGCCHHHHHHCCCCHHHHHHH HHHHHHHHHTTTCCSSSSCCTHHHHHHHHTTCCCCCCTTSCHHHHHHHHHHTCSSGGGS CCHHHHHHHHHTTTSCGGGSSCCCC

Figure 2.8: The secondary structure of protein 1AD5.

7

2.2.2 Secondary Structure Assignments

Protein secondary structures are assigned to amino acid sequences based on the

three dimensional orthogonal coordinates of the atoms in proteins. The three di-

mensional orthogonal coordinates of proteins are stored in the RCSB (Research

Collaboratory for Structural Bioinformatics) Protein Data Bank (PDB) [2], which

is a database that stores information about known proteins such as their amino

acid sequences, the methods used to find them, their atoms, and the three dimen-

sional orthogonal coordinates of each atom in a protein. By April, 2007, the RCSB

Protein Data Bank has stored information from about 39,261 known proteins. The

information from a protein is stored in a PDB file in plain text format. Figure 2.9

illustrates part of the PDB file for protein 1WCK. The lines starting with “ATOM”

list the three dimensional orthogonal coordinates of the atoms in protein 1WCK.

Considering the first line starting with “ATOM” as an example: the “N” in the

third column indicates that the atom is Nitrogen; the “GLY” in the fourth column

indicates that this atom is one of the atoms of the amino acid type “GLY,” which

is “Glycine” with 1-letter code G; the “80” in the sixth column indicates that the

amino acid “Glycine” is the 80th amino acid of the protein 1WCK; the “-0.522”

indicates the orthogonal coordinate X in angstroms; the “84.984” indicates the

orthogonal coordinate Y in angstroms; and the “-3.507” indicates the orthogonal

coordinate Z in angstroms. The protein 1WCK contains 220 amino acids in total

as shown in Figure 2.10, in which the atoms in the underlined amino acids are

those included in the PDB file shown in Figure 2.9 and the atoms in the non-

underlined amino acids are not included in the PDB file shown in Figure 2.9. The

reason that the PDB file (see Figure 2.9) does not provide the three dimensional

orthogonal coordinates of the atoms in the non-underlined amino acid is that these

amino acids are partially or completely unstructured and do not fold into a stable

8

state, which is labeled as “disordered regions” by structural biologists.

ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.014586 0.008421 0.000000 0.00000 SCALE2 0.000000 0.016843 0.000000 0.00000 SCALE3 0.000000 0.000000 0.006122 0.00000 ATOM 1 N GLY A 80 -0.522 84.984 -3.507 0.70 21.33 N ATOM 2 CA GLY A 80 -0.637 83.613 -2.936 0.70 21.02 C ATOM 3 C GLY A 80 -2.019 83.003 -3.100 0.70 20.72 C ATOM 4 O GLY A 80 -3.033 83.679 -2.950 0.70 20.99 O ATOM 5 N LEU A 81 -2.051 81.712 -3.403 1.00 20.46 N ATOM 6 CA LEU A 81 -3.300 80.999 -3.613 1.00 20.44 C ATOM 7 C LEU A 81 -3.792 80.270 -2.371 1.00 19.00 C ATOM 8 O LEU A 81 -4.944 79.840 -2.325 1.00 20.49 O ATOM 9 CB LEU A 81 -3.212 80.043 -4.818 1.00 20.90 C ATOM 10 CG ALEU A 81 -3.101 80.656 -6.221 0.50 21.33 C ATOM 11 CD1ALEU A 81 -2.821 79.575 -7.249 0.50 21.54 C ATOM 12 CD2ALEU A 81 -4.357 81.434 -6.598 0.50 21.68 C ATOM 13 CG BLEU A 81 -1.876 79.810 -5.538 0.50 21.08 C ATOM 14 CD1BLEU A 81 -0.895 78.992 -4.707 0.40 20.94 C ATOM 15 CD2BLEU A 81 -2.113 79.145 -6.882 0.50 20.90 C ATOM 16 N GLY A 82 -2.938 80.146 -1.354 1.00 16.73 N ATOM 17 CA GLY A 82 -3.351 79.483 -0.121 1.00 14.57 C ATOM 18 C GLY A 82 -2.701 78.133 0.087 1.00 12.96 C ATOM 19 O GLY A 82 -1.730 77.784 -0.588 1.00 13.51 O ATOM 20 N LEU A 83 -3.245 77.380 1.037 1.00 11.79 N ATOM 21 CA LEU A 83 -2.718 76.063 1.386 1.00 10.98 C ATOM 22 C LEU A 83 -3.533 74.969 0.715 1.00 11.39 C ATOM 23 O LEU A 83 -4.726 75.153 0.491 1.00 11.82 O ATOM 24 CB LEU A 83 -2.758 75.859 2.902 1.00 10.83 C ATOM 25 CG LEU A 83 -2.057 76.908 3.770 1.00 10.83 C ATOM 26 CD1 LEU A 83 -2.218 76.506 5.221 1.00 11.61 C ATOM 27 CD2 LEU A 83 -0.575 77.039 3.399 1.00 12.20 C ATOM 28 N PRO A 84 -2.892 73.831 0.390 1.00 11.49 N ATOM 29 CA PRO A 84 -3.635 72.714 -0.218 1.00 11.69 C ATOM 30 C PRO A 84 -4.651 72.020 0.688 1.00 11.03 C ATOM 31 O PRO A 84 -5.626 71.448 0.191 1.00 12.31 O ATOM 32 CB PRO A 84 -2.535 71.726 -0.636 1.00 12.29 C ATOM 33 CG PRO A 84 -1.316 72.159 0.021 1.00 12.72 C . . . ATOM 966 N HIS A 215 4.176 70.860 -0.778 1.00 15.76 N ATOM 967 CA HIS A 215 4.125 69.550 -1.440 1.00 17.82 C ATOM 968 C HIS A 215 5.505 68.909 -1.429 1.00 18.68 C ATOM 969 O HIS A 215 6.524 69.610 -1.419 1.00 20.85 O ATOM 970 CB HIS A 215 3.631 69.675 -2.888 1.00 18.07 C ATOM 971 CG HIS A 215 2.216 70.148 -3.020 1.00 19.06 C ATOM 972 ND1 HIS A 215 1.155 69.520 -2.405 1.00 18.86 N ATOM 973 CD2 HIS A 215 1.681 71.170 -3.732 1.00 19.26 C ATOM 974 CE1 HIS A 215 0.032 70.141 -2.716 1.00 18.89 C ATOM 975 NE2 HIS A 215 0.324 71.145 -3.523 1.00 21.03 N TER 976 HIS A 215 HETATM 977 AS CAC A1216 -0.013 79.161 8.880 0.11 15.18 AS HETATM 978 O1 CAC A1216 0.838 80.589 8.363 0.16 15.89 O HETATM 979 O2 CAC A1216 -0.094 79.116 10.616 0.11 14.61 O HETATM 980 C1 CAC A1216 -1.826 79.189 8.138 0.22 15.58 C HETATM 981 C2 CAC A1216 0.927 77.576 8.217 0.22 15.65 C

Figure 2.9: Part of the PDB file of protein 1WCK.

The DSSP program [49] is a program that assigns secondary structures to

amino acid sequences based on the three dimensional coordinates of the atoms in

9

proteins. The DSSP program reads the PDB files as shown in Figure 2.9, assigns

a secondary structure type to each of the amino acid positions, and saves the

secondary structures in a DSSP file. Figure 2.113 illustrates part of the DSSP file

of protein 1WCK. The fourth column from the left lists the amino acids of the

protein, and the fifth column from the left lists the secondary structures in each

amino acid position. PDBFINDER [49] is a database that stores the secondary

structures of all protein entries in the Protein Data Bank. Figure 2.12 shows the

entry of 1WCK in the PDBFINDER database, in which the line starting with

“Sequence” lists the amino acid of protein 1WCK (without disordered segments),

and the line starting with “DSSP” lists the secondary structure of protein 1WCK.

The secondary structures in the PDBFINDER database are assigned by the DSSP

program.

2.3 Protein Secondary Structure Prediction

2.3.1 Overview

Protein secondary structure prediction methods usually do not distinguish all of

the secondary structure types defined in the “Dictionary of Protein Secondary

Structure,” but only consider three structural states. Generally, α helix (“H”)

and 3-turn helix(“G”) are all treated as Helix, represented by “H,” β strand(E)

3In order to fit on the paper, some unrelated segments or columns in the examples shown inthis thesis may be deleted or omitted as indicated by “...”.

>1WCK:A|PDBID|CHAIN|SEQUENCE MAFDPNLVGPTLPPIPPFTLPTGPTGPTGPTGPTGPTGPTGPTGDTGTTGPTGPTGPTGPTGPTGATGL TGPTGPTGPS GLGLPAGLYAFNSGGISLDLGINDPVPFNTVGSQFGTAISQLDADTFVISETGFYKITV IANTATASVLGGLTIQVNGVPVPGTGSSLISLGAPIVIQAITQITTTPSLVEVIVTGLGLSLALGTSAS IIIEKVAH HHHHH

Figure 2.10: The entire amino acid sequence of protein 1WCK.

10

# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI

1 80 A G 0 0 131 0, 0.0 3,-0.1 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 140.4

2 81 A L - 0 0 193 1,-0.3 2,-0.2 0, 0.0 0, 0.0 0.585 360.0-120.3 -95.7 -12.5

3 82 A G - 0 0 67 131,-0.1 -1,-0.3 3,-0.0 3,-0.0 -0.659 42.4 -36.2 109.3-166.7

4 83 A L - 0 0 71 -2,-0.2 130,-0.1 1,-0.1 3,-0.1 -0.708 41.3-132.4 -99.1 149.3

5 84 A P S S- 0 0 50 0, 0.0 2,-0.3 0, 0.0 33,-0.2 0.771 83.6 -1.9 -67.9 -27.1

6 85 A A E +A 133 0A 1 127,-0.6 127,-2.4 30,-0.1 2,-0.3 -0.988 58.4 171.9-164.6 150.3

7 86 A G E -AB 132 35A 3 28,-2.2 28,-2.4 -2,-0.3 2,-0.4 -0.962 9.3-163.8-163.8 151.3

8 87 A L E -AB 131 34A 3 123,-2.5 123,-2.5 -2,-0.3 2,-0.5 -0.991 0.8-167.8-143.5 131.8

9 88 A Y E +AB 130 33A 92 24,-2.8 23,-2.2 -2,-0.4 24,-1.6 -0.986 18.4 166.5-117.7 125.4

10 89 A A E +AB 129 31A 0 119,-2.5 119,-2.6 -2,-0.5 2,-0.3 -0.951 7.4 179.4-135.8 157.5

11 90 A F E -AB 128 30A 41 19,-2.8 19,-2.3 -2,-0.3 2,-0.5 -0.966 26.5-120.0-151.7 165.4

12 91 A N E -A 127 0A 13 115,-2.5 114,-1.6 -2,-0.3 115,-0.8 -0.962 28.4-175.4-114.6 127.3

13 92 A S E +A 125 0A 57 -2,-0.5 2,-0.3 112,-0.2 112,-0.2 -0.978 15.0 143.6-126.2 132.5

14 93 A G E -A 124 0A 34 110,-1.8 110,-3.0 -2,-0.4 4,-0.1 -0.988 50.0-124.8-162.8 162.7

15 94 A G S S+ 0 0 67 -2,-0.3 2,-0.3 108,-0.2 108,-0.2 0.504 91.5 49.1 -92.2 -6.5

16 95 A I S S- 0 0 141 106,-0.1 108,-0.1 108,-0.1 2,-0.1 -0.899 92.4 -91.6-130.0 158.7

17 96 A S - 0 0 60 -2,-0.3 2,-0.5 106,-0.1 105,-0.2 -0.360 40.9-131.0 -67.5 152.1

18 97 A L E -E 121 0B 47 103,-2.3 103,-2.5 -4,-0.1 2,-0.5 -0.925 12.5-155.1-116.8 124.3

19 98 A D E -E 120 0B 117 -2,-0.5 2,-0.3 101,-0.2 101,-0.2 -0.873 15.5-176.8-104.0 127.6

20 99 A L E -E 119 0B 22 99,-2.8 99,-2.1 -2,-0.5 2,-0.2 -0.919 10.5-149.9-125.0 146.5

21 100 A G > - 0 0 26 -2,-0.3 3,-2.1 97,-0.2 93,-0.3 -0.574 45.4 -57.9-106.4 174.4

.

.

.

125 204 A T E S+A 13 0A 69 -2,-0.4 -112,-0.2 -112,-0.2 -63,-0.2 -0.758 70.0 178.5 -85.7 101.1

126 205 A S E - 0 0 4 -114,-1.6 -64,-2.4 -2,-1.1 2,-0.3 0.802 62.5 -10.4 -74.8 -29.0

127 206 A A E -AD 12 61A 0 -115,-0.8 -115,-2.5 -66,-0.3 2,-0.3 -0.983 55.0-161.1-165.4 154.4

128 207 A S E -AD 11 60A 22 -68,-2.6 -68,-2.3 -2,-0.3 2,-0.3 -0.974 5.6-167.0-139.1 157.2

129 208 A I E -AD 10 59A 0 -119,-2.6 -119,-2.5 -2,-0.3 2,-0.4 -0.997 5.3-167.3-142.7 138.5

130 209 A I E -AD 9 58A 63 -72,-2.2 -72,-2.3 -2,-0.3 2,-0.4 -1.000 6.0-174.1-121.4 129.8

131 210 A I E +AD 8 57A 0 -123,-2.5 -123,-2.5 -2,-0.4 2,-0.4 -0.992 9.4 179.9-118.8 129.1

132 211 A E E -AD 7 56A 51 -76,-2.6 -76,-2.5 -2,-0.4 2,-0.7 -0.993 29.4-131.8-131.0 136.9

133 212 A K E +AD 6 55A 6 -127,-2.4 -127,-0.6 -2,-0.4 -78,-0.2 -0.798 30.7 171.9 -83.8 117.0

134 213 A V E + 0 0 60 -80,-2.3 2,-0.3 -2,-0.7 -79,-0.2 0.622 60.3 14.4-104.6 -17.5

135 214 A A E D 0 54A 42 -81,-1.6 -81,-2.5 -131,-0.0 -1,-0.3 -0.969 360.0 360.0-156.3 145.3

136 215 A H 0 0 157 -2,-0.3 -83,-0.3 -83,-0.2 -85,-0.0 -0.506 360.0 360.0 -78.4 360.0

Fig

ure

2.11

:Par

tof

the

DSSP

file

ofpro

tein

1WC

K.

11

ID : 1WCK Header : STRUCTURAL PROTEIN Date : 2005-10-25 Compound : bcla protein Source : (bacillus anthracis) Author : S.Rety Author : S.Salamitou Author : L.A.Augusto Author : R.Chaby Author : F.Lehegarat Author : A.Lewit-bentley Exp-Method : X Resolution : 1.36 R-Factor : 0.179 Free-R : 0.190 Ref-Prog : REFMAC HSSP-N-Align : 23 T-Frac-Beta : 0.60 T-Nres-Prot : 136 T-Water-Mols : 190 HET-Groups : 1 Het-Id : 1216 Natom : 5 Name : CACODYLATE ION Chain : A Sec-Struc : 136 Beta : 82 B-Bridge : 2 Anti-Hb : 108 Amino-Acids : 136 Substrate : 5 Sequence : GLGLPAGLYAFNSGGISLDLGI ... ALGTSASIIIEKVAH DSSP : CCCCSEEEEEEEEESSCEEECT ... CSEEEEEEEEEEEEC Nalign : 4455555555555555555555 ... 444444444444442| 10.9706 Nindel : 0000000000000000001100 ... 000000000000000| 0.1324 Entropy : 0202444334323414334305 ... 020232202000429| 0.2487 Cons-Weight : 9294244115135384211191 ... 939466292999553| 0.5021 Chain : Z Water-Mols : 190

Figure 2.12: Part of the PDBFINDER entry of protein 1WCK.

12

and “residue in isolated beta-bridge”(B) are all treated as Strand, represented by

“E,” and all of the others are treated as Coil, represented by “C” or “ ”(space).

Correspondingly, the “3-state accuracy” score (also called “Q3 score”) is used to

evaluate prediction accuracy, which is the percentage of the residues which have

predictions matching the real structures.

Many methods and algorithms have been used to predict protein secondary

structures. The early methods used in protein structure prediction usually only

contained linear statistics [16, 5, 15, 17, 34] and stereochemical principles [31].

Subsequently, machine learning algorithms proved to be a successful way to predict

protein secondary structures. The successful machine learning algorithms used

include decision tree [35], neural networks [38, 40, 19, 28, 39, 3, 4, 21], and K-

way nearest neighbours [6, 7, 13, 12, 30]. Currently, most of the top successful

prediction tools with prediction accuracy higher than 75%, such as PHD [40],

PSIPRED [24] and SSPRO [37], use artificial neural network (ANN) algorithms

to make their predictions.

It has also been proven that considering evolutionary information, or multiple

aligned sequences, in protein structure prediction can improve prediction accu-

racy [8]. This is because multiple sequence alignment can be obtained from the

core structure or a consensus structure of a whole protein family which can then

be used to predict the structure of proteins which belong to or are related to that

protein family. Currently, multiple sequence alignment is used quite often in pro-

tein secondary structure prediction [33, 36, 14, 9], and is considered a successful

method [18, 32].

Recently, a trend is not to use only one technique to predict protein secondary

structures, but to combine several techniques; for example, to combine ANNs

and multiple sequence alignment [47, 9, 25] , and to combine statistical methods,

13

homology methods, information theory methods, and artificial neural network

algorithms [46]. Besides combining various techniques in one tool, some tools

combine other prediction tools to make consensus prediction. One typical and

successful example is JPRED [10] which combines 6 different prediction tools:

DSC [27], PHD [40], NNSSP [6], PREDATOR [14], ZPRED [48], and MULPRED4.

Each of these tools combines multiple sequences alignment with a specific method;

for example, PHD uses jury decision neural networks, NNSSP is based on nearest

neighbours, and DSC uses linear discrimination.

CISPred integrates two existing prediction tools SSPRO [37] and PSIPRED [24],

which have prediction results with relatively high 3-state accuracy, and are freely

downloaded and easily integrated. Moreover, CISPred also integrates the protein

motif structures database and the threading method, which have not been widely

used by existing protein secondary structure prediction tools.

2.3.2 PHD, PSIPRED and SSPRO

Currently, PHD [40], PSIPRED [24] and SSPRO [37] are three of the most suc-

cessful protein secondary structure prediction tools.

As mentioned above, multiple sequence alignments can improve the accuracy

of protein secondary structure prediction, and are widely used in protein sec-

ondary structure prediction tools. The generation of sequence profiles by multiple

sequence alignment is time-consuming. For example, a very successful method,

PHD [40], uses a multi-processor computer to generate multiple sequence align-

ment; therefore, the PHD server [42] cannot be moved to a new site. In 1999,

PSIPRED [24], a protein secondary structure prediction system that could be

easily ported to any workstation, was created. The approach of PSIPRED is

4Barton, 1988, unpublished

14

to use the position-based scoring matrix of PSI-BLAST, instead of multiple se-

quence alignments, as the inputs for a two-stage neural network. According to

the experiments conducted by its author, PSIPRED can achieve an average 3-

state accuracy between 76.5% and 78.3% on the CASP3 (Critical Assessment of

Techniques for Protein Structure Prediction experiment) [29] dataset. The output

of PSIPRED [24] gives the confidence score of each of the three secondary struc-

tures “C,” “H,” and “E,” respectively, in each amino acid position. The details of

PSIPRED output reports are presented in Chapter 3.

In 2004, SSPRO [37], a protein secondary structure prediction tool, was cre-

ated based on an ensemble of 100 1D-RNNs (one dimensional recurrent neural net-

works), PSI-BLAST-derived profiles (position-based scoring matrix), and a large

non-redundant training set. According to the experiments conducted by its au-

thor, SSPRO can achieve a 3-state accuracy of 77%. The details of the prediction

results from SSPRO [37] are presented in Chapter 3.

2.3.3 The Threading Method and THREADER

The threading method is an algorithm which can be used to predict protein struc-

tures. A protein fold library usually is constructed which contains protein folds as

structural templates. Then a score function is chosen to evaluate any alignments

of a queried amino acid sequence with a structural template. The score function

usually computes the free energy of this queried sequence in a structural tem-

plate. The less free energy the queried sequence has, the more stable the queried

sequence is in this structural template, which also indicates a higher likelihood

that this template is the final structure of this queried sequence. Based on the

score function, the best alignment of a query sequence with each of the structural

templates can be found. Then the most appropriate structural templates with

15

optimal alignments are selected as the predicted structures.

THREADER [23] is a tool which implements the threading method. Its output

includes a score report showing the score of the alignment of a queried sequence

with each structural template, and an alignment report showing the alignments of

the query sequence with each of the structural templates. The details of the score

report and alignment report are presented in Chapter 3.

2.3.4 Comparison of Protein Structure Prediction Tools

Because each method and tool has a different approach to prediction, results may

be different, and sometimes part of the results are contradictory. Figures 2.13,

2.14, 2.15, and 2.16 are part of the prediction results based on the first 50 amino

acids of protein 1AP9, from GARNIER [16], PREDATOR [14], PSIPRED [24],

and SSPRO [37].

. 10 . 20 . 30 . 40 . 50 QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP helix HHH H HHHHH sheet EEE EE EEEEE EEEEEE EEEEEEEEE turns TT TT coil CCCCC CCC CCCC

Figure 2.13: Part of a GARNIER [16] prediction report.

1 QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP 50 _______HHHHHHHHHHHHHHHHHHHHHHHHHH___HHHHHHHHHHHHHH

Figure 2.14: Part of a PREDATOR [14] prediction report.

These four reports illustrate the differences between the four prediction tools.

For example, PREDATOR [14] predicts more α helices (“H”) than GARNIER [16],

and at some positions PREDATOR [14] predicts an α helix (“H”) where GAR-

NIER [16] predicts contradictory results, a β sheet (“E”). PSIPRED [24] and

16

SSPRO [37] have similar prediction results; both of them predict two series of α

helices (“H”) with some coils (“C”) in the middle. But the lengths of the two

α helices (“H”) series predicted by PSIPRED [24] and SSPRO [37] are slightly

different.

2.3.5 Benchmarked Non-redundant Dataset

As mentioned above, most of the successful methods to predict protein secondary

structure use machine learning algorithms. Accordingly, datasets that contain

non-redundant protein amino acid sequences are needed for cross-validation. In

1994, Burkhard Rost and Chris Sander provided a dataset, often referred to as

dataset “RS126” [41], that contains 126 non-redundant protein sequences. The

non-redundancy of “RS126” means that any two proteins in the dataset share no

more than a 25% sequence identity over a length of more than 80 residues. The

“RS126” dataset was used as a benchmark dataset by many machine learning

algorithms that predict protein secondary structure.

In 1999, James Cuff and Geoffrey Barton pointed out that the standard used

to determine the non-redundancy of the “RS126” dataset, percentage identity, is

a poor measure of sequence similarity [8], and they provided a more sophisticated

Conf: 97124684478999989899899999995268999873356467887788 Pred: CCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCHHHHHHHHHHHH AA: QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP 10 20 30 40 50

Figure 2.15: Part of a PSIPRED [24] horizontal prediction report.

QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP CCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHH

Figure 2.16: Part of a SSPRO [37] prediction report.

17

method. To compute the similarity of two amino acid sequences A and B, their

method aligns A and B using a standard dynamic programming algorithm, and

then obtains the score for the alignment V. The order of the amino acids in both

sequence A and sequence B are randomized, and the two randomized sequences are

aligned by the dynamic programming algorithm. This process is usually repeated

more than 100 times, and then the mean of the alignment scores of randomized

sequences x and the standard deviation of the alignments scores of randomized

sequences σ are computed. A SD score, or a Z-score, that measures the similarity

of the original sequences A and B is computed using Equation 2.1. The sequences

with an SD score higher than 5 are considered similar. A dataset that contains

396 sequences, usually named dataset “CB396” [8], is provided using this similar-

ity method. Each sequence in “CB396” is not similar to any other sequence in

“CB396” and also not similar to any sequence in “RS126.” “CB396” is another

of the benchmark non-redundant datasets currently used.

SD =V − x

σ(2.1)

The similarity of each pair of sequences in dataset “RS126” was also measured

using the new method mentioned above, and 9 sequences were removed from

“RS126” in order to make the remaining 117 sequences non-redundant. These 117

sequences from “RS126” are combined with “CB396” to form a dataset named

“CB513,” which is one of the benchmarked non-redundant datasets currently used.

18

2.4 Protein Motif and Motif Databases

2.4.1 Protein Motif

Motifs are biologically significant sites and patterns existing in proteins. They can

be used to characterize protein families.

2.4.2 PATMATMOTIFS and the PROSITE database

PATMATMOTIFS [1], a protein motifs finding tool, compares a given protein se-

quence to the PROSITE [22] database, which stores the information about known

protein motifs. In some cases, an unknown protein sequence is distantly related

to known proteins, therefore it is difficult to determine the features of an un-

known protein by overall sequence alignment. By comparing an unknown protein

to the PROSITE [22] database, some biologically important patterns, motifs or

fingerprints can be found, which can help determine to which family it belongs.

Examples of PATMATMOTIFS results and PROSITE entries are presented in

Chapter 3.

19

Chapter 3

CISPred: Consensus Integrated

Protein Structure Prediction

3.1 Overview

CISPred, a Consensus Integrated Structure Prediction tool, predicts protein sec-

ondary structures by integrating several prediction tools and databases. The tools

and databases integrated in CISPred include two protein secondary structure pre-

diction tools, SSPRO [37] and PSIPRED [24]; a protein motif searching tool,

PATMATMOTIFS [1]; a motif database, PROSITE; a protein secondary structure

database, PDBFINDER [20]; and a threading method tool, THREADER [23].

3.2 System Architecture

3.2.1 Selection of Integrated Tools

The tools selected for integration in CISPred meet several requirements. The

integrated tools have a relatively high prediction accuracy. The techniques of

20

protein secondary structure prediction have had great improvements in the last

20 years. Some early algorithms, such as GARNIER [16], have no more than

65% accuracy. Because the consensus results of CISPred are computed based

on existing tools, the prediction accuracy of these tools directly influences the

accuracy of CISPred.

The tools integrated by CISPred can be downloaded and installed on local ma-

chines. Currently, there are many successful protein structure prediction servers,

yet some of them cannot be downloaded, but only accessed from their web pages.

To improve the stability and reduce the execution time of CISPred, the existing

tools integrated are designed to be executed on local machines. Therefore, the

protein secondary structure prediction tools PSIPRED [24] and SSPRO [37] are

selected to be integrated into CISPred, because both of them can be downloaded

and installed on local machines, and both have relatively high accuracy: PSIPRED

has 80.6% accuracy and SSPRO has 80% or better.

PSIPRED and SSPRO mainly use ANN techniques. In order to provide con-

sensus predictions, more tools which use different prediction approaches are inte-

grated into CISPred. THREADER [23] is integrated into CISPred and implements

the threading method, a completely different method than neural networks. More-

over, CISPred also integrates protein motif structural information by finding the

“structure formulae” of protein motifs. In total, CISPred integrates two protein

secondary structure prediction tools using neural networks and PSI-BLAST pro-

files, a tool using the threading method, and protein motif “structure formulae.”

Figure 3.1 illustrates the system architecture of CISPred. A program run-

ning at the web server of CISPred submits queried sequences to the integrated

tools which are executed at a Cluster. Motif structure formulae are found by in-

tegrating the tool PATMATMOTIFS [1] and two databases: PDBFINDER [20]

21

Web Server

PATMATMOTIFS

PDBFinder Database

PROSITE Database

THREADER

Clustering

Finding the Motif Structure Formulas

SSPRO

PSIPRED

Web Server

Query sequences

Cluster

Structure Formulas

Submission Program

Consensus Prediction Program

Consensus Results

Email Program

Prediction results

Prediction results Protein

folds

Figure 3.1: CISPred system architecture.

22

and PROSITE [22]. The results generated from THREADER are clustered before

being integrated. When the execution of the integrated tools and the finding of

motif structure formulae are finished, a program integrates the structure formulae

and the results of each integrated tool, and then generates consensus prediction

for each queried sequence. The consensus predictions are then sent to a program

running at the web sever of CISPred, which sends the results to CISPred users by

email.

3.2.2 THREADER

3.2.2.1 Sorting THREADER Reports

THREADER [23] is a tool that implements the threading method. THREADER

has a library which contains the structures of 6251 protein folds. A queried amino

acid sequence is “threaded” through each of the protein folds. For each protein

fold, an alignment between the amino acid sequence and the protein fold with

minimum free energy is selected as the optimum alignment. THREADER provides

information about the optimum alignment for each fold in the library. Figure 3.2

shows part of the output report from THREADER, in which column 8 lists the

filtered combined energy Z-scores [23] and the rightmost column lists the PDB ID

codes of the protein folds. Based on the manual of THREADER, protein structure

predictions should be based on the filtered combined energy Z-scores, because

the higher the filtered combined energy Z-scores are, the more appropriately the

amino acid sequence fits the protein fold, and the higher the probability that the

protein fold is the correct prediction structure. Usually, the protein fold with the

filtered combined energy Z-score above 3.5 is considered to have significantly high

probability to be the correct predicted structure of a queried sequence.

The alignments between a queried amino acid sequence and the protein fold

23

-545.43 -1145.34 4.20 -11.34 3.14 -780.91 4.73 4.73 73.6 350 90.9 79.9 0 1b3mA0

-523.75 -1070.86 3.98 2.05 0.34 -481.13 2.79 2.79 -46.0 309 91.2 70.8 0 1b7gO0

-323.79 -980.73 1.97 -7.26 2.29 -474.51 2.74 2.74 79.2 350 93.1 79.9 0 1bhe00

-430.87 -1046.17 3.05 -1.68 1.12 -465.82 2.69 2.69 -47.9 305 94.1 69.6 0 1a5t00

-511.70 -841.22 3.86 3.99 -0.06 -428.74 2.45 2.45 50.4 326 99.4 74.7 0 1ak500

-233.74 -767.98 1.07 -9.17 2.69 -424.16 2.42 2.42 91.7 305 93.6 69.6 0 1b6cB0

-326.61 -844.66 2.00 -4.30 1.67 -415.82 2.36 2.36 44.3 288 99.0 65.8 0 1bf6A0

-249.10 -786.89 1.22 -8.04 2.45 -416.13 2.36 2.36 -4.8 369 80.4 84.5 0 1a4yA0

-471.73 -925.15 3.46 3.83 -0.03 -392.12 2.21 2.21 -3.4 351 90.3 80.4 0 1a4gA0

-367.95 -994.76 2.41 -0.62 0.90 -380.87 2.14 2.14 9.6 336 94.1 76.9 0 1a7kA0

-289.06 -610.82 1.62 -4.40 1.69 -380.48 2.13 2.13 20.1 316 96.9 72.4 0 1a6o00

-444.86 -775.12 3.19 3.60 0.02 -370.00 2.07 2.07 -27.4 297 96.7 67.8 0 1b3oA0

-270.28 -1119.54 1.43 -4.77 1.77 -369.42 2.06 2.06 9.7 376 87.3 86.1 0 1bif00

-480.57 -869.64 3.55 5.52 -0.38 -366.00 2.04 2.04 12.6 306 94.2 70.1 0 1b4kA0

-275.31 -916.65 1.48 -4.39 1.69 -366.47 2.04 2.04 -4.1 359 89.5 82.0 0 1aye00

-367.21 -1060.40 2.41 0.41 0.69 -358.79 1.99 1.99 43.2 348 91.3 79.5 0 1bd0A0

-296.23 -729.19 1.69 -2.92 1.38 -356.96 1.98 1.98 5.0 245 89.1 55.9 0 1b5tA0

-465.23 -1115.40 3.39 5.70 -0.42 -346.81 1.92 1.92 74.5 358 89.3 81.7 0 1a12A0

-239.84 -620.05 1.13 -4.71 1.75 -337.66 1.86 1.86 104.7 245 88.2 56.2 0 1a0600

-270.70 -770.96 1.44 -2.75 1.35 -327.85 1.79 1.79 25.3 258 94.2 58.9 0 1af700

-502.09 -1146.20 3.90 -9.59 2.65 -688.63 4.16 4.16 30.9 359 89.5 82.0 0 1d7yA0

-393.29 -760.31 2.80 -5.73 1.89 -504.60 2.98 2.98 -67.8 275 94.5 63.0 0 1dhpA0

-323.57 -625.21 2.09 -7.62 2.26 -471.71 2.77 2.77 21.1 249 94.3 57.1 0 1c8zA0

-318.76 -945.51 2.05 -6.38 2.02 -442.82 2.58 2.58 31.9 299 94.6 68.5 0 1bsvA0

-351.78 -727.15 2.38 -3.65 1.48 -422.71 2.45 2.45 16.4 322 88.7 73.5 0 1c0kA0

-456.34 -1183.65 3.44 5.05 -0.23 -358.07 2.04 2.04 158.9 340 89.7 77.6 0 1c3oB0

-299.23 -1064.07 1.85 -2.38 1.23 -345.45 1.96 1.96 78.1 377 93.6 86.3 0 1c0nA0

-348.78 -1026.27 2.35 0.22 0.72 -344.50 1.95 1.95 -74.4 335 98.2 76.7 0 1bx4A0

-291.71 -1436.03 1.77 -2.68 1.29 -343.77 1.95 1.95 -18.5 394 69.8 90.2 0 1dceA0

-203.92 -1546.30 0.89 -7.04 2.15 -340.81 1.93 1.93 -2.3 380 90.3 87.0 0 1bk5A0

-322.13 -1001.99 2.08 0.67 0.63 -309.14 1.72 1.72 120.5 349 89.3 79.7 0 1cl2A0

-300.23 -1152.30 1.86 -0.44 0.85 -308.80 1.72 1.72 79.5 335 88.2 76.7 0 1bjwA0

Fig

ure

3.2:

Exam

ple

ofT

HR

EA

DE

Rsc

ore

repor

t.

24

are also provided by THREADER, as shown in Figure 3.3. The alignments contain

confidence scores which fall in an integer range from 0 to 9 inclusive at positions

with structure types “H” and “E.” Each of these confidence scores indicates the

possibility that a structure type is the correct prediction at an amino acid posi-

tion. However, in order to integrate the structures with a confidence score of 0,

CISPred raises all of the confidence scores by 1, and subsequently the range of

these confidence scores becomes 1 to 10 inclusive.

For a queried sequence, THREADER generates a filtered combined energy Z-

score and an alignment for each of the 6251 protein folds in its library. By sorting

the column listing the filtered combined energy Z-scores, CISPred finds the 20 pro-

tein folds with the highest filtered combined energy Z-scores. Usually, the filtered

combined energy Z-scores of these 20 folds are all above 3.5, however, CISPred

checks the filtered combined energy Z-scores of these 20 folds and eliminates the

folds with filtered combined energy Z-scores lower than 3.5. The folds left are con-

sidered to be highly appropriate prediction structures for the queried sequence.

To integrate the most appropriate protein folds into CISPred, the structural seg-

ments of these protein folds are clustered, and only the cluster of folds containing

the highest average confidence score is integrated into CISPred.

3.2.2.2 Clustering THREADER Alignments

The structural segments of the protein folds with the highest filtered combined

energy Z-scores are clustered by a hierarchical clustering algorithm [11]. Initially,

each cluster contains the secondary structures of one of these fold segments. The

distance between each pair of clusters is computed, and the two clusters with the

smallest distance are merged into one cluster. This process continues until the

smallest distance between each pair of clusters reaches a threshold. The distance

25

THREADER 3.5 - Protein Sequence Threading Program Build date : Sep 4 2004 Copyright (C) 2002 University College London Portions Copyright (C) 1990 D.T.Jones

Registered user: [email protected]

Reading mean force potential tables... Alignment with 1sb8A0: 10 20 30 40 ----------00000000000----44444-----999999999999----------999 -------CCCHHHHHHHHHHHC-CCEEEEEC-CCCHHHHHHHHHHHHC-------CCEEE -------MMSRYEELRKELPAQ-PKVWLITG-VAGFIGSNLLETLLKL-------DQKVV | | | | EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGE---WWKARSLATRKEGYIPSNY-VA 10 20 30 40 50

50 60 70 80 90 999--------------5555555555--0000--22222-----00000000------- EEECCCC--------CCHHHHHHHHHHCCHHHHCCEEEEECCCCCHHHHHHHHCC----- GLDNFAT--------GHQRNLDEVRSLVSEKQWSNFKFIQGDIRNLDDCNNACAG----- | | | RVDSLETEEWFFKGISRKDAERQLLAPGN--MLGS-FMIRDSETTKGSYSLSVRDYDPRQ 60 70 80 90 100 110

.

.

.

220 230 240 250 -22222222222---------666-----00---33233333333333------------ CHHHHHHHHHHHC----CCC-EEECCCCCEECC-EEHHHHHHHHHHHHCC-------CCC AVIPKWTSSMIQG----DDV-YINGDGETSRDF-CYIENTVQANLLAATA-------GLD | || | PKLIDFSAQIAEGMAFIEQRNYIHRDLRA-ANILVSASLVCKIADFGLARVGAKFPIKWT 280 290 300 310 320 330

260 270 280 290 300 ---55555--------0042222222222222------------222------------- CCCEEEEEC---CCCCEEHHHHHHHHHHHHHHCCCCCC---CCCEEEC-CCCCCCCCCCC ARNQVYNIA---VGGRTSLNQLFFALRDGLAENGVSYH---REPVYRD-FREGDVRHSLA | | | | | | | APE-AINFGSFTIKS--DVWSFGILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPEN 340 350 360 370 380 390

320 330 340 --000000-----------------999999999999999----- CCHHHHHHC--CC--CC-CC----CHHHHHHHHHHHHHHHCC--- DISKAAKLL--GY--AP-KY----DVSAGVALAMPWYIMFLK--- | CPEELYNIMMRCWKNRPEERPTFEYIQSVLDDFYTATESQYQQQP 400 410 420 430

Percentage Identity = 7.6.

Figure 3.3: Alignment results of THREADER.

26

between two clusters is computed according to Equation 3.1 [11], in which C

and C∗ are two clusters, |C| and |C∗| are the number of fold segments in the

two clusters respectively, and d(x, y) is the distance between two fold segments

located in two clusters. The shorter the distance between two clusters, the greater

the similarities between them.

davg(C, C∗) =

∑d(x, y)

|C||C∗| (3.1)

The distance between two fold segments is computed according to Equa-

tion 3.2, in which Nidentical represents the number of positions with identical sec-

ondary structures, and Ntotal represents the total number of positions in a fold

segment.

d(x, y) =Nidentical

Ntotal

(3.2)

After the clustering stops, the cluster with the highest average confidence score

is selected, and the fold segments in that cluster are integrated into CISPred. The

average confidence score is computed by dividing the sum of confidence scores of

each structure type (“H,” “E,” and “C”) by the total number of positions in that

cluster, as shown in Equation 3.3. THREADER only provides confidence scores

for the positions with structure types “H” and “E,” so the confidence score for the

positions with structure “C” is set to 5, one of the middle integers in the range 1

to 10.

Cavg =

∑CH +

∑CE +

∑CC

NH + NE + NC

(3.3)

The threshold of the clustering algorithm influences the number of fold seg-

ments integrated into CISPred, and therefore influences the final consensus pre-

27

dictions of CISPred. To illustrate the influence this threshold has on the accuracy

of CISPred, several experiments were conducted, and are presented in Chapter 5:

Experimental Results.

3.2.3 Finding Motif Secondary Structures

The proteins in one family, or having similar functions, are found to contain some

common or similar amino acid segments. These segments can distinguish families

of proteins, and are called motifs. A motif can exist in many individual pro-

teins; however, the structures of the motif are conservative and fit a particular

structure template. For example, Figure 3.4 illustrates the structures of a motif

named “BACTERIAL OPSIN 1” in different individual proteins. The structures

of the first five positions and the last three positions are usually “H,” and for the

five positions in the middle the structures can be either “H” or “T.” By statisti-

cally analyzing all of the structures of an existing motif in different proteins, the

structure template or “structure formula” of the motif can be determined, which

provides the proportion of each secondary structure type at each of the amino

acid positions of a motif. Figure 3.5 shows the structure formula of a motif named

“PROTEIN KINASE ATP.”

The structure formula is generated by integrating two databases, PROSITE [22]

and PDBFINDER [20], and the motif finding tool, PATMATMOTIFS [1]. PAT-

MATMOTIFS finds motifs from queried amino acid sequences and provides the

name, length, and start and end positions of these motifs. Figure 3.6 is an example

of a PATMATMOTIFS result, which illustrates that PATMATMOTIFS finds a

motif named “PROTEIN KINASE ATP” which is 23 amino acids long, starting

from position 190 and ending at position 212. After executing PATMATMOTIFS

on a queried sequence, CISPred then searches the PROSITE database for entries

28

2nd Structure PDB ID AA Sequence HHHHH HHHTH HHH 2BRD RYADW LFTTP LLL HHHHH HHHTT THH 2AT9 RYADW LFTTP LLL HHHHH HHHTH HHH 1XJI RYADW LFTTP LLL HHHHH HHHTH HHH 1VJM RYADW LFTTP LLL HHHHH TTHHH HHH 1UCQ RYADW LFTTP LLL HHHHH HHHTH HHH 1UAZ RYADW LFTTP LLL HHHHH TTHHH HHH 1TN5 RYADW LFTTP LLL HHHHH TTHHH HHH 1TN0 RYADW LFTTP LLL HHHHH TTHHH HHH 1S54 RYADW LFTTP LLL HHHHH TTHHH HHH 1S53 RYADW LFTTP LLL HHHHH TTHHH HHH 1S52 RYADW LFTTP LLL HHHHH TTHHH HHH 1S51 RYADW LFTTP LLL HHHHH HHHTH HHH 1R84 RYADW LFTTP LLL HHHHH HHHTH HHH 1R2N RYADW LFTTP LLL HHHHH TTHHH HHH 1QM8 RYADW LFTTP LLL HHHHH HHHTH HHH 1QKP RYADW LFTTP LLL HHHHH HHHTH HHH 1QKO RYADW LFTTP LLL HHHHH HHHTH HHH 1QHJ RYADW LFTTP LLL HHHHH TTHHH HHH 1Q5I RYADW LFTTP LLL HHHHH TTHHH HHH 1PY6 RYADW LFTTP LLL HHHHH TTHHH HHH 1PXS RYADW LFTTP LLL HHHHH TTHHH HHH 1PXR RYADW LFTTP LLL HHHHH HHHTH HHH 1P8U RYADW LFTTP LLL HHHHH HHHTH HHH 1P8I RYADW LFTTP LLL HHHHH HHHTH HHH 1P8H RYADW LFTTP LLL HHHHH TTHHH HHH 1O0A RYADW LFTTP LLL HHHHH HHHTH HHH 1M0M RYADW LFTTP LLL HHHHH HHHTH HHH 1M0L RYADW LFTTP LLL HHHHH HHHTH HHH 1M0K RYADW LFTTP LLL HHHHH TTHHH HHH 1KME RYADW LFTTP LLL HHHHH HHHTH HHH 1KGB RYADW LFTTP LLL HHHHH HHHTH HHH 1KG9 RYADW LFTTP LLL HHHHH HHHTH HHH 1KG8 RYADW LFTTP LLL HHHHH HHHTH HHH 1JGJ RYIDW ILTTP LIV HHHHH HHHTH HHH 1IXF RYADW LFTTP LLL HHHHH HHHTH HHH 1IW9 RYADW LFTTP LLL HHHHH HHHTH HHH 1IW6 RYADW LFTTP LLL HHHHH HHHTH HHH 1BRR RYADW LFTTP LLL HHHHH HHHHH HHH 1BRD RYADW LFTTP LLL HHHHH HHHTH HHH 1BM1 RYADW LFTTP LLL HHHHH HHHHH HHH 1AT9 RYADW LFTTP LLL THHHH TTTHH HHT 1AP9 RYADW LFTTP LLL

Figure 3.4: Structure segments for a protein motif.

29

MOTIF PROTEIN_KINASE_ATP

LENGTH 23

START 190

END 212

STR_FORMULA [C:0.11, H:0.00, T:0.00, S:0.01, E:0.87, G:0.01, I:0.00, B:0.00]























Fig

ure

3.5:

Exam

ple

ofa

stru

cture

form

ula

resu

lt.

30

about the motifs PATMATMOTIFS found.

Figure 3.7 shows an entry in the PROSITE database which contains informa-

tion about the motif “PROTEIN KINASE ATP.” The line starting with “ID” is

the name of the motif. The lines starting with “PA” are the consensus pattern

of the motif, which is the amino acid template of that motif. For example, in

a consensus pattern [ALT] indicates that any one of Alanine(A), Leucine(L) or

Threonine(T) may occur at this position; {AM} indicates that any amino acid ex-

cept Alanine(A) and Methionine(M) may occur in this position; x indicates that

any amino acid may be in this position; x(3) corresponds to x-x-x, which indicates

that any three amino acids may occur in this position; and x(2,4) corresponds

to x-x, x-x-x or x-x-x-x, which indicates that any two, three or four amino acids

may occur in this position. The lines starting with “3D” contain the PDB [2]

IDs of the proteins in which this motif exists. Based on the PDB IDs of these

proteins, CISPred searches the PDBFINDER [20] database and retrieves the sec-

ondary structures and the amino acid sequences of these proteins. PATMATMO-

TIFS is then executed on each of these amino acid sequences in order to locate

the position of the motif “PROTEIN KINASE ATP” in each of these proteins.

According to the position of the motif provided by PATMATMOTIFS, CISPred

finds the secondary structures of the motif in each of these proteins. A statisti-

cal analysis which computes the proportion of the occurrence of each secondary

structure type and generates the structure formula of this motif is then performed

on these secondary structures. The process of finding the structure formulae of

protein motifs is illustrated in Figure 3.8. Figure 3.5 illustrates the structure for-

mula of the motif “PROTEIN KINASE ATP,” and each line therein starting with

“STR FORMULA” provides the proportion of the occurrence of each structure

type in that amino acid position. Since the motif “PROTEIN KINASE ATP” is

31

23 amino acids long, the structure formula of “PROTEIN KINASE ATP” contains

23 lines, each of which contains the proportion of each structure type.

######################################## # Program: patmatmotifs # Rundate: Sun Sep 17 16:34:39 2006 # Report_format: dbmotif # Report_file: 1158521678131202243110-1.Pat ########################################

#======================================= # # Sequence: SEQUENCE from: 1 to: 438 # HitCount: 2 # # Full: No # Prune: Yes # Data_file: /usr/local/share/EMBOSS/data/PROSITE/ prosite.lines # #=======================================

Length = 23 Start = position 190 of sequence End = position 212 of sequence

Motif = PROTEIN_KINASE_ATP

KLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPG | | 190 212

Length = 13 Start = position 299 of sequence End = position 311 of sequence

Motif = PROTEIN_KINASE_TYR

IEQRNYIHRDLRAANILVSASLV | | 299 311

#--------------------------------------- #---------------------------------------

Figure 3.6: Example of a PATMATMOTIFS result.

The length of some motifs are variable; for example, the consensus pattern of

32

ID PROTEIN_KINASE_ATP; PATTERN. AC PS00107; DT APR-1990 (CREATED); NOV-1995 (DATA UPDATE); SEP-2006 (INFO UPDATE). DE Protein kinases ATP-binding region signature. PA [LIV]-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW}-[LIVCAT]-{PD}-x-[GSTACLIVMFY]- PA x(5,18)-[LIVMFYWCSTAR]-[AIVP]-[LIVMFAGCKR]-K. NR /RELEASE=50.8,234112; NR /TOTAL=1989(1969); /POSITIVE=1890(1870); /UNKNOWN=1(1); NR /FALSE_POS=98(98); /FALSE_NEG=373; /PARTIAL=30; CC /TAXO-RANGE=??EPV; /MAX-REPEAT=2; CC /VERSION=1; DR P13368, 7LESS_DROME, T; P20806, 7LESS_DROVI, T; P45894, AAPK1_CAEEL, T; DR Q13131, AAPK1_HUMAN, T; Q5EG47, AAPK1_MOUSE, T; Q5RDH5, AAPK1_PONPY, T; DR P54645, AAPK1_RAT , T; Q95ZQ4, AAPK2_CAEEL, T; P54646, AAPK2_HUMAN, T; DR Q28948, AAPK2_PIG , T; Q5RD00, AAPK2_PONPY, T; Q09137, AAPK2_RAT , T; DR Q6ZMQ8, AATK_HUMAN , T; Q80YE4, AATK_MOUSE , T; P03949, ABL1_CAEEL , T; DR P00519, ABL1_HUMAN , T; P00520, ABL1_MOUSE , T; P42684, ABL2_HUMAN , T; DR P00522, ABL_DROME , T; P10447, ABL_FSVHY , T; P00521, ABL_MLVAB , T; . . .

DR Q9GMA3, VSX1_BOVIN , F; P29944, YCB2_PSEDE , F; P33222, YJFC_ECOLI , F; DR Q12291, YL063_YEAST, F; Q09371, YS42_CAEEL , F; O32095, YUEF_BACSU , F; DR P47917, ZRP4_MAIZE , F; Q9PIN2, ZUPT_CAMJE , F; 3D 1A9U; 1AD5; 1AGW; 1APM; 1ATP; 1B6C; 1BKX; 1BL6; 1BL7; 1BLX; 1BMK; 1BX6; 3D 1BYG; 1CKI; 1CKJ; 1CSN; 1CTP; 1DAW; 1DAY; 1DI9; 1DS5; 1E9H; 1EH4; 1ERK; 3D 1F0Q; 1FGI; 1FGK; 1FIN; 1FMK; 1FMO; 1FOT; 1FPU; 1FQ1; 1FVV; 1G3N; 1GAG; 3D 1GJO; 1GNG; 1GOL; 1GY3; 1H1P; 1H1Q; 1H1R; 1H1S; 1H1W; 1H24; 1H25; 1H26; 3D 1H27; 1H28; 1H8F; 1HOW; 1I09; 1I44; 1IAN; 1IAS; 1IEP; 1IG1; 1IR3; 1IRK; 3D 1J1B; 1J1C; 1J3H; 1J91; 1JAM; 1JKK; 1JKL; 1JKS; 1JKT; 1JLU; 1JPA; 1JQH; 3D 1JST; 1JWH; 1K2P; 1K3A; 1K9A; 1KMU; 1KMW; 1KSW; 1KV1; 1KV2; 1KWP; 1L3R; 3D 1LC9; 1LCH; 1LD2; 1LEW; 1LEZ; 1LFN; 1LFR; 1LG3; 1LHX; 1LP4; 1LPU; 1LR4; 3D 1LUF; 1LWP; 1M14; 1M17; 1M2P; 1M2Q; 1M2R; 1M52; 1M7N; 1M7Q; 1MP8; 1MQ4; 3D 1MQB; 1MRU; 1MUO; 1NA7; 1NXK; 1NY3; 1O6K; 1O6L; 1O6Y; 1O9U; 1OB3; 1OEC; 3D 1OGU; 1OI9; 1OIU; 1OIY; 1OKV; 1OKW; 1OKY; 1OKZ; 1OL1; 1OL2; 1OL5; 1OL6; 3D 1OL7; 1OM1; 1OMW; 1OPJ; 1OPK; 1OPL; 1OUK; 1OUY; 1OVE; 1OZ1; 1P14; 1P38; 3D 1P4F; 1P4O; 1P5E; 1PF6; 1PF8; 1PHK; 1PJK; 1PKD; 1PKG; 1PME; 1PVK; 1PY5; 3D 1PYX; 1Q24; 1Q3D; 1Q3W; 1Q41; 1Q4L; 1Q5K; 1Q61; 1Q62; 1Q8T; 1Q8U; 1Q8W; 3D 1Q8Y; 1Q8Z; 1Q97; 1Q99; 1QCF; 1QL6; 1QMZ; 1QPC; 1QPE; 1QPJ; 1R0E; 1R0P; 3D 1R1W; 1R39; 1R3C; 1RDQ; 1RE8; 1REJ; 1REK; 1RJB; 1RQQ; 1RW8; 1S9I; 1S9J; 3D 1SM2; 1SMH; 1SNU; 1SNX; 1STC; 1SYK; 1SZM; 1T45; 1T46; 1TVO; 1U59; 1U5Q; 3D 1U5R; 1U7E; 1UNL; 1URC; 1UU3; 1UU7; 1UU8; 1UU9; 1UV5; 1UVR; 1UWH; 1UWJ; 3D 1V0B; 1V0P; 1VJY; 1VR2; 1VYW; 1VZO; 1W7H; 1W82; 1W83; 1W84; 1W98; 1WBN; 3D 1WBO; 1WBP; 1WBS; 1WBT; 1WBV; 1WBW; 1WFC; 1WMK; 1WZY; 1X8B; 1XH4; 1XH5; 3D 1XH6; 1XH7; 1XH8; 1XH9; 1XHA; 1XJD; 1XKK; 1XO2; 1XQZ; 1XR1; 1XWS; 1Y57; 3D 1Y6B; 1Y8G; 1YDR; 1YDS; 1YDT; 1YHS; 1YI3; 1YI4; 1YKR; 1YM7; 1YMI; 1YOL; 3D 1YOM; 1YVJ; 1YW2; 1YWR; 1Z57; 1Z5M; 1ZMU; 1ZMW; 1ZOE; 1ZOG; 1ZOH; 1ZRZ; 3D 1ZXE; 1ZY4; 1ZY5; 1ZYC; 1ZYD; 1ZZ2; 1ZZL; 2A19; 2A1A; 2AC3; 2AC5; 2AUH; 3D 2B4S; 2B54; 2B7A; 2B9F; 2B9H; 2B9I; 2B9J; 2BAJ; 2BAK; 2BAL; 2BAQ; 2BCJ; 3D 2BIK; 2BIL; 2BIY; 2BKZ; 2BMC; 2BPM; 2BZH; 2BZI; 2BZJ; 2C1A; 2C1B; 2C30; 3D 2C3I; 2C4G; 2C5N; 2C5O; 2C5P; 2C5T; 2C5X; 2C6D; 2C6E; 2C6T; 2CDZ; 2CHL; 3D 2CPK; 2CSN; 2ERK; 2ERZ; 2ESM; 2ETK; 2ETO; 2ETR; 2EU9; 2EXM; 2F49; 2F4J; 3D 2F57; 2FA2; 2FGI; 2FO0; 2G15; 2HCK; 2PHK; 2SRC; 3ERK; 3LCK; 4ERK; DO PDOC00100; //

Figure 3.7: Example of a PROSITE entry.

33

the motif “PROTEIN KINASE ATP” shown in figure 3.7 contains an x(2,4) which

indicates that any two, three or four amino acids may occur in that position. These

variable length positions do not affect the illustration of the structural template

of the motif; therefore, these variable length positions are ignored by CISPred.

3.2.4 PSIPRED and SSPRO

PSIPRED [24] and SSPRO [37] are the two existing protein secondary structures

prediction tools integrated into CISPred. Figure 3.9 illustrates an example of a

PSIPRED result: the first column provides an index of the amino acid residues

in the queried sequence, the second column provides the amino acid sequence; the

third column provides the predicted structures which have the highest confidence

scores in each amino acid position; the fourth column provides the confidence

scores of the structure type “C”; the fifth column provides the confidence scores of

the structure type “H”; and the sixth column provides the confidence scores of the

structure type “E.” Figure 3.10 shows a sample result from SSPRO, and contains

the ID and description of the queried amino acid sequence, followed by the queried

amino acid sequence and the predicted sequence of secondary structures. SSPRO

does not provide any confidence scores about its predictions.

PSIPRED and SSPRO are independent and successful prediction tools. They

provide complete predictions for all of the amino acid residues of the queried

sequence. Therefore, their results and the confidence scores are directly integrated

into CISPRED.

3.2.5 Generating Consensus Structure Prediction

The consensus predictions of CISPred are determined by integrating the fold

structures provided by THREADER [23], the predicted structures provided by

34

Query amino acid sequence

PATMATMOTIFS

Motif name

PROSITE Database

Protein PDB IDs of all proteins

having this motif

PDBFINDER Database

Protein secondary structure sequences of the proteins having this motif

PATMATMOTIFS

Sequence segments for

this motif

Program Generating Structure Formula

Motif Structure Formula

Consensus Prediction Program

Figure 3.8: The generation of motif structure formulae.

35

# PSIPRED VFORMAT

1 E C 0.998 0.001 0.006

2 D C 0.843 0.009 0.186

3 I E 0.347 0.002 0.655

4 I E 0.046 0.002 0.970

5 V E 0.017 0.002 0.988

6 V E 0.018 0.003 0.984

7 A E 0.063 0.009 0.949

8 L E 0.453 0.018 0.549

9 Y C 0.627 0.013 0.419

10 D C 0.843 0.020 0.154

11 Y C 0.883 0.025 0.113

12 E C 0.926 0.016 0.037

13 A C 0.964 0.021 0.033

14 I C 0.885 0.019 0.017

15 H C 0.840 0.014 0.023

16 H C 0.931 0.063 0.024

17 E C 0.799 0.142 0.045

.

.

.

420 Q H 0.014 0.977 0.001

421 S H 0.018 0.979 0.001 422 V H 0.025 0.975 0.001

423 L H 0.023 0.979 0.001

424 D H 0.024 0.977 0.002

425 D H 0.035 0.963 0.003

426 F H 0.055 0.933 0.005

427 Y H 0.163 0.802 0.007

428 T H 0.386 0.562 0.008

429 A C 0.520 0.430 0.010

430 T C 0.794 0.215 0.016

431 E C 0.828 0.169 0.023

432 S C 0.611 0.412 0.021

433 Q C 0.599 0.466 0.021

434 Y C 0.629 0.307 0.049

435 Q C 0.785 0.140 0.082

436 Q C 0.852 0.033 0.012

437 Q C 0.819 0.021 0.012

438 P C 0.733 0.000 0.001

Figure 3.9: Example of a PSIPRED vertical result.

36

PSIPRED [24] and SSPRO [37], and motif structure formulae.

CISPred parses the results of SSPRO, PSIPRED, THREADER alignments,

and motif structure formulae, and then generates consensus predictions from the

first amino acid position to the last, one amino acid position at a time. Figure 3.11

shows an example of the available structures and confidence scores for one amino

acid position. The line starting with “AA” shows the type of queried amino acid

in this position, a “K”(Lysine) in this example. The line starting with “SSPRO”

shows the structure type predicted by SSPRO in this position, a “H” in this

example. The line starting with “PSIPRED” shows the PSIPRED result in this

position, and provides an index of this position in the queried amino acid sequence

(“10”), the queried amino acid type in this position (“K”), the structure type

predicted by PSIPRED in this position (“H”), the confidence score of the structure

type “C” in this position (“0.013”), the confidence score of the structure type

“H” in this position (“0.627”), and the confidence score of the structure type

“E” in this position (“0.419”). The lines starting with “THREADER” show the

THREADER alignments in this position. In this example, five alignments are

1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDSLETEEWFFKGI SRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHYKIRTLDNGGFYISPRSTFSTLQ ELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIPRESLKLEKKLGAGQFGEVWMATYNKHTKVAVKT MKPGSMSVEAFLAEANVMKTLQHDKLVKLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFS AQIAEGMAFIEQRNYIHRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSF GILLMEIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQSVLDDF YTATESQYQQQP CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHHHEEECCCHHHCCCEECCC CHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCEEEEEEEEEEECCCCCEECCCCCCECCHH HHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCCCECCHHHEEEEEEEECCCCEEEEEEEECCCEEEEEEE ECCCCECHHHHHHHHHHHCCCCCCCECCEEEEECCCCCEEEECCCCCCEHHHHHHCHHHHCCCHHHHHHHH HHHHHHHHHHHHCCCCCCCCCHHHEEECCCCCEEECCCCHHHHCCCCCHHHCCHHHHHHCCCCHHHHHHHH HHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHHHHHHCC CCCCCCCCECCC

Figure 3.10: Example of a SSPRO result.

37

chosen for integration into CISPred, and their structures in this position are “C,”

“H,” “E,” “H” and “H,” with confidence scores of “-,” “6,” “3,” “0” and “5,”

respectively1. However, as previously presented, all of the confidence scores of

the structure types “E” and “H” are increased by 1 from their original numbers

provided by THREADER alignments, and the confidence scores of the structure

type “C” are arbitrarily set to be 5. Therefore, the confidence scores actually

considered by CISPred are “5,” “7,” “4,” “1” and “6,” respectively. The line

starting with “Str F” shows the structure formula in this position, and indicates

the proportion of the occurrence of each structure type in the position. In the

“Str F” line, “C:0.11” means that 11% of the protein motifs have a structure type

“C” in this position.

For each of the amino acid positions in a queried sequence, CISPred computes

a total confidence score Ctotal for each of the three structure types: “H,” “E,” and

“C.” The structure type with the highest total confidence score Ctotal is considered

as the consensus prediction in an amino acid position. Equation 3.4 shows the

computation of the total confidence score Ctotal for the structure type “H” in an

amino acid position.

Ctotal H = WpsiCpsi H + WsspCssp H + WsfPsf H + WthrCthr avg H (3.4)

In Equation 3.4, Cpsi H is the confidence score of the structure type “H” pro-

vided by PSIPRED. The confidence scores provided by PSIPRED are converted to

a scale from 0 to 10; therefore, original confidence scores provided by PSIPRED

are multiplied by 10. In the example presented in Figure 3.11, the integrated

1THREADER does not provide confidence scores for the structure type “C,” and it uses “-”to represent the confidence score of the structure type “C”.

38

AA K

SSPRO H

PSIPRED 10 K H 0.013 0.627 0.419

THREADER - K C

THREADER 6 K H

THREADER 3 K E

THREADER 0 K H

THREADER 5 K H

Str_F [C:0.11, H:0.87, T:0.00, S:0.01, E:0.00, G:0.01, I:0.00, B:0.00]

Figure 3.11: Example of the information available in one amino acid position.

39

confidence score of the structure type “H” (Cpsi H) is 0.627×10=6.27. Cssp H is

the confidence score of the structure type “H” provided by SSPRO. As previously

mentioned, SSPRO does not provide any confidence scores for its predictions. To

integrate the predictions of SSPRO and assign its predictions appropriate confi-

dence scores, each of the predictions of SSPRO is considered to have a confidence

score of 5, one of the two integers halfway between 1 and 10. In the example

illustrated in Figure 3.11, SSPRO predicts an “H” in that position; therefore, the

Cssp H in that example is set to be “5.” If the structure type predicted by SSPRO

in that position is not an “H,” the Cssp H was set to be “0.” Psf H is the propor-

tion of the structure type “H” provided by the structure formula of an amino acid

position. CISPred uses a 3-state scheme [40], in which “G” and “H” are taken to

be helix (“H”), “E” and “B” are taken to be strand (“E”), and all of the other

structure types are considered as coil (“C”). The proportion of the structure type

“H” includes the proportion of the structure type “G”. Like the confidence scores

provided by PSIPRED, the proportions provided by structure formula are multi-

plied by 10 in order to change their scale to between 0 and 10. In the example

illustrated in Figure 3.11, Psf H = (0.87+0.01) × 10 = 8.8. Cthr avg H is the av-

erage confidence score of the structure type “H” in an amino acid position of the

selected THREADER alignments. As previously mentioned, the fold structures

from the THREADER alignments are clustered, and one cluster of fold struc-

tures is chosen for integration in CISPred. In each position, CISPred computes

an average of the confidence scores of these selected fold structures. Equation 3.5

illustrates the computation of Cthr avg H : Cthr H i is the confidence score of the

structure type “H” in an amino acid position of the THREADER alignments, and

NH is the number of the structure types “H” in the amino acid position. For the

example illustrated in Figure 3.11, Cthr H 1, Cthr H 2, and Cthr H 3 are equal to “7,”

40

“1,” and “6,” respectively;∑N

i=1(Cthr H i) equals (7+1+6)=14, and NH equals to

“3”.

Cthr avg H =

∑NH

i=1 Cthr H i

NH

(3.5)

The consensus predictions of CISPred are determined not only by considering

the confidence scores provided by each tool, but also the overall prediction accura-

cies of the integrated tools. In Equation 3.4, Wpsi, Wssp, and Wthr are the weights

of PSIPRED, SSPRO, and THREADER, and Wsf is the weight for the structure

formulae. A weight is a real number from 0 to 1 inclusive which indicates the accu-

racy rate of the information provided by a tool. The weight of structure formulae is

set to 1 because the proportions provided by structure formulae are determined by

statistical analysis of the real structural data of existing motifs, and not predicted

by algorithms. The weights of PSIPRED, SSPRO, and THREADER are equal

to their average 3-state accuracies on a training dataset containing 80 randomly

selected amino acid sequences. 3-state accuracy, also called a Q3 score, is used to

evaluate the prediction accuracy of secondary structure prediction tools. 3-state

accuracy only considers the prediction accuracy of the following three states: helix

(“H”), strand (“E”), and coil (’C’). 3-state accuracy is defined as the percentage of

the amino acid residues that are correctly predicted. Equation 3.6 defines 3-state

accuracy, where Ncorrect is the number of residues that are correctly predicted, and

Ntotal is the total number of amino acid residues in the sequence.

Q3 =Ncorrect

Ntotal

(3.6)

SSPRO and PSIPRED are independent existing protein secondary structure

prediction tools, and their outputs are in the format of protein secondary struc-

ture sequences. The 3-state accuracy scores of their predictions can be computed

41

by comparing the predicted structure type in each amino acid position with the

real structure type retrieved from the PDBFINDER database [20]. The average

3-state accuracy of SSPRO predictions on the 80 training sequences is 0.937, and

the average 3-state accuracy of PSIPRED predictions on the 80 training sequences

is 0.798. THREADER does not provide predicted secondary structure sequences,

rather the structures of the most appropriate folds, and the number of the most

appropriate folds is influenced by the threshold at which the hierarchy cluster-

ing stops. The structure type that has the highest average confidence score is

considered to be the structure type predicted by THREADER in an amino acid

position. Equation 3.5 illustrates the computation of the average confidence score

of the structure type “H.” Figure 3.12 illustrates the average 3-state accuracy of

THREADER predictions on 80 random sequences. The average 3-state accuracy

declines as the threshold of the hierarchy clustering rises, which indicates that

the more similarities the selected folds have, the higher the prediction accuracy of

THREADER is.

As shown in Figure 3.12, the minimum average 3-state accuracy of THREADER

is 0.684, and the maximum average 3-state accuracy of THREADER is 0.770.

Therefore, the 3-state accuracy of THREADER is considered to be 0.727 which is

halfway between the maximum accuracy and the minimum accuracy.

In total, based on Equation 3.4, the total confidence score for the structure

type “H” in the amino acid position presented in Figure 3.11 is computed as

Ctotal H = 0.798 × 10 × 0.627 + 0.937 × 5 + 1 × 10 × (0.87 + 0.01) + 0.727

× (7 + 1 + 6)/3 ≈ 21.88. Similarly, Ctotal E and Ctotal C are computed as Ctotal E

= 0.798 × 10 × 0.419 + 0.937 × 0 + 1 × 10 × (0.00 + 0.00) + 0.727 × 4/1≈6.07, and Ctotal C = 0.798 × 10 × 0.013 + 0.937 × 0 + 1 × 10 × (0.11 + 0.00

+ 0.01 + 0.00) + 0.727 × 5/1 ≈ 4.94, respectively. In the amino acid position

42

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Ave

rag

e 3-

stat

e ac

cura

cy (

Q3

sco

re)

Figure 3.12: Average 3-state accuracy of THREADER predictions on 80 randomsequences.

43

presented in Figure 3.11, the maximum among Ctotal H , Ctotal E and Ctotal C is

Ctotal H , which approximately equals 21.88; therefore, the structure type “H” is

considered to be the consensus structure type predicted by CISPred in this amino

acid position. Similar to the prediction performed in the position illustrated in

Figure 3.11, CISPred provides a consensus prediction in each of the amino acid

positions of a queried sequence, which composes a protein secondary structure

sequence as the consensus prediction on the queried sequence.

Ctotal reaches its maximum limit when the confidence score provided by PSIPRED

equals 1.00, the confidence score provided by SSPRO equals 5, the proportion

provided by structure formula equals 1.00, and the average confidence score of

THREADER alignments equals 10. Therefore, the maximum limit of Ctotal is

computed as Lmax = 0.798 × 10 × 1.000 + 0.937 × 5 + 1 × 10 × 1.00 + 0.727

× 10 = 29.935.

Ctotal reaches its minimum limit when the confidence score provided by PSIPRED

equals 0.00, the confidence score provided by SSPRO equals 0, the proportion

provided by structure formula equals 0.00, and the average confidence score of

THREADER alignments equals 0 when a fold has a gap “-” aligned in an amino

acid position. Therefore, the minimum limit of Ctotal is computed as Lmin = 0.798

× 10 × 0.000 + 0.937 × 0 + 1 × 10 × 0.00 + 0.727 × 0 = 0. In order to clearly

present total confidence scores, the Ctotal of each structure type is converted into

a real number from 0 to 1 inclusive. Equation 3.7 illustrates the conversion of the

total confidence score of the structure type “H.”

C ′total H = 1× Ctotal H − Lmin

Lmax − Lmin

(3.7)

In the example illustrated in Figure 3.11, the converted total confidence score

of the structure type “H” is computed as C ′total H = 21.88/29.935 ≈ 0.731, the

44

converted total confidence score of the structure type “E” is computed as C ′total E

= 6.07/29.935 ≈ 0.203, and the converted total confidence score of the structure

type “C” is computed as C ′total C = 4.94/29.935 ≈ 0.144. Figure 3.13 illustrates

the consensus prediction of CISPred in the amino acid position illustrated in Fig-

ure 3.11, in which “10” is the index of the amino acid position, “K” is the amino

acid type, “H” is the structure type predicted by CISPred, “0.144” is the con-

verted total confidence score of structure type “C,” “0.731” is the converted total

confidence score of structure type “H,” and “0.203” is the converted total confi-

dence score of structure type “E.” Figure 3.14 illustrates an example of CISPred

result in a vertical format, in which each line is a CISPred prediction result for

an amino acid position. Figure 3.15 illustrates an example of a CISPred result

in a horizontal format, in which the queried sequence is shown in FASTA format

followed by CISPred prediction results.

10 K H 0.144 0.731 0.203

Figure 3.13: Example of a CISPred vertical result in an amino acid position.

45

# CISPred vertical result

1 E C 0.330 0.000 0.000 2 D C 0.325 0.000 0.005 3 I C 0.312 0.000 0.017 4 I E 0.001 0.000 0.426 5 V E 0.000 0.000 0.427 6 V E 0.000 0.000 0.427 7 A E 0.148 0.000 0.182 8 L C 0.315 0.000 0.015 9 Y C 0.320 0.000 0.011 10 D C 0.179 0.025 0.004 11 Y E 0.024 0.025 0.160 12 E C 0.181 0.025 0.001 13 A C 0.182 0.025 0.001 14 I C 0.180 0.025 0.000 15 H C 0.179 0.025 0.001 16 H C 0.181 0.026 0.001 17 E C 0.178 0.028 0.001 . . .

420 Q H 0.000 0.183 0.000 421 S H 0.000 0.183 0.000 422 V H 0.001 0.183 0.000 423 L H 0.001 0.183 0.000 424 D H 0.001 0.183 0.000 425 D C 0.157 0.026 0.000 426 F C 0.158 0.025 0.000 427 Y C 0.161 0.021 0.000 428 T C 0.167 0.015 0.000 429 A C 0.170 0.011 0.000 430 T C 0.178 0.006 0.000 431 E C 0.179 0.005 0.001 432 S C 0.173 0.011 0.001 433 Q C 0.172 0.012 0.001 434 Y C 0.173 0.008 0.001 435 Q E 0.021 0.004 0.159 436 Q C 0.179 0.001 0.000 437 Q C 0.178 0.001 0.000 438 P C 0.176 0.000 0.000

Figure 3.14: Example of a CISPred vertical result.

46

>1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP

CCCEEEECCCECCCCCCECCECCCCEEEEEECCCCEEEEEECCCCCEEEEEHCCEEECCC CCHCCCEECCCCHHHHHHHHHCCCCCCCCEEEEECCCCCCCEEEEEEEEECCCCEEEEEE EEEECCCCCEECCCCCCHHHHHHHHHHHCCCCCCCCCCCCCECCCCCCCCCCCCCHHHHH HHHHHHEEEEECCCCEEEEEEEECCCEEEEEEEECCCCCCHHHHHHHHHHHCCCCCCCEE EEEEEECCCCCEEEECCCCCCEHHHHHHCHCCHCCCHHHHHHHHHHHHHHHHHHHHCCCC CCCCCHHHHHHHHHHHHHHHHHHHHHHCCCCCCCHCCHHHHHHCCCCHHHHEEEEEEEEE HHHCCCCCCCCCCCHHHHHHHHHHCCCCCCCCCCCHHHHHHHHHHCCCCHHHCCCHHHHH HH HHCCCCCCCCCCECCC

Figure 3.15: Example of a CISPred horizontal result.

47

Chapter 4

System Implementation

4.1 Overview

From a software engineering perspective, CISPred is an Internet web application,

an application that is accessed through a web browser over the Internet. The

inputs of CISPred, amino acid sequences, are submitted on a web page of the

CISPred website. After CISPred predictions are finished, the consensus prediction

results are sent to the email address of a user.

The tools integrated in CISPred are concurrently executed on a 164-processor

high performance SUN cluster. The computations for one queried sequence are

simultaneously performed on at least 12 processors, which greatly reduces the

execution time of CISPred.

4.2 System Infrastructure

Figure 4.1 illustrates the system infrastructure of CISPred. A user of CISPred ac-

cesses the CISPred website at the HTTP address http://acrl.cs.unb.ca/conpred/

through a web browser. The CISPred website is constructed in HTML and

48

CGI(Common Gateway Interface), and runs on a web server which is a 4-processor

SUN computer with the Fully Qualified Domain Name (FQDN) quartet.cs.unb.ca.

The user email address and queried amino acid sequences in FASTA format shown

in Figure 4.2 are submitted to the CISPred web sever by the submission web page

of CISPred shown in Figure 4.3. A PERL CGI program running on the CISPred

web server parses the queried sequences and submits concurrent jobs to the 164-

processor high performance SUN cluster with FQDN chorus.cs.unb.ca. A MySQL

database that runs at the 4-processor SUN computer stores the information about

queried sequences and their concurrent job IDs. A program that runs on the web

server retrieves the IDs of unfinished concurrent jobs and checks the status of these

jobs. After the concurrent jobs are finished, a program on the cluster is executed

which integrates the results of concurrent jobs and generates consensus prediction

results. The consensus prediction results are sent to the CISPred web server via

FTP and then sent to the email address of a user.

Moreover, the CISPred website contains web pages for CISPred administrators

to view the usage information of CISPred. The administration web pages are

implemented in PERL CGI and are able to list the submitted and finished time

of each queried task, and the IP address and the geographical location of the

computer a CISPred user uses to submit queried sequences. This information

is stored in a table in the MySQL database that runs at the 4-processor SUN

machine. Figure 4.4 shows one of the administration web pages of CISPred.

49

Users

Web Server & Database Server

Program Submitting Execution Jobs

SUN Cluster with 160 Processors

Execution of Integrated Tools

HTTP Email

CISPred Web Page

Queryied sequences

Consensus Predictions

Consensus Predictions

Queried sequences

Job IDs

MySQL Database

Program Generating Consensus Predictions

Program Checking Job Status & Sending

Emails

Task Information

Queried sequences

Job IDs

Task ID

Job status

Results of Integrated Tools

Figure 4.1: The system infrastructure of CISPred.

>1AD5:B|PDBID|CHAIN|SEQUENCE EDIIVVALYDYEAIHHEDLSFQKGDQMVVLEESGEWWKARSLATRKEGYIPSNYVARVDS LETEEWFFKGISRKDAERQLLAPGNMLGSFMIRDSETTKGSYSLSVRDYDPRQGDTVKHY KIRTLDNGGFYISPRSTFSTLQELVDHYKKGNDGLCQKLSVPCMSSKPQKPWEKDAWEIP RESLKLEKKLGAGQFGEVWMATYNKHTKVAVKTMKPGSMSVEAFLAEANVMKTLQHDKLV KLHAVVTKEPIYIITEFMAKGSLLDFLKSDEGSKQPLPKLIDFSAQIAEGMAFIEQRNYI HRDLRAANILVSASLVCKIADFGLARVGAKFPIKWTAPEAINFGSFTIKSDVWSFGILLM EIVTYGRIPYPGMSNPEVIRALERGYRMPRPENCPEELYNIMMRCWKNRPEERPTFEYIQ SVLDDFYTATESQYQQQP

Figure 4.2: Example of a CISPred queried sequence.

50

Figure 4.3: Web page for submitting query sequences to CISPred.

51

Figure 4.4: A CISPred web page displaying user jobs.

52

4.3 Concurrent Implementation

4.3.1 Overview

The tools integrated in CISPred, THREADER [23], PSIPRED [24], and SSPRO [37],

and the finding of motif structure formulae are concurrently executed in a high per-

formance SUN cluster containing 164 processors. Figure 4.5 illustrates an overview

of the concurrent implementation of CISPred. A PERL CGI program that runs at

the CISPred web server parses the submission file that may contains more than one

queried amino acid sequence in FASTA format. For each of the queried sequences,

the PERL CGI program submits several concurrent jobs to the high performance

SUN cluster, which contains 10 concurrent jobs of THREADER [23], 1 concurrent

job of PSIPRED [24], 1 concurrent job of SSPRO [37], and n concurrent jobs of

finding motif structural formulae, where n equals the number of existing motifs

in a queried amino acid sequence. After the executions of the concurrent jobs are

finished, a program integrates the results of each concurrent job and generates

consensus predictions.

4.3.2 THREADER

The default fold library of THREADER contains 6251 protein folds. A queried

sequence is “threaded” through each of the protein folds and free energy is com-

puted during this process. Figure 4.6 illustrates the concurrent implementation

of THREADER. CISPred divides the library into 10 sub-libraries, each of which

contains approximately 625 protein folds. A queried sequence is threaded through

the folds in the 10 sub-libraries simultaneously on 10 processors.

In total, a PERL CGI program that runs at the CISPred web server submits

10 concurrent jobs of THREADER according to the submission template shown in

53

. . .

THREADER Task 1

THREADER Task 10

Motif Structure Finding Task 1

Motif Structure Finding Task n

. . .

SSPRO Task

PSIPRED Task

A Query Sequence Results Concurrent Job

Submission Program

Result Integration Program

Figure 4.5: Overview of the concurrent implementation of CISPred.

Concurrent Job Submission

Program

Fold Library 10

Fold Library 1

THREADER 10

THREADER 1

Result Integration Program

R e s u l

t 1 0

R e s u l t 1

… …

…

A queried sequence

…

Figure 4.6: Concurrent implementation of THREADER.

54

Appendix A.1. After the executions of these concurrent jobs are finished, 10 align-

ment reports and 10 score reports are generated by THREADER. An integration

program that runs on the high performance SUN cluster sorts each of the score

reports. In each of the score reports, 20 folds with the highest filtered combined

energy Z-scores [23] are selected. The integration program gathers 20 folds from

each of the sub-libraries, sorts the 20 × 10 folds based on the filtered combined

energy Z-scores [23], and selects the top 20 folds with the highest filtered combined

energy Z-scores [23] as shown in Figure 4.7. These 20 folds are the most appro-

priate folds in the 6251 protein folds of the THREADER default library. From

these 20 folds, CISPred checks the filtered combined energy Z-scores of each of

them and the protein folds with filtered combined energy Z-scores lower than 3.5

are eliminated. The folds left are then clustered and only the folds in one cluster

are integrated in CISPred.

Score report 1

Score report 10

Top 20 folds in report 1

Top 20 folds in report 10

Top 20 folds in all reports

Sort function

Sort function

Sort function

…

…

Figure 4.7: The sorting of THREADER reports.

4.3.3 Finding Protein Motif Secondary Structures

A queried amino acid sequence may contain several motifs. The submission pro-

gram that runs at the CISPred web server executes PATMATMOTIFS [1] on each

55

of the queried sequences in order to find the existing motifs in a queried sequence.

The finding of the structural formulae of a motif is performed on one processor.

By parsing the results of PATMATMOTIFS, the submission program that runs at

the CISPred web server retrieves the number of motifs found in a queried sequence

and submits the corresponding number of concurrent jobs to the high performance

SUN cluster, each of which performs an independent process of finding structural

formulae for one motif. Appendix A.4 illustrates the template for submitting

concurrent jobs of finding motif structural formulae. Figure 4.8 illustrates the

concurrent implementation of finding motif structural formulae.

After the concurrent jobs of finding motif structural formulae are finished, the

structural formulae are integrated in a program that runs on the high performance

SUN cluster.

4.3.4 SSPRO and PSIPRED

PSIPRED [24] and SSPRO [37] are independent existing protein structure pre-

diction tools, and their execution time is within two minutes, much shorter than

the execution time of THREADER, which may be up to several hours based on

the length of a queried sequence. PSIPRED and SSPRO are not further divided

and each of them is performed by a processor. The submission PERL CGI pro-

gram that runs at the CISPred web server submits jobs for each of SSPRO and

PSIPRED. Appendices A.2 and A.3 illustrate the template for submitting jobs of

SSPRO and PSIPRED. After the executions of the jobs are finished, the integra-

tion program that runs on the high performance SUN cluster directly integrates

the results of SSPRO and PSIPRED.

56

PATMATMOTIFS

Query amino acid sequence

Names of the motifs found in

query sequence

Processor 1

The PDB IDs of all proteins that contain Motif 1

The whole amino acid sequence and

secondary structures of the

proteins that contain Motif 1

Segments of the secondary structure sequences of Motif 1

. . .

N a m e o f M o t i f 1

Concurrent Tasks

Submission Program

PROSITE Database

PDBFINDER Database

PATMATMOTIFS

Structure Formula

Generation Program

Consensus Prediction Program of

CISPred

. . .

Processor n

The PDB IDs of all proteins that contain Motif n

The whole amino acid sequence and

secondary structures of the

proteins that contain Motif n

Segments of the secondary structure sequences of Motif n

PROSITE Database

PDBFINDER Database

PATMATMOTIFS

Structure Formula

Generation Program

N a m e o f M o t i f n

Structure Formula of

Motif 1

Structure Formula of

Motif n

Figure 4.8: Concurrent finding of motif structures.

57

4.4 Execution Time

The executions of integrated tools and the finding of motif structural formulae of

CISPred are concurrently implemented on a high performance 160-processor SUN

cluster, which greatly reduces its execution time.

The PERL CGI program that runs at the CISPred web server sequentially

submits concurrent jobs to the high performance SUN cluster, and the first sub-

mitted concurrent jobs are first executed in the high performance cluster. The

submission order of concurrent jobs is the same as the execution order of concur-

rent jobs; this is called the “job schedule.” Sometimes the number of available

processors in the high performance cluster is less than the number of concurrent

jobs submitted. After being submitted, some of the concurrent jobs cannot be

executed immediately, but have to wait to be executed. Moreover, the execution

time of concurrent jobs is different; for example, the execution time of SSPRO on

one queried sequence is usually within 2 minutes, while the execution time of a

THREADER concurrent job usually takes 30-40 minutes. Job schedule influences

the total execution time of concurrent jobs.

Figures 4.9 and 4.10 illustrate the execution time and speedup of CISPred

on protein 1AD5 shown in Figure 4.2. “Optimized Schedule” indicates that the

concurrent jobs with longer execution times are submitted ahead of the concurrent

jobs that take less time. The order of jobs submitted is as follows:

1. Job(s) for finding motif structural formulae1;

2. 10 THREADER concurrent jobs;

3. The SSPRO job;

1If a queried amino acid sequence contains more than one motif, the order of jobs submittedfor finding motif structural formulae of these motifs is the same as the order in which thesemotifs were found in the queried amino acid sequence.

58

4. The PSIPRED job;

“Non-optimized Schedule” indicates that the concurrent jobs are submitted in

a random order. As shown in Figures 4.9 and 4.10, submitting the concurrent jobs

that have longer execution times ahead of the concurrent jobs that have shorter

execution times reduces the total execution time and improves the speedup of

CISPred.

CISPred concurrently generates structure formulae for each of the motifs found

in a queried sequence. Initially, CISPred sequentially generates structure formulae

for each of the motifs found in a queried sequence. In Figures 4.9 and 4.10, the line

with the legend “ Optimized Schedule” indicates the execution time and speedup

of CISPred when the CISPred job schedule is optimized but finding structural for-

mulae of existing motifs in a queried sequence are not concurrently implemented.

The line with the legend “Non-optimized Schedule” indicates the execution time

and speedup of CISPred when the CISPred job schedule is not optimized and

the finding of structural formulae of existing motifs in a queried sequence are not

concurrently implemented. The line with the legend “Optimized schedule and con-

current finding of motif structures” indicates the execution time and speedup of

CISPred when the CISPred job schedule is optimized and the finding of structural

formulae of existing motifs in a queried sequence are concurrently implemented.

As shown in Figure 4.9, when the CISPred job schedule is optimized and

the finding of structural formulae of existing motifs in a queried sequence are

concurrently implemented, the execution time of CISPred has a large decrease

when the number of processors increases from 1 to 6. As Figure 4.10 shows, when

the CISPred job schedule is optimized and the finding of structural formulae of

existing motifs in a queried sequence are concurrently implemented, the execution

time of CISPred decreases almost 8 times when the number of processors increases

59

from 1 to 11.

Optimized schedule and concurrent finding of motif structure formulas

Optimized schedule

Non-optimized schedule

Figure 4.9: The execution time of CISPred.

60

Optimized schedule and concurrent finding of motif structure formulas

Optimized schedule

Non-optimized schedule

Figure 4.10: The speedup of CISPred.

61

Chapter 5

Experimental Results

5.1 Overview

The purpose of the experiments is to test the prediction accuracy of CISPred,

to determine a default threshold to be used in the clustering of protein folds

generated from THREADER alignments, and to compare CISPred with other

existing protein structure prediction tools.

Two test datasets are used in the experiments. One of the test datasets con-

sists of 109 “Critical Assessment of Techniques for Protein Structure Prediction

Experiment” (CASP) [29] target amino acid sequences. CASP [29] is an organi-

zation which evaluates protein structure prediction methods. Prediction methods

provide blind predictions before the structures of the target sequences are observed

by experimental methods. The CASP target sequences have a variety of lengths

and are newly discovered proteins; therefore, they have been widely used by cur-

rent prediction tools as a standard test dataset. The 109 CASP sequences used

in the experiments conducted are randomly selected from CASP3 (1998), CASP4

(2000), CASP5 (2002), and CASP6 (2004).

The experiments are performed on a dataset containing 1758 amino acid se-

62

quences selected from the PDBFINDER database [20] as a result of the follow-

ing procedure. The PDBFINDER database contains information, such as the

amino acid sequences and the secondary structures, of current known proteins.

PDBFINDER database is in a plain text format. The entries in the PDBFINDER

database are listed by the alphabetical order of the protein PDB IDs. We select

5000 proteins starting from the first protein listed in the PDBFINDER database

with PDB ID 100D to the protein with PDB ID 4HTC by the alphabetical order.

Out of these 5000 protein sequences, the following sequences are deleted:

1. The sequences included in the training dataset which contains 80 sequences

presented in Chapter 3.

2. The sequences that contain illegal characters.

3. Some protein chains with the same PDB ID. Some proteins contain several

chains. For example, “1ZD6:A” and “1ZD6:B” are the two chains of protein

1ZD6. The two chains “1ZD6:A” and “1ZD6:B” have very high sequence

similarities as shown in Figure 5.1 and are listed in PDBFINDER as two

separate entries. In this situation, only the first chain of the protein is

included, all of the other chains of the same proteins are eliminated.

The test dataset (which contains 1758 sequences) is much larger than the CASP

dataset (which has 109 sequences) and contains selected amino acid sequences of

regular known proteins.

As is the case with most of the other existing protein structure prediction

tools, CISPred only predicts three structural states: helix (“H”), strand (“E”),

and coil (“C”). The eight secondary structure types defined by DSSP are reduced

into these three states based on a 3-state scheme [40]: “G” and “H” are taken to

63

be helix (“H”), “E” and “B” are both taken to be strand (“E”), and all of the

other structure types are considered to be coil (“C”).

3-state accuracy is used to evaluate prediction accuracy in the experiments

conducted. 3-state accuracy is defined as the percentage of the amino acid residues

that are correctly predicted, as shown in Equation 3.6.

5.2 CISPred Testing Results on CASP Sequences

As presented in Chapter 3, the secondary structure sequences of the protein folds

generated from THREADER alignments are clustered, and only the folds in one

cluster are integrated by CISPred. The clustering process stops when the distance

between the nearest two clusters reaches a threshold. In order to test the predic-

tion accuracy of CISPred on each of the clustering thresholds from 1% to 100%,

CISPred is executed 100 times on the two datasets, each time with a different

clustering threshold. Figure 5.2 illustrates the average Q3 scores of the 109 CASP

sequences on each threshold from 1% to 100%, with 1% as the interval.

As shown in Figure 5.2, the average 3-state accuracy of CISPred predictions

stays above 0.826 when the threshold increases from 0 to almost 0.20. It declines

to 0.823 when the threshold is 0.25. It reaches its peak, 0.828, when the threshold

is 0.40, and declines from its peak to the lowest point, 0.815, when the threshold

is raised to 0.66. It has a slight increase from the lowest point to 0.819 when

>1ZD6:A|PDBID|CHAIN|SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE

>1ZD6:B|PDBID|CHAIN|SEQUENCE MGPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLT TEEEFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTT AVVTNPKE

Figure 5.1: Two chains of protein 1ZD6.

64

the threshold increases from 0.66 to 1.00. In total, the average 3-state accuracy

has a large decline when the threshold is raised from 0.40 to 0.66. Moreover, it

stays relatively high when the threshold is between 0 and 0.40, and relatively low

when the threshold is between 0.66 and 1, which illustrates that the clustering of

the protein folds generated from THREADER alignments improves the prediction

accuracy of CISPred. This is because when the threshold reaches 1, all of the

protein folds are integrated, which is the equivalent of integrating all of the protein

folds without clustering them.

Figure 5.3 illustrates the standard deviation of the 3-state accuracies of the

CISPred predictions with different thresholds on the 109 CASP sequences. Stan-

dard deviation is used to measure how the values in a distribution are spread.

Standard deviation is computed according to the Equation 5.1.

0.814

0.816

0.818

0.82

0.822

0.824

0.826

0.828

0.83

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Ave

rag

e 3-

stat

e ac

cura

cy (

Q3

sco

re)

Figure 5.2: Average 3-state accuracy of CISPred on the 109 CASP sequences.

65

σ =

√√√√ 1

N

N∑i=1

(xi − x)2 (5.1)

N stands for the number of samples taken, xi is the value of each sample, and

x is the average of the sample values.

0.094

0.095

0.096

0.097

0.098

0.099

0.1

0.101

0.102

0.103

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Sta

nd

ard

dev

iati

on

Figure 5.3: Standard deviation of CISPred on the 109 CASP sequences.

In statistics, “coefficient of variation” [44] also measures the dispersion of a

probability distribution. The coefficient of variation is defined using Equation 5.2,

in which µ stands for the arithmetic mean or average, and σ stands for standard

deviation:

Cv =σ

µ(5.2)

Figure 5.4 illustrates the “coefficient of variation” of the 3-state accuracies of

66

the CISPred predictions with different thresholds on the 109 CASP sequences.

Both Figures 5.3 and 5.4 show that the 3-state accuracies of the 109 CASP

sequences have the highest dispersion when the threshold is 0.60, and have the

lowest dispersion when the threshold is 0.72. In total, the standard deviation is

roughly 0.1 when the threshold is increased from 0 to 0.62, and declines to 0.095

when the threshold is increased from 0.60 to 1. The coefficient of variation is

roughly 12.1% when the threshold is increased from 0 to 0.53, and after a rise,

the coefficient of variation declines to 11.6% when the threshold is increased from

0.6 to 1. Generally speaking, both the standard deviation and the coefficient of

variation stay at lower values when the threshold reaches a relatively higher value.

The standard deviation ranges from 0.095 to 0.102, and the coefficient of variation

ranges from 11.6% to 12.4%, neither of which changes greatly and both remain at

a low level.

0.115

0.116

0.117

0.118

0.119

0.12

0.121

0.122

0.123

0.124

0.125

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Co

effi

cien

t o

f va

rian

ce

C o e

f f i c i

e n t o

f v a r

i a t i o

n

Figure 5.4: Coefficient of variation of CISPred on the 109 CASP sequences.

67

Figure 5.5 illustrates the number of sequences that have 3-state accuracies in

some specific ranges. It shows that CISPred always predicts at least 20 sequences

out of 109 CASP sequences with 3-state accuracy higher than 90%, which is about

18.3% of the 109 CASP sequences; CISPred always predicts at least 35 sequences

out of the 109 CASP sequences with 3-state accuracy higher than 85%, which is

about 32% of the 109 CASP sequences; and CISPred always predicts at least 60

sequences out of the 109 CASP sequences with 3-state accuracy higher than 80%,

which is about 55% of the 109 CASP sequences. The number of sequences with

higher 3-state accuracies, such as those above 90%, 85%, 80%, and 75%, declines

when the threshold is increased from 0.48 to 0.64, while the number of sequences

with lower 3-state accuracies, such as 65%, 60%, and 55%, increases when the

threshold is increased from 0.48 to 0.64.

0

10

20

30

40

50

60

70

80

90

100

110

0 0.2 0.4 0.6 0.8 1

Threshold

Nu

mb

er o

f se

qu

ence

s

Above 90%

Above 85%

Above 80%

Above 75%

Above 70%

Above 65%

Above 60%

Above 55%

Above 50%

Figure 5.5: Number of sequences CISPred predicts with 3-state accuracy in severalspecific ranges on the 109 sequences dataset.

68

Figures 5.6, 5.7, and 5.8 illustrate the distributions of the CISPred predictions

on the 109 CASP sequences with 1%, 3%, and 5% 3-state accuracy as intervals,

respectively. In these three gray-scale graphs, the more sequences located in an

area, the darker this area is. In Figure 5.6, the area with 3-state accuracy of 80%

and a threshold from 0% to 58% is the darkest area, which indicates that this area

contains the densest distribution of sequences. Figure 5.7 shows that besides the

area with 80% 3-state accuracy and a threshold from 0% to 58%, the area with

90% 3-state accuracy and a threshold from 25% to 45% also contains a relatively

high density of sequences. Figure 5.8 shows that area with 3-state accuracy from

75% to 85%, and a threshold from 40% to 44%, 48% to 58%, and 60% to 100%

contains a higher density of sequences. Overall, the sequences with higher than

90% 3-state accuracy are mostly generated when the threshold is from 25% to 50%,

the sequences with 80% to 85% 3-state accuracy are mostly generated when the

threshold is from 70% to 100%, the sequences with 78% to 80% 3-state accuracy

are mostly generated when the threshold is from 0% to 60%.

5.3 CISPred Testing Results on 1758 Sequences

The experiments conducted on the 109 CASP sequences are also performed on a

dataset containing 1758 amino acid sequences. The 1758 sequences are selected

form the PDBFINDER database [20], which contains information about known

proteins, such as amino acid sequences and secondary structure sequences.

Figure 5.9 illustrates the average 3-state accuracies of CISPred on the 1758

sequences. It shows that the average of 3-state accuracy rises from 0.8887 to

0.8929 when the threshold is increased from 0 to 0.45; it declines to 0.8913 when

the threshold is increased to 0.62, and then rises and reaches its peak at 0.8938.

In total, the average 3-state accuracy rises when the threshold is increased from

69

3 - s t

a t e

a c c u

r a c y

( i n

p e r

c e n t

a g e )

Threshold (in percentage)

Figure 5.6: Distribution of the 109 CASP sequences predicted by CISPred with1% 3-state accuracy as interval.

70


3 - s t

a t e

a c c u

r a c y

( i n

p e r

c e n t

a g e )


71


3 - s t

a t e

a c c u

r a c y

( i n

p e r

c e n t

a g e )


72

0 to 1, which is inverse to the 3-state accuracy of CISPred predictions on the

109 CASP sequences shown in Figure 5.2, in which the average 3-state accuracy

declines when the threshold is increased from 0 to 1. When the threshold that

stops the clustering of protein folds generated from THREADER alignments is

equal to 1, the protein folds in all of the clusters are integrated by CISPred. The

increasing trend of the 3-state accuracy shown in Figure 5.9 indicates that the

folds generated from THREADER are overwhelming the prediction results from

other integrated tools. Furthermore, as more folds generated from THREADER

are integrated, the 3-state accuracy of CISPred predictions increases. The peak of

the 3-state accuracy on the 1758 sequences shown in Figure 5.9 is 0.8938, which is

higher than the peak of the 3-state accuracy on the 109 CASP sequences, which is

0.8278, as shown in Figure 5.2. This indicates that CISPred has a higher average

prediction accuracy on the 1758 sequences than on the 109 CASP sequences.

0.888

0.889

0.89

0.891

0.892

0.893

0.894

0.895

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Ave

rag

e 3-

stat

e ac

cura

cy (

Q3

sco

re)

Figure 5.9: Average 3-state accuracy of CISPred predictions on 1758 sequences.

73

Figure 5.10 illustrates the standard deviation of CISPred predictions on the

1758 sequences. It shown that the standard deviation increases from its lowest

point, 0.0944, to its peak, 0.0967, when the threshold is increased from 0 to 0.37.

The standard deviation then declines to 0.0952 and hovers around 0.0952, when

the threshold is increased from 0.37 to 1. The range of the standard deviation

of CISPred predictions on the 1758 sequences is 0.0023 (the highest point 0.0967

minus the lowest point 0.0944), which is very low and indicates that the dispersion

of CISPred predictions on the 1758 sequences is rarely influenced by the threshold

of clustering. Compared with the standard deviation of CISPred predictions on the

109 CASP sequences shown in Figure 5.3, the predictions on the 1758 sequences

have slightly less dispersion than the predictions on the 109 CASP sequences.

0.094

0.0945

0.095

0.0955

0.096

0.0965

0.097

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Sta

nd

ard

dev

iati

on

Figure 5.10: Standard deviation of CISPred predictions on the 1758 sequences.

Figure 5.11 illustrates the coefficient of variation of the CISPred predictions on

the 1758 sequences. The lowest point of the coefficient of variation is 10.62%, when

74

the threshold equals 0, and the highest point of the coefficient of variation is at

10.84%, when the threshold equals 0.37. When the threshold is increased from 0.37

to 1, the coefficient of variation declines and then hovers around 10.69%. When

the threshold is above 0.8, the coefficient of variation stays at 10.69%. Generally

speaking, the coefficient of variation of CISPred predictions on the 1758 sequences

is about 2% less than the coefficient of variation of CISPred predictions on the 109

CASP sequences shown in Figure 5.4, which indicates that CISPred predictions

on the 1758 sequences have less dispersion than on the 109 CASP sequences.

0.106

0.1065

0.107

0.1075

0.108

0.1085

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold

Co

effi

cien

t o

f va

rian

ce

C o

e f f i c

i e n t

o f v

a r i a

t i o n

Figure 5.11: Coefficient of variation of CISPred predictions on the 1758 sequences.

Figure 5.12 illustrates the number of sequences which have 3-state accuracies in

some specific ranges. It shows that CISPred always predicts at least 600 sequences

with 3-state accuracy higher than 95%, which is 34% of the 1758 sequences; 1000

sequences with 3-state accuracy higher than 90%, which is 56.9% of the 1758

75

sequences; 1200 sequences with 3-state accuracy higher than 85%, which is 68.3%

of the 1758 sequences; 1400 sequences with 3-state accuracy higher than 80%,

which is 79.7% of the 1758 sequences. Compared to Figure 5.5, CISPred has better

performance on the 1758 sequences dataset regarding the number of sequences with

high 3-state accuracy.

0

200

400

600

800

1000

1200

1400

1600

0 0.2 0.4 0.6 0.8 1

Threshold

Nu

mb

er o

f se

qu

ence

s

Above 95%

Above 90%

Above 85%

Above 80%

Above 75%

Above 70%

Above 65%

Above 60%

Above 55%

Abvoe 50%

Above 45%

Figure 5.12: Number of sequences CISPred predicts with 3-state accuracy in sev-eral specific ranges on the 1758 sequences dataset.

5.4 Selection of CISPred Default Threshold

As presented in Chapter 3, the protein folds generated from THREADER align-

ments are clustered, and only the folds in one cluster are integrated in CISPred.

The process of clustering stops when the distance between the two closest clusters

is larger than a threshold. The experiments conducted on two datasets indicate

76

that the average 3-state accuracy and the dispersions of CISPred predictions are

influenced by the threshold in clustering. A default threshold is determined based

on the experiments conducted, and is used as the clustering threshold on any

queried sequences submitted by CISPred users.

The default threshold is determined by considering the average 3-state accu-

racy, standard deviation, and coefficient of variation of CISPred predictions on

the two datasets. At the default threshold, a compromise is made which balances

the average 3-state accuracy and the dispersion: CIPred has relatively higher av-

erage 3-state accuracy and lower standard deviation and coefficient of variation.

Based on these requirements, 0.42 is selected as the default threshold in CISPred,

because at this threshold, the average 3-state accuracy of CISPred predictions

on the 109 CASP sequences equals 0.826, which is very close to the peak, 0.828,

as shown in Figure 5.2, and the average 3-state accuracy of CISPred predictions

on the 1758 sequences equals 0.8926, which is very close to its peak, 0.8930, as

shown in Figure 5.9. Moreover, CISPred predictions have less dispersion when the

threshold equals 0.42: the standard deviation of CISPred predictions on the 109

CASP sequences equals 0.0996, as shown in Figure 5.3, the coefficient of variation

of CISPred predictions on the 109 CASP sequence equals 0.1069, as shown in

Figure 5.4; the standard deviation of CISPred predictions on the 1758 sequences

equals 0.0955, as shown in Figure 5.10; and the coefficient of variation of CISPred

predictions on the 1758 sequences equals 0.1206, as shown in Figure 5.11.

Figure 5.5 shows that CISPred predicts more sequences with 3-state accuracy

higher than 90% and higher than 85% when the threshold equals 0.42 on the 109

CASP sequences. Figure 5.12 illustrates that CISPred predicts more sequences

with 3-state accuracy higher than 90%, higher than 85%, and higher than 80%,

when the threshold equals 0.42 on the 1758 sequences.

77

For all the above reasons, 0.42 is determined as the default threshold to be used

in the hierarchical clustering of the folds generated by THREADER alignments.

5.5 Comparison of CISPred and Integrated Tools

5.5.1 Overview

PSIPRED and SSPRO are two existing protein secondary structure prediction

tools that are integrated by CISPred. The manuals of PSIPRED and SSPRO

provide their average 3-state accuracies based on the experiments conducted by

their authors. The datasets used to test the average 3-state accuracies of PSIPRED

and SSPRO may be different than the two datasets used to test CISPred, which

may mean that the experimental results of PSIPRED and SSPRO cannot be

compared to the experimental results of CISPred. PSIPRED and SSPRO are

independently tested using the same test datasets used to test CISPred. The

experimental results of PSIPRED and SSPRO are compared to the experimental

results of CISPred with a default threshold of 0.42.

5.5.2 Comparison on CASP Sequences

Figure 5.13 illustrates the 3-state accuracies of PSIPRED predictions on the 109

CASP sequences. Figure 5.14 is a bar graph showing the distribution of the 3-state

accuracies of PSIPRED predictions on 109 CASP sequences. The average 3-state

accuracy, standard deviation, and coefficient of variation of PSIPRED predictions

on the 109 CASP sequences is 0.778, 0.084, and 15.6%, respectively. Figure 5.13

shows that PSIPRED predicts proteins number 21 and 9 with a 3-state accuracy

of 0.43 and 0.46, which are the two lowest 3-state accuracies of the 109 predictions.

Figure 5.14 shows that PSIPRED does not predict any sequences with 3-state

78

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Sequence I.D.

3-st

ate

accu

racy

(Q

3 sc

ore

)

Figure 5.13: 3-state accuracies of PSIPRED predictions on the 109 CASP se-quences with average Q3 score 0.778, standard deviation 0.084, and coefficient ofvariation 15.6%.

79

accuracy between 0.95 and 1; predicts 33 sequences with 3-state accuracy between

0.80 and 0.85, which is 30.3% of the 109 CASP sequences; and predicts 28 se-

quences with 3-state accuracy between 0.80 and 0.85, which is 25.7% of the 109

CASP sequences.

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5

3-state accuracy ranges

N u

m b

e r o

f s e

q u

e n c e

s

Figure 5.14: Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on 109 CASP sequences.

Figure 5.15 shows the 3-state accuracies of the predictions of SSPRO on the 109

CASP sequences. The average 3-state accuracy, standard deviation, and coefficient

of variation of SSPRO predictions on the 109 CASP sequences is 0.821, 0.095, and

11.6%, respectively. Figure 5.15 shows that SSPRO predicts 7 CASP sequences

with 100% 3-state accuracy, and protein number 9 with 3-state accuracy 0.5, which

is the lowest 3-state accuracy in the 109 predictions.

Figure 5.16 is a bar graph showing the distribution of 3-state accuracies of

SSPRO predictions on the 109 CASP sequences. Figure 5.16 shows that SSPRO

80

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Sequence I.D.

3-st

ate

accu

racy

(Q

3 sc

ore

)

Figure 5.15: 3-state accuracies of SSPRO predictions on 109 CASP sequences withan average Q3 score 0.821, standard deviation 0.095, and coefficient of variation11.6%.

81

predicts the same number as PSIPRED of sequences with 3-state accuracy between

0.75 and 0.85. However, SSPRO predicts 15 sequences with 3-state accuracy

between 0.95 and 1.00, and 8 sequences with 3-state accuracy between 0.90 and

0.95. Compared to PSIPRED, SSPRO predicts more sequences with high 3-state

accuracy.

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5 3-state accuracy ranges

N u

m b

e r o

f s e

q u

e n c e

s

Figure 5.16: Bar graph showing the distribution of 3-state accuracies of SSPROpredictions on the 109 CASP sequences.

Figure 5.17 illustrates the 3-state accuracies of CISPred predictions on the 109

CASP sequences when the threshold equals 0.42. The average 3-state accuracy,

standard deviation, and coefficient of variation of CISPred predictions on 109

CASP sequences when the threshold equals 0.42 are 0.826, 0.0997, and 12.06%,

respectively. Compared to the average 3-state accuracy, standard deviation, and

coefficient of variation of PSIPRED predictions on the 109 CASP sequences, which

82

are 0.778, 0.084, and 15.6% respectively, the average 3-state accuracy of CISPred,

0.826, is higher than the average 3-state accuracy of PSIPRED, 0.778; the standard

deviation of CISPred, 0.0997, is higher than the standard deviation of PSIPRED,

0.084, which indicates that the predictions of CISPred are slightly more distributed

than the predictions of PSIPRED. The coefficient of variation of CISPred, 12.6%,

is lower than the coefficient of variation of PSIPRED, 15.6%, which indicates that

CISPred predictions have less dispersion percentage of average 3-state accuracy

than PSIPRED. In total, the predictions of CISPred have 0.048 (0.826-0.778)

higher average 3-state accuracy than PSIPRED, 0.0157 (0.0997-0.0840) higher

standard deviation than PSIPRED, and 2.54% (15.60%-12.06%) lower coefficient

of variation than PSIPRED.

Compared to the average 3-state accuracy, standard deviation, and coefficient

of variation of SSPRO predictions on the 109 CASP sequences, which are 0.821,

0.0950, and 11.6% respectively, the average 3-state accuracy of CISPred, 0.826, is

slightly higher than the average 3-state accuracy of SSPRO, 0.821. The standard

deviation of CISPred, 0.0997, is slightly higher than the standard deviation of

SSPRO, 0.0950, which indicates that the predictions of CISPred are slightly more

distributed than the predictions of SSPRO. The coefficient of variation of CISPred,

12.6%, is slightly higher than the coefficient of variation of SSPRO, 11.6%, which

indicates that CISPred predictions have a slightly higher dispersion percentage of

average 3-state accuracy than SSPRO. The predictions of SSPRO and CISPred are

very close regarding average 3-state accuracy, standard deviation, and coefficient of

variation. CISPred has a slightly higher average 3-state accuracy than SSPRO, but

the dispersion of CISPred predictions is slightly higher than the SSPRO dispersion.

Figure 5.18 depicts a bar graph that shows the distribution of CISPred pre-

dictions on the 109 CASP sequences when the threshold equals 0.42. It shows

83

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Sequence I.D.

3-st

ate

accu

racy

(Q

3 sc

ore

)

Figure 5.17: 3-state accuracies of CISPred predictions on the 109 CASP sequenceswhen the threshold at which to stop clustering equals 0.42.

84

that CISPred predicts 13 sequences with 3-state accuracy between 0.95 and 1,

which is 11.9% of the 109 CASP sequences; 20 sequences with 3-state accuracy

between 0.90 and 0.95, which is 18.3% of the 109 CASP sequences; 28 sequences

with 3-state accuracy between 0.80 and 0.85, which is 25.7% of the 109 CASP

sequences.

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5 3-state accuracy ranges

N u

m b

e r o

f s e q

u e n

c e s

Figure 5.18: Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions when the threshold equals 0.42 on the 109 CASP sequences.

Figure 5.19 depicts a bar graph showing the distributions of the 3-state accu-

racies of the predictions of CISPred, PSIPRED, and SSPRO, which is made by

combining Figures 5.14, 5.16, and 5.18. Figure 5.19 shows that CISPred predicts

a slightly smaller number of sequences with 3-state accuracy between 0.95 and

1.00, 0.85 and 0.90, and 0.80 and 0.85 than does SSPRO. CISPred predicts 11

(19-8=11) more sequences with 3-state accuracy between 0.90 and 0.95 than does

SSPRO. In total, CISPred predicts 74 (14+19+12+29=74) sequences with 3-state

85

accuracy higher than 0.80, and SSPRO predicts 66 (15+8+13+30=66) sequences

with 3-state accuracy higher than 0.80, which indicates that CISPred predicts 8

(74-66=8) more sequences with 3-state accuracy higher than 0.80, which is 7.4%

of the 109 CASP sequences. PSIPRED predicts 52 (0+4+15+33=52) sequences

with 3-state accuracy higher than 0.80, which is 22 (74-52=22) sequences less than

CISPred, or 20.2% of the 109 CASP sequences. CISPred has better prediction

performance regarding the number of predicted sequences with 3-state accuracy

higher than 0.80 as compared to PSIPRED and SSPRO.

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5


N u

m b

e r o

f s e

q u

e n c e

s

Figure 5.19: Bar graph showing the distribution of the 3-state accuracies of pre-dictions of CISPred, PSIPRED, and SSPRO.

PSIPRED, SSPRO and CISPred all have predicted very low 3-state accuracy

for some of the 109 CASP sequences. The sequence with sequence I.D. “9” is one

of the sequences with very low 3-state accuracy: figure 5.17 shows that CISPred

predicts sequence “9” with 3-state accuracy 0.5, figure 5.15 shows that SSPRO pre-

86

dicts sequence “9” with 3-state accuracy 0.5, and figure 5.13 shows that PSIPRED

predicts sequence “9” with 3-state accuracy 0.46. PSIPRED, SSPRO and CIS-

Pred all predict sequence “9” with very low 3-state accuracy in the 109 CASP

sequences. Sequence “9” is a segment of protein 1WCK, which was discovered in

the year 2005. Figure 5.20 shows the predictions of PSIPRED, SSPRO, and CIS-

Pred for the protein 1WCK, the amino acid sequence of protein 1WCK, and the

real secondary structures of protein 1WCK. The lines starting with “PSIPRED”

are the predicted structures of protein 1WCK provided by PSIPRED, the lines

starting with “SSPRO” are the predicted structures of protein 1WCK provided by

SSPRO, the lines starting with “CISPred” are the predicted structures of protein

1WCK provided by CISPred, the lines starting with “AA” indicate the amino

acid sequence of protein 1WCK, and the lines starting with “REAL” are the real

structures of protein 1WCK, which were determined by experimental methods.

Figure 5.20 shows that CISPred predictions are the same as SSPRO predictions,

which indicates that SSPRO predictions are always selected as the consensus pre-

dictions when predicting the structures of protein 1WCK. As presented in Chapter

3, the folds generated by THREADER are sorted based on the filtered combined

energy Z-scores, and usually only 20 of the folds with filtered combined energy Z-

scores above 3.5 are clustered, parts of which are integrated in CISPred. None of

the folds generated by THREADER has filtered combined energy Z-scores above

3.5 when the sequence of protein 1WCK is queried into THREADER, which in-

dicates that none of the 6251 protein folds in the THREADER library has a

structure that perfectly fits the protein 1WCK. CISPred predicts the structure of

protein 1WCK without integrating any protein folds generated by THREADER.

Furthermore, Figure 5.20 shows that CISPred predicts several segments of helices

(“H”); the real structures are sheets (“E”). In total, both PSIPRED and SSPRO

87

does not predict the structures of protein 1WCK with high 3-state accuracy, and

the THREADER library does not have any folds that fit well with protein 1WCK.

Thus CISPred does not predict the structures of protein 1WCK with high 3-state

accuracy. The 3-state accuracy of PSIPRED predictions on protein 1WCK is 0.46,

and the 3-state accuracy of SSPRO predictions on protein 1WCK is 0.5. CISPred

selects the predictions provided by SSPRO as consensus predictions, which have

a higher 3-state accuracy.

SSPRO: CCCCCCCEEEECCCCEEECCCCCCCCCCCCCCCCCHHHHHHCCCCEEEEECCCCEEEEEE PSIPRED: CCCCCCEEEEEECCCEEEEECCCCCCCHHHHHHHHHHHHHHCCCCEEEEECCCCEEEEEE CISPRED: CCCCCCCEEEECCCCEEECCCCCCCCCCCCCCCCCHHHHHHCCCCEEEEECCCCEEEEEE REAL: CCCCCEEEEEEEEECCCEEECCCCECCCCEEEEEECCCEEEEECCEEEECCCEEEEEEEE AA: GLGLPAGLYAFNSGGISLDLGINDPVPFNTVGSQFGTAISQLDADTFVISETGFYKITVI 10 20 30 40 50 60

SSPRO: EEEHHHHHHCCCEEEECCCCCCCCCHHHHHCCCCEEEEHEHECCCCCCHHHHHHHCCHHH PSIPRED: ECCCHHHHHCCEEEEECCEECCCCCCHHHHHCHHHHHHHHHHHCCCHHHHHHHHHCCCCE CISPRED: EEEHHHHHHCCCEEEECCCCCCCCCHHHHHCCCCEEEEHEHECCCCCCHHHHHHHCCHHH REAL: EEECCCCCCCEEEEEECCEECCCCCEECCCCCCEEEEEEEEEECCCCEEEEEEEECCCEE AA: ANTATASVLGGLTIQVNGVPVPGTGSSLISLGAPIVIQAITQITTTPSLVEVIVTGLGLS 70 80 90 100 110 120

SSPRO: HHHCCCHHHEEEECCC PSIPRED: EEECCCHHHHHHHHCC CISPRED: HHHCCCHHHEEEECCC REAL: ECCEEEEEEEEEEEEC AA: LALGTSASIIIEKVAH 130

Figure 5.20: Prediction results of CISPred and integrated tools on protein 1WCK.

As presented above, none of the folds generated by THREADER has a filtered

combined energy Z-score above 3.5 when the sequence of protein 1WCK is queried

into THREADER. THREADER does provide several folds with filtered combined

energy Z-scores lower than 3.5. An experiment was conducted, in which CISPred

provides a consensus prediction on protein 1WCK by integrating the THREADER

folds with a filtered combined energy Z-score above 2.0. The result of this exper-

iment shows that the 3-state accuracy of CISPred prediction on protein 1WCK

becomes 0.37, which is lower than 0.5, the 3-state accuracy of CISPred prediction

of protein 1ACK provided by integrating THREADER folds with a filtered com-

88

bined energy Z-score above 3.5. This experiment indicates that integrating folds

with a low filtered combined energy Z-score does not improve the 3-state accuracy

of CISPred consensus prediction on protein 1WCK.

Figure 5.15 illustrates that SSPRO predicts 7 of the 109 CASP sequences

with 100% 3-state accuracy. Figure 5.21 illustrates the amino acid sequences and

secondary structure sequences of the 7 sequences. These 7 secondary structure

sequences do not have distinct structural patterns or structural features, and the

occurence of each structure type is random. No paticular structure patterns or

structure type compositions are found to be predicted with high 3-state accuracy

by SSPRO.

5.5.3 Comparison on 1758 Sequences

Figure 5.22 illustrates the 3-state accuracy of PSIPRED on 1758 sequences; Fig-

ure 5.23 is a bar graph showing the distribution of the 3-state accuracies of

PSIPRED predictions on the 1758 sequences. The average 3-state accuracy, stan-

dard deviation, and the coefficient of variation of PSIPRED predictions on 1758

sequences are 0.789, 0.089, and 11.2%. Figures 5.22 and 5.23 illustrate that

PSIPRED predicts 21 sequences with 3-state accuracy between 0.95 and 1.00, and

predicts 528 sequences with 3-state accuracy between 0.80 and 0.85, which is 30%

of the 1758 sequences. Figure 5.22 shows that PSIPRED predicts two proteins

with 3-state accuracy 0. Figure 5.24 shows the amino acid sequences of these two

proteins. The reasons PSIPRED fails to predict these two proteins are uncertain,

but it is probably due to some problems in the PSIPRED program; the sequences

do show considerable similarity.

Figure 5.25 illustrates the 3-state accuracy of SSPRO predictions on 1758

sequences, and Figure 5.26 shows the distribution of the 3-state accuracies of

89

>1O8V MEAFLGTWKMEKSEGFDKIMERLGVDFVTRKMGNLVKPNLIVTDLGGGKYKMRSESTFKTTEXSFKLGEKFKEVT PDSREVASLITVENGVMKHEQDDKTKVTYIERVVEGNELKATVKVDEVVCVRTYSKVA

CHHHCEEEEEEEEECHHHHHHHHCCCHHHHHHHHHCCCEEEEEEEECCEEEEEEECCCCEEEEEEECCCCEEEEC CCCCEEEEEEEEECCEEEEEEECCCCEEEEEEEEECCEEEEEEEECCEEEEEEEEECC

>1M20 AVKYYTLEEIQKHNNSKSTWLILHYKVYDLTKYLEEHPGGEEVLREQAGGDATENFEDVGHSTDARELSKTFIIG ELHPDDR

CCCEECHHHHCCCECCCCEEEEECCEEEECCCCCCCCCCCCHHHHHHCCCECHHHHHHCCCCHHHHHHHHHHEEE EECHHHC

>1NIJ NPIAVTLLTGFLGAGKTTLLRHILNEQHGYKIAVIENEFGEVSVDDQLIGDRATQIKTLTNGCICCSRSNELEDA LLDLLDNLDKGNIQFDRLVIECTGMADPGPIIQTFFSHEVLCQRYLLDGVIALVDAVHADEQMNQFTIAQSQVGY ADRILLTKTDVAGEAEKLHERLARINARAPVYTVTHGDIDLGLLFNTNGFMLEENVVSTKPRFHFIADKQNDISS IVVELDYPVDISEVSRVMENLLLESADKLLRYKGMLWIDGEPNRLLFQGVQRLYSADWDRPWGDEKPHSTMVFIG IQLPEEEIRAAFAGLRK

CCEEEEEEEECCCCCCHHHHHHHHHCCCCCCEEEECCCCCCCCEEEEEECCCCCEEEEECCCCEEECCCCCHHHH HHHHHHHHHHCCCCCCEEEEEEECCCCHHHHHHHHHHCHHHHHHEEEEEEEEEEECCCHHHHHHHCHHHHHHHHC CCEEEEECCCCCCCCHHHHHHHHHHCCCCCEEECCCCCCCHHHHCCCCCCCCCCCCCCCCCCCCCCHHHHCCEEE EEEEECCCECHHHHHHHHHHHHHHCCCCEEEEEEEECECCCCEEEEEEEECCEEEEEEEEECCCCCCCEEEEEEE ECCCHHHHHHHHHCCCC

>1H7H SVDFAFELRKAQDTGKIVMGARKSIQYAKMGGAKLIIVARNARPDIKEDIEYYARLSGIPVYEFEGTSVELGTLL GRPHTVSALAVVDPGESRILAL

CCCHHHHHHHHHHHCEEEECHHHHHHHHHCCCCCEEEEECCCCHHHHHHHHHHHHHHCCCEEEECCCHHHHHHHC CCCCCCCEEEEEECCCCCHHHC

>1MQ7 STTLAIVRLDPGLPLPSRAHDGDAGVDLYSAEDVELAPGRRALVRTGVAVAVPFGMVGLVHPRSGLATRVGLSIV NSPGTIDAGYRGEIKVALINLDPAAPIVVHRGDRIAQLLVQRVELVELVEVSSFDEAGL

CCCCEEEECCCCCCCCCCCCCCCCEEEEECCCCEEECCCCEEEEEEEEEEECCCCEEEEEECCCCHHHHCCEEEC CCCEEECCCCCCEEEEEEEECCCCCCEEECCCCEEEEEEEEECCCCCCCCCCCCCCCCC

>1NXJ AISFRPTADLVDDIGPDVRSCDLQFRQFGGRSQFAGPISTVRCFQDNALLKSVLSQPSAGGVLVIDGAGSLHTAL VGDVIAELARSTGWTGLIVHGAVRDAAALRGIDIGIKALGTNPRKSTKTGAGERDVEITLGGVTFVPGDIAYSDD DGIIVV

CCCCCCHHHHHHHHCCCCEECCCCCEECCCECCEEEEEEEEECCCECHHHHHHHHCCCCCCEEEEECCCCCCCEE ECHHHHHHHHHHCCCEEEEEEEECCHHHHCCCCCEEEEEEECCCECECCCCCEECCCEEECCEEECCCCEEEECC CCEEEC

>1IZN RVSDEEKVRIAAKFITHAPPGEFNEVFNDVRLLLNNDNLLREGAAHAFAQYNMDQFTPVKIEGYDDQVLITEHGD LGNGRFLDPRNKISFKFDHLRKEASDPQPEDTESALKQWRDACDSALRAYVKDHYPNGFCTVYGKSIDGQQTIIA CIESHQFQPKNFWNGRWRSEWKFTITPPTAQVAAVLKIQVHYYEDGNVQLVSHKDIQDSVQVSSDVQTAKEFIKI IENAENEYQTAISENYQTMSDTTFKALRRQLPVTRTKIDWNKILSYKIGK

CCCHHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHCCHHHHHHHHHHHHHHHHHHCCEEECCCCCCCCEEECCCCE CCCCEEECCCCCEEEECCCCCCCCCCCEECCCCCCCHHHHHHHHHHHHHHHHHHCCCEEEEEEEEEECCEEEEEE EEEEEEEEHHHCEEEEEEEEEEEECCCCEEEEEEEEEEEEEECCCEEEEEEEEEEEEEEEECCCCHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHCHHHHHCCCCCCCCCCCCHHHHCCCCCCC

Figure 5.21: Amino acid sequences and secondary structure sequences predictedby SSPRO with 100% 3-state accuracy.

90

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0 20

0 40

0 60

0 80

0 10

00

1200

14

00

1600

Seq

uen

ce I.

D.

3-state accuracy (Q3 score)

Fig

ure

5.22

:3-

stat

eac

cura

cyof

PSIP

RE

Don

1758

sequen

ces

wit

hav

erag

eof

3-st

ate

accu

racy

0.78

9,st

andar

ddev

iati

on0.

089,

and

coeffi

cien

tof

vari

atio

n11

.2%

.

91

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5


N u

m b

e r o

f s e q

u e n

c e s

Figure 5.23: Bar graph showing the distribution of the 3-state accuracies ofPSIPRED predictions on the 1757 sequences dataset.

>1BT0:_|PDBID|CASP|TEST|SEQUENCE MLIKVKTLTGKEIEIDIEPTDTIDRIKERVEEKEGIPPVQQRLIYAGKQLADDKTAKDYN IEGGSVLHLVLAL

>1V80:_|PDBID|CASP|TEST|SEQUENCE MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN IQKESTLHLVLRLRGG

Figure 5.24: The amino acid sequences for which PSIPRED fails to predict thesecondary structures.

92

SSPRO on 1758 sequences. The average 3-state accuracy, standard deviation,

and the coefficient of variation of SSPRO predictions on the 1758 sequences are

0.911, 0.101, and 11.1%. Figure 5.25 shows that SSPRO predicts 949 sequences

with 3-state accuracy between 0.95 and 1.00, which is 54.0% of the 1758 sequences.

SSPRO has much better prediction performance compared to PSIPRED regarding

3-state accuracy and the number of sequences with high 3-state accuracies. The

standard deviations of SSPRO and PSIPRED predictions are almost the same.

Figure 5.27 shows the 3-state accuracies of CISPred predictions on the 1758

sequences when the threshold at which to stop clustering equals 0.42. The av-

erage 3-state accuracy of CISPred predictions on the 1758 sequences is 0.893.

Compared to the average 3-state accuracy of PSIPRED and SSPRO on 1758 se-

quences, 0.789 and 0.911, the average 3-state accuracy of CISPred is 0.104 (0.893-

0.789=0.104) higher than the average 3-state accuracy of PSIPRED, but 0.018

(0.911-0.893=0.018) lower than the average 3-state accuracy of SSPRO. The stan-

dard deviation of CISPred predictions on the 1758 sequences is 0.095. Compared

to the standard deviation of PSIPRED and SSPRO on the 1758 sequences, 0.089

and 0.101, the average 3-state accuracy of CISPred is 0.006 (0.095-0.089=0.006)

higher than the standard deviation of PSIPRED, but 0.006 (0.101-0.095=0.006)

lower than the standard deviation of SSPRO. The coefficient of variation of CIS-

Pred predictions on the 1758 sequences is 10.7%. Compared to the coefficient

of variation of PSIPRED and SSPRO on 1758 sequences, 11.2% and 11.1%, the

coefficient of variation of CISPred is 0.5% (11.2%-10.7%=0.5%) lower than the

coefficient of variation of PSIPRED, and 0.4% (11.1%-10.7%=0.4%) lower than

the coefficient of variation of SSPRO. In total, the average 3-state accuracy of

CISPred is slightly lower (0.018 lower or 1.8% lower) than SSPRO, while siginif-

icantly higher (0.104 or 10.4%) than PSIPRED; however, the dispersions of the

93

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0 20

0 40

0 60

0 80

0 10

00

1200

14

00

1600

Seq

uen

ce I.

D.


Fig

ure

5.25

:3-

stat

eac

cura

cyof

SSP

RO

pre

dic

tion

son

1758

sequen

ces

wit

hav

erag

eof

3-st

ate

accu

racy

0.91

1,st

andar

ddev

iati

on0.

101,

and

coeffi

cien

tof

vari

atio

n11

.1%

.

94

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5


N u

m b

e r o

f s e q

u e n

c e s

Figure 5.26: Bar graph showing the distribution of the 3-state accuracies of SSPROpredictions on the 1758 sequences dataset.

95

predictions of PSIPRED, SSPRO, and CISPred do not have large differences.

As shown in Figure 5.27, CISPred provides prediction with very low 3-state

accuracy, 0.14, on a protein sequence with sequence I.D. “856.” Sequence “856”

is protein 1XR0. Figure 5.28 shows the amino acid sequence, 8-state DSSP sec-

ondary structure, and 3-state secondary structure of protein 1XR0; Figure 5.29

shows the prediction results from SSPRO for protein 1XR0; Figure 5.30 shows the

prediction results from PSIPRED for protein 1XR0; and Figure 5.31 shows the

alignment results from THREADER for protein 1XR0. The 3-state accuracy of

SSPRO and PSIPRED predictions for protein 1XR0 are 0.59 and 0.45 respectively,

which are relatively low 3-state accuracies. The 3-state accuracy of CISPred pre-

diction for protein 1XR0 is lower than that of both SSPRO and PSIPRED, and

is 0.14, a very low 3-state accuracy. THREADER only provides one fold with fil-

tered combined energy Z-scores higher than 3.5; however, this fold has a very high

confidence score as shown in Figure 5.31. However, the fold structures aligned

with protein 1XR0 are “HHHHHHHHHHHHHCCCHHHHHH,” which have few

correlations to the actual structures of 1XR0 as shown in Figure 5.28, “CCCCCC-

CCCCCCCCCCCCEECC.” These fold structures do not have a high correlation

to the actual structures, but have high confidence scores, which makes the 3-state

accuracy of CISPred lower. However, the predictions for proteins like 1XR0 are

rare in both the 109 CASP sequences and the 1758 sequences.

Figure 5.27 depicts a bar graph showing the distribution of the 3-state accu-

racies of CISPred predictions on the 1758 sequences dataset when the threshold

is 0.42. Figure 5.33 depicts a bar graph showing the distribution of the 3-state

accuracies of the predictions of PSIPRED, SSPRO, and CISPred on the 1758

sequences dataset when the threshold is 0.42. Figure 5.33 is made by combin-

ing Figures 5.23, 5.26, and 5.27. Figure 5.33 shows that compared to CIS-

96

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

0 20

0 40

0 60

0 80

0 10

00

1200

14

00

1600

Seq

uen

ce I.

D.


Fig

ure

5.27

:T

he

3-st

ate

accu

raci

esof

CIS

Pre

dpre

dic

tion

son

the

1758

sequen

ces

when

the

thre

shol

dat

whic

hto

stop

clust

erin

geq

ual

s0.

42.

The

aver

age

3-st

ate

accu

racy

,st

andar

ddev

iati

on,an

dco

effici

ent

ofva

riat

ion

ofth

ese

pre

dic

tion

sar

e0.

893,

0.09

5,an

d10

.7%

resp

ecti

vely

.

97

ID: 1XR0 AA: HSQMAVHKLAKSIPLRRQVTVS DSSP: CCSCSCCCCCSCCCCSCCEECC 3-state: CCCCCCCCCCCCCCCCCCEECC

Figure 5.28: The amino acid sequence, 8-state DSSP secondary structure, and3-state secondary structure of protein 1XR0.

1XR0:_|PDBID|CASP|TEST|SEQUENCE HSQMAVHKLAKSIPLRRQVTVS CCHHHHHHHHCCCCCCCEEECC

Figure 5.29: Prediction result of SSPRO on protein 1XR0.

# PSIPRED VFORMAT

1 H C 0.988 0.005 0.004 2 S C 0.640 0.351 0.037 3 Q H 0.305 0.506 0.155 4 M H 0.153 0.641 0.236 5 A H 0.102 0.757 0.122 6 V H 0.038 0.915 0.036 7 H H 0.029 0.908 0.043 8 K H 0.053 0.898 0.019 9 L H 0.069 0.928 0.009 10 A H 0.200 0.800 0.015 11 K H 0.358 0.608 0.016 12 S C 0.577 0.412 0.027 13 I C 0.916 0.081 0.038 14 P C 0.789 0.125 0.106 15 L C 0.523 0.254 0.257 16 R E 0.369 0.367 0.386 17 R E 0.258 0.344 0.421 18 Q E 0.263 0.145 0.486 19 V E 0.230 0.064 0.606 20 T E 0.293 0.037 0.639 21 V C 0.606 0.033 0.391 22 S C 0.994 0.000 0.005

Figure 5.30: Prediction result of PSIPRED on protein 1XR0.

98

Pred, SSPRO predicts 301 (949-648=301) more sequences with 3-state accuracy

between 0.95 and 1.00, and compared to PSIPRED, SSPRO predicts 928 (949-

21=928) more sequences with 3-state accuracy between 0.95 and 1.00. This indi-

cates that SSPRO has significantly better prediction performance regarding the

number of sequences with 3-state accuracy between 0.95 to 1.00. CISPred pre-

dicts 1357 (648+454+255=1357) sequences with 3-state accuracy higher than 0.85,

and SSPRO predicts 1330 (949+217+164=1330) sequences with 3-state accuracy

higher than 0.85, which indicates that CISPred has slightly better performance (27

more sequences) regarding the number of sequences with 3-state accuracy higher

than 0.85. In total, SSPRO has significantly better performance than CISPred

regarding the number of sequences with 3-state accuracy between 0.95 and 1.00;

CISPred has slightly better performance than SSPRO regarding to the number of

sequences with 3-state accuracy higher than 0.85, and significantly better perfor-

mance than PSIPRED regarding to the number of sequences with 3-state accuracy

between 0.95 and 1.00, and 0.90 and 0.95; PSIPRED predicts more sequences than

both SSPRO and CISPred with 3-state accuracy between 0.55 and 0.90, but also

THREADER 3.5 - Protein Sequence Threading Program Build date : Sep 4 2004 Copyright (C) 2002 University College London Portions Copyright (C) 1990 D.T.Jones

Registered user: [email protected]

Reading mean force potential tables... Alignment with 2cpgA0: 10 20 30 40 -----------9999999999999---99999999999999-- CEEEEEEEEEHHHHHHHHHHHHHHCCCHHHHHHHHHHHHHHCC MKKRLTITLSESVLENLEKMAREMGLSKSAMISVALENYKKGQ | | | | -----------HSQMAVHKLAKSIPLRRQVTVS---------- 10 20

Percentage Identity = 18.2.

Figure 5.31: Alignment result of THREADER on protein 1XR0.

99

more sequences with low 3-state accuracy, 0.00-0.40.

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5


N u

m b

e r o

f s e q

u e n

c e s

Figure 5.32: Bar graph showing the distribution of the 3-state accuracies of CIS-Pred predictions on the 1758 sequences when the threshold is 0.42.

5.5.4 Conclusions

CISPred has an average 3-state accuracy 0.826 (82.6%) on the 109 CASP se-

quences, and 0.893 (89.3%) average 3-state accuracy on the 1758 sequences. CIS-

Pred predicts 74 sequences of the 109 CASP sequences with 3-state accuracy higher

than 0.80, which is 67.9% of the 109 CASP sequences. CISPred predicts 1357 se-

quences of the 1758 sequences with 3-state accuracy higher than 0.85, which is

77.2% of the 1758 sequences.

Figure 5.34 illustrates the average 3-state accuracy, standard deviation, and

coefficient of variation of PSIPRED, SSPRO and CISPred on the 109 CASP se-

100

0 . 9 5

- 1 . 0

0

0 . 9 0

- 0 . 9

5

0 . 8 5

- 0 . 9

0

0 . 8 0

- 0 . 8

5

0 . 7 5

- 0 . 8

0

0 . 7 0

- 0 . 7

5

0 . 6 5

- 0 . 7

0

0 . 6 0

- 0 . 6

5

0 . 5 5

- 0 . 6

0

0 . 5 0

- 0 . 5

5

0 . 4 5

- 0 . 5

0

0 . 4 0

- 0 . 4

5

0 . 0 5

- 0 . 4

0

0 . 0 0

- 0 . 0

5


N u

m b

e r o

f s e

q u

e n c e

s

Figure 5.33: Bar graph showing the distribution of the 3-state accuracies of thepredictions of PSIPRED, SSPRO, and CISPred on the 1758 sequences datasetwhen the threshold is 0.42.

101

quences and the 1758 sequences. Figure 5.34 shows that CISPred has higher

average 3-state accuracy than SSPRO and PSIPRED on the 109 CASP sequence,

and CISPred has higher average 3-state accuracy than PSIPRED on the 1758

sequences; however, CISPred has slightly lower average 3-state accuracy than

SSPRO on the 1758 sequences. The standard deviation and coefficient of varia-

tion of CISPred usually are between SSPRO and PSIPRED, which indicates the

dispersion of CISPred predictions is in a similar range as those of SSPRO and

PSIPRED.

Figure 5.34: Summary of experimental results.

SSPRO predicts more sequences with high 3-state accuracy (≥0.95) than

PSIPRED and SSPRO, particularly in the 1758 sequences. CISPred predicts

more sequences with 3-state accuracy higher than 0.85% than both PSIPRED

and SSPRO on both the 109 CASP sequences and the 1758 sequences.

PSIPRED fails to predict the structures of two sequences in the 1758 sequences,

while SSPRO and CISPred do not have any failures, which indicates that CISPred

can provide results when some of the integrated tools fail to predict structures.

102

Chapter 6

Conclusion

6.1 Thesis Contributions

Currently, protein secondary structure prediction tools have combined evolution-

ary information and many machine learning algorithms to make predictions, and

have already achieved a 3-state accuracy around 70%-80%. Because of the va-

riety of methods used by existing tools, the results from the existing tools have

discrepancies. CISPred provides consensus predictions that are determined by in-

tegrating several existing popular tools. The methods integrated in CISPred are

various: the two popular secondary structure prediction tools integrated in CIS-

Pred, SSPRO and PSIPRED, make predictions by combining machine learning

algorithms, neural networks, and PSI-BLAST, which provides evolutionary infor-

mation; the fold recognition tool, THREADER, integrated in CISPred implements

the threading method, which combines both comparative modelling methods (the

alignment of target sequences and fold sequences), and ab initio prediction meth-

ods (determine the fitness of a template fold by computing the free energy of a

target sequence); and the motif structure formulae integrated in CISPred use the

structures of protein motifs to predict the structure of proteins. The experimental

103

results shown in Chapter 5 illustrate that CISPred has a higher 3-state accuracy

than both PSIPRED and SSPRO (0.6% higher than SSPRO and 4.8% higher

than PSIPRED) on the 109 CASP sequences, and has a higher 3-state accuracy

than PSIPRED (10.4% higher), but a slightly lower (1.8% lower) 3-state accu-

racy than SSPRO on the 1758 sequences. CISPred predicts more sequences with

3-state accuracy higher than 85% than both SSPRO and PSIPRED on both the

dataset containing the 109 CASP sequences and the dataset containing the 1758

sequences.

The threading method is a unique and representative method in protein folding

recognition. The predicted structure of a target sequence is found by threading

the target sequence through each fold in a fold library, and the fold in which the

target sequence has minimum free energy is considered as the fit structure of the

target sequence. The threading method requires a great deal of computation time

to thread the target sequence through each template fold and calculate free en-

ergy. CISPred concurrently implements the threading method by dividing the fold

library into 10 sub-libraries, and threading a target sequence through all of the

10 sub-libraries simultaneously. Moreover, the other two existing tools integrated

in CISPred, and the finding of motif structure formulae, are concurrently imple-

mented, which saves up to 8 times the execution time, according to Figure 4.9

in Chapter 4. CISPred allows a user to query multiple sequences each time CIS-

Pred is executed, and the executions of the queried sequences are concurrently

performed on a high-performance SUN Cluster.

The user interface of CISPred is a website, which enables easy accessibility:

any users who have the Internet access are able to execute CISPred, and the

prediction results are sent to the email addresses of users.

The experiments mentioned in Chapter 5 show that PSIPRED fails to predict

104

the structures of two sequences in the test dataset containing the 1758 sequences.

CISPred is still able to predict the structures of these two sequences, because

CISPred provides prediction results by integrating other tools, which shows that

CISPred is relatively more stable and more reliable than a single prediction tool.

6.2 Future Work

CISPred provides predictions of 3 secondary structural types: Coil (“C”), Helix

(“H”), and Strand (“E”). The DSSP (Dictionary of Protein Secondary Structure)

defines 8 secondary structure types, as mentioned in Chapter 2. Currently, some

existing tools, such as SSPRO8 [37], predict all 8 secondary structure types in

DSSP, although most of them are in the experimental stage. CISPred could be

modified to predict 8 secondary structure types by integrating existing tools that

predict all 8 DSSP structure types, or improve the CISPred predictions of 3 struc-

tural types by integrating existing tools that predict all 8 DSSP structure types.

Both SSPRO and PSIPRED use PSI-BLAST profiles as inputs for their neural

network architectures. The PSI-BLAST profiles used by SSPRO were generated in

2004, and the PSI-BLAST profiles used by PSIPRED were generated in 1999. The

109 CASP sequences used in the experiments mentioned in Chapter 5 are from

CASP3 (1998), CASP4 (2000), CASP5 (2002), and CASP6 (2004), and the 1758

sequences may contain sequences discovered before 1999. Therefore, the sequences

in the 109 CASP sequences dataset and the 1758 sequences dataset may already

have been included in the PSI-BLAST profiles used by SSPRO, and the sequences

in CASP3 (1998) and the 1758 sequences may already have been included in

the PSI-BLAST profiles used by PSIPRED, which means that the experiments

results mentioned in Chapter 5 may not be objective estimates of the performance

of both SSPRO and PSIPRED. The solution to this problem is to test PSIPRED

105

and SSPRO on recently discovered proteins, and then to predict the structures

of these proteins without considering their homologous information. Experiments

may be conducted to see whether the prediction performances of both SSPRO

and PSIPRED could be improved by using the most recent PSI-BLAST profiles,

such as the PSI-BLAST profiles generated in 2006 or 2007. The source codes

of both PSIPRED and SSPRO are available for download, which makes it more

feasible to conduct these experiments. Using the most recent PSI-BLAST profiles

may include homologous information about the newly discovered proteins, which

makes it difficult to find test sequences that do not have homologous information

included in their prediction.

SSPRO combines both neural networks and homologue information to make

predictions. For a queried sequence, SSPRO finds homologues of the queried se-

quence, and uses the structure of the most significant homologue as its predicted

structure of the sequence. For the queried sequences that do not have significant

homologues, SSPRO uses neural networks and PSI-BLAST profiles to make pre-

dictions. In the experiments mentioned in Chapter 5, SSPRO uses the structures

of the most significant homologues as its predictions on the sequences that have

been found for several years and have had plenty of homologue information avail-

able. This makes SSPRO predictions on these sequences have very high 3-state

accuracy. To have an objective estimate of the performance of SSPRO, exper-

iments may be conducted when homologue information are not considered, but

only neural networks and PSI-BLAST profiles are used to make predictions.

106

References

[1] A. Bairoch, P. Bucher, and K. Hofmann. The PROSITE database, its status

in 1997. Nucleic Acids Research, 25:217–221, 1997.

[2] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig,

I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids

Research, 28:235–242, 2000.

[3] J. M. Chandonia and M. Karplus. The importance of larger data sets for

protein secondary structure prediction with neural networks. Protein Science,

5:768–744, 1996.

[4] J. M. Chandonia and M. Karplus. New methods for accurate prediction of

protein secondary structure. Proteins, 35:293–306, 1999.

[5] P. Y. Chou and G. D. Fasman. Conformational parameters for amino acids

in helical, b-sheet, and random coil regions calculated from proteins. Bio-

chemistry, 28:211–222, 1974.

[6] J. A. Cuff and G. J. Barton. Prediction of protein secondary structure by com-

bining nearest-neighbor algorithms and multiple sequence alignments. Jour-

nal of Molecular Biology, 247:11–15, 1995.

[7] J. A. Cuff and G. J. Barton. Secondary structure prediction using segment

similarity. Protein Engineering, 10:1143–1153, 1997.

107

[8] J. A. Cuff and G. J. Barton. Evaluation and improvement of multiple sequence

methods for protein secondary structure prediction. Proteins, 34:508–519,

1999.

[9] J. A. Cuff and G. J. Barton. Application of multiple sequence alignment

profiles to improve protein secondary structure prediction. Proteins, 40:502–

511, 2000.

[10] J. A. Cuff, M. Clamp, A. S. Siddiqui, M. Finlay, and G. J. Barton. JPred: A

consensus secondary structure prediction server. Bioinformatics, 14:892–893,

1998.

[11] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis

and display of genome-wide expression patterns. Proceedings of the National

Academy of Sciences of the United States of America, 95:14863–14868, 1998.

[12] D. Frishman and P. Argos. Protein secondary structure prediction using

nearest-neighbor methods. Journal of Molecular Biology, 232:1117–1129,

1993.

[13] D. Frishman and P. Argos. Incorporation of non-local interactions in pro-

tein secondary structure prediction from the amino acid sequence. Protein

Engineering, 9:133–142, 1996.

[14] D. Frishman and P. Argos. Seventy-five percent accuracy in protein secondary

structure prediction. Proteins, 27:329–335, 1997.

[15] J. Garnier, J. Gibrat, and B. Robson. GOR methos for predicting protein

secondary structure from amino acid sequence. Methods Enzymol, 266:540–

553, 1996.

108

[16] J. Garnier, D. J. Osguthorpe, and B. Robson. Analysis of the accuracy

and implications of simple methods for predicting the secondary structure of

globular proteins. Journal of Molecular Biology, 120:97–120, 1978.

[17] J. F. Gibrat, J. Garnier, and B. Robson. Further developments of protein

secondary structure prediction using information theory. New parameters and

consideration of residue pairs. Journal of Molecular Biology, 198:425–443,

1987.

[18] N. Goldman, J. L. Thorne, and D. T. Jones. Using evolutionary trees in pro-

tein secondary structure prediction and other comparative sequence analysis.

Journal of Molecular Biology, 263:196–208, 1996.

[19] L. H. Holley and M. Karplus. Protein Secondary Structure Prediction with

a Neural Network. Proceedings of the National Academy of Sciences of the

United States of America, 86:152–156, 1989.

[20] R. W. W. Hooft, M. Scharf, C. Sander, and G. Vriend. The PDBFINDER

database: A summary of PDB, DSSP and HSSP information with added

value. Computer Application in the Bioscience (CABIOS), 12:525–529, 1996.

[21] X. Huang, D. S. Huang, G. Z. Zhang, Y. P. Zhu, and Y. X. Li. Predic-

tion of protein secondary structure using improved two-level neural network

architecture. Protein and Peptide Letters, 12:805–811, 2005.

[22] N. Hulo, C. J. Sigrist, V. L. Saux, P. S. Langendijk-Genevaux, L. Bordoli,

A. Gattiker, E. D. Castro, P. Bucher, and A. Bairoch. Recent improvements

to the PROSITE database. Nucleic Acids Research, 32:D134–D137, 2004.

[23] D. T. Jones. THREADER: Protein sequence threading by double dynamic

programming. In Steven Salzberg, David Searls, and Simon Kasif, editors,

109

Computational Methods in Molecular Biology, volume 32, chapter 13. Elsevier

Science, Amsterdam, Netherland, 1998.

[24] D. T. Jones. Protein secondary structure prediction based on position-specific

scoring matrices. Journal of Molecular Biology, 292:195–202, 1999.

[25] H. Kaur and G. P. Raghava. Prediction of beta-turns in proteins from multiple

alignment using neural network. Proteins, 12:627–634, 2003.

[26] Kinemage. “Kinemage, Next Generation”. On-

line. Accessed September 26th 2006, Available HTTP:

http://kinemage.biochem.duke.edu/software/king.php.

[27] R. D. King and M. J. E. Sternberg. Identification and application of the

concepts important for accurate and reliable protein secondary structure pre-

diction. Protein Science, 5:2298–2310, 1996.

[28] D. G. Kneller, F. E. Cohen, and R. Langridge. Improvements in protein

secondary structure prediction by an enhanced neural network. Journal of

Molecular Biology, 214:171–182, 1990.

[29] A. Kryshtafovych, C. Venclovas, K. Fidelis, and J. Moult. Progress over

the first decade of CASP experiments. Proteins: Structure, Function, and

Bioinformatics, 61:225–236, 2005.

[30] J. M. Levin. Exploring the limits of nearest neighbour secondary structure

prediction. Protein Engineering, 10:771–776, 1997.

[31] V. I. Lim. Algorithms for prediction of alpha helices and structural regions

in globular proteins. Journal of Molecular Biology, 88:873–894, 1974.

110

[32] P. Lio, N. Goldman, J. L. Thorne, and D. T. Jones. PASSML: combining

evolutionary inference and protein secondary structure prediction. Bioinfor-

matics, 14:726–733, 1998.

[33] P. K. Mehta, J. Heringa, and P. Argos. A simple and fast approach to pre-

diction of protein secondary structure from multiply aligned sequences with

accuracy above 70%. Protein Science, 4:2517–2525, 1995.

[34] K. Nishikawa and T. Nogughi. Predicting protein secondary structure based

on amino acid sequence. Method Enzymol, 202:31–34, 1995.

[35] D. Petrey, Z. Xiang, C. L. Tang, and L. Xie. Decision tree-based formation of

consensus protein secondary structure prediction. Bioinformatics, 15:1039–

1046, 1999.

[36] D. Petrey, Z. Xiang, C. L. Tang, and L. Xie. Using multiple structure align-

ments, fast model building, and energetic analysis in fold recognition and

homology modeling. Proteins, 53:430–435, 2003.

[37] G. Pollastri, D. Pizybylski, B. Rost, and P. Baldi. Improving the prediction of

protein secondary structure in three and eight classes using recurrent neural

networks and profiles. Proteins, 47:228–235, 2002.

[38] N. Qian and T. J. Sejnowski. Predicting the secondary structure of globular

proteins using neural network models. Journal of Molecular Biology, 202:865–

884, 1988.

[39] S. K. Riis and A. Krogh. Improving prediction of protein secondary structure

using structured neural networks and multiple sequence alignments. Journal

of Computational Biology, 3:163–183, 1996.

111

[40] B. Rost and C. Sander. Prediction of protein secondary structure at better

than 70% accuracy. Journal of Molecular Biology, 232:584–599, 1993.

[41] B. Rost and C. Sander. Combining Evolutionary Information and Neural Net-

works to Predict Protein Secondary Structure. Proteins: Structure, Function,

and Genetics, 19:52–72, 1994.

[42] B. Rost, G. Yachdav, and J. Liu. The PredictProtein server. Nucleic Acids

Research, 23:W321–W326, 2004.

[43] Web-Books.Com. Molecular biology web book. Online, September

2005. Accessed August 26th 2006, Available HTTP: http://www.web-

books.com/MoBio/Free/Ch2C6.htm.

[44] Wikipedia.Com. Coefficient of variation. Online, May 2007. Accessed July

5th 2007, Available HTTP: www.wikipedia.org.

[45] Wikipedia.Com. Residue (chemistry). Online, May 2007. Accessed July 5th

2007, Available HTTP: www.wikipedia.org.

[46] K. Zimmermann and J. F. Gibrat. New joint prediction algorithm (Q7-

JASEP) improves the prediction of protein secondary structure. Biochem-

istry, 30:11164–11172, 1991.

[47] K. Zimmermann and J. F. Gibrat. In unison: regularization of protein sec-

ondary structure predictions that makes use of multiple sequence alignments.

Protein Engineering, 11:861–865, 1998.

[48] M. Zvelebil, G. Barton, W. Taylor, and M. Sternberg. Prediction of pro-

tein secondary structure and active sites using the alignment of homologous

sequences. Journal of Molecular Biology, 195:957–961, 1987.

112

[49] M. Zvelebil, G. Barton, W. Taylor, and M. Sternberg. Dictionary of protein

secondary structure: pattern recognition of hydrogen-bonded and geometrical

features. Biopolymers, 22:2577–2637, 2004.

113

Appendix A

Submission Templates on Cluster

A.1 Submission Template for THREADER

The following is the template for submitting concurrent jobs for THREADER [23]

at the Cluster.

#/bin/bash

# Use /bin/sh shell

#$ -S /bin/sh

# Run in submit directory

#$ -cwd

# Job name

#$ -N TSeq-AAAAAA-Str-BBBBBB

# Direct output

#$ -j y

#$ -o logSeqAAAAAAStrBBBBBB.screen

./threader-linux -p -j AAAAAA.seq AAAAAA-BBBBBB.out BBBBBB.lst

114

A.2 Submission Template for SSPRO

The following is the template for submitting concurrent jobs for SSPRO [37] at a

Cluster.

#/bin/bash

# Use /bin/sh shell

#$ -S /bin/sh


#$ -cwd

# Job name

#$ -N SSpro-AAAAAA

# Direct output

#$ -j y

#$ -o logSeqAAAAAA.screen

/apps/cs/sspro4/bin/predict_ssa.sh AAAAAA.seq AAAAAA.out

A.3 Submission Template for PSIPRED

The following is the template for submitting concurrent jobs for PSIPRED [24] at

a Cluster.

#/bin/bash

# Use /bin/sh shell

#$ -S /bin/sh


#$ -cwd

# Job name

#$ -N Psi-AAAAAA

115

# Direct output

#$ -j y

#$ -o logSeqAAAAAA.screen

./runpsipred AAAAAA.seq

A.4 Submission Template for Finding Motif Struc-

tures

The following is the template for submitting concurrent jobs for finding motif

structures at a Cluster.

#/bin/bash

# Use /bin/sh shell

#$ -S /bin/sh


#$ -cwd

# Job name

#$ -N Pat-SeqID-MotifNo

# Direct output

#$ -j y

#$ -o logSeqID-MotifNo.screen

perl MotifStrFinder.pl MotifName MotifNo SeqID Start End Length

116

Vita

Candidate’s full name:

Zheng Wang

University attended:

Shandong Economic University, Jinan, Shandong, P.R.China.

Bachelor of Management Information System, 2004.

Poster:

Zheng Wang, Patricia Evans and Virendra Bhavsar, CISPred: Consensus

Integrated Protein Structure Prediction. April 4, 2007, University of New

Brunswick.

Documents

Consensus Prediction of Protein Secondary Structuresorca.st.usm.edu/~zwang/thesis/ZhengWang_Mythesis.pdf · Consensus Prediction of Protein Secondary Structures by ... A THESIS SUBMITTED