Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Quorum-Sensing Control Repressor - Pseudomonas
aeruginosa Secondary Structure Prediction using Particle
Swarm Optimization (PSO) tuned Artificial
Neural Network
1Saravanan K,
2Sivakumar S
1Department of Physics, AVS Engineering College, Salem, Tamilnadu, INDIA
2Department of Physics, Government Arts College(Autonomous), Salem, Tamilnadu, INDIA
Corresponding E.Mail:[email protected].
ABSTRACT
Quorum sensing controls gene expression in hundreds of Proteobacteria including a
number of plant and animal pathogens. Generally, the AHL receptors are members of a
family of related transcription factors, and although they have been targets for the
development of antivirulence therapeutics. But there is very little structural information
about this class of bacterial receptors. Hence, secondary structure prediction becomes one
of the most important and challenging problems. Machine learning techniques have been
applied to solve this problem and have gained substantial success in this research area.
Although, neural network-based prediction becomes more popular, the training
methodology involves more processing. Hence, in order to overcome this drawback, this
work proposed a new topology called PSO trained Neural Fields which can able to tune NN
automatically and is designed for protein SS prediction. The results are compared with
other prediction mechanisms. The obtained results are more accurate and better than the
corresponding other mechanisms.
Keywords: protein structure prediction / secondary structure / neuralnetwork / back-propagation
/ PSO.
INTRODUCTION
Proteins perform many biological functions and represent the building blocks of
organisms. They are complex organic compounds of which the basic forming unit is the amino
acid. Proteins are initially linear chains of amino acids which can vary in length from a few up to
thousands of amino acids. Proteins fold, under the influence of several chemical and physical
factors, into their unique 3D structures that determine their biological functions and properties.
Misfolding occurs when the protein folds into a 3D structure that does not represent its correct
native structure, which can lead to many diseases, such as Alzheimer's, several types of cancer,
etc. Due to the importance of this issue to human life, scientists have developed laboratory
techniques such as X-ray crystallography and nuclear magnetic resonance (NMR) to determine
9
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
the native structures of proteins. Although these methods are reliable, they are not always
feasible. Hence, predicting the native structure of a protein, given its primary sequence, is an
important and challenging task in computational biology. The primary protein structure is a
linear sequence of amino acids connected together via peptide bonds. Proteins fold due to the
hydrophobic effect, van der Waals interactions, electrostatic forces, hydrogen bonding, etc. The
secondary structures are three-dimensional structures characterized by a repeating bonding
pattern. The most common structures are helices and strands. The proteins that include these
secondary structures can further fold into the tertiary structure forming a bundle of secondary
structures, turns and loops. Furthermore, the aggregation of tertiary structure regions of some
separate protein sequences forms the so-called quaternary structures [1]. Thus the protein
structure prediction computational approaches are heuristics and can be classified as homology
modeling, threading, and ab initio methods [2]
Comparative-modeling methods can successfully exploit this property. Most of these
methods use neural networks and achieve good prediction accuracies. Evolutionary information
in the form of multiple sequence alignment profiles is used [3]. Best methods can achieve
accuracies up to 79.01% with bidirectional neural networks [4–6].
The secondary-structure prediction approaches in today can be categorized into three
groups: neighbor-based, model-based, and meta predictor-based [7]. The neighbor-based
approaches predict the secondary structure by identifying a set of similar sequence fragments
with known secondary structure; the model-based approaches employ sophisticated machine
learning techniques to learn a predictive model trained on sequences of known structure, whereas
the meta predictor -based approaches predict based on a combination of the results of various
neighbor and/or model-based techniques.
Historically, the most successful model-based approaches, such as PSIPRED [8] were
based on neural network (NN) learning techniques [9]. Protein secondary structures are
traditionally characterized as 3 general states: helix (H), strand (E), and coil (C). From these
general three states, the DSSP program [10] proposed a finer characterization of the secondary
structures by extending the three states into eight states: 310 helix (G), α-helix (H), π-helix (I), β-
stand (E), bridge (B), turn (T), bend (S), and others (C). Prediction of the three states from
protein sequences (i.e., the Q3 prediction problem) has been intensively investigated for decades
using many machine learning methods, including the probability graph models [11,12], support
vector machines [13, 14], hidden Markov models [15, 16], artificial neural network [19-21], and
bidirectional recurrent neural network(BRNN) [17].
However, Artificial Neural Network (ANN) design is a complex task because its
performance depends on the architecture, the selected transfer function, and the learning
algorithm used to train the set of synaptic weights[18]. To overcome this drawback, a new
methodology that automatically tunes ANN using particle swarm optimization algorithms (PSO)
is implemented here...
10
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
Among various kinds of proteins, Quorum sensing, a cell-cell communication system, is broadly
distributed among bacteria and is commonly used to regulate the production of shared products.
An important consequence of quorum sensing is a delay in the production of certain products
until the population density is high. The bacterium Pseudomonas aeruginosa has a particularly
complicated quorum sensing system involving multiple signals and receptors. Hence it is
necessary to predict the structure of this protein. In this work, in order to incorporate the
prediction of the structure of QscR, a new hybrid topology called PSO trained Neural Network is
introduced...
METHODOLOGY AND MATERIALS
PARTICLE SWARM OPTIMIZATION (PSO)
Particle Swarm Optimization was developed by James Kennedy and Russell Eberhart in
the year 1995. This technique is a population-based one that is inspired by biological
perceptions like flocking and swarming. The idea first appeared through the behavior observed
from swarms of bees, school of fishes and flocks of birds. The key fact of PSO is its fast
convergence, simple execution and needless gradient information. PSO is initialized by a
randomly generated population and it conducts searching in the population of the particles. Each
and every particle in the population signifies a fitness solution to the given problem [22,24]. The
particles travel in the search space and transform their position by getting the information such as
(i) the distance between the Pbest and the particle’s current position (ii) the distance between the
Gbest and the particle’s current position. All particles remember its feasible solution with its
achieved position known as Pbest, the personal best. It is the best value with its position
originated in the group Gbest, the global best. Each particle is accelerated by the PSO to its
Pbest and the Gbest locations. Figure 1 shows particle position alteration in particle swarm
optimization.
Figure 1. Particle position alternation in PSO
whereGbest
iV
is the velocity based on Gbest and Pbest
iV
is the velocity based on Pbest .
11
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
Figure 2. Particle Swarm group behavior
Every particle in the group has the ability to produce a solution to the given problem.
The ith
particle’s position is represented as Xi = (xi1, xi2, …..,xin) and the velocity corresponding
to the ith particle is represented as Vi = (vi1, vi2, …..,vin) is shown in figure 2.
Figure 3. The behavior of Pbest and Gbest
The Pibest
and Gbest
of the ith
particle are specified as Pibest
= (xi1Pbest
, xi2Pbest
,….xinPbest
) and
Gbest
= (x1Gbest
, x2Gbest
,…. xnGbest
) which is mentioned in figure.4. The velocity of the particle i is
represented as shown in equation 1.
(.1)
xi(t)
-
the
present position of the particle i at iteration t
t - Iteration pointer
pibest
- the best position of the particle i until iteration t
Gbest
- the global best position of entire swarm until iteration t
c1, c2 - the acceleration coefficients varies between 0 and 4
vi(t)
- the velocity between the step size xi(t) and xi(t+1)
ω - the inertia weight/damping factor which decreases from
))(
(22
))(
(11
)()1( tiXbestGrc
tiX
bestiPrc
tiV
tiV
12
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
0.9 to 0.4 used to control the contact of new velocity with
its previous velocity
r1,r2 - Random variables with a range of [0, 2]
The inertia weight ω is calculated by the following equation 2.
iteriter
max
minmaxmax
(2)
where
ωmax - initial weight
ωmax - final weight
itermax - maximum iteration number
iter - current iteration number
A new velocity is calculated in the direction of pibest
and Gbest
to execute a change in the
current search point (Swagatam Das et al. 2005). Every particle attempts to migrate from its
current position to the new position by using the modified velocity given below in equation 3.
)1()()1( t
i
t
i
t
i VXX (3)
In PSO optimization, all the particles attempt to migrate for improved positions. The
mutual effort of all the particles, the best position (optimal solution) is obtained. This iteration
comes to an end after attaining the stopping condition. Common types of stopping conditions are
the number of iteration of the algorithm, the number of iterations while the final update of the
global best solution and a predefined fitness value (Miller 2002). The PSO algorithm is
completed after certain iterations after reaching the fitness value close enough to the desired
output. There are surplus versions of PSO are available for discrete optimization, constrained
optimization and for multi-objective optimization [25].
ARTIFICIAL NEURAL NETWORK (ANN)
ANN is an approximation function mapping inputs to outputs. A typical network of three
layers of neurons depicted in figure 4 consists of input, hidden, and output layers; in which each
neuron acting as an independent computational element.
13
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
Figure 4. An Artificial Neural Network
The input layer is defined as a layer of neurons receiving inputs directly from outside the
network. The layer of a network that is not connected to the network output called hidden layer
and layer whose output is passed to the world outside the network is the output layer. Weight
functions apply weights to input to get weighted inputs, as specified by a particular function.
Different algorithms can apply to minimize the network error during the ANN training. This
usually happens by finding the correct tune of the network in which is depended on weights,
biases, number of neurons in the hidden layer and iteration number. Among these parameters, the
training algorithm is important in a way it reaches optimum weights and biases. Generally, the
performance function of the ANN models during the training process is assessed using the sum
of squares of the errors as follows:
Where
T is the total number of training samples,
m is the number of output layer neurons,
W represents the vector containing all the weights in the network,
yp is the actual network output, and
dp is the desired output.
Achieving an optimum number of neurons in the hidden layer is often obtained through a
trial and error procedure. Aside from the abovementioned parameters, the correct selection of the
input variables which affect the target variable is considered as one of the most important stages
dealing with ANN models.
14
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
ANN TRAINING WITH PSO ALGORITHM
To train the ANN with the PSO algorithm, the following procedures are taken into
consideration. The algorithm is utilized to find the optimum weights and biases of the ANN
model. Weights and biases' values form the search space of the algorithm which is of n
dimensions[23]. The n is the total number of weights (and biases) that need to be optimized.
Each particle has a position vector and a velocity vector of n-dimensions. Here both weights and
biases are shown by W. The optimal set of weights is obtained by flying the particles around the
search space. At each iteration, the algorithm comes up with a set of weights that their fitness is
assessed. It happens by assigning these weights to the nodes and predicting the target value.
Afterward, the accuracy of the prediction through assigned weights is evaluated as the difference
between actual and predicted values which should be minimized through the optimization
process. In this regard, the best fitness the particle has been achieved so far is considered as its
personal best.
Similarly, the best fitness of the swarm is used as the global best. This process is repeated
for a specific number of iteration until the optimized weights for the ANN are yielded. The steps
for a PSO optimized ANN is given below. For a three-layered perceptron, W[1] and W[2]
represent the connection weight matrix between the input layer and the hidden layer, and
between the hidden layer and the output layer respectively. Applying a PSO algorithm to train
the multilayer perceptron, the ith
particle is denoted by:
(4)
(5)
(6)
where j = 1, 2; m = 1,. . . ,Mj ; n = 1,. . . , Nj ;
Mj and Nj are the rows and column sizes of the matrices
W, P, and V; r and s are positive constants;
a and b are random numbers in the range from 0 to 1;
t is the time step between observations and is often taken as unity;
V'' and W'' represent the new values.
15
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
Applying Equation,
the new velocity of the particle is computed by using its previous velocity and the distances of its
current position from the best experiences both in its own and as a group. The second element on
the right-hand side of Equation represents the private thinking of the particle itself whilst the
social part, i.e., the third element on the right-hand side of Equation, denotes the collaboration
among the particles as a group. The new position according to the new velocity can be
determined by Equation
The fitness function f is the mean squared error and is defined as:
Where
F is the fitness value,
n is the number of data points.
STUDY AREA AND DATA
In this work, 100 proteins set for training and 5 protein set for testing were used. All
these sets have a representative mix of the three secondary structure classes, α-helix, β-strand
and coil. Each set was used as the validating set and as the testing set.
Table 1: The parameter configuration used in PSO
S.No Parameters Values
1 Particles 100
2 C1 1
3 C2 2
4 Max generation 1000
16
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
RESULTS AND DISCUSSION
HYDROPHILICITY PLOT
A hydrophilicityplotis a quantitative analysis of the degree of hydrophobicity or
hydrophilicity of amino acids of a protein. It is used to characterize or identify possible structures
or domains of a protein.
Figure 5.hydrophilicity plot of QscR
From the figure5, it is concluded that amino acids show positive for hydrophobicity, these amino
acids may be part of alpha-helix spanning.
Table 2: Predicted secondary structure ofQscR under different topology
Methods Secondary structure
Sequence (1-50) MHDEREGYLE ILSRITTEEE FFSLVLEICG NYGFEFFSFG ARAPFPLTAP
Structure
DSSP ******SHHH HHHH** SHHH HHHHHHHHHH HTT*SEEEEE EE***STTS*
MLNN CHHHHHHHHH HHHHCCCHHH HHHHHHHHHH HHCCCEEEEE EECCCCCCCC
Proposed
PSONN CHHHHHHHHH HHHHCHCHHH HHHHHHHHHH HHCHCEEEEE EECCCCCCCH
Sequence(51-100) KYHFLSNYPG EWKSRYISED YTSIDPIVRH GLLEYTPLIW NGEDFQENRF
Structure
DSSP *EEEEE*** H HHHHHHHHTT GGGT*HHHHH HHHS*S* EEE ETTT*SS*HH
MLNN CEEEECCCCH HHHHHHHHHC CHHHCHHHHH HHHCCCCEEE CCCCCHHHHH
Proposed
PSONN HEEEECCCCH HHHHHHHHHC CHHHHHHHHH HHHCCCHEEE CCCCHHHHHH
Sequence(101-150) FWEEALHHGI RHGWSIPVRG KYGLISMLSL VRSSESIAAT EILEKESFLL
Structure
DSSP HHHHHHHTT* *EEEEEEEE* GGG*EEEEEE EESSS*** HH HHHHHHHHHH
MLNN HHHHHHHHCC CCEEEEEEEC CCCCEEEEEE ECCCCCCCHH HHHHHHHHHH
Proposed
PSONN HHHHHHHHCH CCEEEEEEEH CCCCEEEEEE ECCCCCCHHH HHHHHHHHHH
Sequence(151-200) WITSMLQATF GDLLAPRIVP ESNVRLTARE TEMLKWTAVG KTYGEIGLIL
Structure
DSSP HHHHHHHHHH HHHHHHHHSG GGG**** HHH HHHHHHHHTT **HHHHHHHH
MLNN HHHHHHHHHH HHHHCCCCCC CCCCCCCHHH HHHHHHHHHC CCHHHHHHHH
Proposed HHHHHHHHHH HHHHCCCCCH CCCCCCHHHH HHHHHHHHHC HCHHHHHHHH
17
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
PSONN
Sequence(201-237) SIDQRTVKFH IVNAMRKLNS SNKAEATMKA YAIGLLN ---
Structure
DSSP TS*HHHHHHH HHHHHHHTT* SSHHHHHHHH HHTT*** ---
MLNN CCCHHHHHHH HHHHHHHHCC CCHHHHHHHH HHHCCCC ---
Proposed
PSONN CCHHHHHHHH HHHHHHHHCH CCHHHHHHHH HHHCCHH
----
ASSESSMENT OF PREDICTION ACCURACY
Four routinely used assessment criteria were adopted here, that is, sensitivity (SN),
specificity (SP), accuracy (ACC), and AUC (area under Receiver Operating Characteristic
curve): where TP, TN, FP, and FN were the abbreviations of true positives, true negatives, false
positives, and false negatives. The experimental results were given in Table 3.
Table 3: The prediction performanceof algorithms
S.No Methodology SN (%) SP (%) ACC (%)
1 MLNN 92.55 97.17 94.28
2 PSO-NN 93.73 97.59 95.04
The ROC (Receiver Operating Characteristic) curve was to plot the true positive rate
against false positive rate, and the AUC was a reliable measure for evaluating performance.
Generally, the PSONN performed the best among these two algorithms and the same is inferred
from figure 6. Figure 7 depicts the 3D structure of QscR
Figure 6. ROC Curve
18
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
Figure 7. 3D structure of QscR
COMPARISON WITH OTHER METHODS
Finally, the results obtained using the proposed methodology is compared with the performance
of other networks which depicted the secondary structure of QscR. And is displayed in table 2.
Table 4. Comparative analysis of the performance of the other methods in secondary
structure prediction
S.No Method Alpha (%) Beta sheet(%) Coil
1 DSSP 54 13 -
2 STRIDE 54 11 -
3 MLNN 68 15 4
4 PSO-NN 72 16 3
From table 4, it is concluded that the proposed PSO-NN gives more and better prediction of
secondary structure than the other topologies.
CONCLUSION
In this work, a novel method PSO based ANN is implemented to identify the secondary
structure of a protein. The proposed predictor achieved promising results and outperformed
many other state-of-the-art predictors. This scheme automatically tunes the neural network using
optimization topology called PSO. It achieved an accuracy of about 95% on the independent
dataset. The experimental performance indicated that the proposed method could be useful in
assisting the discovery of important protein modifications and would be powerful in protein
structure prediction research domains.
19
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
REFERENCES
1. Rylance, G.. Applications of genetic algorithms in protein folding studies, 2004 The first-
year report, School of Chemistry, England.
2. Sikder, A.R., and Zomaya, A.Y.. An Overview of protein-folding techniques: issues and
perspectives, 2005 International Journal of Bioinfermatics Resaerch and Application,
V- 1,P- 121–143.
3. B. Rost, and C. Sander, Prediction of protein secondary structure at better than accuracy,
1993, Journal of Molecular Biology ,P- 232 584.
4. P. Baldi, S. Brunak, P., Plotting the past and the future in secondary structure prediction,
1999, Journal of Bioinformatics,V- 15 (11),P- 937.
5. G. Pollastri and A. McLysaght, Porter: a new, accurate server for protein secondary
structure prediction, 2005, Journal of Bioinformatics, V- 21,P- 1719.
6. P. Baldi and S. Brunak, The Machine Learning Approach,2001, MIT Press, Cambridge.
7. Hae-Jin Hu, Robert W. Harrison, Current Methods for Protein Secondary-Structure
Prediction Based on Support Vector Machine, 2007, Knowledge Discovery in
Bioinformatics: Techniques, Methods, and Applications,,.
8. David T. Jones, Protein Secondary Structure Prediction Based on Position-specific
Scoring Matrice, 1999,Journal of Molecular Biology,V-1,P-1-5.
9. Wang S and Peng J, Protein secondary structure prediction using deep convolutional
neural fields, 2016. Science. Report. V-6, P-18962.
10. Kabsch, Wolfgang and Sander, Christian. Dictionary of protein secondary structure:
pattern recognition of hydrogen-bonded and geometrical features, 1983, Journal of
Biopolymers, V-22(12),P-2577–2637.
11. Schmidler SC, Liu JS, Brutlag LD. Bayesian segmentation of protein secondary structure,
2000, Journal of Computational Biology, V- 7(1-2),P-233–48.
12. Chu W, Ghahramani Z, A graphical model for protein secondary structure prediction,
2004, Proceedings 21st Annual (ICML). New York: ACM,P- 161–168.
13. Hua S, Sun Z. A novel method of protein secondary structure prediction with high
segment overlap measure: support vector machine approach, 2001, Journal Molecular
Biology,V- 308(2),P-397–407.
14. Guo J, Chen H, Sun Z, A novel method for protein secondary structure prediction using
dual-layer SVM and profiles,2004, Protein Structre Function and Bioinformatics, V-
54(4),P-738–743.
15. Asai K, and Hayamizu S, Prediction of protein secondary structure by the hidden
Markov model, 1993, Journal of Bioinformatics,V-9(2), P-141.
16. Aydin Z, Altunbasak Y, ,Protein secondary structure prediction for a single-sequence
using hidden semi-Markov models, 2006, Journmal of Bioinformatics,V- 7(1),P-178.
17. Qian N and Sejnowski TJ. Predicting the secondary structure of globular proteins using
neural network models, 1988, Journal of Molecular Biology,V- 202(4),P-865–884
20
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019
18. Jones DT. Protein secondary structure prediction based on position-specific scoring
matrices, 1999, Journal of Molecular Biology,V- 292(2),P-195.
19. Buchan DW, Minneci F, Nugent TC, Scalable web services for the inspired protein
analysis workbench, 2013, Journal of Nucleic Acids Research,V- 413,P-49–57.
20. Faraggi E, Al E. Spine x: improving protein secondary structure prediction by multistep
learning coupled with prediction of solvent accessible surface area and backbone torsion
angles, 2012,Journal of Computational Chemistry, V- 33(3), P-259–67.
21. Baldi P, BrunakSfrasconi P, Exploiting the past and the future in protein secondary
structure prediction ,1999, Journal of Bioinformatics.,V-15(11),P- 937–946.
22. R. Mendes, J. Kennedy, The fully informed particle swarm: simpler, maybe better, 2004,
IEEE Transectional .Evaluation Computational,V- 8 (3),P- 204–210.
23. C.H. Yang, Y.S. Lin, A particle swarm optimization-based approach with local search for
predicting protein folding, 2017, Journal of Computational Biology,V- 24 (10),P- 981–
994.
24. Wilke, D.N. Analysis of the Particle Swarm Optimization Algorithm. Master dissertation,
University of Pretoria, 2005.
25. M. Geis, and M. Middendorf, Particle swarm optimization for finding RNA secondary
structures, 2011, Journal of Intellegence and Computational cybern,V- 4 (2),P- 160–186.
21
ISSN NO: 1524-2560
http://jscglobal.org/
Journal of Scientific Computing
Volume 8 Issue 12 2019