Upload
valentine-james
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
2009/2010
FEED-FORWARD NEURAL NETWORK FOR
PROTEIN STRUCTURE PREDICTION
Genetic paradigmDNA Watson-Crick model Genetic code
The answer comes from the knowledge of three concepts : DNA-WC, GP, GC.
The living systems can not be imagined without biological molecule DNA and protein.
DNA is in nucleus and consists from two strands linked with complementary bases.
Proteins are suited in cytoplasm and are being formed from codons made from three DNA bases.
Looking at GC one codon is one AA, and these AAc produce peptide chain.
Compound : alpha-amylase
Sequence : TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVDIYVDTLINHMAAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYINDLQAIGVKGFRFDASKHVAASDIQSLMAKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAWLSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPVHNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQYCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDAMAIHKNAKLN
Amino Acids 1-letter symbol Code Binary representation
Alanine A 01 10000000000000000000
Cysteine C 02 01000000000000000000
Aspartate D 03 00100000000000000000
Glutamate E 04 00010000000000000000
Phenylalanine F 05 00001000000000000000
Glycine G 06 00000100000000000000
Histidine H 07 00000010000000000000
Isolecine I 08 00000001000000000000
Lysine K 09 00000000100000000000
Leucine L 10 00000000010000000000
Methionine M 11 00000000001000000000
Asparagine N 12 00000000000100000000
Proline P 13 00000000000010000000
Glutamine Q 14 00000000000001000000
Arginine R 15 00000000000000100000
Serine S 16 00000000000000010000
Threonine T 17 00000000000000001000
Valine V 18 00000000000000000100
Tryptophan W 19 00000000000000000010
Threonine Y 20 00000000000000000001
In that peptide chain each AA is encoded by 1-letter symbol and we have this one long PRIMARY PROTEIN SEQUENCE.
Thr Pro Thr
3D SPATIAL ARRANGEMENT
Compound : alpha-amylase
Sequence : TPTTFVHLFEWNWQDVAQECEQYLKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAWLSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPVHNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQYCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDA
At the ribosome the entiree primary protein chain folds into unique 3-D spatial arrangement and so we
get a tertiary protein structure
ribosome
Protein Structure Prediction Problem
4. Automated methods? 5. Bridge between primary and 3D structure?
3. Experimental methods?
1. 3D protein structures? 2. Relationship between primary and tertiary protein structure?
• It is known that 3D protein structure represents human functions and behaviour. Unique 3D structure is determined by the primary sequence of protein. This sequence is determined by the structure of DNA. In this way DNA controls the development of our functions and hereditary characteristics.
• Determination of the relationship between the primary protein structure and its 3D structure is the main problem of the contemporary molecular biology.
• Primary sequences of protein are becoming available at a rapid rate. But, determination of 3D protein structure using experimental methods are very expensive, long term duration and requests experts from different fields. The number of 3D protein structures is only a tiny number of the total proteins number.
• It is clear that we need automated methods for predicting 3D structures from primary structures.
• Sience the rules of protein folding are largely unknown and general problem of predicting 3D structures unsolved, AI offers bridge to bring the gap between primaray structure and 3D structure in order to satisfy but only part of our need.
What is the bridge which AI offers?
Secondary structures SIGN
(DSSP classes)
Isolated beta-bridge B
5helix (pi helix) I
3-helix G
Bend S
Beta sheet E
Hydrogen bonded turn T
Alpha helix H
Other types C
UNKNOWN SECONDARY PROTEIN STRUCTURE ?
Known primary an secondary protein structures
That bridge are DSSP classes, and AI offers NN trained by known primary and secondary structures for prediction of unknown secondary
protein structure from known primary structure.
KNOWN PRIMARY STRUCRURE : TPVHFE QWDAQEC AVQW
Od primarne do kvarterne proteinske struktureFROM PRIMARY TO QUARTERLY STRUCTURE
primary structure
secondary structure
tertiary structure
quartarly structure
So we have chance to bridge primary and tertiary structure with secondary structure on our way
toward quarterly structure.
Protein Secondary Structure Prediction and Correspnding Software Environments
• Artificial neural network?
For this purpose we hve developed algorithm based on the next steps: EPE (usingAPI-EPE), DID&DOD (using MATLAB), NN (using NN-
TOOLBOX)
Searching for a protein
sequence database
PDBFIND2.txt
Extraction of amino acid sequences and
corresponding
secondary structures
SeparatedSekStruk.txt
EnCoded.txt
Separation of input data and
output data
InputTrenSet.txt, OutputTrenSet.txt
Determination of representative training data filesElimination of unimportant parts
of amino acid sequences and related secondary structures Cleared.txt
InputData.txt,
OutputData.txt
API-EPE
Extraction & Preparing & Encoding of Data Examples
Encoded of extracted data into a numeric patterns
Block diagram of the part EPE is given on this picture
Determination of NN Training Sets - Matrices
pattern_temp = prepare_data(‘IputData.txt')
pattern = prepare_pattern(pattern_temp, wsize)
InputData.txt
target_temp = prepare_data(‘OutputData.txt')
target = prepare_target(target_temp, wsize)
OutputData.txt
MATLAB
To make pattern matrix
To make target matrix
net = train(net, pattern, target)
Target(known secondary structures
net = create_nn(no_hiden_node, algorithm, no_epochs)
Pattern(known primary sequences)
Design and Training of Neural Network
MATLAB & NN-TOOLBOX
2.Comparing the network output to the desired output and changing the weights in the direction in order to minimize the difference between actual output and desired output.
3.The main goal of the network is to minimize the total error E(SSE) of each output node j over all training examples p:
E=p
j (Tj –Oj)2
1.In the training process it was used an improved backpropagation algorithm (momentum and adaptive learning rate) for different window sizes, diferent number of neurons in hidden layer, different training sets and different number of
epochs.
accuracy(net, pattern, target)
pattern target
trainednet
Determination of SSP and Performance Evaluation of Neural Network
The second step is performance evaluation with the same training set. Percentage of correctly classified patterns is Q3-statistic
Third step is performance evaluation with the test set (different from training set).
Percentage of correctly classified patterns is Q3-statistic.
And after all, the network is given a sequence of AAs windows (vectors). The goal is to corectly predict the SSP for the middle
AAs in the input windows.
In the previous step we have got one trained NN with wanted (set) SSE.
12 8 5 4 11 10 15 0 1 1 9 16 0 6 18 8 17
H
No.TRAINING/ TEST
SETS NAMESWINDOW
SIZESSUM SQUARE
ERRORS
TEST RESULTS
Q3% (training set)
Q3% (test set)
1 trainsetw11/testsetw11 11 0,172551 60,7346 59,1385
2 trainsetw13/testsetw13 13 0,160897 63,8356 60,4760
3 trainsetw15/testsetw15 15 0,165837 63,0606 60,7517
4 trainsetw17/testsetw17 17 0,159978 65,2184 60,92955 trainsetw19/testsetw19 19 0,162065 64,3420 60,5499
6 trainsetw21/testsetw21 21 0,160577 64,3860 60,0121
7 trainsetw23/testsetw23 23 0,155637 65,3273 60,5967
Average value Q3 63,8436 60,3506
TEST RESULTS FOR DIFFERENT SIZES OF WINDOWS
Algorithm parameters Parameter values
Window size 11,13,15,17,19,21,23
Number of hidden layers 5
Size of training set 50
Number of training epoch 250
Sizes of windows in test set from 13 to 23 do not influence so much on acuracy of neural network prediction. With sizes of windows less than 13 neural network prediction is worse.
No.Number of neurons in
hidden layerSum square error
Test
Q3% (training set)
Q3% (test set)
1 1 0,263599 43,3307 39,5450
2 2 0,197537 52,9875 51,8153
3 3 0,181233 58,8066 54,6927
4 4 0,170486 61,5835 59,1820
5 5 0,159978 65,2184 60,9295
6 6 0,157229 65,3120 60,5227
7 7 0,161042 65,3120 60,5227
8 8 0,159854 64,9532 61,0199
9 9 0,159005 64,9766 60,9220
10 10 0,159066 64,2980 60,3194
11 11 0,155174 65,9906 60,6282
12 12 0,153532 65,9438 60,7487
13 13 0,158285 65,2418 60,8843
Average values for Q3 61,8427 57,8256
TEST RESULTS FOR DIFFERENT NUMBER OF NEURONS IN HIDDENE LAYER
Algorithm parameters Parameter values
Window size 17
Neurons in hidden layer 1,2,3,4,5,6,7,8,9,10,11,12,13
Size of training set 50
Number of training epoch 250
Number of neurons from 5 to 13 do not influence so much on accuracy of neural network prediction.
Very bad results are when there is less than 4 neurons in hidden layer.
TEST RESULTS FOR DIFFERENT SIZES OF TRAINING SETS TRAINING SETS
Sum square errors
TEST
No. Training setNumber of
proteinsNumber of
patternsQ3%
(training set)Q3%
(test set)
1 trainset20 20 5393 0,143151 69,4233 57,5324
2 trainset30 30 8089 0,148417 67,7710 59,3100
3 trainset40 40 10786 0,154081 66,2155 60,0633
4 trainsetw17 50 12820 0,159854 64,9532 61,0199
5 trainset60 60 16178 0,157511 64,5444 60,9295
6 trainset80 80 21572 0,159735 64,2390 61,2685
7 trainset100 100 26964 0,163200 63,1546 61,7053
8 trainset125 125 33705 0,163603 63,1835 62,0142
9 trainset150 150 40446 0,163908 63,0989 62,0066
10 trainset200 200 53928 0,164072 62,9098 62,2627
Average value for Q3 64,9493 60,8112
Algorithm parameters Parameter values
Window size 17
Neurons in hidden layer 8
Size of training set 20,30,40,50,60,80,100,125,150,200
Number of training epoch 250Changing of training sets is based on chaining of protein sequences (number of proteins). Adding of input patterns in training set we have the better prediction in sizes (better than 1%).
TEST RESULTS FOR DIFFERENT NUMBER OF EPOCHS
No. Number of epochs Sum square eerror
Test
Q3% (training set)Q3%
(test set)
1 100 0,247866 41,0659 39,4622
2 150 0,199955 51,5317 50,3842
3 200 0,164233 62,9691 62,1874
4 250 0,164072 62,9098 62,2627
5 300 0,163579 63,0619 62,2175
6 500 0,161417 63,6590 62,8277
7 1000 0,156456 65,1053 63,0838
8 2000 0,151204 66,6314 63,6261
9 2500 0,151015 66,7909 63,6035
Average value for Q3 60,4139 58,8506
Algorithm parameters Parameter values
Window size 17
Neurons in hidden layer 8
Size of training set 200
Number of training epoch 100,150,.........1000,2000,2500Increasing of epoch number is chance to have better accuracy in prediction in size more then 1%.We can notice that we have worse results when epoch raise
over 2000 because of the NN-over training.
Algorithm parametersValues of
parameters
Window size 17
Number of neurons in hidden layer 8
Training set 200
Epoch number 2000
Parameter values for which we have the best prediction accuracy
Q363.6261 %
Comparison with other methods
Avdagic & Purisevic (2005)
Q3=62%
Method Q3%
Chou & Fasman (1978) 50
Lim (1974) 50
Robson (1978) 53
Levin (1986) 59.7
Sejnovski (1988), net1 62.7
Qian & Sejnovski (1988), net2 64.3
Chandonia & Korplus (1995) 73.9
Chandonia & Korplus (1996) 80.2
Avdagic&Purisevic
Q3
63.6261 %
Discussion Have we finally succeeded?The bad news is: we still can not predict structure for any sequence . The good news is: we have come closer to our goal, and growing databases facilitate the task. Because of that we propose our strategy in this work.
Our strategy +orchestra of different scientists biochemists,structural biologist, biophysics, computer scientists, and using fusion of different methods.
We would be able to understand better mechanism of life, bioversity and evolutin in general.
Because of that we need the knowledge about protein structure functions and relationship to the
genes Steps from 1 to 7 repeat first on radial bases and than on recurrent networks
// On my watch it's Sat Feb 14 02:21:23 2004. // // When using this database, please cite: // PDBFinderII - a database for protein structure analysis and prediction // Krieger,E., Hooft,R.W.W., Nabuurs,S., Vriend.G. (2004) Submitted// ID : 101M // Header : OXYGEN TRANSPORT // Date : 1998-04-08 Compound : myoglobin // Compound : Mutant Source : (physeter catodon) // Source : sperm whale // Water-Mols : 138 // Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I // DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH // Nalign : 4 5 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
From PDBFIND2.txt data base we must extract AA sequences and corresponding secondary structures
Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH Sequence : MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH DSSP : CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT ESequence : MSNT L F DD I F QVS EV D PG RY NKVC R I E A ASTDSSP : ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Sequence : E-XX DSSP : C-CC
IzdvojeneSekStruk.txtSeparatedSekStruk.txt
We need eliminate unimportant parts of amino acid sequences and related secondary structures
Precisceni.txt
MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT E
Cleared.txt
Then we need to encode extracted data into a numeric patterns.
20YThreonine
19WTryptophan
18VValine
17TThreonine
16SSerine
15RArginine
14QGlutamine
13PProline
12NAsparagine
11MMethionine
L
K
I
H
G
F
E
D
C
A
1-letter code
02Cysteine
03Aspartate
10Leucine
09Lysine
08Isolecine
07Histidine
06Glycine
05Phenylalanine
04Glutamate
01Alanine
Number codes Amino acids
03other structureB, I, S, T, C, L
02-strandE
01-helix H,G
CodesStructure used in our
algorithm
DSSP class
Secondary structure codes
11 18 10 16 04 06 06 19 14 10 07 18
03 03 03 03 01 01 01 01 01 01 01 01
11 12 08 05 04 11 10 15 08 03 04 06
03 03 01 01 01 01 01 01 01 01 01 03
EnCoded.txt
Extraction & Preparing & Encoding of Data Examples
MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT E
Cleared.txt
UlazniTrenSkup.txt11 18 10 16 04 06 06 19 14 10 07 18 19 01 09 18 04 01 03 18 0111 12 08 05 04 11 10 15 08 03 04 06 10 15 10 09 08 20 09 03 17
IzlazniTrenSkup.txt 03 03 03 03 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 0103 03 01 01 01 01 01 01 01 01 01 03 03 02 02 02 02 02 02 03 03
InputTrenSet.txt
OutputTrenSet.txt
11 12 08 05 04 11 10 1501 01 09 1606 18 18 17 09 03 04 01 04 09 10 05
UlazniPodaci.txtInputData.txt
Example of three protein chains.
UlazniPodaci.txt 11 12 08 05 04 11 10 1501 01 09 1606 18 18 17 09 03 04 01 04 09 10 05
pattern_temp
11 12 8 5 4 11 10 15 0 1 1 9 16 0 6 18 18 17 9 3 4 1 4 9 10 5
InputData.txt
Retuns a vector which contains sequences of AA separated by number 0.
Design of input pattern matrix taken from
the string of encoded protein sequences
Tra
nsfo
rmin
g 1-
num
ber
code
in
to 2
0-nu
mbe
r co
de
A window is a short segment of a complete protein string and in the middle at it there is an amino acid for which we want to predict secondary structure.This window moves through protein , 1 amino acid at a time
Our prediction is made for central AA and if we have 0 at that position of window then our function prepere_pattern doesn‘t permit to that window to be placed into patern matrix
DESIGN AND TRAINIG OF NEURAL NETWORK
These SSS correspond to three AA sequences in the file InputData.txt
OutputData.txt
From the file OutputData.txt we have these secondary structure sequences:
From the file OutputData.txt we have these secondary structure sequences:
The function prepare_data produces this string :
Retuns a vector which contains sequences of SS separated by number 0.
The function prepare_data produces this string :
The function prepare_target eliminates those SS from output string which correspond to AA for which we dont make any prediction (the first and last
AAs) in the input vector.
As well as, the function prepare_target transforms 1-number code of the underlined SSs into 3-number code and
result is target matrix which column vectors correspond helix, strand and other structures
From the file OutputData.txt we have these secondary structure sequences:
DESIGN AND TRAINING OF NEURAL NETWORK
MATRIX-VECTOR NOTATION OF NEURAL NETWORK
Input Layer 1 Layer 2 Layer 3
MATRIX-VECTOR NOTATION OF THREE-LAYER NEURAL NETWORK