2009/2010 FEED-FORWARD NEURAL NETWORK FOR PROTEIN STRUCTURE PREDICTION

2009/2010

FEED-FORWARD NEURAL NETWORK FOR

PROTEIN STRUCTURE PREDICTION

Genetic paradigmDNA Watson-Crick model Genetic code

The answer comes from the knowledge of three concepts : DNA-WC, GP, GC.

The living systems can not be imagined without biological molecule DNA and protein.

DNA is in nucleus and consists from two strands linked with complementary bases.

Proteins are suited in cytoplasm and are being formed from codons made from three DNA bases.

Looking at GC one codon is one AA, and these AAc produce peptide chain.

Compound : alpha-amylase

Sequence : TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVDIYVDTLINHMAAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYINDLQAIGVKGFRFDASKHVAASDIQSLMAKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAWLSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPVHNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQYCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDAMAIHKNAKLN

Amino Acids 1-letter symbol Code Binary representation

Alanine A 01 10000000000000000000

Cysteine C 02 01000000000000000000

Aspartate D 03 00100000000000000000

Glutamate E 04 00010000000000000000

Phenylalanine F 05 00001000000000000000

Glycine G 06 00000100000000000000

Histidine H 07 00000010000000000000

Isolecine I 08 00000001000000000000

Lysine K 09 00000000100000000000

Leucine L 10 00000000010000000000

Methionine M 11 00000000001000000000

Asparagine N 12 00000000000100000000

Proline P 13 00000000000010000000

Glutamine Q 14 00000000000001000000

Arginine R 15 00000000000000100000

Serine S 16 00000000000000010000

Threonine T 17 00000000000000001000

Valine V 18 00000000000000000100

Tryptophan W 19 00000000000000000010

Threonine Y 20 00000000000000000001

In that peptide chain each AA is encoded by 1-letter symbol and we have this one long PRIMARY PROTEIN SEQUENCE.

Thr Pro Thr

3D SPATIAL ARRANGEMENT

Compound : alpha-amylase

Sequence : TPTTFVHLFEWNWQDVAQECEQYLKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAWLSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPVHNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQYCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDA

At the ribosome the entiree primary protein chain folds into unique 3-D spatial arrangement and so we

get a tertiary protein structure

ribosome

Protein Structure Prediction Problem

4. Automated methods? 5. Bridge between primary and 3D structure?

3. Experimental methods?

1. 3D protein structures? 2. Relationship between primary and tertiary protein structure?

• It is known that 3D protein structure represents human functions and behaviour. Unique 3D structure is determined by the primary sequence of protein. This sequence is determined by the structure of DNA. In this way DNA controls the development of our functions and hereditary characteristics.

• Determination of the relationship between the primary protein structure and its 3D structure is the main problem of the contemporary molecular biology.

• Primary sequences of protein are becoming available at a rapid rate. But, determination of 3D protein structure using experimental methods are very expensive, long term duration and requests experts from different fields. The number of 3D protein structures is only a tiny number of the total proteins number.

• It is clear that we need automated methods for predicting 3D structures from primary structures.

• Sience the rules of protein folding are largely unknown and general problem of predicting 3D structures unsolved, AI offers bridge to bring the gap between primaray structure and 3D structure in order to satisfy but only part of our need.

What is the bridge which AI offers?

Secondary structures SIGN

(DSSP classes)

Isolated beta-bridge B

5helix (pi helix) I

3-helix G

Bend S

Beta sheet E

Hydrogen bonded turn T

Alpha helix H

Other types C

UNKNOWN SECONDARY PROTEIN STRUCTURE ?

Known primary an secondary protein structures

That bridge are DSSP classes, and AI offers NN trained by known primary and secondary structures for prediction of unknown secondary

protein structure from known primary structure.

KNOWN PRIMARY STRUCRURE : TPVHFE QWDAQEC AVQW

Od primarne do kvarterne proteinske struktureFROM PRIMARY TO QUARTERLY STRUCTURE

primary structure

secondary structure

tertiary structure

quartarly structure

So we have chance to bridge primary and tertiary structure with secondary structure on our way

toward quarterly structure.

Protein Secondary Structure Prediction and Correspnding Software Environments

• Artificial neural network?

For this purpose we hve developed algorithm based on the next steps: EPE (usingAPI-EPE), DID&DOD (using MATLAB), NN (using NN-

TOOLBOX)

Searching for a protein

sequence database

PDBFIND2.txt

Extraction of amino acid sequences and

corresponding

secondary structures

SeparatedSekStruk.txt

EnCoded.txt

Separation of input data and

output data

InputTrenSet.txt, OutputTrenSet.txt

Determination of representative training data filesElimination of unimportant parts

of amino acid sequences and related secondary structures Cleared.txt

InputData.txt,

OutputData.txt

API-EPE

Extraction & Preparing & Encoding of Data Examples

Encoded of extracted data into a numeric patterns

Block diagram of the part EPE is given on this picture

Determination of NN Training Sets - Matrices

pattern_temp = prepare_data(‘IputData.txt')

pattern = prepare_pattern(pattern_temp, wsize)

InputData.txt

target_temp = prepare_data(‘OutputData.txt')

target = prepare_target(target_temp, wsize)

OutputData.txt

MATLAB

To make pattern matrix

To make target matrix

net = train(net, pattern, target)

Target(known secondary structures

net = create_nn(no_hiden_node, algorithm, no_epochs)

Pattern(known primary sequences)

Design and Training of Neural Network

MATLAB & NN-TOOLBOX

2.Comparing the network output to the desired output and changing the weights in the direction in order to minimize the difference between actual output and desired output.

3.The main goal of the network is to minimize the total error E(SSE) of each output node j over all training examples p:

E=p

j (Tj –Oj)2

1.In the training process it was used an improved backpropagation algorithm (momentum and adaptive learning rate) for different window sizes, diferent number of neurons in hidden layer, different training sets and different number of

epochs.

accuracy(net, pattern, target)

pattern target

trainednet

Determination of SSP and Performance Evaluation of Neural Network

The second step is performance evaluation with the same training set. Percentage of correctly classified patterns is Q3-statistic

Third step is performance evaluation with the test set (different from training set).

Percentage of correctly classified patterns is Q3-statistic.

And after all, the network is given a sequence of AAs windows (vectors). The goal is to corectly predict the SSP for the middle

AAs in the input windows.

In the previous step we have got one trained NN with wanted (set) SSE.

12 8 5 4 11 10 15 0 1 1 9 16 0 6 18 8 17

H

No.TRAINING/ TEST

SETS NAMESWINDOW

SIZESSUM SQUARE

ERRORS

TEST RESULTS

Q3% (training set)

Q3% (test set)

1 trainsetw11/testsetw11 11 0,172551 60,7346 59,1385



4 trainsetw17/testsetw17 17 0,159978 65,2184 60,92955 trainsetw19/testsetw19 19 0,162065 64,3420 60,5499



Average value Q3 63,8436 60,3506

TEST RESULTS FOR DIFFERENT SIZES OF WINDOWS

Algorithm parameters Parameter values

Window size 11,13,15,17,19,21,23

Number of hidden layers 5

Size of training set 50

Number of training epoch 250

Sizes of windows in test set from 13 to 23 do not influence so much on acuracy of neural network prediction. With sizes of windows less than 13 neural network prediction is worse.

No.Number of neurons in

hidden layerSum square error

Test

Q3% (training set)

Q3% (test set)

1 1 0,263599 43,3307 39,5450

2 2 0,197537 52,9875 51,8153

3 3 0,181233 58,8066 54,6927

4 4 0,170486 61,5835 59,1820

5 5 0,159978 65,2184 60,9295

6 6 0,157229 65,3120 60,5227

7 7 0,161042 65,3120 60,5227

8 8 0,159854 64,9532 61,0199

9 9 0,159005 64,9766 60,9220

10 10 0,159066 64,2980 60,3194

11 11 0,155174 65,9906 60,6282

12 12 0,153532 65,9438 60,7487

13 13 0,158285 65,2418 60,8843

Average values for Q3 61,8427 57,8256

TEST RESULTS FOR DIFFERENT NUMBER OF NEURONS IN HIDDENE LAYER


Window size 17

Neurons in hidden layer 1,2,3,4,5,6,7,8,9,10,11,12,13


Number of training epoch 250

Number of neurons from 5 to 13 do not influence so much on accuracy of neural network prediction.

Very bad results are when there is less than 4 neurons in hidden layer.

TEST RESULTS FOR DIFFERENT SIZES OF TRAINING SETS TRAINING SETS

Sum square errors

TEST

No. Training setNumber of

proteinsNumber of

patternsQ3%

(training set)Q3%

(test set)

1 trainset20 20 5393 0,143151 69,4233 57,5324

2 trainset30 30 8089 0,148417 67,7710 59,3100

3 trainset40 40 10786 0,154081 66,2155 60,0633

4 trainsetw17 50 12820 0,159854 64,9532 61,0199

5 trainset60 60 16178 0,157511 64,5444 60,9295

6 trainset80 80 21572 0,159735 64,2390 61,2685

7 trainset100 100 26964 0,163200 63,1546 61,7053

8 trainset125 125 33705 0,163603 63,1835 62,0142

9 trainset150 150 40446 0,163908 63,0989 62,0066

10 trainset200 200 53928 0,164072 62,9098 62,2627

Average value for Q3 64,9493 60,8112


Window size 17

Neurons in hidden layer 8

Size of training set 20,30,40,50,60,80,100,125,150,200

Number of training epoch 250Changing of training sets is based on chaining of protein sequences (number of proteins). Adding of input patterns in training set we have the better prediction in sizes (better than 1%).

TEST RESULTS FOR DIFFERENT NUMBER OF EPOCHS

No. Number of epochs Sum square eerror

Test

Q3% (training set)Q3%

(test set)

1 100 0,247866 41,0659 39,4622

2 150 0,199955 51,5317 50,3842

3 200 0,164233 62,9691 62,1874

4 250 0,164072 62,9098 62,2627

5 300 0,163579 63,0619 62,2175

6 500 0,161417 63,6590 62,8277

7 1000 0,156456 65,1053 63,0838

8 2000 0,151204 66,6314 63,6261

9 2500 0,151015 66,7909 63,6035

Average value for Q3 60,4139 58,8506


Window size 17

Neurons in hidden layer 8


Number of training epoch 100,150,.........1000,2000,2500Increasing of epoch number is chance to have better accuracy in prediction in size more then 1%.We can notice that we have worse results when epoch raise

over 2000 because of the NN-over training.

Algorithm parametersValues of

parameters

Window size 17

Number of neurons in hidden layer 8

Training set 200

Epoch number 2000

Parameter values for which we have the best prediction accuracy

Q363.6261 %

Comparison with other methods

Avdagic & Purisevic (2005)

Q3=62%

Method Q3%

Chou & Fasman (1978) 50

Lim (1974) 50

Robson (1978) 53

Levin (1986) 59.7

Sejnovski (1988), net1 62.7

Qian & Sejnovski (1988), net2 64.3

Chandonia & Korplus (1995) 73.9

Chandonia & Korplus (1996) 80.2

Avdagic&Purisevic

Q3

63.6261 %

Discussion Have we finally succeeded?The bad news is: we still can not predict structure for any sequence . The good news is: we have come closer to our goal, and growing databases facilitate the task. Because of that we propose our strategy in this work.

Our strategy +orchestra of different scientists biochemists,structural biologist, biophysics, computer scientists, and using fusion of different methods.

We would be able to understand better mechanism of life, bioversity and evolutin in general.

Because of that we need the knowledge about protein structure functions and relationship to the

genes Steps from 1 to 7 repeat first on radial bases and than on recurrent networks

// On my watch it's Sat Feb 14 02:21:23 2004. // // When using this database, please cite: // PDBFinderII - a database for protein structure analysis and prediction // Krieger,E., Hooft,R.W.W., Nabuurs,S., Vriend.G. (2004) Submitted// ID : 101M // Header : OXYGEN TRANSPORT // Date : 1998-04-08 Compound : myoglobin // Compound : Mutant Source : (physeter catodon) // Source : sperm whale // Water-Mols : 138 // Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I // DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH // Nalign : 4 5 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

From PDBFIND2.txt data base we must extract AA sequences and corresponding secondary structures

Sequence : MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I DSSP : CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH Sequence : MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH DSSP : CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT ESequence : MSNT L F DD I F QVS EV D PG RY NKVC R I E A ASTDSSP : ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Sequence : E-XX DSSP : C-CC

IzdvojeneSekStruk.txtSeparatedSekStruk.txt

We need eliminate unimportant parts of amino acid sequences and related secondary structures

Precisceni.txt

MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT E

Cleared.txt

Then we need to encode extracted data into a numeric patterns.

20YThreonine

19WTryptophan

18VValine

17TThreonine

16SSerine

15RArginine

14QGlutamine

13PProline

12NAsparagine

11MMethionine

L

K

I

H

G

F

E

D

C

A

1-letter code

02Cysteine

03Aspartate

10Leucine

09Lysine

08Isolecine

07Histidine

06Glycine

05Phenylalanine

04Glutamate

01Alanine

Number codes Amino acids

03other structureB, I, S, T, C, L

02-strandE

01-helix H,G

CodesStructure used in our

algorithm

DSSP class

Secondary structure codes

11 18 10 16 04 06 06 19 14 10 07 18

03 03 03 03 01 01 01 01 01 01 01 01

11 12 08 05 04 11 10 15 08 03 04 06

03 03 01 01 01 01 01 01 01 01 01 03

EnCoded.txt

Extraction & Preparing & Encoding of Data Examples

MVLSEGEWQLVLHVWAKVEADVAGHGQD I L I CCCCHHHHHHHHHHHHHHGGGHHHHHHHHHH MNI F EML R I D E G LR LK I Y KDT EGYY T I G I GH CCHHHHHHHHHCCE E E EE E C T T S C E E E E TT E

Cleared.txt

UlazniTrenSkup.txt11 18 10 16 04 06 06 19 14 10 07 18 19 01 09 18 04 01 03 18 0111 12 08 05 04 11 10 15 08 03 04 06 10 15 10 09 08 20 09 03 17

IzlazniTrenSkup.txt 03 03 03 03 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 0103 03 01 01 01 01 01 01 01 01 01 03 03 02 02 02 02 02 02 03 03

InputTrenSet.txt

OutputTrenSet.txt

11 12 08 05 04 11 10 1501 01 09 1606 18 18 17 09 03 04 01 04 09 10 05

UlazniPodaci.txtInputData.txt

Example of three protein chains.

UlazniPodaci.txt 11 12 08 05 04 11 10 1501 01 09 1606 18 18 17 09 03 04 01 04 09 10 05

pattern_temp

11 12 8 5 4 11 10 15 0 1 1 9 16 0 6 18 18 17 9 3 4 1 4 9 10 5

InputData.txt

Retuns a vector which contains sequences of AA separated by number 0.

Design of input pattern matrix taken from

the string of encoded protein sequences

Tra

nsfo

rmin

g 1-

num

ber

code

in

to 2

0-nu

mbe

r co

de

A window is a short segment of a complete protein string and in the middle at it there is an amino acid for which we want to predict secondary structure.This window moves through protein , 1 amino acid at a time

Our prediction is made for central AA and if we have 0 at that position of window then our function prepere_pattern doesn‘t permit to that window to be placed into patern matrix

DESIGN AND TRAINIG OF NEURAL NETWORK

These SSS correspond to three AA sequences in the file InputData.txt

OutputData.txt

From the file OutputData.txt we have these secondary structure sequences:


The function prepare_data produces this string :

Retuns a vector which contains sequences of SS separated by number 0.

The function prepare_data produces this string :

The function prepare_target eliminates those SS from output string which correspond to AA for which we dont make any prediction (the first and last

AAs) in the input vector.

As well as, the function prepare_target transforms 1-number code of the underlined SSs into 3-number code and

result is target matrix which column vectors correspond helix, strand and other structures


DESIGN AND TRAINING OF NEURAL NETWORK

MATRIX-VECTOR NOTATION OF NEURAL NETWORK

Input Layer 1 Layer 2 Layer 3

MATRIX-VECTOR NOTATION OF THREE-LAYER NEURAL NETWORK

Documents

2009/2010 FEED-FORWARD NEURAL NETWORK FOR PROTEIN STRUCTURE PREDICTION