24
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural network Output Neural network: • is a black box that no one can understand • over predicts performance

Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Biological sequence analysis andinformation processing by artificial

neural networks

Morten NielsenBioSys, DTU

Objectives

Input Neural network Output

• Neural network:

• is a black box that noone can understand

• over predictsperformance

Page 2: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Pairvise alignment

• >carp Cyprinus carpio growth hormone 210 aa vs.

• >chicken Gallus gallus growth hormone 216 aa

• scoring matrix: BLOSUM50, gap penalties: -12/-2

• 40.6% identity; Global alignment score: 487

• 10 20 30 40 50 60 70

• carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD

• :: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.

• chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE

• 10 20 30 40 50 60 70 80

• 80 90 100 110 120 130 140 150

• carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN

• : ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .

• chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G

• 90 100 110 120 130 140 150 160

• 170 180 190 200 210

• carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL

• .: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.

• chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI

• 170 180 190 200 210

Page 3: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

HUNKAT

Page 4: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Biological Neural network

Page 5: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Biological neuron structure

Diversity of interactions in anetwork enables complex calculations

• Similar in biological and artificial systems

• Excitatory (+) and inhibitory (-) relations between compute units

Page 6: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Transfer of biological principles toartificial neural network algorithms

• Non-linear relation between input and output

• Massively parallel information processing

• Data-driven construction of algorithms

• Ability to generalize to new data items

Page 7: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural
Page 8: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural
Page 9: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

How to predict

• The effect on the binding affinity ofhaving a given amino acid at oneposition can be influenced by theamino acids at other positions in thepeptide (sequence correlations).– Two adjacent amino acids may for

example compete for the space in apocket in the MHC molecule.

• Artificial neural networks (ANN) areideally suited to take suchcorrelations into account

SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV

LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL

HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI

ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL

LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV

PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV

ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM

KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV

KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV

SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV

ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLV

TGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAA

GAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLA

KARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIV

AVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVV

GLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLV

VLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQC

ISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGA

YTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYI

NMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTV

VVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQ

GLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYL

EAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAV

YLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRL

FLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKL

AAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYI

AAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

MHC peptide binding

Page 10: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

313 binding peptides 313 random peptides

Mutual information

Higher order sequence correlations

• Neural networks can learn higher order correlations!– What does this mean?

S S => 0

L S => 1

S L => 1

L L => 0

Say that the peptide needs one and onlyone large amino acid in the positions P3and P4 to fill the binding cleft

How would you formulate this to test ifa peptide can bind?

=> XOR function

Page 11: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural networks

•Neural networks can learnhigher order correlations

XOR function:

0 0 => 0

1 0 => 1

0 1 => 1

1 1 => 0

(1,1)

(1,0)(0,0)

(0,1)

No linear function can separate the points

Error estimates

XOR

0 0 => 0

1 0 => 1

0 1 => 1

1 1 => 0

(1,1)

(1,0)(0,0)

(0,1)Predict

0

1

1

1

Error

0

0

0

1

Mean error: 1/4

Page 12: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural networks

v1

v2

Linear function

!

y = x1" v1+ x

2" v

2

Neural networks

Page 13: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural networks. How does itwork?

Hand out

Neural networks. How does itwork?

w12

v1

w21

w22

v2

wt2

wt1w

11

vt

Input 1 (Bias){

!

O =1

1+ exp("o)

!

o = xi" # w

i

Page 14: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural networks (0 0)

0 06

-9

46

9

1

-2

-64

1

-4.5

Input 1 (Bias){

o1=-6O1=0

o2=-2O2=0

y1=-4.5Y1=0

!

O =1

1+ exp("o)

!

o = xi" # w

i

Neural networks (1 0 && 0 1)

1 06

-9

46

9

1

-2

-64

1

-4.5

Input 1 (Bias){

o1=-2O1=0

o2=4O2=1

y1=4.5Y1=1

!

O =1

1+ exp("o)

!

o = xi" # w

i

Page 15: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural networks (1 1)

1 16

-9

46

9

1

-2

-64

1

-4.5

Input 1 (Bias){!

O =1

1+ exp("o)

!

o = xi" # w

i

o1=2O1=1

o2=10O2=1

y1=-4.5Y1=0

What is going on?

XOR function:

0 0 => 0

1 0 => 1

0 1 => 1

1 1 => 0

!

fXOR (x1,x2) = "2 # x1# x

2+ (x

1+ x

2) = y

1" y

2

6

-9

46

9

-2

-64

-4.5

Input 1 (Bias){

y2 y1

Page 16: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

What is going on?

(1,1)

(1,0)(0,0)

(0,1)

!

y1

= x1+ x

2

y2

= 2 " x1" x

2

x2

x1 y1

y2

(1,0)

(2,2)

(0,0)

Page 17: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural
Page 18: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Training and error reduction

DEMO

Page 19: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Transfer of biologicalprinciples to neural network algorithms

• Non-linear relation between input and output

• Massively parallel information processing

• Data-driven construction of algorithms

• A Network contains a very largeset of parameters

– A network with 5 hiddenneurons predicting binding for9meric peptides has9x20x5=900 weights

• Over fitting is a problem

• Stop training when testperformance is optimal

Neural network training

years

Tem

pera

ture

Page 20: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Neural network training. Cross validation

Cross validation

Train on 4/5 of dataTest on 1/5=>Produce 5 differentneural networks eachwith a differentprediction focus

Neural network training curve

Maximum test set performanceMost cable of generalizing

Page 21: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Network training

• Encoding of sequence data• Sparse encoding

• Blosum encoding

• Sequence profile encoding

Sparse encoding of amino acidsequence windows

Page 22: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Sparse encoding

Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AAcid

A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

BLOSUM encoding (Blosum50 matrix)

A R N D C Q E G H I L K M F P S T W Y V

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 23: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

Sequence encoding (continued)

• Sparse encodingV:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1L:0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

V.L=0 (unrelated)

• Blosum encodingV: 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

L:-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

V.L = 0.88 (highly related)V.R = -0.08 (close to unrelated)

Applications of artificial neuralnetworks

• Talk recognition

• Prediction of protein secondary structure

• Prediction of Signal peptides

• Post translation modifications• Glycosylation

• Phosphorylation

• Proteasomal cleavage

• MHC:peptide binding

Page 24: Biological sequence analysis and Objectives...Biological sequence analysis and information processing by artificial neural networks Morten Nielsen BioSys, DTU Objectives Input Neural

What have we learned

• Neural networks are not so bad as theirreputation

• Neural networks can deal with higherorder correlations

• Be careful when training a neural network– Always use cross validated training