Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics Center, Kyoto University

Inferring strengths 　　　　　　　　　 of protein-protein interactions from experimental data using 　　　 linear programming

Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu

Bioinformatics Center,Kyoto University

Overview Background Probabilistic model Related work Biological experimental data Proposed methods

For binary data For numerical data

Results of computational experiments Conclusion

Background (1/3) Understanding protein-protein

interactions is useful for understanding of protein functions. Transcription factors

Proteins interact with a factor. Regulate the gene.

Receptors, etc.

Background (2/3) Various methods were developed for inf

erence of protein-protein interactions Gene fusion/Rosetta stone (Enright et al. a

nd Marcotte et al. 1999) Number of possible genes to be applied is limit

ed. Molecular dynamics

Long CPU time Difficult to predict precisely

Background (3/3) A Model based on domain-domain

interactions has been proposed. Use domains defined by databases

like InterPro or Pfam.

Domain

Domain




Probabilistic model of interaction (1/2) Model (Deng et al., 2002)

Two proteins interact. At least one pair of domains

interacts. Interactions between domains are

independent events.D1

D2

D3

D2 D4

P2P1

: Proteins Pi and Pj interact : Domains Dm and Dn interact : Domain pair (Dm ,Dn) is include

d in protein pair PiX Pj

Probabilistic model of interaction (2/2)

Overview Background Probabilistic model Related work

Association method (Sprinzak et al., 2001) EM method (Deng et al., 2002)

Biological experimental data Proposed methods Results of computational experiments Conclusion

Related work INPUT:

interacting protein pairs (positive examples) non-interacting protein pairs (negative example

s) OUTPUT: Pr(Dmn=1) for all domain pairs

Association method (Sprinzak et al., 2001) Inference of probabilities of

domain-domain interactions using ratios of frequencies

: Number of interacting protein pairs that include (Dm, Dn)

: Number of protein pairs that include (Dm, Dn)

EM method (Deng et al.,2002) Probability (likelihood L) that experiment

al data {Oij={0,1}} are observed.

Use EM algorithm in order to (locally) maximize L.

Estimate Pr(Dmn=1)




Biological experimental data Related methods (Association and EM) use o

nly binary data (interact or not). Experimental data using Yeast 2 hybrid

Ito et al. (2000, 2001) Uetz et al. (2001)

For many protein pairs, different results (Oij = {0,1}) were observed.

We developed new methods using raw numerical data.

Numerical data Ito et al. (2000,2001)

For each protein pair, experiments were performed multiple times.

IST (Interaction Sequence Tag) Number of observed interactions By using a threshold, we obtain binary

data.




Proposed methods It seems difficult

to modify EM method for numerical data.

Linear Programming

For binary data LPBN Combined methods

LPEM EMLP

SVM-based method For numerical data

ASNM LPNM




LPBN (LP-based method)(1/2) Transformation into linear

inequalities Pi and Pj interact

LPBN (LP-based method)(2/2) Linear programming for inference

of protein-protein interactions

Combination of EM and LPBN LPEM method

Use the results of LPBN as initial parameter values for EM.

EMLP method Constrains to LPBN with the

following inequalities so that LP solutions are close to EM solutions.

Simple SVM-based method Feature vector

Simple linear kernel with Interacting pairs = Positive examples Non-interacting pairs = Negative

examples




Strength of protein-protein interaction For each protein pair, experiments were

performed multiple times. The ratio can be considered as streng

th.

Kij : Number of observed interactions for a protein pair (Pi,Pj)

Mij : Number of experiments for (Pi,Pj)

LPNM method (1/2) Minimize the gap between Pr(Pij=1)

and using LP.

LPNM method (2/2) Linear programming for inference

of strengths of protein-protein interactions

ASNM Modified Association method for numeri

cal data

For binary data (Sprinzak et al., 2001)




Computational experimentsfor binary data DIP database (Xenarios et al., 2002)

1767 protein pairs as positive 2/3 of the pairs for training, 1/3 for test

Computational environment Xeon processor 2.8 GHz LP solver: loqo

Results on training data (binary data)

SVM

EM

LPBN

Association

Results on test data (binary data)

SVM

EMEML

P

Association

LPEM

Computational experimentsfor numerical data YIP database (Ito et al., 2001, 2002)

IST (Interaction Sequence Tag) 1586 protein pairs 4/5 for training, 1/5 for test

Computational environment Xeon processor 2.8 GHz LP solver: lp_solve

Results on test data (numerical data)

ASNMEMLPN

MAssociation

Results on test data (numerical data)

LPNM is the best. EM and Association methods

classify Pr(Pij=1) into either 0 or 1.

LPNM ASNM

EM ASSOC

Ave. Error 0.0308 0.0405 0.295 0.277

CPU (sec.) 1.20 0.0077 1.62 0.0088

Conclusion We have defined a new problem to infer

strengths of protein-protein interactions.

We have proposed LP-based methods. For binary data

LPBN, LPEM, EMLP SVM-based method

For numerical data ASNM LPNM LPNM outperformed the other methods.

Future work Improve the methods to avoid overfittin

g. Improve the probabilistic model to under

stand protein-protein interactions more accurately.

Documents

Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics Center, Kyoto University