81
Bioinformatics techniques and methodologies Università della Calabria Facoltà di Ingegneria BIOINFORMATICS TECHNIQUES AND METHODOLOGIES Research group coordinated by Prof. Luigi Palopoli Lecturer: Simona Rombo

Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

  • Upload
    vokhanh

  • View
    222

  • Download
    9

Embed Size (px)

Citation preview

Page 1: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

Bioinformatics techniques and methodologiesUniversità della Calabria

Facoltà di Ingegneria

BIOINFORMATICS TECHNIQUES AND METHODOLOGIES

Research group coordinated by Prof. Luigi PalopoliLecturer: Simona Rombo

Page 2: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

2

Bioinformatics techniques and methodologies

OUTLINE

1. Introduction to Bioinformatics

2. Pattern discovery– Strings

– Images

3. Biological Networks Analysis– Network alignment

– Network clustering

Page 3: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

3

Bioinformatics techniques and methodologies

Donald Knuth, 1993:

“…It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at people’fingertips, that it won’t be pretty much working on refinement of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can’t predict an unending growth. I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on…”

Introduction to Bioinformatics

Page 4: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

4

Bioinformatics techniques and methodologies

There are several facts about biology that are important to keep in mind:

– In biology there are no rules without exceptions

– In reasoning with biological structures, looking for generalizations maybe often misleading

– It is often impossible to look at a biological phenomenon in isolation, for it may take place just as long as other related phenomena take place as well, which need to be taken care of too

– To reason with incomplete information is quite the rule rather than the exception

– In reasoning about biological structures and functions it is important to bear in mind the pervasive role of evolution

Introduction to Bioinformatics

Page 5: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

5

Bioinformatics techniques and methodologies

A definition:“Bioinformatics is the combination of biology and

Information technology. It is the branch of science that

deals with computer-based analysis of large biological

data sets. Bioinformatics incorporates the development

of databases to store and search data, and statistical

tools and algorithms to analyze and determine relationships

between biological data sets, such as macromolecular

sequences, structures, expression profiles and biochemical pathways.” (R.M. Twyman)

Introduction to Bioinformatics

In most cases, computer based tools developed in bioinformatics require expert human intervention for the addressed problems to get solved

Page 6: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

6

Bioinformatics techniques and methodologies

Generally speaking, the aim of bioinformatics is to help biologists in gathering and processing biological data and to aid in studying protein structures and interactions in order to allow optimal drug design.

Introduction to Bioinformatics

Page 7: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

7

Bioinformatics techniques and methodologies

Here is a summary of CS methods and techniques relevant to bioinformatics:

– String algorithms, grammars and automata– Indexing methods and query optimization– Integration techniques– Optimization techniques– Dynamic programming and heuristics– Data mining and machine learning techniques– Probability and statistic-based methods– Computational geometry methods– Text mining – …

Introduction to Bioinformatics

Page 8: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

8

Bioinformatics techniques and methodologies

Two main points of view:

1. Cellular components (e.g., DNA, RNA, proteins)

2. Interaction of cellular components (e.g., metabolic pathways, protein-protein interactions)

Introduction to Bioinformatics

Page 9: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

9

Bioinformatics techniques and methodologies

Introduction to Bioinformatics – Cellular Components

Page 10: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

10

Bioinformatics techniques and methodologies

DNA

Introduction to Bioinformatics – Cellular Components

Page 11: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

11

Bioinformatics techniques and methodologies

AMINO ACIDS

Proteins are the core structures determining cell lifecycle;

they are made up of elementary units called amino acids (few exceptions exist) or residues;

There are 20 amino acids in nature

Introduction to Bioinformatics – Cellular Components

Page 12: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

12

Bioinformatics techniques and methodologies

•Another perspective is the analysis of protein mutual interactions

•Proteins are involved in complexes performing specific biological functions

Saccaromyces Cerevisiae

Introduction to BioinformaticsIntroduction to Bioinformatics – Interactions of components

Page 13: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

13

Bioinformatics techniques and methodologies

Pattern Discovery

Page 14: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

14

Bioinformatics techniques and methodologies

Efficient data structures

Trie• A tree data structure used to store strings• Each edge has a label representing a symbol• Two edges out of the same node have distinct labels• Each node, except the root, is associated with a string• Concatenating all the symbols in the path from the root to a node n, the string corresponding to n is obtained• All the descendance of the same node n are associated with strings having a common prefix, i.e., the string corresponding to n

Pattern discovery

Page 15: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

15

Bioinformatics techniques and methodologies

ExampleA trie storing the words {to, te, tea, ten, hi, he, her}:

Pattern discovery

t h

o e

toto tete

a n

teatea tenten

i e

hihi heher

herher

Page 16: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

16

Bioinformatics techniques and methodologies

Efficient data structures

Suffix TreeGiven a string s of n caracters on the alphabet Σ, a suffix tree T associated to s can be defined as a trie containing all the n suffixes of s.• For each leaf of T, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix si of s

• For any pairs of suffixes in s, the path associated with their longer prefix is the same in T

(Example on the string abbababbab)

Pattern discovery

Page 17: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

17

Bioinformatics techniques and methodologies

Pattern Discovery

Page 18: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

18

Bioinformatics techniques and methodologies

Pattern Discovery

Page 19: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

19

Bioinformatics techniques and methodologies

Pattern Discovery

Page 20: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

20

Bioinformatics techniques and methodologies

Pattern Discovery

Page 21: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

21

Bioinformatics techniques and methodologies

Pattern Discovery

Page 22: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

22

Bioinformatics techniques and methodologies

Pattern Discovery

Page 23: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

23

Bioinformatics techniques and methodologies

Pattern Discovery

Page 24: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

24

Bioinformatics techniques and methodologies

Pattern Discovery

Problem: often the size of the output is exponential in the input size

Page 25: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

25

Bioinformatics techniques and methodologies

Pattern Discovery

Page 26: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

26

Bioinformatics techniques and methodologies

Pattern Discovery – 2D Array

Page 27: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

27

Bioinformatics techniques and methodologies

Pattern Discovery – 2D Array

Page 28: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

28

Bioinformatics techniques and methodologies

Definition of maximal motif

not in composition

not in length

MAXIMAL

Page 29: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

29

Bioinformatics techniques and methodologies

Page 30: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

30

Bioinformatics techniques and methodologies

BASIS

• A basis of an image I is a set of irredundant motifs able to generate all the other motifs of I

• It is possible to prove that each image has ONLY ONE basis the basis is unique

• The size of the basis is linear in the size of the image- If I has size N, the number of motifs in the basis is O(N)

• In general, the number of motifs with don’t care in I is exponential in N

An important problem is the extraction of the basis from I

Page 31: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

31

Bioinformatics techniques and methodologies

A key concept: autocorrelation

Autocorrelations: the meet between I and all its bites

••••b•bababababababab•b•bababab•b•b•babab•ba••••babbbbbaabab

ababbbbabaababbbbabababa••••b•ba••••b•babababababababababababbbbb•b•babab•b•babababab•b•b•bab•b•b•babbbbb•ba••••b•ba••••

AAbbbbabababbababababababbbbabababbbb

bababababababbbbabababbb

PQ

meet between P and Q:

b•b••••bab••••b•b••••ab•bb•••••••••

b•b••bab••b•b••ab•bb

Page 32: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

32

Bioinformatics techniques and methodologies

Consensus, Meet, Autocorrelation

Projection at (i1, j

1) and (i

2, j

2)

Page 33: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

33

Bioinformatics techniques and methodologies

Basic Approach

Theorem: the basis is a subset of the set of autocorrelations

Three steps:

1. Generate all the autocorrelations of the inpute image I

2. Compute the lists of occurrences of the autocorrelations

3. Discard irredundant motifs

1. O(N2)

2. ?

3. O(N2)

Page 34: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

34

Bioinformatics techniques and methodologies

Second step

ababbbbabaababbbbabababababababababababababababbababababaababbbbbbbababbbbbbbababababababababababababbbbbaababbbbbbaabab

1) Fisher & Paterson O(N2lognloglogn)

2) Incremental building of the setB of irredundant motifs O(N3)

3) Exploit some properties about don’t cares O(N2), but only for binary alphabets

RRijij

ii

jj

BBij+1ij+1

BBijij

Page 35: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

35

Bioinformatics techniques and methodologies

Optimal Approach

Exploit some properties holding for |Σ|=2 (e.g., Σ ={a,b})

Page 36: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

36

Bioinformatics techniques and methodologies

Optimal Approach - Example

Is (2, 2) an occurrence of A3 4

?

Is (2, 4) an occurrence of A3 4

?

d1=2

d2=0 d

3=2

d2=1 d

3=1

Page 37: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

37

Bioinformatics techniques and methodologies

Optimal Approach

Three steps:

1. Generate all the autocorrelations of the inpute image I

2. Compute the lists of occurrences of the autocorrelations

3. Discard irredundant motifs

1. O(N2)

2. O(N2)

3. O(N2)

Overall Cost: O(NOverall Cost: O(N22))

Only black-and-white Images

Page 38: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

38

Bioinformatics techniques and methodologies

Image Compression

Main Idea: Exploit motif basis as 2D patches

Page 39: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

39

Bioinformatics techniques and methodologies

Image Compression

Page 40: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

40

Bioinformatics techniques and methodologies

Image Compression

Page 41: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

41

Bioinformatics techniques and methodologies

References:– A. Amelio, A. Apostolico and S. E. Rombo. Image

Compression by 2D Motif Basis. In Proceedings of IEEE Data Compression Conference (DCC 2011), IEEE CS Press, Snowbird, UT, USA, 2011 (Forthcoming).

– A. Apostolico, L. Parida and S. E. Rombo, Motif Patterns in 2D. Theoretical Computer Science. 2008.

– S. E. Rombo: Optimal extraction of motif patterns in 2D. Inf. Process. Lett. 109(17): 1015-1020 (2009).

– A. Apostolico and L. Parida, Incremental Paradigms of Motif Discovery, J. of Comp. Biol. 11:1 (2004) 15-25.

– A. Amir and M. Farach, Two-dimensional dictionary matching, Inf. Process. Lett. 44:5 (1992) 233-239.

– M.J. Fisher and M.S. Paterson, String Matching and Other Products, in: R.M. Karp (Ed.), Complexity of Computation (SIAM-AMS Proceedings, v.7), 1974, pp. 113-125.

Pattern discovery

Page 42: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

42

Bioinformatics techniques and methodologies

Approfondimenti (dal 2009 in poi):

• Compressione di immagini

• Analisi di immagini biologiche

• Pattern discovery/matching su immagini con rotazioni,

scaling e altre varianti

• Tecniche applicate alla ricerca di similarità tra immagini

• Pattern discovery (motif extraction) su stringhe

biologiche

Pattern discovery

Page 43: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

43

Bioinformatics techniques and methodologies

PPI networks similarity search

•Evolution influence protein-protein interactions

•Proteins cannot be analyzed independently

•Both high-throughput

and computational

methods contribute to

discover and predict

protein-protein

interactions

Biological Networks Analysis

Page 44: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

44

Bioinformatics techniques and methodologies

The Interaction Network of an organism:

nodes=proteins

edges=interactions

Biological Networks Analysis

Page 45: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

45

Bioinformatics techniques and methodologies

Why searching for similarity between proteins belonging to different PPI networks?

To individuate functional conservations across species

Biological Networks Analysis

Page 46: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

46

Bioinformatics techniques and methodologies

Our basic idea

Two proteins p1 and p2 in two different PPI networks

may be considered similar if:

– p1 and p2 have similar sequences

– proteins p1 and p2 are connected with, i.e., their

neighborhoods, have similar sequences

Biological Networks Analysis

Page 47: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

47

Bioinformatics techniques and methodologies

Refining protein similarities

S=sequence similarity

Biological Networks Analysis

Page 48: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

48

Bioinformatics techniques and methodologies

S’=refined similarity

Refining protein similarities

Biological Networks Analysis

Page 49: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

49

Bioinformatics techniques and methodologies

The Graph Network

P = a set of nodes labeled by proteins id

I = a set of indirect labeled edges

– <w,c> | w,c ∈[0,1]

– w = weakness

– c = confidence

Graph Network: GN = <P,I>

Biological Networks Analysis

Page 50: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

50

Bioinformatics techniques and methodologies

Interaction Pathi (I-Pathi)

A path such that:

– F(i-1) ≤ Σu wu ≤ F(i), i ≥ 1, F(0) = 0

Example:

<0.8,0.4>

p9

p8

p7

p6p5

p4

p3p2

p1

<0.2,0.7><0.1,0.6>

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2      i=1

<p2, p1, p4> satisfied

<p3, p4, p5, p6  > satisfied

<p4, p5, p9  > not satisfied

Biological Networks Analysis

Page 51: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

51

Bioinformatics techniques and methodologies

Cumulative Confidence

Given an I-Pathi:

– C=Πucu

Example:

<0.8,0.4>

p9

p8

p7

p6p5

p4

p3p2

p1

<0.2,0.7><0.1,0.6>

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2      i=1

For the path <p2, p1, p4>:

C = 0.4 * 0.7 = 0.28

Biological Networks Analysis

Page 52: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

52

Bioinformatics techniques and methodologies

i-th Neighborhood

Given a node p in GN = <P,I>:

– N(p,i)={q | q∈P, q≠p, <p,q> is an I-Pathi in GN with

minimum Σuwu}

Example:

p6

p5

p4

p2p3

p1

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2      i=1N(p3,i)={p1, p2, p4, p6}

Biological Networks Analysis

Page 53: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

53

Bioinformatics techniques and methodologies

The Bi-GRAPPIN Algorithm

Let GN 1 and GN 2 be graph networks of two different

organisms, with n1 and n2 nodes, resp.

Align each pair of proteins (p’,p’’) | p’∈GN 1 and p’’∈GN 2

(e.g., by the BLAST 2 seq. algorithm)

Biological Networks Analysis

Page 54: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

54

Bioinformatics techniques and methodologies

The Bi-GRAPPIN Algorithm

INPUT: a sequence similarity dictionary SSD storing all the triplets:

– <p’, p’’, f0> | p’∈GN 1, p’’∈GN 2, f0∈[0,1]

– f0: obtained by sequence alignment parameters

OUTPUT: a dictionary FSD storing:

– <p’, p’’, fp> | p’∈GN 1, p’’∈GN 2, fp ∈[0,1]

– fp: functional similarity

Biological Networks Analysis

Page 55: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

55

Bioinformatics techniques and methodologies

The Bi-GRAPPIN Algorithm

FSD = SSDfor each <p’,p’’, f0> ∈SSD

– if (f0 > fcut-off )

▪ set i=1

▪ while i<iMAX

– generategenerate NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– computecompute a bipartite graph maximum weight a bipartite graph maximum weight

matching between matching between NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– refinerefine ff00 obtaining a new value obtaining a new value ffpp, according to , according to

the objective function of the max. weight the objective function of the max. weight matchingmatching

– i=i+1i=i+1– return FSD

a fixed treshold value

corr. to the maximum network percentage to be analized

Biological Networks Analysis

Page 56: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

56

Bioinformatics techniques and methodologies

Example (1/3)

E

yeast flyP’ P’’

Target iMAX =4f0(p’,p’’)>fCUT­OFF

F(x)=Identity<w,c> = <1,1>

N(­, 1)

­

Biological Networks Analysis

Page 57: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

57

Bioinformatics techniques and methodologies

Example (2/3)

E

Bipartite graph maximum weight matching between

N(p’,1) and N(p’’,1)

(

yeast fly

0,75

0,83

0,89

0,82

0,65

0,22

0,73

0,85

0,34

0,33

Biological Networks Analysis

Page 58: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

58

Bioinformatics techniques and methodologies

Bipartite graph maximum weight matching between

N(p’,1) and N(p’’,1)

(

yeast fly

0,75

0,83

0,82

0,65

0,22

0,73

085

0,34

0,33

0,89

fp(1)= (1)*µ(δ N(p’,1),N(p’’,1),FSD, )+[1­ α (1)δ ]* f0(p’,p’’)

(

Example (2/3)

E

Biological Networks Analysis

Page 59: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

59

Bioinformatics techniques and methodologies

yeast flyP’ P’’

Target iMAX =4f0(p’,p’’)>fCUT­OFF

F(x)=Identity<w,c> = <1,1>

N(­, 1)

­

Example (3/3)

E

Biological Networks Analysis

Page 60: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

60

Bioinformatics techniques and methodologies

yeast flyP’

N(­, 1)

­

P’’

Target iMAX =4f0(p’,p’’)>fCUT­OFF

F(x)=Identity<w,c> = <1,1>

N(­, 2)

­

Example (3/3)

E

Biological Networks Analysis

Page 61: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

61

Bioinformatics techniques and methodologies

yeast flyP’

N(­, 1)

­

P’’

Target iMAX =4f0(p’,p’’)>fCUT­OFF

F(x)=Identity<w,c> = <1,1>

N(­, 2)

­

N(­, 3)

­

<p’, p’’, fp(3)>  FSD

Example (3/3)

C

Biological Networks Analysis

Page 62: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

62

Bioinformatics techniques and methodologies

Synthetic data (1/3)

S

Very similar neighborhoods: final fp greater than f0

Biological Networks Analysis

Page 63: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

63

Bioinformatics techniques and methodologies

High f0 but very dissimilar neighborhoods: final fp lower than f0

Synthetic data (2/3)

S

Biological Networks Analysis

Page 64: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

64

Bioinformatics techniques and methodologies

High f0, not very similar N(­, 1) but very similar N(­, 2) : 

final fp greater than f0

Synthetic data (3/3)

S

Biological Networks Analysis

Page 65: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

65

Bioinformatics techniques and methodologies

Functional Orthologs

S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.

R. Singh, J. Xu, and B. Berger. Pairwise global alignmentof protein interaction networks by matching neighborhoodtopology. In RECOMB 2007. LNB, 2007.

Page 66: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

66

Bioinformatics techniques and methodologies

Further experimentsQuery D. Melanogaster PPI network with Abp1, for

which no evident homolog has been detected – The most similar protein based on the sequence

homology: CG10083 (a debrin-like protein)

1

Abp1: an actin binding protein regulating actin nucleation

Is it possible to find other proteins involved in actin reorganization, comparing the sub-net composing Abp1 together with its first two neighborhoods against the entire drosophila network?

Biological Networks Analysis

Page 67: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

67

Bioinformatics techniques and methodologies

Further experiments

Best match according to our refined similarity: CG10083 (confirm the pairwise sequence similarity)

Abp1 and CG10083 are both Actin-binding proteins

Other proteins of unknown functions showing low sequence similarity with Abp1, may share similar function

CG6873-PA: a cofilin-like protein possibly involved in cytoskeleton shaping

SSD: <Abp1, CG6873-PA, 0.287>

FSD: <Abp1, CG6873-PA, 0.442 >

Biological Networks Analysis

Page 68: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

68

Bioinformatics techniques and methodologies

Asymmetric Alignment

•Master Network– Guides the alignment process

•Slave Network– It’s aligned to the master

•Some well-characterized organisms:– E.g. Saccharomyces Cerevisiae

•This is not the case for many other organisms

•Advantage:– Results retain the structural characteristic of the master

network (so they are sound )

Biological Networks Analysis

Page 69: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

69

Bioinformatics techniques and methodologies

Biological Networks Analysis

•Linearization of the slave network:– Translation of the network into a sequence of symbols

•Given a linearization of the slave find the portion of the master that can be associated to it

•Motivations:– Only the slave network is linearized, all the structural information

about the master network are kept– The approximation allows us to find similar groups of proteins, not

just isomorphic structures– The resulting algorithm has a polynomial time complexity

Asymmetric Alignment

Page 70: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

70

Bioinformatics techniques and methodologies

Biological Networks Analysis

Asymmetric Alignment

•Master network Alignment Model– Weighted finite-state automaton– States of the model corresponds to proteins

•Find the maximum scoring path (among the states of the master) for the linearization of the slave network: Viterbi Algorithm

        (p1, 0), (p2, 1), ... , (p3, 0)  score 1

(p1, 0), (*, 1), ... , (*, 0)  score 2

Page 71: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

71

Bioinformatics techniques and methodologies

Biological Networks Analysis

Asymmetric Alignment

•Global Alignment of Yeast (Master) and Fly (Slave)

Page 72: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

72

Bioinformatics techniques and methodologies

Biological Networks Analysis

Asymmetric Alignment

•Yeast (as the master) vs. Fly: – 945 protein pairings

•Fly (as the master) vs. Yeast: – 707 protein pairings

•Possible explanation:– Yeast network is better characterized than Fly network

with yeast as slave much structural information gets lost– There are more regions of the Yeast that have been

conserved in the Fly than vice versa, since the Fly is more complex

Page 73: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

73

Bioinformatics techniques and methodologies

PPI networks clustering

• Aim: clustering dense regions of a given PPI network, since it has been observed by biologists that groups of highly interacting proteins could be involved in common biological processes

Biological Networks Analysis

Page 74: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

74

Bioinformatics techniques and methodologies

Search of functional modules in PPI networks

•The network is modeled by a matrix representing the interactions.

•The algorithm introduces the concept of quality of a sub-matrix and apply a greedy tecnique to discover compact

regions of the network.

Biological Networks Analysis

Page 75: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

75

Bioinformatics techniques and methodologies

Biological Networks Analysis

Page 76: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

76

Bioinformatics techniques and methodologies

Biological Networks Analysis

Page 77: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

77

Bioinformatics techniques and methodologies

Biological Networks Analysis

Page 78: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

78

Bioinformatics techniques and methodologies

Biological Networks Analysis

Page 79: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

79

Bioinformatics techniques and methodologies

Biological Networks Analysis

Validation

Page 80: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

80

Bioinformatics techniques and methodologies

References1. N. Ferraro, L. Palopoli, S. Panni and S. E. Rombo. “Master-Slave”

Biological Network Alignment. In Proceedings of 6th International symposium on Bioinformatics Research and Applications (ISBRA 2010), 215–229, Connecticut, USA, 2010.

2. F. Bruno, L. Palopoli and S. E. Rombo. New trends in graph mining: Structural and Node-colored network motifs. International Journal of Knowledge Discovery in Bioinformatics, 1(1), 81–99, 2010.

3. C. Pizzuti and S. E. Rombo. Multi-functional Protein Clustering in PPI Networks. BIRD 2008.

4. V. Fionda, S. Panni, L. Palopoli and S. E. Rombo. Bi-GRAPPIN: Bipartite graph based protein-protein interaction networks similarity search. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07). Silicon Valley, USA, 2007.

5. C. Pizzuti and S. E. Rombo. PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07). Birmingham, UK, 16th-19th December, 2007.

6. S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.

Biological Networks Analysis

Page 81: Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

81

Bioinformatics techniques and methodologies

Approfondimenti (dal 2009 in poi):

• Alignment of biological networks • Integration and cleaning of biological networks• Querying of biological databases/networks • Biological networks clustering • RNA structure prediction• RNA sequence/structure alignment

Biological Networks Analysis