Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques

Bioinformatics techniques and methodologiesUniversità della Calabria

Facoltà di Ingegneria

BIOINFORMATICS TECHNIQUES AND METHODOLOGIES

Research group coordinated by Prof. Luigi PalopoliLecturer: Simona Rombo

2

Bioinformatics techniques and methodologies

OUTLINE

1. Introduction to Bioinformatics

2. Pattern discovery– Strings

– Images

3. Biological Networks Analysis– Network alignment

– Network clustering

3


Donald Knuth, 1993:

“…It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at people’fingertips, that it won’t be pretty much working on refinement of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can’t predict an unending growth. I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on…”

Introduction to Bioinformatics

4


There are several facts about biology that are important to keep in mind:

– In biology there are no rules without exceptions

– In reasoning with biological structures, looking for generalizations maybe often misleading

– It is often impossible to look at a biological phenomenon in isolation, for it may take place just as long as other related phenomena take place as well, which need to be taken care of too

– To reason with incomplete information is quite the rule rather than the exception

– In reasoning about biological structures and functions it is important to bear in mind the pervasive role of evolution


5


A definition:“Bioinformatics is the combination of biology and

Information technology. It is the branch of science that

deals with computer-based analysis of large biological

data sets. Bioinformatics incorporates the development

of databases to store and search data, and statistical

tools and algorithms to analyze and determine relationships

between biological data sets, such as macromolecular

sequences, structures, expression profiles and biochemical pathways.” (R.M. Twyman)


In most cases, computer based tools developed in bioinformatics require expert human intervention for the addressed problems to get solved

6


Generally speaking, the aim of bioinformatics is to help biologists in gathering and processing biological data and to aid in studying protein structures and interactions in order to allow optimal drug design.


7


Here is a summary of CS methods and techniques relevant to bioinformatics:

– String algorithms, grammars and automata– Indexing methods and query optimization– Integration techniques– Optimization techniques– Dynamic programming and heuristics– Data mining and machine learning techniques– Probability and statistic-based methods– Computational geometry methods– Text mining – …


8


Two main points of view:

1. Cellular components (e.g., DNA, RNA, proteins)

2. Interaction of cellular components (e.g., metabolic pathways, protein-protein interactions)


9


Introduction to Bioinformatics – Cellular Components

10


DNA


11


AMINO ACIDS

Proteins are the core structures determining cell lifecycle;

they are made up of elementary units called amino acids (few exceptions exist) or residues;

There are 20 amino acids in nature


12


•Another perspective is the analysis of protein mutual interactions

•Proteins are involved in complexes performing specific biological functions

Saccaromyces Cerevisiae

Introduction to BioinformaticsIntroduction to Bioinformatics – Interactions of components

13


Pattern Discovery

14


Efficient data structures

Trie• A tree data structure used to store strings• Each edge has a label representing a symbol• Two edges out of the same node have distinct labels• Each node, except the root, is associated with a string• Concatenating all the symbols in the path from the root to a node n, the string corresponding to n is obtained• All the descendance of the same node n are associated with strings having a common prefix, i.e., the string corresponding to n

Pattern discovery

15


ExampleA trie storing the words {to, te, tea, ten, hi, he, her}:

Pattern discovery

t h

o e

toto tete

a n

teatea tenten

i e

hihi heher

herher

16


Efficient data structures

Suffix TreeGiven a string s of n caracters on the alphabet Σ, a suffix tree T associated to s can be defined as a trie containing all the n suffixes of s.• For each leaf of T, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix si of s

• For any pairs of suffixes in s, the path associated with their longer prefix is the same in T

(Example on the string abbababbab)

Pattern discovery

17


Pattern Discovery

18


Pattern Discovery

19


Pattern Discovery

20


Pattern Discovery

21


Pattern Discovery

22


Pattern Discovery

23


Pattern Discovery

24


Pattern Discovery

Problem: often the size of the output is exponential in the input size

25


Pattern Discovery

26


Pattern Discovery – 2D Array

27


Pattern Discovery – 2D Array

28


Definition of maximal motif

not in composition

not in length

MAXIMAL

29


30


BASIS

• A basis of an image I is a set of irredundant motifs able to generate all the other motifs of I

• It is possible to prove that each image has ONLY ONE basis the basis is unique

• The size of the basis is linear in the size of the image- If I has size N, the number of motifs in the basis is O(N)

• In general, the number of motifs with don’t care in I is exponential in N

An important problem is the extraction of the basis from I

31


A key concept: autocorrelation

Autocorrelations: the meet between I and all its bites

••••b•bababababababab•b•bababab•b•b•babab•ba••••babbbbbaabab

ababbbbabaababbbbabababa••••b•ba••••b•babababababababababababbbbb•b•babab•b•babababab•b•b•bab•b•b•babbbbb•ba••••b•ba••••

AAbbbbabababbababababababbbbabababbbb

bababababababbbbabababbb

PQ

meet between P and Q:

b•b••••bab••••b•b••••ab•bb•••••••••

b•b••bab••b•b••ab•bb

32


Consensus, Meet, Autocorrelation

Projection at (i1, j

1) and (i

2, j

2)

33


Basic Approach

Theorem: the basis is a subset of the set of autocorrelations

Three steps:

1. Generate all the autocorrelations of the inpute image I

2. Compute the lists of occurrences of the autocorrelations

3. Discard irredundant motifs

1. O(N2)

2. ?

3. O(N2)

34


Second step

ababbbbabaababbbbabababababababababababababababbababababaababbbbbbbababbbbbbbababababababababababababbbbbaababbbbbbaabab

1) Fisher & Paterson O(N2lognloglogn)

2) Incremental building of the setB of irredundant motifs O(N3)

3) Exploit some properties about don’t cares O(N2), but only for binary alphabets

RRijij

ii

jj

BBij+1ij+1

BBijij

35


Optimal Approach

Exploit some properties holding for |Σ|=2 (e.g., Σ ={a,b})

36


Optimal Approach - Example

Is (2, 2) an occurrence of A3 4

?

Is (2, 4) an occurrence of A3 4

?

d1=2

d2=0 d

3=2

d2=1 d

3=1

37


Optimal Approach

Three steps:

1. Generate all the autocorrelations of the inpute image I

2. Compute the lists of occurrences of the autocorrelations

3. Discard irredundant motifs

1. O(N2)

2. O(N2)

3. O(N2)

Overall Cost: O(NOverall Cost: O(N22))

Only black-and-white Images

38


Image Compression

Main Idea: Exploit motif basis as 2D patches

39


Image Compression

40


Image Compression

41


References:– A. Amelio, A. Apostolico and S. E. Rombo. Image

Compression by 2D Motif Basis. In Proceedings of IEEE Data Compression Conference (DCC 2011), IEEE CS Press, Snowbird, UT, USA, 2011 (Forthcoming).

– A. Apostolico, L. Parida and S. E. Rombo, Motif Patterns in 2D. Theoretical Computer Science. 2008.

– S. E. Rombo: Optimal extraction of motif patterns in 2D. Inf. Process. Lett. 109(17): 1015-1020 (2009).

– A. Apostolico and L. Parida, Incremental Paradigms of Motif Discovery, J. of Comp. Biol. 11:1 (2004) 15-25.

– A. Amir and M. Farach, Two-dimensional dictionary matching, Inf. Process. Lett. 44:5 (1992) 233-239.

– M.J. Fisher and M.S. Paterson, String Matching and Other Products, in: R.M. Karp (Ed.), Complexity of Computation (SIAM-AMS Proceedings, v.7), 1974, pp. 113-125.

Pattern discovery

42


Approfondimenti (dal 2009 in poi):

• Compressione di immagini

• Analisi di immagini biologiche

• Pattern discovery/matching su immagini con rotazioni,

scaling e altre varianti

• Tecniche applicate alla ricerca di similarità tra immagini

• Pattern discovery (motif extraction) su stringhe

biologiche

Pattern discovery

43


PPI networks similarity search

•Evolution influence protein-protein interactions

•Proteins cannot be analyzed independently

•Both high-throughput

and computational

methods contribute to

discover and predict

protein-protein

interactions

Biological Networks Analysis

44


The Interaction Network of an organism:

nodes=proteins

edges=interactions


45


Why searching for similarity between proteins belonging to different PPI networks?

To individuate functional conservations across species


46


Our basic idea

Two proteins p1 and p2 in two different PPI networks

may be considered similar if:

– p1 and p2 have similar sequences

– proteins p1 and p2 are connected with, i.e., their

neighborhoods, have similar sequences


47


Refining protein similarities

S=sequence similarity


48


S’=refined similarity

Refining protein similarities


49


The Graph Network

P = a set of nodes labeled by proteins id

I = a set of indirect labeled edges

– <w,c> | w,c ∈[0,1]

– w = weakness

– c = confidence

Graph Network: GN = <P,I>


50


Interaction Pathi (I-Pathi)

A path such that:

– F(i-1) ≤ Σu wu ≤ F(i), i ≥ 1, F(0) = 0

Example:

<0.8,0.4>

p9

p8

p7

p6p5

p4

p3p2

p1

<0.2,0.7><0.1,0.6>

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2 i=1

<p2, p1, p4> satisfied

<p3, p4, p5, p6 > satisfied

<p4, p5, p9 > not satisfied


51


Cumulative Confidence

Given an I-Pathi:

– C=Πucu

Example:

<0.8,0.4>

p9

p8

p7

p6p5

p4

p3p2

p1

<0.2,0.7><0.1,0.6>

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2 i=1

For the path <p2, p1, p4>:

C = 0.4 * 0.7 = 0.28


52


i-th Neighborhood

Given a node p in GN = <P,I>:

– N(p,i)={q | q∈P, q≠p, <p,q> is an I-Pathi in GN with

minimum Σuwu}

Example:

p6

p5

p4

p2p3

p1

<0.3,0.4>

<0.6,0.2><0.9,0.4>

<0.7,0.1>

<0.5,0.3>

F(x)=x2 i=1N(p3,i)={p1, p2, p4, p6}


53


The Bi-GRAPPIN Algorithm

Let GN 1 and GN 2 be graph networks of two different

organisms, with n1 and n2 nodes, resp.

Align each pair of proteins (p’,p’’) | p’∈GN 1 and p’’∈GN 2

(e.g., by the BLAST 2 seq. algorithm)


54



INPUT: a sequence similarity dictionary SSD storing all the triplets:

– <p’, p’’, f0> | p’∈GN 1, p’’∈GN 2, f0∈[0,1]

– f0: obtained by sequence alignment parameters

OUTPUT: a dictionary FSD storing:

– <p’, p’’, fp> | p’∈GN 1, p’’∈GN 2, fp ∈[0,1]

– fp: functional similarity


55



FSD = SSDfor each <p’,p’’, f0> ∈SSD

– if (f0 > fcut-off )

▪ set i=1

▪ while i<iMAX

– generategenerate NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– computecompute a bipartite graph maximum weight a bipartite graph maximum weight

matching between matching between NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– refinerefine ff00 obtaining a new value obtaining a new value ffpp, according to , according to

the objective function of the max. weight the objective function of the max. weight matchingmatching

– i=i+1i=i+1– return FSD

a fixed treshold value

corr. to the maximum network percentage to be analized


56


Example (1/3)

E

yeast flyP’ P’’

Target iMAX =4f0(p’,p’’)>fCUTOFF

F(x)=Identity<w,c> = <1,1>

N(, 1)


57


Example (2/3)

E

Bipartite graph maximum weight matching between

N(p’,1) and N(p’’,1)

(

yeast fly

0,75

0,83

0,89

0,82

0,65

0,22

0,73

0,85

0,34

0,33


58


Bipartite graph maximum weight matching between

N(p’,1) and N(p’’,1)

(

yeast fly

0,75

0,83

0,82

0,65

0,22

0,73

085

0,34

0,33

0,89

fp(1)= (1)*µ(δ N(p’,1),N(p’’,1),FSD, )+[1 α (1)δ ]* f0(p’,p’’)

(

Example (2/3)

E


59


yeast flyP’ P’’



N(, 1)

Example (3/3)

E


60


yeast flyP’

N(, 1)

P’’



N(, 2)

Example (3/3)

E


61


yeast flyP’

N(, 1)

P’’



N(, 2)

N(, 3)

<p’, p’’, fp(3)> FSD

Example (3/3)

C


62


Synthetic data (1/3)

S

Very similar neighborhoods: final fp greater than f0


63


High f0 but very dissimilar neighborhoods: final fp lower than f0


S


64


High f0, not very similar N(, 1) but very similar N(, 2) :

final fp greater than f0


S


65


Functional Orthologs

S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.

R. Singh, J. Xu, and B. Berger. Pairwise global alignmentof protein interaction networks by matching neighborhoodtopology. In RECOMB 2007. LNB, 2007.

66


Further experimentsQuery D. Melanogaster PPI network with Abp1, for

which no evident homolog has been detected – The most similar protein based on the sequence

homology: CG10083 (a debrin-like protein)

1

Abp1: an actin binding protein regulating actin nucleation

Is it possible to find other proteins involved in actin reorganization, comparing the sub-net composing Abp1 together with its first two neighborhoods against the entire drosophila network?


67


Further experiments

Best match according to our refined similarity: CG10083 (confirm the pairwise sequence similarity)

Abp1 and CG10083 are both Actin-binding proteins

Other proteins of unknown functions showing low sequence similarity with Abp1, may share similar function

CG6873-PA: a cofilin-like protein possibly involved in cytoskeleton shaping

SSD: <Abp1, CG6873-PA, 0.287>

FSD: <Abp1, CG6873-PA, 0.442 >


68


Asymmetric Alignment

•Master Network– Guides the alignment process

•Slave Network– It’s aligned to the master

•Some well-characterized organisms:– E.g. Saccharomyces Cerevisiae

•This is not the case for many other organisms

•Advantage:– Results retain the structural characteristic of the master

network (so they are sound )


69



•Linearization of the slave network:– Translation of the network into a sequence of symbols

•Given a linearization of the slave find the portion of the master that can be associated to it

•Motivations:– Only the slave network is linearized, all the structural information

about the master network are kept– The approximation allows us to find similar groups of proteins, not

just isomorphic structures– The resulting algorithm has a polynomial time complexity


70




•Master network Alignment Model– Weighted finite-state automaton– States of the model corresponds to proteins

•Find the maximum scoring path (among the states of the master) for the linearization of the slave network: Viterbi Algorithm

(p1, 0), (p2, 1), ... , (p3, 0) score 1

(p1, 0), (*, 1), ... , (*, 0) score 2

71




•Global Alignment of Yeast (Master) and Fly (Slave)

72




•Yeast (as the master) vs. Fly: – 945 protein pairings

•Fly (as the master) vs. Yeast: – 707 protein pairings

•Possible explanation:– Yeast network is better characterized than Fly network

with yeast as slave much structural information gets lost– There are more regions of the Yeast that have been

conserved in the Fly than vice versa, since the Fly is more complex

73


PPI networks clustering

• Aim: clustering dense regions of a given PPI network, since it has been observed by biologists that groups of highly interacting proteins could be involved in common biological processes


74


Search of functional modules in PPI networks

•The network is modeled by a matrix representing the interactions.

•The algorithm introduces the concept of quality of a sub-matrix and apply a greedy tecnique to discover compact

regions of the network.


75



76



77



78



79



Validation

80


References1. N. Ferraro, L. Palopoli, S. Panni and S. E. Rombo. “Master-Slave”

Biological Network Alignment. In Proceedings of 6th International symposium on Bioinformatics Research and Applications (ISBRA 2010), 215–229, Connecticut, USA, 2010.

2. F. Bruno, L. Palopoli and S. E. Rombo. New trends in graph mining: Structural and Node-colored network motifs. International Journal of Knowledge Discovery in Bioinformatics, 1(1), 81–99, 2010.

3. C. Pizzuti and S. E. Rombo. Multi-functional Protein Clustering in PPI Networks. BIRD 2008.

4. V. Fionda, S. Panni, L. Palopoli and S. E. Rombo. Bi-GRAPPIN: Bipartite graph based protein-protein interaction networks similarity search. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07). Silicon Valley, USA, 2007.

5. C. Pizzuti and S. E. Rombo. PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07). Birmingham, UK, 16th-19th December, 2007.

6. S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.


81


Approfondimenti (dal 2009 in poi):

• Alignment of biological networks • Integration and cleaning of biological networks• Querying of biological databases/networks • Biological networks clustering • RNA structure prediction• RNA sequence/structure alignment


Documents

Facoltà di Ingegneria - unipa.itmath.unipa.it/rombo/files/teaching/seminario2010.pdf · Facoltà di Ingegneria ... t h o e to te a n tea ten i e hi he r her. 16 Bioinformatics techniques