Upload
vokhanh
View
222
Download
9
Embed Size (px)
Citation preview
Bioinformatics techniques and methodologiesUniversità della Calabria
Facoltà di Ingegneria
BIOINFORMATICS TECHNIQUES AND METHODOLOGIES
Research group coordinated by Prof. Luigi PalopoliLecturer: Simona Rombo
2
Bioinformatics techniques and methodologies
OUTLINE
1. Introduction to Bioinformatics
2. Pattern discovery– Strings
– Images
3. Biological Networks Analysis– Network alignment
– Network clustering
3
Bioinformatics techniques and methodologies
Donald Knuth, 1993:
“…It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at people’fingertips, that it won’t be pretty much working on refinement of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can’t predict an unending growth. I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on…”
Introduction to Bioinformatics
4
Bioinformatics techniques and methodologies
There are several facts about biology that are important to keep in mind:
– In biology there are no rules without exceptions
– In reasoning with biological structures, looking for generalizations maybe often misleading
– It is often impossible to look at a biological phenomenon in isolation, for it may take place just as long as other related phenomena take place as well, which need to be taken care of too
– To reason with incomplete information is quite the rule rather than the exception
– In reasoning about biological structures and functions it is important to bear in mind the pervasive role of evolution
Introduction to Bioinformatics
5
Bioinformatics techniques and methodologies
A definition:“Bioinformatics is the combination of biology and
Information technology. It is the branch of science that
deals with computer-based analysis of large biological
data sets. Bioinformatics incorporates the development
of databases to store and search data, and statistical
tools and algorithms to analyze and determine relationships
between biological data sets, such as macromolecular
sequences, structures, expression profiles and biochemical pathways.” (R.M. Twyman)
Introduction to Bioinformatics
In most cases, computer based tools developed in bioinformatics require expert human intervention for the addressed problems to get solved
6
Bioinformatics techniques and methodologies
Generally speaking, the aim of bioinformatics is to help biologists in gathering and processing biological data and to aid in studying protein structures and interactions in order to allow optimal drug design.
Introduction to Bioinformatics
7
Bioinformatics techniques and methodologies
Here is a summary of CS methods and techniques relevant to bioinformatics:
– String algorithms, grammars and automata– Indexing methods and query optimization– Integration techniques– Optimization techniques– Dynamic programming and heuristics– Data mining and machine learning techniques– Probability and statistic-based methods– Computational geometry methods– Text mining – …
Introduction to Bioinformatics
8
Bioinformatics techniques and methodologies
Two main points of view:
1. Cellular components (e.g., DNA, RNA, proteins)
2. Interaction of cellular components (e.g., metabolic pathways, protein-protein interactions)
Introduction to Bioinformatics
9
Bioinformatics techniques and methodologies
Introduction to Bioinformatics – Cellular Components
10
Bioinformatics techniques and methodologies
DNA
Introduction to Bioinformatics – Cellular Components
11
Bioinformatics techniques and methodologies
AMINO ACIDS
Proteins are the core structures determining cell lifecycle;
they are made up of elementary units called amino acids (few exceptions exist) or residues;
There are 20 amino acids in nature
Introduction to Bioinformatics – Cellular Components
12
Bioinformatics techniques and methodologies
•Another perspective is the analysis of protein mutual interactions
•Proteins are involved in complexes performing specific biological functions
Saccaromyces Cerevisiae
Introduction to BioinformaticsIntroduction to Bioinformatics – Interactions of components
13
Bioinformatics techniques and methodologies
Pattern Discovery
14
Bioinformatics techniques and methodologies
Efficient data structures
Trie• A tree data structure used to store strings• Each edge has a label representing a symbol• Two edges out of the same node have distinct labels• Each node, except the root, is associated with a string• Concatenating all the symbols in the path from the root to a node n, the string corresponding to n is obtained• All the descendance of the same node n are associated with strings having a common prefix, i.e., the string corresponding to n
Pattern discovery
15
Bioinformatics techniques and methodologies
ExampleA trie storing the words {to, te, tea, ten, hi, he, her}:
Pattern discovery
t h
o e
toto tete
a n
teatea tenten
i e
hihi heher
herher
16
Bioinformatics techniques and methodologies
Efficient data structures
Suffix TreeGiven a string s of n caracters on the alphabet Σ, a suffix tree T associated to s can be defined as a trie containing all the n suffixes of s.• For each leaf of T, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix si of s
• For any pairs of suffixes in s, the path associated with their longer prefix is the same in T
(Example on the string abbababbab)
Pattern discovery
17
Bioinformatics techniques and methodologies
Pattern Discovery
18
Bioinformatics techniques and methodologies
Pattern Discovery
19
Bioinformatics techniques and methodologies
Pattern Discovery
20
Bioinformatics techniques and methodologies
Pattern Discovery
21
Bioinformatics techniques and methodologies
Pattern Discovery
22
Bioinformatics techniques and methodologies
Pattern Discovery
23
Bioinformatics techniques and methodologies
Pattern Discovery
24
Bioinformatics techniques and methodologies
Pattern Discovery
Problem: often the size of the output is exponential in the input size
25
Bioinformatics techniques and methodologies
Pattern Discovery
26
Bioinformatics techniques and methodologies
Pattern Discovery – 2D Array
27
Bioinformatics techniques and methodologies
Pattern Discovery – 2D Array
28
Bioinformatics techniques and methodologies
Definition of maximal motif
not in composition
not in length
MAXIMAL
29
Bioinformatics techniques and methodologies
30
Bioinformatics techniques and methodologies
BASIS
• A basis of an image I is a set of irredundant motifs able to generate all the other motifs of I
• It is possible to prove that each image has ONLY ONE basis the basis is unique
• The size of the basis is linear in the size of the image- If I has size N, the number of motifs in the basis is O(N)
• In general, the number of motifs with don’t care in I is exponential in N
An important problem is the extraction of the basis from I
31
Bioinformatics techniques and methodologies
A key concept: autocorrelation
Autocorrelations: the meet between I and all its bites
••••b•bababababababab•b•bababab•b•b•babab•ba••••babbbbbaabab
ababbbbabaababbbbabababa••••b•ba••••b•babababababababababababbbbb•b•babab•b•babababab•b•b•bab•b•b•babbbbb•ba••••b•ba••••
AAbbbbabababbababababababbbbabababbbb
bababababababbbbabababbb
PQ
meet between P and Q:
b•b••••bab••••b•b••••ab•bb•••••••••
b•b••bab••b•b••ab•bb
32
Bioinformatics techniques and methodologies
Consensus, Meet, Autocorrelation
Projection at (i1, j
1) and (i
2, j
2)
33
Bioinformatics techniques and methodologies
Basic Approach
Theorem: the basis is a subset of the set of autocorrelations
Three steps:
1. Generate all the autocorrelations of the inpute image I
2. Compute the lists of occurrences of the autocorrelations
3. Discard irredundant motifs
1. O(N2)
2. ?
3. O(N2)
34
Bioinformatics techniques and methodologies
Second step
ababbbbabaababbbbabababababababababababababababbababababaababbbbbbbababbbbbbbababababababababababababbbbbaababbbbbbaabab
1) Fisher & Paterson O(N2lognloglogn)
2) Incremental building of the setB of irredundant motifs O(N3)
3) Exploit some properties about don’t cares O(N2), but only for binary alphabets
RRijij
ii
jj
BBij+1ij+1
BBijij
35
Bioinformatics techniques and methodologies
Optimal Approach
Exploit some properties holding for |Σ|=2 (e.g., Σ ={a,b})
36
Bioinformatics techniques and methodologies
Optimal Approach - Example
Is (2, 2) an occurrence of A3 4
?
Is (2, 4) an occurrence of A3 4
?
d1=2
d2=0 d
3=2
d2=1 d
3=1
37
Bioinformatics techniques and methodologies
Optimal Approach
Three steps:
1. Generate all the autocorrelations of the inpute image I
2. Compute the lists of occurrences of the autocorrelations
3. Discard irredundant motifs
1. O(N2)
2. O(N2)
3. O(N2)
Overall Cost: O(NOverall Cost: O(N22))
Only black-and-white Images
38
Bioinformatics techniques and methodologies
Image Compression
Main Idea: Exploit motif basis as 2D patches
39
Bioinformatics techniques and methodologies
Image Compression
40
Bioinformatics techniques and methodologies
Image Compression
41
Bioinformatics techniques and methodologies
References:– A. Amelio, A. Apostolico and S. E. Rombo. Image
Compression by 2D Motif Basis. In Proceedings of IEEE Data Compression Conference (DCC 2011), IEEE CS Press, Snowbird, UT, USA, 2011 (Forthcoming).
– A. Apostolico, L. Parida and S. E. Rombo, Motif Patterns in 2D. Theoretical Computer Science. 2008.
– S. E. Rombo: Optimal extraction of motif patterns in 2D. Inf. Process. Lett. 109(17): 1015-1020 (2009).
– A. Apostolico and L. Parida, Incremental Paradigms of Motif Discovery, J. of Comp. Biol. 11:1 (2004) 15-25.
– A. Amir and M. Farach, Two-dimensional dictionary matching, Inf. Process. Lett. 44:5 (1992) 233-239.
– M.J. Fisher and M.S. Paterson, String Matching and Other Products, in: R.M. Karp (Ed.), Complexity of Computation (SIAM-AMS Proceedings, v.7), 1974, pp. 113-125.
Pattern discovery
42
Bioinformatics techniques and methodologies
Approfondimenti (dal 2009 in poi):
• Compressione di immagini
• Analisi di immagini biologiche
• Pattern discovery/matching su immagini con rotazioni,
scaling e altre varianti
• Tecniche applicate alla ricerca di similarità tra immagini
• Pattern discovery (motif extraction) su stringhe
biologiche
Pattern discovery
43
Bioinformatics techniques and methodologies
PPI networks similarity search
•Evolution influence protein-protein interactions
•Proteins cannot be analyzed independently
•Both high-throughput
and computational
methods contribute to
discover and predict
protein-protein
interactions
Biological Networks Analysis
44
Bioinformatics techniques and methodologies
The Interaction Network of an organism:
nodes=proteins
edges=interactions
Biological Networks Analysis
45
Bioinformatics techniques and methodologies
Why searching for similarity between proteins belonging to different PPI networks?
To individuate functional conservations across species
Biological Networks Analysis
46
Bioinformatics techniques and methodologies
Our basic idea
Two proteins p1 and p2 in two different PPI networks
may be considered similar if:
– p1 and p2 have similar sequences
– proteins p1 and p2 are connected with, i.e., their
neighborhoods, have similar sequences
Biological Networks Analysis
47
Bioinformatics techniques and methodologies
Refining protein similarities
S=sequence similarity
Biological Networks Analysis
48
Bioinformatics techniques and methodologies
S’=refined similarity
Refining protein similarities
Biological Networks Analysis
49
Bioinformatics techniques and methodologies
The Graph Network
P = a set of nodes labeled by proteins id
I = a set of indirect labeled edges
– <w,c> | w,c ∈[0,1]
– w = weakness
– c = confidence
Graph Network: GN = <P,I>
Biological Networks Analysis
50
Bioinformatics techniques and methodologies
Interaction Pathi (I-Pathi)
A path such that:
– F(i-1) ≤ Σu wu ≤ F(i), i ≥ 1, F(0) = 0
Example:
<0.8,0.4>
p9
p8
p7
p6p5
p4
p3p2
p1
<0.2,0.7><0.1,0.6>
<0.3,0.4>
<0.6,0.2><0.9,0.4>
<0.7,0.1>
<0.5,0.3>
F(x)=x2 i=1
<p2, p1, p4> satisfied
<p3, p4, p5, p6 > satisfied
<p4, p5, p9 > not satisfied
Biological Networks Analysis
51
Bioinformatics techniques and methodologies
Cumulative Confidence
Given an I-Pathi:
– C=Πucu
Example:
<0.8,0.4>
p9
p8
p7
p6p5
p4
p3p2
p1
<0.2,0.7><0.1,0.6>
<0.3,0.4>
<0.6,0.2><0.9,0.4>
<0.7,0.1>
<0.5,0.3>
F(x)=x2 i=1
For the path <p2, p1, p4>:
C = 0.4 * 0.7 = 0.28
Biological Networks Analysis
52
Bioinformatics techniques and methodologies
i-th Neighborhood
Given a node p in GN = <P,I>:
– N(p,i)={q | q∈P, q≠p, <p,q> is an I-Pathi in GN with
minimum Σuwu}
Example:
p6
p5
p4
p2p3
p1
<0.3,0.4>
<0.6,0.2><0.9,0.4>
<0.7,0.1>
<0.5,0.3>
F(x)=x2 i=1N(p3,i)={p1, p2, p4, p6}
Biological Networks Analysis
53
Bioinformatics techniques and methodologies
The Bi-GRAPPIN Algorithm
Let GN 1 and GN 2 be graph networks of two different
organisms, with n1 and n2 nodes, resp.
Align each pair of proteins (p’,p’’) | p’∈GN 1 and p’’∈GN 2
(e.g., by the BLAST 2 seq. algorithm)
Biological Networks Analysis
54
Bioinformatics techniques and methodologies
The Bi-GRAPPIN Algorithm
INPUT: a sequence similarity dictionary SSD storing all the triplets:
– <p’, p’’, f0> | p’∈GN 1, p’’∈GN 2, f0∈[0,1]
– f0: obtained by sequence alignment parameters
OUTPUT: a dictionary FSD storing:
– <p’, p’’, fp> | p’∈GN 1, p’’∈GN 2, fp ∈[0,1]
– fp: functional similarity
Biological Networks Analysis
55
Bioinformatics techniques and methodologies
The Bi-GRAPPIN Algorithm
FSD = SSDfor each <p’,p’’, f0> ∈SSD
– if (f0 > fcut-off )
▪ set i=1
▪ while i<iMAX
– generategenerate NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– computecompute a bipartite graph maximum weight a bipartite graph maximum weight
matching between matching between NN((p’,i)p’,i) and and NN((p’’,i)p’’,i)– refinerefine ff00 obtaining a new value obtaining a new value ffpp, according to , according to
the objective function of the max. weight the objective function of the max. weight matchingmatching
– i=i+1i=i+1– return FSD
a fixed treshold value
corr. to the maximum network percentage to be analized
Biological Networks Analysis
56
Bioinformatics techniques and methodologies
Example (1/3)
E
yeast flyP’ P’’
Target iMAX =4f0(p’,p’’)>fCUTOFF
F(x)=Identity<w,c> = <1,1>
N(, 1)
Biological Networks Analysis
57
Bioinformatics techniques and methodologies
Example (2/3)
E
Bipartite graph maximum weight matching between
N(p’,1) and N(p’’,1)
(
yeast fly
0,75
0,83
0,89
0,82
0,65
0,22
0,73
0,85
0,34
0,33
Biological Networks Analysis
58
Bioinformatics techniques and methodologies
Bipartite graph maximum weight matching between
N(p’,1) and N(p’’,1)
(
yeast fly
0,75
0,83
0,82
0,65
0,22
0,73
085
0,34
0,33
0,89
fp(1)= (1)*µ(δ N(p’,1),N(p’’,1),FSD, )+[1 α (1)δ ]* f0(p’,p’’)
(
Example (2/3)
E
Biological Networks Analysis
59
Bioinformatics techniques and methodologies
yeast flyP’ P’’
Target iMAX =4f0(p’,p’’)>fCUTOFF
F(x)=Identity<w,c> = <1,1>
N(, 1)
Example (3/3)
E
Biological Networks Analysis
60
Bioinformatics techniques and methodologies
yeast flyP’
N(, 1)
P’’
Target iMAX =4f0(p’,p’’)>fCUTOFF
F(x)=Identity<w,c> = <1,1>
N(, 2)
Example (3/3)
E
Biological Networks Analysis
61
Bioinformatics techniques and methodologies
yeast flyP’
N(, 1)
P’’
Target iMAX =4f0(p’,p’’)>fCUTOFF
F(x)=Identity<w,c> = <1,1>
N(, 2)
N(, 3)
<p’, p’’, fp(3)> FSD
Example (3/3)
C
Biological Networks Analysis
62
Bioinformatics techniques and methodologies
Synthetic data (1/3)
S
Very similar neighborhoods: final fp greater than f0
Biological Networks Analysis
63
Bioinformatics techniques and methodologies
High f0 but very dissimilar neighborhoods: final fp lower than f0
Synthetic data (2/3)
S
Biological Networks Analysis
64
Bioinformatics techniques and methodologies
High f0, not very similar N(, 1) but very similar N(, 2) :
final fp greater than f0
Synthetic data (3/3)
S
Biological Networks Analysis
65
Bioinformatics techniques and methodologies
Functional Orthologs
S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.
R. Singh, J. Xu, and B. Berger. Pairwise global alignmentof protein interaction networks by matching neighborhoodtopology. In RECOMB 2007. LNB, 2007.
66
Bioinformatics techniques and methodologies
Further experimentsQuery D. Melanogaster PPI network with Abp1, for
which no evident homolog has been detected – The most similar protein based on the sequence
homology: CG10083 (a debrin-like protein)
1
Abp1: an actin binding protein regulating actin nucleation
Is it possible to find other proteins involved in actin reorganization, comparing the sub-net composing Abp1 together with its first two neighborhoods against the entire drosophila network?
Biological Networks Analysis
67
Bioinformatics techniques and methodologies
Further experiments
Best match according to our refined similarity: CG10083 (confirm the pairwise sequence similarity)
Abp1 and CG10083 are both Actin-binding proteins
Other proteins of unknown functions showing low sequence similarity with Abp1, may share similar function
CG6873-PA: a cofilin-like protein possibly involved in cytoskeleton shaping
SSD: <Abp1, CG6873-PA, 0.287>
FSD: <Abp1, CG6873-PA, 0.442 >
Biological Networks Analysis
68
Bioinformatics techniques and methodologies
Asymmetric Alignment
•Master Network– Guides the alignment process
•Slave Network– It’s aligned to the master
•Some well-characterized organisms:– E.g. Saccharomyces Cerevisiae
•This is not the case for many other organisms
•Advantage:– Results retain the structural characteristic of the master
network (so they are sound )
Biological Networks Analysis
69
Bioinformatics techniques and methodologies
Biological Networks Analysis
•Linearization of the slave network:– Translation of the network into a sequence of symbols
•Given a linearization of the slave find the portion of the master that can be associated to it
•Motivations:– Only the slave network is linearized, all the structural information
about the master network are kept– The approximation allows us to find similar groups of proteins, not
just isomorphic structures– The resulting algorithm has a polynomial time complexity
Asymmetric Alignment
70
Bioinformatics techniques and methodologies
Biological Networks Analysis
Asymmetric Alignment
•Master network Alignment Model– Weighted finite-state automaton– States of the model corresponds to proteins
•Find the maximum scoring path (among the states of the master) for the linearization of the slave network: Viterbi Algorithm
(p1, 0), (p2, 1), ... , (p3, 0) score 1
(p1, 0), (*, 1), ... , (*, 0) score 2
71
Bioinformatics techniques and methodologies
Biological Networks Analysis
Asymmetric Alignment
•Global Alignment of Yeast (Master) and Fly (Slave)
72
Bioinformatics techniques and methodologies
Biological Networks Analysis
Asymmetric Alignment
•Yeast (as the master) vs. Fly: – 945 protein pairings
•Fly (as the master) vs. Yeast: – 707 protein pairings
•Possible explanation:– Yeast network is better characterized than Fly network
with yeast as slave much structural information gets lost– There are more regions of the Yeast that have been
conserved in the Fly than vice versa, since the Fly is more complex
73
Bioinformatics techniques and methodologies
PPI networks clustering
• Aim: clustering dense regions of a given PPI network, since it has been observed by biologists that groups of highly interacting proteins could be involved in common biological processes
Biological Networks Analysis
74
Bioinformatics techniques and methodologies
Search of functional modules in PPI networks
•The network is modeled by a matrix representing the interactions.
•The algorithm introduces the concept of quality of a sub-matrix and apply a greedy tecnique to discover compact
regions of the network.
Biological Networks Analysis
75
Bioinformatics techniques and methodologies
Biological Networks Analysis
76
Bioinformatics techniques and methodologies
Biological Networks Analysis
77
Bioinformatics techniques and methodologies
Biological Networks Analysis
78
Bioinformatics techniques and methodologies
Biological Networks Analysis
79
Bioinformatics techniques and methodologies
Biological Networks Analysis
Validation
80
Bioinformatics techniques and methodologies
References1. N. Ferraro, L. Palopoli, S. Panni and S. E. Rombo. “Master-Slave”
Biological Network Alignment. In Proceedings of 6th International symposium on Bioinformatics Research and Applications (ISBRA 2010), 215–229, Connecticut, USA, 2010.
2. F. Bruno, L. Palopoli and S. E. Rombo. New trends in graph mining: Structural and Node-colored network motifs. International Journal of Knowledge Discovery in Bioinformatics, 1(1), 81–99, 2010.
3. C. Pizzuti and S. E. Rombo. Multi-functional Protein Clustering in PPI Networks. BIRD 2008.
4. V. Fionda, S. Panni, L. Palopoli and S. E. Rombo. Bi-GRAPPIN: Bipartite graph based protein-protein interaction networks similarity search. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM'07). Silicon Valley, USA, 2007.
5. C. Pizzuti and S. E. Rombo. PINCoC: a Co-Clustering based Method to Analyze Protein-Protein Interaction Networks. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07). Birmingham, UK, 16th-19th December, 2007.
6. S. Bandyopadhyay, R. Sharan, and T. Ideker. Systematic identification of functional orthologs based on protein network comparison. Genome Research, 16(3):428–435, 2006.
Biological Networks Analysis
81
Bioinformatics techniques and methodologies
Approfondimenti (dal 2009 in poi):
• Alignment of biological networks • Integration and cleaning of biological networks• Querying of biological databases/networks • Biological networks clustering • RNA structure prediction• RNA sequence/structure alignment
Biological Networks Analysis