Upload
zaltana-torres
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Non-breaking Similarity of Genomes with Gene Repetitions. Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao. Background. - PowerPoint PPT Presentation
Citation preview
04/19/231
Non-breaking Similarity of Genomes with Gene Repetitions
Binhai Zhu
Computer Science Department, Montana State University
Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao
04/19/232
Background
Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936.
A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.
04/19/233
Background (cond.)
On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem.
This is a typical optimization problem, it makes sense to study the approximability of the problem.
04/19/234
DefinitionsGiven n gene families (alphabets) F, a genome
G’ is a sequence of elements of F such that each element has a (+/-) sign.
Example. F={a,b,c,d},
G’=-bd-cab-d-c
We will focus on unsigned sequences in this work.
A genome G is said to be exemplar if every gene appears exactly once in G.
04/19/235
Definitions (cond.)
Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G.
Example, G=abcdefg
H=efgdcab
there are 3 breakpoints in G (and symmetrically in H).
The number of breakpoints between G and H is called the breakpoint distance between G and H.
04/19/236
Exemplar Breakpoint Distance Problem
• Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized.
• We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).
04/19/237
Approximation Algorithms• Given a minimization (maximization)
problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/).
• Usually we say that A is a factor- approximation for Л.
04/19/238
Prior Results (1)• We showed that the exemplar breakpoint
distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006].
• This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0.
• Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].
04/19/239
Prior Results (2)• On the other hand, for the exemplar
breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results.
• As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.
04/19/2310
Background for this work• We try to look at the complement of the
breakpoint distance under the gene duplication model.
• As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.
04/19/2311
Definitions• Given exemplar genomes G and H drawn from the same
alphabet, ab is a non-breaking point, if ab appears in both G and H.
Example. G = abcdefg
H = fegcdab
We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H).
Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1.
• Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).
04/19/2312
ExampleG’ = abcadcefg
H’ = cfegcdabf
We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg.
We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf.
enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.
04/19/2313
Inapproximability ResultTheorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP.
Proof Idea: A linear reduction from Independent Set (IS).
04/19/2314
v5v3 v4
v1 v2
e1
e2
e3 e4
e5
G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5
H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =
x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2
Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M
H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}
Input graph has an IS of size K iff enbs(G,H’)=K.
N=5 vertices, M=5 edges
N+M is even
04/19/2315
v5v3 v4
v1 v2
e1
e2
e3 e4
e5
G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5
H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =
x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2
Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M
H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2 correspond to the optimal independent set {v3,v4}
Input graph has an IS of size K iff enbs(G,H’)=K.
N=5 vertices, M=5 edges
N+M is even
04/19/2316
Positive Results
Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g.,
…xyx…aba…
04/19/2317
Positive Results
Definition:
occ(g,G’) is the number of occurrence of g in G’.
span(g,G’) is the maximum distance between two copies of g in G’.
totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)
04/19/2318
Positive Results
Definition:
occ(g,G’) is the number of occurrence of g in G’.
span(g,G’) is the maximum distance between two copies of g in G’.
totalocc(c,G’)=∑gene g in G’ with span(g,G’)≥c occ(g,G’)
Example. G’=abcdaebd
span(a,G’)=4, span(b,G’)=5, span(d,G’)=4,
totalocc(4,G’)=6
04/19/2319
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
04/19/2320
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time.
Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse.
T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)
04/19/2321
Positive Results Theorem 2. Let G’ and H’ be two genomes with
t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.
Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.
04/19/2322
Positive Results Theorem 3. Let G’ and H’ be two genomes with
a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time.
Example. G’=abcadef
H’=bcedefad
shift(a,G’,H’) = 6