Search for a Star: Approximate Gene Cluster DiscoveryProblem (AGCDP) as Minimization Problem on Graph
Jeffrey A. Aborot, Henry Adorna, Jhoirene B. Clemente, Brian Kenneth de Jesus andGeoffrey Solano
Algorithms and Complexity LaboratoryDepartment of Computer Science
College of EngineeringUniversity of the Philippines Diliman
[email protected], [email protected], [email protected],[email protected], [email protected]
ABSTRACTFinding gene clusters in genomes is an essential process inestablishing relationship among organisms. Gene clustersmay express functional dependencies among genes and maygive insight into expression of specific traits. The problem offinding gene clusters among several genomes is referred to asGene Cluster Discovery and several models has already beenformulated for its definition. One formulation of this prob-lem is the Approximate Gene Cluster Discovery Problem(AGCDP) which is modelled as a combinatorial optimiza-tion problem in some works. In this paper we propose anapproach which produces a transformation of AGCDP intoa minimum-weight star finding problem in graph. Detailedexamples are also presented to further clarify the notion ofthe transformation. Proof of equivalence is also presentedin the paper to show the equivalence of input parameters ofAGCDP and the construction of the graph representing theinput parameters to the problem.
KeywordsGene, genome, gene cluster, gene content, linear interval,genome, minimization, minimum-weight star, combinatorialoptimization
1. INTRODUCTIONGene clusters are set of genes that are closely related toeach other. Genes belonging to a cluster may share func-tional dependencies and may be involved in the expressionof a specific trait. Identifying gene clusters is also an essen-tial step in establishing relationships between organisms aswell as discovery of drug and treatments for diseases. Theproblem of identifying this set of genes is called Gene ClusterDiscovery. This problem has been modelled several times,examples of which are presented in [3], [4], [5], where genesare modelled as integers and genomes are either permuta-
tions or sequences defined over the set of all genes. Modelsin [3] also takes into account gene clusters with (max-gapclusters) and without gaps (exact clusters).
The focus of this work is on the model presented in [5], wherethey define Approximate Gene Cluster Discovery Problem(AGCDP) as a combinatorial problem which identifies theset of genes that are kept “more or less” together acrossgenome sequences. An Integer Linear Programming (ILP)formulation is also presented in [5]. Several modifications ofthe model, specifically on the objective function, is also pre-sented to take into account characteristics of real biologicaldata. Among these includes, absence of gene cluster occur-rence in some of the input genomes, identification of validgene clusters, and use of certain reference genome.
In this paper we will represent AGCDP as a graph problem.We will define how we transformed the set of inputs to aspecific graph called GACGDP . Then we will discussed howthe problem is reduced to finding minimum weight star(u)in a graph. Two cases were both modelled in this paper.We consider scenarios with and without a given referencegenome.
This paper is organized as follows. Section 2 presents abrief discussion of AGCDP as well as the naive and ILPformulation of the problem. Section 3 contains the detaileddiscussion of how AGCDP is represented as a graph problem.Proof of equivalence of the two representations is discussedin Section 4. Finally Section 5 concludes the paper.
2. APPROXIMATE GENE CLUSTER DIS-COVERY PROBLEM
Necessary for our understanding of the problem are the fol-lowing definitions.
1. Gene A gene is represented by an integer g ∈ Z0.Special genes represented by the integer 0 are geneswith non existing homologs, with which we are notinterested of in this problem.
2. Gene Universe The set of all unique genes is calledthe gene universe and is denoted by U = {0, 1, 2, . . . , N}.
3. Genome For simplicity, the genome of a certain indi-vidual can be represented as a single sequence of genesfrom a chromosome. For instance, gi = (gi1, g
i2, . . . , g
ini
),
where each gij is the jth gene in the ith genome. In gen-
eral, a set of genomes is represented by G = {g1, g2, . . . , gt}for some t ∈ Z+, where each gi has length ni ∈ Z+.
4. Linear Interval A linear interval J i in a genome gi =(gi1, g
i2, . . . , g
ini
) is an index set which can either be
empty J i = ∅ or J i = {j, j + 1, . . . , k}, which can alsobe denoted as J i
ji,kiwhere 1 ≤ ji ≤ ki ≤ ni. A set of
linear intervals from all genomes is denoted by J .
5. Gene Content The gene content G(J ij,k) of a linear
interval J ij,k in genome gi is the set of unique genes
contained in that interval.
6. Set Difference The set difference between two dis-tinct gene contents G(J i
ji,ki) and G(Jp
jp,kp), 1 ≤ i ≤ t,
1 ≤ p ≤ t and i 6= p, is defined as
G(J iji,ki
)\G(Jpjp,kp
) = {g|g ∈ G(J iji,ki
) and g /∈ G(Jpjp,kp
)}.
Given the definitions listed above, AGCDP aims to find agene set X ⊂ U , where 0 /∈ X and a set of linear intervalsJ = J i
ji,kiwhere the gene content G(J i
ji,ki) for each genome
gi is roughly equal toX. To define formally how closeX is tothe gene contentsG(J i
ji,ki), the number of missing genes |X\
G(J iji,ki
)| and additional genes |G(J iji,ki
)\X| are computed.
Let us formally define the Basic Approximate Gene ClusterDiscovery Problem (AGCDP).
Definition 1 (AGCDP [5]). Given the gene universeU = {0, 1, . . . , N}, set of genomes G = {g1, g2, . . . , gt}, sizerange [D−, D+] or positive constant D, and integer weightsw− and w+ corresponding to cost of missing and additionalgenes in an interval, identify X ⊂ U , 0 /∈ X, D− ≤ |X| ≤D+ or |X| = D and a set of linear intervals J = {J i
ji,ki},
∀ i such that the cost function
cost(X, J) =
t∑i=1
[(w− ·|X \G(J iji,ki
)|)+(w+ ·|G(J iji,ki
)\X|)]
is minimum.
AGCDP is a double minimization problem. In order toidentify the cost of X, we identify the set of linear inter-vals J which minimizes the value of the objective functioncost(X, J). The naive way of identifying the set X is tocheck the cost of all possible X ⊂ U , which is
(ND
)if |X| = D,
otherwise ∑∀d
d
(N
d
), where D− ≤ d ≤ D+
if we have D− ≤ |X| ≤ D+. Also note that we have toidentify the set of linear intervals for each genome whichminimizes the cost given X. Naively this can be done by
checking all possible linear intervals in each genome. Thetotal running of time the naive AGCDP solver is the num-ber of X ⊂ U satisfying the constraint |X| = D, times therunning time of identifying the best linear interval given X.Thus, naive AGCDP solver takes O(N !(n2t)).
Example 1. To illustrate Definition 1, suppose we havethe set of genomes G = {g1, g2, g3, g4} defined over the geneuniverse U = {0, 1, 2, 3, 4, 5, 6, 7}. The aim of AGCDP is todiscover a set of genes X ⊂ U , with cardinality |X| = D = 3such cost(X, J) is minimum. Let G be equal to the followingset. In this example, gene 0 does not exist in at least one ofthe genomes.
g1 : 1 1 3 2 4g2 : 3 2 1 4g3 : 5 6 1 4 2g4 : 1 3 7
The naive approach of finding minimum X is to evaluateeach X that satisfy the constraints X ⊂ U and |X| = D. Inthis case we have to evaluate
X = {{1, 3, 2}, {2, 3, 4}, {3, 4, 5}, . . .},
where the total number of gene sets to be evaluated is(73
).
Note that to evaluate the score of a certain gene set, anotherminimization procedure is needed, i.e. identification of thebest linear interval for each genome.
After checking all possible linear interval and all possiblegene cluster X, we see that cost(X∗, J∗) is minimum whereX∗ = {1, 2, 3} and J∗ = {J1
1,4, J21,3, J
33,5, J
41,2} as shown in
the figure below.
g1 : 1 1 3 2 4g2 : 3 2 1 4g3 : 5 6 1 4 2g4 : 1 3 7
Figure 1: The set of gene contents{{1, 3, 2}, {3, 2, 1}, {1, 4, 2}, {1, 3}} corresponding toJ∗.
Given that the penalties for missing and additional genesare the same, i.e. w+ = w− = 1, the computation forcost(X∗, J∗) is as follows.
cost(X∗, J∗) =∑4
i=1 |X∗ \Gi
J∗i|+ |Gi
J∗i\X∗|
cost(X∗, J∗) = (0 + 0) + (0 + 0) + (1 + 1) + (1 + 0)cost(X∗, J∗) = 3
The integer linear programming (ILP) formulation of AGCDPis presented in [5]. They defined the set of missing and ad-ditional genes in Equation 1 and 2 respectively.
|X \GiJi| =
N∑q=0
(xq − liq) (1)
|GiJi\X| =
N∑q=0
(X iq − liq) (2)
where xq, lqi , and X iq are binary indicator vectors which per-
tains to the reference gene setX, the intersection of referencegene set and gene content X
⋂Gi
Ji, and the gene content i
Ji
respectively. Further details of their integer linear program-ming formulation is presented in [5].
A simpler approach to solve ACGDP is to make use of a ref-erence genome. Note that in the objective function, we arecomparing the gene set of each linear interval to a referencegene set X. With the use of a reference genome sequence,instead of evaluating all possible X ⊂ U which we can gen-erate from the universal set, we only evaluate Xs which arepresent in the reference genome. This reduces the solutionspace of AGCDP. However, it is important to note that aglobal minimum X identified using the naive solution (con-sidering all possible clusters in all the input genomes) maynot be the X identified when we use a reference genome,i.e. X may not occur in at least one of the genomes in thenaive solution. Since in some cases a reference genome isavailable, one may opt to use other specialized algorithmssimpler than the ILP formulation presented in [5]. For in-stances in which reference genome sequences are availablethrough comparison and alignment of genomes of severalspecies, these reference genome sequences are used in theprocess of identifying new genome sequences.
3. AGCDP AS MINIMIZATION PROBLEMON GRAPH
Given the definition of AGCDP we look into two differentcases of the problem which we discuss in the succeeding sub-sections. The first case is when we consider one of the inputgenomes as a reference genome from which we generate genecontents which we treat as reference gene sets without lossof generality. The second case is when we do not assume areference genome from the set of input genomes. Instead,we consider each gene content of a linear interval in any ofthe parts of the graph satisfying the constraint on |X| asa possible solution gene set X. This lessens the number ofpossible reference gene set subsets of U which we need toevaluate.
3.0.1 Construction of GraphGiven gene universe U = {0, 1, . . . , N}, a set of genomesG = {g1, g2, . . . , gt}, size range |X| = [D−, D+] or |X| = D,and integer penalty weights w− and w+ for missing and ad-ditional genes respectively in a linear interval, we proposethe following transformation of AGCDP input parametersinto a t-partite edge-weighted undirected graph representa-tion.
Definition 2. Define a t-partite edge-weighted undirectedgraph GAGCDP = (V,E), where
V =
t⋃i=1
Vi.
1. A part Vi ⊂ V represents a genome gi ∈ G where1 ≤ i ≤ t.
2. A function v(.) is a bijection v : Gi −→ Vi from the setof all gene contents Gi = {G(J i
ji,ki)} in genome gi to
the set of all vertices in Vi ∀ i, 1 ≤ i ≤ t. An evaluationof v(G(J i
ji,ki)) is a vertex in graph GAGCDP and an
evaluation of v(xi) is a gene content of a linear intervalin genome gi.
3. An edge e ∈ E is incident to vertices x ∈ Vx and y ∈ Vy
if and only if x 6= y.
4. The weight wx,y assigned to an edge incident to verticesx and y is equal to
wx,y = w− · |v(x)\v(y)|+ w+ · |v(y)\v(x)| (3)
where the ”\” is the set difference operator.
Given GAGCDP as the graph representation of the input pa-rameters of AGCDP, let us now define a collection of verticesstar(x) rooted at vertex x ∈ Vj defined in [10]. Basically,star(x) is a set of vertices from each part where the weightbetween the root and each vertex is minimum among all ver-tices in a part incident to the root. Below shows the formaldefinition of star(x).
Definition 3. Given a vertex x in GAGCDP , let star(x)be a set of vertices of size (t − 1) such that xi is rooted atx ∈ Vu, where
xi = arg minxi wxi,x ∀ xi ∈ Vi, i 6= u
.
We can evaluate a star(x) by computing its weight as
weight(star(x)) =
t∑i=1
wxi,x, i 6= u (4)
3.0.2 Minimization on GraphGiven the proposed transformation of the input parametersof AGCDP into its graph representation, we define AGCDPas a minimization problem on graph as follows.
Definition 4. Given a graph GAGCDP = (V,E) and asize range [D−, D+] or positive constant D, find a vertexx ∈ Vu and star(x) such that |v(x)| = D or |v(x)| is in therange [D−, D+] and
weight(star(x)) =
t∑i=1
wxi,x, i 6= u
is minimum.
As discussed in section 2, AGCDP is a double minimizationproblem. This is also true for the graph problem transfor-mation of AGCDP in which a vertex x mapped to a genecontent (possible solution gene set X) and the set of vertices
xi, star(x), which minimizes the weight of each edge inci-dent to pairs x, xi is searched for in the graph. From amongthe found star(x), the minimum-weight star is identified.From the resulting minimum-weight star(x) we identify thegene content Gu
ju,kuto which the root vertex x maps into.
Guju,ku
is the solution gene cluster. Note that there may beone or more solution gene cluster and so we may find one ormore root x. Also, we identify the set Ju
ju,ku∪ {J i
ji,ki} as
the set of linear intervals J where {J iji,ki} is the set of linear
intervals to which star(x) maps into and Juju,ku
is the linearinterval associated with the gene content Gu
ju,ku.
To give a more intuitive notion of the transformation dis-cussed, we present examples of the two cases aforementionedwith input parameters as specified in Example 1.
3.0.3 AGCDP with Reference GenomeWithout loss of generality we assume genome g1 to be thereference genome. All gene contents of linear intervals in g1
which satisfy the size constraint |X| = D will be consideredas possible solution gene set X to the problem. These genecontents are mapped to vertices in part V1 of the graph.These vertices are the roots of star(x) which will be eval-uated for weight. Assume that the penalty weight w− =w+ = 1 and the size constraint |X| = 3 as specified in Ex-ample 1.
Example 2. Given the set of genomes G, we constructthe graph GAGCDP .
1. Identify the linear intervals J iji,ki
for each genome gi
in G.
2. Identify gene content G(J iji,ki
) for each linear interval
J iji,ki
.
3. To each identified gene content G(J iji,ki
) a vertex
v(G(J iji,ki
)) in GAGCDP is mapped into.
4. For all pairs x1, xi, x1 ∈ V1, xi ∈ Vi and i 6= 1 definean edge ex1,xi ∈ E. Define the weight assigned to eachedge ex1,xi as defined in (3) where x = x1 and y = xi.
5. For each x ∈ V1 such that |v(x)| = |G(J1j1,k1
)| = 3determine star(x).
star(v(G(J11,4))) = {v(G(J2
1,3)),
v(G(J33,3)) | v(G(J3
5,5)) |v(G(J3
3,5)), v(G(J41,2))}
star(v(G(J12,4))) = star(v(G(J1
1,4)))
star(v(G(J13,5))) = {v(G(J2
1,2)) | v(G(J21,4)),
v(G(J34,5)), v(G(J4
2,2))}
where the ” |” symbol means ”or”.
6. Determine minimum-weight star(x) from among those
gi Jj,k {j} G(Jj,k) {1, 2, 3} {2, 3, 4}
g2
J1,1 (3) {3} 2 2J2,2 (2) {2} 2 2J3,3 (1) {1} 2 4J4,4 (4) {4} 4 2J1,2 (3, 2) {2, 3} 1 1J1,3 (3, 2, 1) {1, 2, 3} 0 2J1,4 (3, 2, 1, 4) {1, 2, 3, 4} 1 1J2,3 (2, 1) {1, 2} 1 3J2,4 (2, 1, 4) {1, 2, 4} 2 2J3,4 (1, 4) {1, 4} 3 3
g3
J1,1 (5) {5} 4 4J2,2 (6) {6} 4 4J3,3 (1) {1} 2 4J4,4 (4) {4} 4 2J5,5 (2) {2} 2 2J1,2 (5, 6) {5, 6} 5 5J1,3 (5, 6, 1) {1, 5, 6} 4 6J1,4 (5, 6, 1, 4) {1, 4, 5, 6} 5 5J1,5 (5, 6, 1, 4, 2) {1, 2, 4, 5, 6} 4 4J2,3 (6, 1) {1, 6} 3 5J2,4 (6, 1, 4) {1, 4, 6} 4 4J2,5 (6, 1, 4, 2) {1, 2, 4, 6} 3 3J3,4 (1, 4) {1, 4} 3 3J3,5 (1, 4, 2) {1, 2, 4} 2 2J4,5 (4, 2) {2, 4} 3 1
g4
J1,1 (1) {1} 2 4J2,2 (3) {3} 2 2J3,3 (7) {7} 4 4J1,2 (1, 3) {1, 3} 1 3J1,3 (1, 3, 7) {1, 3, 7} 2 4J2,3 (3, 7) {3, 7} 3 3
Table 1: Identification of edge weights with respectto the gene contents (candidate solution gene set){1, 2, 3} and {2, 3, 4}. These are the gene contentsof linear intervals in genome g1 which satisfies thesize constraint |X| = 3. Highlighted in the table areminimum weights for each reference gene content inrelation to the gene contents of other genomes.
Fig
ure
2:
The
const
ructe
dgra
phG
AGCD
P.
The
vert
ices
on
top
are
ele
ments
ofV1,
on
the
left
are
ele
ments
ofV2,
on
the
right
are
ele
ments
of
V3
and
on
the
bott
om
are
ele
ments
ofV4.
The
vert
ices
whic
hare
colo
red
inyellow
are
the
only
vert
ices
whic
hsa
tisf
yth
econst
rain
t|X|=
3and
there
fore
are
the
only
valid
roots
forstar(x
).T
he
weig
ht
of
each
edge
are
indic
ate
don
the
side
of
each
vert
ex
wit
hcolo
rcorr
esp
ondin
gto
the
colo
rof
the
edge.
identified in the previous step.
weight(star(v(G(J11,4)))) = weight((v(G(J1
1,4)),
v(G(J21,3)))) +
weight((v(G(J11,4)),
v(G(J33,3)))) +
weight((v(G(J11,4)),
v(G(J41,2))))
= 0 + 2 + 1
= 3
weight(star(v(G(J12,4)))) = star(v(G(J1
1,4)))
weight(star(v(G(J13,5)))) = weight((v(G(J1
3,5)),
v(G(J21,2)))) +
weight((v(G(J13,5)),
v(G(J34,5)))) +
weight((v(G(J13,5)),
v(G(J42,2))))
= 1 + 1 + 2
= 4
In this example, we see that the minimum-weight star(x),with weight of 3, in graph GAGCDP are the followingstar(x)
star(v(G(J11,4))) = {v(G(J2
1,3)),
v(G(J33,3)), v(G(J4
1,2))}= {v(G(J2
1,3)),
v(G(J35,5)), v(G(J4
1,2))}= {v(G(J2
1,3)),
v(G(J33,5)), v(G(J4
1,2))}star(v(G(J1
2,4))) = {v(G(J21,3)),
v(G(J33,3)), v(G(J4
1,2))}= {v(G(J2
1,3)),
v(G(J35,5)), v(G(J4
1,2))}= {v(G(J2
1,3)),
v(G(J33,5)), v(G(J4
1,2))}
The vertices v(G(J11,4)) and v(G(J1
2,4)) are then theroots of star(x) in GAGCDP which has the minimumweight. The solution gene set X we are searching formust then be either of the gene contents to which ver-tices v(G(J1
1,4)) and v(G(J12,4)) are mapped into. Note
that the gene content to which the vertex v(G(J11,4)) is
mapped into is {1, 2, 3} and the gene content to whichthe vertex v(G(J1
2,4)) is mapped into is also {1, 2, 3}.Given the minimum-weight star(v(G(J1
1,4))) and star(v(G(J12,4)))
we see that the reference gene set
X = {1, 2, 3}
together with either of the sets of linear intervals
J1 = {J11,4, J
21,3, J
33,3, J
41,2}
J2 = {J11,4, J
21,3, J
35,5, J
41,2}
J3 = {J11,4, J
21,3, J
33,5, J
41,2}
J4 = {J21,3, J
21,3, J
33,3, J
41,2}
J5 = {J21,3, J
21,3, J
35,5, J
41,2}
J6 = {J21,3, J
21,3, J
33,5, J
41,2}
form a solution to our example problem as what is con-cluded in Example 1.
3.0.4 AGCDP with no Reference GenomeIn this case we do not choose a specific genome from the setof input genomes and instead consider all vertices mappedto gene contents which satisfy the input constraint |X| = D.As a result, all vertices from each part of GAGCDP are ad-jacent to other vertices from other parts. This largely in-creases the number of edges in graph GAGCDP which needsto be evaluated for weight. All gene contents of linear in-tervals in all genomes gi which satisfy the size constraint|X| = D or |X| in the range [D−, D+] will be consideredas possible solution gene cluster X to the problem. Thesevertices are the roots of stars which will be evaluated forfinding the solution to the problem. We also assume thatthe penalty weight w− = w+ = 1 and the size constraint|X| = 3 as defined in Example 1.
Example 3. Given the set of genomes G, we constructthe graph GAGCDP .
1. Identify the linear intervals J iji,ki
for each genome gi
in G.
2. Identify gene content G(J iji,ki
) for each linear interval
J iji,ki
.
3. To each identified gene content G(J iji,ki
) a vertex
v(G(J iji,ki
)) in GAGCDP is mapped.
4. For all pairs x, xi, x ∈ Vu, xi ∈ Vi and i 6= u definean edge ex,xi ∈ E. Define the weight assigned to eachedge ex,xi as defined in (3) where x = x and y = xi.
5. For each vertex x ∈ Vu, 1 ≤ u ≤ t, such that |v(x)| = 3determine star(x). As what was done for Example 2,we determine the star(x) for the identified gene con-tents v(x) of linear intervals from all genomes satisfy-
Fig
ure
3:
The
const
ructe
dgra
phG
AGCD
Pfo
rcase
2.
The
set
of
vert
ices
on
the
top
are
ele
ments
ofV1,
on
the
left
are
ele
ments
ofV2,
on
the
right
are
ele
ments
ofV3
and
on
the
bott
om
are
ele
ments
ofV4.
All
vert
ices
are
connecte
dto
all
oth
er
vert
ices
inth
eoth
er
part
sof
the
gra
ph.
The
weig
hts
of
edges
are
om
itte
dfo
rcla
rity
of
the
gra
ph.
The
weig
hts
are
identi
fied
inT
able
2.
ing the size constraint |X| = 3.
star(v(G(J11,4))) = {v(G(J2
1,3)),
v(G(J33,3)) | v(G(J3
5,5)) |v(G(J3
3,5)), v(G(J41,2))}
star(v(G(J12,4))) = star(v(G(J1
1,4)))
star(v(G(J13,5))) = {v(G(J2
1,2)) | v(G(J21,4)),
v(G(J34,5)), v(G(J4
2,2))}star(v(G(J2
1,3))) = {v(G(J11,4)) | v(G(J1
2,4)),
v(G(J23,3)) | v(G(J2
5,5)) |v(G(J2
3,5)), v(G(J41,2))}
star(v(G(J22,4))) = {v(G(J1
1,4)) | v(G(J12,5)) |
v(G(J14,5)), v(G(J3
3,5)),
v(G(J41,1))}
star(v(G(J31,3))) = {v(G(J1
1,1)) | v(G(J12,2)) |
v(G(J11,2)), v(G(J2
3,3)),
v(G(J41,1))}
star(v(G(J32,4))) = {v(G(J1
1,1)) | v(G(J12,2)) |
v(G(J11,2)) | v(G(J1
1,2)),
v(G(J23,4)), v(G(J4
1,1))}star(v(G(J3
3,5))) = {v(G(J11,5)) | v(G(J1
2,5)) |v(G(J1
4,5)), v(G(J22,4)),
v(G(J41,1))}
star(v(G(J41,3))) = {v(G(J1
1,3)) | v(G(J12,3)),
v(G(J21,1)) | v(G(J2
3,3)) |v(G(J2
1,3)), v(G(J33,3))}
where the | symbol means ”or”.
6. Determine minimum-weight star(x) from among thoseidentified in the previous step.
weight(star(v(G(J11,4)))) = 0 + 2 + 1
= 3
weight(star(v(G(J12,4)))) = 0 + 2 + 1
= 3
weight(star(v(G(J13,5)))) = 1 + 1 + 2
= 4
weight(star(v(G(J21,3)))) = 0 + 2 + 1
= 3
weight(star(v(G(J22,4)))) = 1 + 0 + 2
= 3
weight(star(v(G(J31,3)))) = 2 + 2 + 2
= 6
weight(star(v(G(J32,4)))) = 2 + 1 + 2
= 5
weight(star(v(G(J33,5)))) = 1 + 0 + 2
= 3
weight(star(v(G(J41,3)))) = 1 + 2 + 2
= 5
From the evaluation of weight of the identified star(x)the ones with minimum weight are star(v(G(J1
1,4))),star(v(G(J1
2,4))), star(v(G(J21,3))), star(v(G(J2
2,4))) andstar(v(G(J3
3,5))). From these star(x) we identify thesolution gene cluster {1, 2, 3} from v(G(J1
1,4)), v(G(J12,4))
and v(G(J21,3)), and gene cluster {1, 2, 4} from v(G(J2
2,4))and v(G(J3
3,5)). Notice that the solution gene cluster{1, 2, 4} was not identified in the case wherein a ref-erence genome was assumed. This is a consequenceof assuming a reference genome since the possible so-lution gene clusters which can be found are limitedto the available gene contents in the chosen referencegenome.
4. PROOF OF EQUIVALENCEWe show the equivalence of AGCDP with the proposed trans-formation into graph problem of finding minimum-weightstar(x) in an edge-weighted undirected t-partite graph. Weshow the equivalence for the presented two cases of AGCDPwherein the first case we identify a reference genome fromamong the input genomes and the second case wherein noreference genome is identified. We refer to Definition 1, 2,3, and 4 as basis for the following proofs.
With Reference GenomeWithout loss of generality we identify genome g1 as the ref-erence genome. Recall that for any vertex x ∈ V1, x =v(G(J1
j1,k1)) such that 1 ≤ j ≤ k ≤ n1. Equation 3 can
then be written as
wx,y = w−·|G(J1j1,k1
)\G(J iji,ki
)| + w+·|G(J iji,ki
)\G(J1j1,k1
)|
and the total weight of a star(x) defined in (4) can then bewritten as
∑i6=1
w− · |G(J1j1,k1
)\G(J iji,ki
)| + w+ · |G(J iji,ki
)\G(J1j1,k1
)|
If we letX represent any possible solution gene clusterG1j1,k1
,the previous equation is equal to
∑i 6=1
w− · |X\G(J iji,ki
)| + w+ · |G(J iji,ki
)\X|
Note that this is just the objective function cost(X, J) tobe minimized for AGCDP with the first genome, g1, de-fined as reference genome. It is easy to see that a uniquepair of gene cluster X and set of linear intervals J whichminimizes the objective function cost(X, J) in AGCDP withreference genome, (X∗, J∗), is mapped to a unique pair ofvertex x and a set of vertices star(x), (x, star(x)), whichalso minimizes the objective function weight(star(x)) forthe minimum-weight star(x) finding problem in an edge-weighted undirected t-partite graph GAGCDP .
gi
Jj,k{j}
G(J
j,k
)g1
:{1,2,3}
g1
:{2,3,4}
g2
:{1,2,3}
g2
:{1,2,4}
g3
:{1,5,6}
g3{1,4,6}
g3
:{1,2,4}
g4
:{1,3,7}
g1
J1,1
(1)
{1}
22
22
22
J2,2
(1)
{1}
22
22
22
J3,3
(3)
{3}
24
44
42
J4,4
(2)
{2}
22
44
24
J5,5
(4)
{4}
42
42
24
J1,2
(1,1
){1}
22
22
22
J1,3
(1,1
,3)
{1,3}
13
33
31
J1,4
(1,1
,3,2
){1
,2,3}
02
44
22
J1,5
(1,1
,3,2
,4)
{1,2,3,4}
11
53
13
J2,3
(1,3
){1,3}
13
33
31
J2,4
(1,3
,2)
{1,2
,3}
02
44
22
J2,5
(1,3
,2,4
){1,2,3,4}
11
53
13
J3,4
(3,2
){2,3}
13
65
33
J3,5
(3,2
,4)
{2,3
,4}
22
64
24
J4,5
(2,4
){2,4}
31
53
15
g2
J1,1
(3)
{3}
22
44
42
J2,2
(2)
{2}
22
44
24
J3,3
(1)
{1}
24
22
22
J4,4
(4)
{4}
42
42
24
J1,2
(3,2
){2,3}
11
55
33
J1,3
(3,2
,1)
{1,2
,3}
02
44
22
J1,4
(3,2
,1,4
){1,2,3,4}
11
53
13
J2,3
(2,1
){1,2}
13
33
13
J2,4
(2,1
,4)
{1,2
,4}
22
42
04
J3,4
(1,4
){1,4}
33
31
13
g3
J1,1
(5)
{5}
44
44
4J2,2
(6)
{6}
44
44
4J3,3
(1)
{1}
24
22
2J4,4
(4)
{4}
42
42
4J5,5
(2)
{2}
22
22
4J1,2
(5,6
){5,6}
55
55
5J1,3
(5,6,1
){1
,5,6}
46
44
4J1,4
(5,6,1,4
){1,4,5,6}
55
53
5J1,5
(5,6,1,4,2
){1,2,4,5,6}
44
42
6J2,3
(6,1
){1,6}
35
33
3J2,4
(6,1,4
){1
,4,6}
44
42
4J2,5
(6,1,4,2
){1,2,4,6}
33
31
5J3,4
(1,4
){1,4}
33
31
3J3,5
(1,4,2
){1
,2,4}
22
20
4J4,5
(4,2
){2,4}
31
31
5
g4
J1,1
(1)
{1}
24
22
22
2J2,2
(3)
{3}
22
24
44
4J3,3
(7)
{7}
44
44
44
4J1,2
(1,3
){1,3}
13
13
33
3J1,3
(1,3,7
){1
,3,7}
24
24
44
4J2,3
(3,7
){3,7}
33
35
55
5
Table
2:
Evalu
ati
on
of
edge
weig
hts
wit
hre
spect
toth
ep
oss
ible
solu
tion
gene
conte
nts
identi
fied
from
all
the
input
genom
es.
These
gene
conte
nts
are
em
phasi
zed
under
the
colu
mnG
i ji,k
i.
The
min
imum
weig
hts
are
als
ohig
hlighte
din
the
table
.
We have shown that the objective function weight(star(x))to be minimized for the minimum-weight star(x) findingproblem in graph GAGCDP is just equal to the objectivefunction cost(X, J) to be minimized for AGCDP with genomeg1 set as the reference genome. The equivalence from ob-jective function of AGCDP to the objective function of theminimum-weight star(x) finding problem can be shown byconstructing graph GAGCDP from the input parameters ofAGCDP. �
With No Reference GenomeFor the case wherein a reference genome is not chosen inAGCDP we can extend the proof for AGCDP with a ref-erence genome to look also into the inter-genome similarityof gene contents of linear intervals. We consider each genecontent in each genome satisfying the size constraint as apossible solution gene cluster for AGCDP instead of enu-merating all the gene set X ⊂ U such that |X| = D or |X|is in [D−, D+].
Since we are looking into the inter-genome relationship ofgene contents, an edge in GAGCDP = (V,E) is defined suchthat for all pairs x, y, x ∈ Vu, y ∈ Vi, and i 6= u, (x, y) ∈ E.
Recall that for any vertex x ∈ V, x = v(G(J iji,ki
)) for 1 ≤i ≤ t. Equation 3 can then be written as
wx,y = w−·|G(Juju,ku
)\G(J iji,ki
)|+ w+·|G(J iji,ki
)\G(Juju,ku
)|
and the total weight of a star(x) defined in 4 can then bewritten as
∑i 6=u
w− · |G(Juju,ku
)\G(J iji,ki
)| + w+ · |G(J iji,ki
)\G(Juju,ku
)|
Suppose that vertex x = v(G(Juju,ku
)) in the definition ofstar(x) in Definition 3. G(Ju
ju,ku) then is a possible solution
gene cluster. If we denote G(Juju,ku
) as X in the aboveequation, the total weight of a star(x) can then be writtenas
∑i 6=u
w− · |X\G(J iji,ki
)| + w+ · |G(J iji,ki
)\X|
Notice that this is also just the objective function cost(X, J)to be minimized for AGCDP. A unique pair of gene clusterX and set of linear intervals J which minimizes the objec-tive function cost(X, J) in AGCDP, (X∗, J∗), is mappedto a unique pair of vertex x and a set of vertices star(x),(x, star(x)), which also minimizes the objective functionweight(star(x)) for the minimum-weight star(x) finding prob-lem in an edge-weighted undirected t-partite graphGAGCDP .The equivalence of the transformation from AGCDP to theminimum-weight star finding problem in graph GAGCDP canbe shown by constructing GAGCDP from the input para-maters of AGCDP or by reversing the presented proof. �
5. CONCLUSIONIn this paper the authors presented the Approximate GeneCluster Problem (AGCDP) as a combinatorial optimiza-tion problem. A proposed transformation of AGCDP into aminimum-weight star(x) finding problem in an edge-weightedundirected t-partite graph was presented both for the caseswherein a reference genome is assumed from the set of inputgenomes and the case wherein no reference genome is iden-tified. Also, proof of equivalence of the transformation fromAGCDP to the grap problem were presented for both casesand it was shown that the objective function in AGCDPsubject to minimization is equivalent to the objective func-tion in the graph problem. It was observed that there is apossibility that some solution gene clusters within the inputgenomes may not be found when a reference genome fromthe input genomes is assumed. That is the case for the so-lution gene cluster {1, 2, 4} in Example 3. It is worth tofurther investigate other approaches to identifying solutiongene clusters given gene contents from all genomes, i.e. con-sensus gene cluster. The authors hope that by presenting theproposed transformation of AGCDP into a graph problemfurther results on AGCDP will materialize based on whatwas presented by the authors in this paper.
6. ADDITIONAL NOTESFinding Nearby StarsWe could widen the range of the solution space to findingminimum-weight star(x) in a graph to include star(x) whichare within the neighborhood of the best solution. Same withthe idea of neighborhood in AGCDP as discussed in [5],a threshold γ can be identified such that given the weightweight(star(x)) of the star(x) with the minimum weight,
weight(star(xi)) ≤ (1 + γ)(weight(star(x)))
such that γ > 0.
7. FUTURE WORKSFor future works on this topic, the authors aim to achievethe following:
• Classification of the complexity of the graph problem
• Enhancements on the proposed transformation fromAGCDP to minimum-weight star(x) finding problem
• Proposal of algorithm and optimization on finding minimum-weight star in a graph, both for cases with referencegenome and without reference genome
• Evaluation of the proposed transformation based onvariations of AGCDP
• Investigate the idea of Consensus Gene Cluster in iden-tifying solution gene clusters given gene contents fromall input genomes
8. REFERENCES[1] R. Karp, Reducibility Among Combinatorial
Problems, Complexity of Computer Computations,1972
[2] P. Pevzner, S. Sze, Combinatorial Approaches toFinding Subtle Signals in DNA Sequences, AmericanAssociation for Artificial Intelligence, 2000
[3] Bergeron A, Corteel S, Raffinot M., The Algorithmicof Gene Teams, Workshop on Algorithms inBioinformatics (WABI), Vol. 2452 of LNCS, pp.464-476, 2002
[4] Hoberman R, Durand D. The Incompatible Desiderataof Gene Cluster Properties, In: McLysaght A, HusonDH, editors. Comparative Genomics: RECOMB 2005International Workshop, Vol. 3678 of LNCS, pp.73-87, 2005
[5] S. Rahmann, G. Klau, Integer Linear ProgrammingTechniques for Discovering Approximate GeneClusters, Bioinformatics Algorithms, Techniques andApplications, Wiley-Interscience, 2008
[6] N. Deo, Graph Theory with Applications toEngineering and Computer Science, Prentice Hall,1974
[7] T. Cormen, C. Leiserson, R. Rivest, C. Stein,Introduction to Algorithms, 3rd Edition, The MITPress, 2009
[8] M. Garrey, D. Johnson, Computers and Intractability,Bell Telephone Laboratories, Inc., 1979
[9] A. Gibbons, Algorithmic Graph Theory, CambridgeUniversity Press, 1985
[10] E. Zaslavsky, M. Singh, A combinatorial OptimizationApproach for Diverse Motif Finding Applicationss,Algorithms for Molecular Biology, 2006