11
Search for a Star: Approximate Gene Cluster Discovery Problem (AGCDP) as Minimization Problem on Graph Jeffrey A. Aborot, Henry Adorna, Jhoirene B. Clemente, Brian Kenneth de Jesus and Geoffrey Solano Algorithms and Complexity Laboratory Department of Computer Science College of Engineering University of the Philippines Diliman [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT Finding gene clusters in genomes is an essential process in establishing relationship among organisms. Gene clusters may express functional dependencies among genes and may give insight into expression of specific traits. The problem of finding gene clusters among several genomes is referred to as Gene Cluster Discovery and several models has already been formulated for its definition. One formulation of this prob- lem is the Approximate Gene Cluster Discovery Problem (AGCDP) which is modelled as a combinatorial optimiza- tion problem in some works. In this paper we propose an approach which produces a transformation of AGCDP into a minimum-weight star finding problem in graph. Detailed examples are also presented to further clarify the notion of the transformation. Proof of equivalence is also presented in the paper to show the equivalence of input parameters of AGCDP and the construction of the graph representing the input parameters to the problem. Keywords Gene, genome, gene cluster, gene content, linear interval, genome, minimization, minimum-weight star, combinatorial optimization 1. INTRODUCTION Gene clusters are set of genes that are closely related to each other. Genes belonging to a cluster may share func- tional dependencies and may be involved in the expression of a specific trait. Identifying gene clusters is also an essen- tial step in establishing relationships between organisms as well as discovery of drug and treatments for diseases. The problem of identifying this set of genes is called Gene Cluster Discovery. This problem has been modelled several times, examples of which are presented in [3], [4], [5], where genes are modelled as integers and genomes are either permuta- tions or sequences defined over the set of all genes. Models in [3] also takes into account gene clusters with (max-gap clusters) and without gaps (exact clusters). The focus of this work is on the model presented in [5], where they define Approximate Gene Cluster Discovery Problem (AGCDP) as a combinatorial problem which identifies the set of genes that are kept “more or less” together across genome sequences. An Integer Linear Programming (ILP) formulation is also presented in [5]. Several modifications of the model, specifically on the objective function, is also pre- sented to take into account characteristics of real biological data. Among these includes, absence of gene cluster occur- rence in some of the input genomes, identification of valid gene clusters, and use of certain reference genome. In this paper we will represent AGCDP as a graph problem. We will define how we transformed the set of inputs to a specific graph called GACGDP . Then we will discussed how the problem is reduced to finding minimum weight star(u) in a graph. Two cases were both modelled in this paper. We consider scenarios with and without a given reference genome. This paper is organized as follows. Section 2 presents a brief discussion of AGCDP as well as the naive and ILP formulation of the problem. Section 3 contains the detailed discussion of how AGCDP is represented as a graph problem. Proof of equivalence of the two representations is discussed in Section 4. Finally Section 5 concludes the paper. 2. APPROXIMATE GENE CLUSTER DIS- COVERY PROBLEM Necessary for our understanding of the problem are the fol- lowing definitions. 1. Gene A gene is represented by an integer g ∈Z 0 . Special genes represented by the integer 0 are genes with non existing homologs, with which we are not interested of in this problem. 2. Gene Universe The set of all unique genes is called the gene universe and is denoted by U = {0, 1, 2,...,N }.

Search for a Star: Approximate Gene Cluster Discovery Problem (AGCDP) as Minimization Problem on Graph

Embed Size (px)

Citation preview

Search for a Star: Approximate Gene Cluster DiscoveryProblem (AGCDP) as Minimization Problem on Graph

Jeffrey A. Aborot, Henry Adorna, Jhoirene B. Clemente, Brian Kenneth de Jesus andGeoffrey Solano

Algorithms and Complexity LaboratoryDepartment of Computer Science

College of EngineeringUniversity of the Philippines Diliman

[email protected], [email protected], [email protected],[email protected], [email protected]

ABSTRACTFinding gene clusters in genomes is an essential process inestablishing relationship among organisms. Gene clustersmay express functional dependencies among genes and maygive insight into expression of specific traits. The problem offinding gene clusters among several genomes is referred to asGene Cluster Discovery and several models has already beenformulated for its definition. One formulation of this prob-lem is the Approximate Gene Cluster Discovery Problem(AGCDP) which is modelled as a combinatorial optimiza-tion problem in some works. In this paper we propose anapproach which produces a transformation of AGCDP intoa minimum-weight star finding problem in graph. Detailedexamples are also presented to further clarify the notion ofthe transformation. Proof of equivalence is also presentedin the paper to show the equivalence of input parameters ofAGCDP and the construction of the graph representing theinput parameters to the problem.

KeywordsGene, genome, gene cluster, gene content, linear interval,genome, minimization, minimum-weight star, combinatorialoptimization

1. INTRODUCTIONGene clusters are set of genes that are closely related toeach other. Genes belonging to a cluster may share func-tional dependencies and may be involved in the expressionof a specific trait. Identifying gene clusters is also an essen-tial step in establishing relationships between organisms aswell as discovery of drug and treatments for diseases. Theproblem of identifying this set of genes is called Gene ClusterDiscovery. This problem has been modelled several times,examples of which are presented in [3], [4], [5], where genesare modelled as integers and genomes are either permuta-

tions or sequences defined over the set of all genes. Modelsin [3] also takes into account gene clusters with (max-gapclusters) and without gaps (exact clusters).

The focus of this work is on the model presented in [5], wherethey define Approximate Gene Cluster Discovery Problem(AGCDP) as a combinatorial problem which identifies theset of genes that are kept “more or less” together acrossgenome sequences. An Integer Linear Programming (ILP)formulation is also presented in [5]. Several modifications ofthe model, specifically on the objective function, is also pre-sented to take into account characteristics of real biologicaldata. Among these includes, absence of gene cluster occur-rence in some of the input genomes, identification of validgene clusters, and use of certain reference genome.

In this paper we will represent AGCDP as a graph problem.We will define how we transformed the set of inputs to aspecific graph called GACGDP . Then we will discussed howthe problem is reduced to finding minimum weight star(u)in a graph. Two cases were both modelled in this paper.We consider scenarios with and without a given referencegenome.

This paper is organized as follows. Section 2 presents abrief discussion of AGCDP as well as the naive and ILPformulation of the problem. Section 3 contains the detaileddiscussion of how AGCDP is represented as a graph problem.Proof of equivalence of the two representations is discussedin Section 4. Finally Section 5 concludes the paper.

2. APPROXIMATE GENE CLUSTER DIS-COVERY PROBLEM

Necessary for our understanding of the problem are the fol-lowing definitions.

1. Gene A gene is represented by an integer g ∈ Z0.Special genes represented by the integer 0 are geneswith non existing homologs, with which we are notinterested of in this problem.

2. Gene Universe The set of all unique genes is calledthe gene universe and is denoted by U = {0, 1, 2, . . . , N}.

3. Genome For simplicity, the genome of a certain indi-vidual can be represented as a single sequence of genesfrom a chromosome. For instance, gi = (gi1, g

i2, . . . , g

ini

),

where each gij is the jth gene in the ith genome. In gen-

eral, a set of genomes is represented by G = {g1, g2, . . . , gt}for some t ∈ Z+, where each gi has length ni ∈ Z+.

4. Linear Interval A linear interval J i in a genome gi =(gi1, g

i2, . . . , g

ini

) is an index set which can either be

empty J i = ∅ or J i = {j, j + 1, . . . , k}, which can alsobe denoted as J i

ji,kiwhere 1 ≤ ji ≤ ki ≤ ni. A set of

linear intervals from all genomes is denoted by J .

5. Gene Content The gene content G(J ij,k) of a linear

interval J ij,k in genome gi is the set of unique genes

contained in that interval.

6. Set Difference The set difference between two dis-tinct gene contents G(J i

ji,ki) and G(Jp

jp,kp), 1 ≤ i ≤ t,

1 ≤ p ≤ t and i 6= p, is defined as

G(J iji,ki

)\G(Jpjp,kp

) = {g|g ∈ G(J iji,ki

) and g /∈ G(Jpjp,kp

)}.

Given the definitions listed above, AGCDP aims to find agene set X ⊂ U , where 0 /∈ X and a set of linear intervalsJ = J i

ji,kiwhere the gene content G(J i

ji,ki) for each genome

gi is roughly equal toX. To define formally how closeX is tothe gene contentsG(J i

ji,ki), the number of missing genes |X\

G(J iji,ki

)| and additional genes |G(J iji,ki

)\X| are computed.

Let us formally define the Basic Approximate Gene ClusterDiscovery Problem (AGCDP).

Definition 1 (AGCDP [5]). Given the gene universeU = {0, 1, . . . , N}, set of genomes G = {g1, g2, . . . , gt}, sizerange [D−, D+] or positive constant D, and integer weightsw− and w+ corresponding to cost of missing and additionalgenes in an interval, identify X ⊂ U , 0 /∈ X, D− ≤ |X| ≤D+ or |X| = D and a set of linear intervals J = {J i

ji,ki},

∀ i such that the cost function

cost(X, J) =

t∑i=1

[(w− ·|X \G(J iji,ki

)|)+(w+ ·|G(J iji,ki

)\X|)]

is minimum.

AGCDP is a double minimization problem. In order toidentify the cost of X, we identify the set of linear inter-vals J which minimizes the value of the objective functioncost(X, J). The naive way of identifying the set X is tocheck the cost of all possible X ⊂ U , which is

(ND

)if |X| = D,

otherwise ∑∀d

d

(N

d

), where D− ≤ d ≤ D+

if we have D− ≤ |X| ≤ D+. Also note that we have toidentify the set of linear intervals for each genome whichminimizes the cost given X. Naively this can be done by

checking all possible linear intervals in each genome. Thetotal running of time the naive AGCDP solver is the num-ber of X ⊂ U satisfying the constraint |X| = D, times therunning time of identifying the best linear interval given X.Thus, naive AGCDP solver takes O(N !(n2t)).

Example 1. To illustrate Definition 1, suppose we havethe set of genomes G = {g1, g2, g3, g4} defined over the geneuniverse U = {0, 1, 2, 3, 4, 5, 6, 7}. The aim of AGCDP is todiscover a set of genes X ⊂ U , with cardinality |X| = D = 3such cost(X, J) is minimum. Let G be equal to the followingset. In this example, gene 0 does not exist in at least one ofthe genomes.

g1 : 1 1 3 2 4g2 : 3 2 1 4g3 : 5 6 1 4 2g4 : 1 3 7

The naive approach of finding minimum X is to evaluateeach X that satisfy the constraints X ⊂ U and |X| = D. Inthis case we have to evaluate

X = {{1, 3, 2}, {2, 3, 4}, {3, 4, 5}, . . .},

where the total number of gene sets to be evaluated is(73

).

Note that to evaluate the score of a certain gene set, anotherminimization procedure is needed, i.e. identification of thebest linear interval for each genome.

After checking all possible linear interval and all possiblegene cluster X, we see that cost(X∗, J∗) is minimum whereX∗ = {1, 2, 3} and J∗ = {J1

1,4, J21,3, J

33,5, J

41,2} as shown in

the figure below.

g1 : 1 1 3 2 4g2 : 3 2 1 4g3 : 5 6 1 4 2g4 : 1 3 7

Figure 1: The set of gene contents{{1, 3, 2}, {3, 2, 1}, {1, 4, 2}, {1, 3}} corresponding toJ∗.

Given that the penalties for missing and additional genesare the same, i.e. w+ = w− = 1, the computation forcost(X∗, J∗) is as follows.

cost(X∗, J∗) =∑4

i=1 |X∗ \Gi

J∗i|+ |Gi

J∗i\X∗|

cost(X∗, J∗) = (0 + 0) + (0 + 0) + (1 + 1) + (1 + 0)cost(X∗, J∗) = 3

The integer linear programming (ILP) formulation of AGCDPis presented in [5]. They defined the set of missing and ad-ditional genes in Equation 1 and 2 respectively.

|X \GiJi| =

N∑q=0

(xq − liq) (1)

|GiJi\X| =

N∑q=0

(X iq − liq) (2)

where xq, lqi , and X iq are binary indicator vectors which per-

tains to the reference gene setX, the intersection of referencegene set and gene content X

⋂Gi

Ji, and the gene content i

Ji

respectively. Further details of their integer linear program-ming formulation is presented in [5].

A simpler approach to solve ACGDP is to make use of a ref-erence genome. Note that in the objective function, we arecomparing the gene set of each linear interval to a referencegene set X. With the use of a reference genome sequence,instead of evaluating all possible X ⊂ U which we can gen-erate from the universal set, we only evaluate Xs which arepresent in the reference genome. This reduces the solutionspace of AGCDP. However, it is important to note that aglobal minimum X identified using the naive solution (con-sidering all possible clusters in all the input genomes) maynot be the X identified when we use a reference genome,i.e. X may not occur in at least one of the genomes in thenaive solution. Since in some cases a reference genome isavailable, one may opt to use other specialized algorithmssimpler than the ILP formulation presented in [5]. For in-stances in which reference genome sequences are availablethrough comparison and alignment of genomes of severalspecies, these reference genome sequences are used in theprocess of identifying new genome sequences.

3. AGCDP AS MINIMIZATION PROBLEMON GRAPH

Given the definition of AGCDP we look into two differentcases of the problem which we discuss in the succeeding sub-sections. The first case is when we consider one of the inputgenomes as a reference genome from which we generate genecontents which we treat as reference gene sets without lossof generality. The second case is when we do not assume areference genome from the set of input genomes. Instead,we consider each gene content of a linear interval in any ofthe parts of the graph satisfying the constraint on |X| asa possible solution gene set X. This lessens the number ofpossible reference gene set subsets of U which we need toevaluate.

3.0.1 Construction of GraphGiven gene universe U = {0, 1, . . . , N}, a set of genomesG = {g1, g2, . . . , gt}, size range |X| = [D−, D+] or |X| = D,and integer penalty weights w− and w+ for missing and ad-ditional genes respectively in a linear interval, we proposethe following transformation of AGCDP input parametersinto a t-partite edge-weighted undirected graph representa-tion.

Definition 2. Define a t-partite edge-weighted undirectedgraph GAGCDP = (V,E), where

V =

t⋃i=1

Vi.

1. A part Vi ⊂ V represents a genome gi ∈ G where1 ≤ i ≤ t.

2. A function v(.) is a bijection v : Gi −→ Vi from the setof all gene contents Gi = {G(J i

ji,ki)} in genome gi to

the set of all vertices in Vi ∀ i, 1 ≤ i ≤ t. An evaluationof v(G(J i

ji,ki)) is a vertex in graph GAGCDP and an

evaluation of v(xi) is a gene content of a linear intervalin genome gi.

3. An edge e ∈ E is incident to vertices x ∈ Vx and y ∈ Vy

if and only if x 6= y.

4. The weight wx,y assigned to an edge incident to verticesx and y is equal to

wx,y = w− · |v(x)\v(y)|+ w+ · |v(y)\v(x)| (3)

where the ”\” is the set difference operator.

Given GAGCDP as the graph representation of the input pa-rameters of AGCDP, let us now define a collection of verticesstar(x) rooted at vertex x ∈ Vj defined in [10]. Basically,star(x) is a set of vertices from each part where the weightbetween the root and each vertex is minimum among all ver-tices in a part incident to the root. Below shows the formaldefinition of star(x).

Definition 3. Given a vertex x in GAGCDP , let star(x)be a set of vertices of size (t − 1) such that xi is rooted atx ∈ Vu, where

xi = arg minxi wxi,x ∀ xi ∈ Vi, i 6= u

.

We can evaluate a star(x) by computing its weight as

weight(star(x)) =

t∑i=1

wxi,x, i 6= u (4)

3.0.2 Minimization on GraphGiven the proposed transformation of the input parametersof AGCDP into its graph representation, we define AGCDPas a minimization problem on graph as follows.

Definition 4. Given a graph GAGCDP = (V,E) and asize range [D−, D+] or positive constant D, find a vertexx ∈ Vu and star(x) such that |v(x)| = D or |v(x)| is in therange [D−, D+] and

weight(star(x)) =

t∑i=1

wxi,x, i 6= u

is minimum.

As discussed in section 2, AGCDP is a double minimizationproblem. This is also true for the graph problem transfor-mation of AGCDP in which a vertex x mapped to a genecontent (possible solution gene set X) and the set of vertices

xi, star(x), which minimizes the weight of each edge inci-dent to pairs x, xi is searched for in the graph. From amongthe found star(x), the minimum-weight star is identified.From the resulting minimum-weight star(x) we identify thegene content Gu

ju,kuto which the root vertex x maps into.

Guju,ku

is the solution gene cluster. Note that there may beone or more solution gene cluster and so we may find one ormore root x. Also, we identify the set Ju

ju,ku∪ {J i

ji,ki} as

the set of linear intervals J where {J iji,ki} is the set of linear

intervals to which star(x) maps into and Juju,ku

is the linearinterval associated with the gene content Gu

ju,ku.

To give a more intuitive notion of the transformation dis-cussed, we present examples of the two cases aforementionedwith input parameters as specified in Example 1.

3.0.3 AGCDP with Reference GenomeWithout loss of generality we assume genome g1 to be thereference genome. All gene contents of linear intervals in g1

which satisfy the size constraint |X| = D will be consideredas possible solution gene set X to the problem. These genecontents are mapped to vertices in part V1 of the graph.These vertices are the roots of star(x) which will be eval-uated for weight. Assume that the penalty weight w− =w+ = 1 and the size constraint |X| = 3 as specified in Ex-ample 1.

Example 2. Given the set of genomes G, we constructthe graph GAGCDP .

1. Identify the linear intervals J iji,ki

for each genome gi

in G.

2. Identify gene content G(J iji,ki

) for each linear interval

J iji,ki

.

3. To each identified gene content G(J iji,ki

) a vertex

v(G(J iji,ki

)) in GAGCDP is mapped into.

4. For all pairs x1, xi, x1 ∈ V1, xi ∈ Vi and i 6= 1 definean edge ex1,xi ∈ E. Define the weight assigned to eachedge ex1,xi as defined in (3) where x = x1 and y = xi.

5. For each x ∈ V1 such that |v(x)| = |G(J1j1,k1

)| = 3determine star(x).

star(v(G(J11,4))) = {v(G(J2

1,3)),

v(G(J33,3)) | v(G(J3

5,5)) |v(G(J3

3,5)), v(G(J41,2))}

star(v(G(J12,4))) = star(v(G(J1

1,4)))

star(v(G(J13,5))) = {v(G(J2

1,2)) | v(G(J21,4)),

v(G(J34,5)), v(G(J4

2,2))}

where the ” |” symbol means ”or”.

6. Determine minimum-weight star(x) from among those

gi Jj,k {j} G(Jj,k) {1, 2, 3} {2, 3, 4}

g2

J1,1 (3) {3} 2 2J2,2 (2) {2} 2 2J3,3 (1) {1} 2 4J4,4 (4) {4} 4 2J1,2 (3, 2) {2, 3} 1 1J1,3 (3, 2, 1) {1, 2, 3} 0 2J1,4 (3, 2, 1, 4) {1, 2, 3, 4} 1 1J2,3 (2, 1) {1, 2} 1 3J2,4 (2, 1, 4) {1, 2, 4} 2 2J3,4 (1, 4) {1, 4} 3 3

g3

J1,1 (5) {5} 4 4J2,2 (6) {6} 4 4J3,3 (1) {1} 2 4J4,4 (4) {4} 4 2J5,5 (2) {2} 2 2J1,2 (5, 6) {5, 6} 5 5J1,3 (5, 6, 1) {1, 5, 6} 4 6J1,4 (5, 6, 1, 4) {1, 4, 5, 6} 5 5J1,5 (5, 6, 1, 4, 2) {1, 2, 4, 5, 6} 4 4J2,3 (6, 1) {1, 6} 3 5J2,4 (6, 1, 4) {1, 4, 6} 4 4J2,5 (6, 1, 4, 2) {1, 2, 4, 6} 3 3J3,4 (1, 4) {1, 4} 3 3J3,5 (1, 4, 2) {1, 2, 4} 2 2J4,5 (4, 2) {2, 4} 3 1

g4

J1,1 (1) {1} 2 4J2,2 (3) {3} 2 2J3,3 (7) {7} 4 4J1,2 (1, 3) {1, 3} 1 3J1,3 (1, 3, 7) {1, 3, 7} 2 4J2,3 (3, 7) {3, 7} 3 3

Table 1: Identification of edge weights with respectto the gene contents (candidate solution gene set){1, 2, 3} and {2, 3, 4}. These are the gene contentsof linear intervals in genome g1 which satisfies thesize constraint |X| = 3. Highlighted in the table areminimum weights for each reference gene content inrelation to the gene contents of other genomes.

Fig

ure

2:

The

const

ructe

dgra

phG

AGCD

P.

The

vert

ices

on

top

are

ele

ments

ofV1,

on

the

left

are

ele

ments

ofV2,

on

the

right

are

ele

ments

of

V3

and

on

the

bott

om

are

ele

ments

ofV4.

The

vert

ices

whic

hare

colo

red

inyellow

are

the

only

vert

ices

whic

hsa

tisf

yth

econst

rain

t|X|=

3and

there

fore

are

the

only

valid

roots

forstar(x

).T

he

weig

ht

of

each

edge

are

indic

ate

don

the

side

of

each

vert

ex

wit

hcolo

rcorr

esp

ondin

gto

the

colo

rof

the

edge.

identified in the previous step.

weight(star(v(G(J11,4)))) = weight((v(G(J1

1,4)),

v(G(J21,3)))) +

weight((v(G(J11,4)),

v(G(J33,3)))) +

weight((v(G(J11,4)),

v(G(J41,2))))

= 0 + 2 + 1

= 3

weight(star(v(G(J12,4)))) = star(v(G(J1

1,4)))

weight(star(v(G(J13,5)))) = weight((v(G(J1

3,5)),

v(G(J21,2)))) +

weight((v(G(J13,5)),

v(G(J34,5)))) +

weight((v(G(J13,5)),

v(G(J42,2))))

= 1 + 1 + 2

= 4

In this example, we see that the minimum-weight star(x),with weight of 3, in graph GAGCDP are the followingstar(x)

star(v(G(J11,4))) = {v(G(J2

1,3)),

v(G(J33,3)), v(G(J4

1,2))}= {v(G(J2

1,3)),

v(G(J35,5)), v(G(J4

1,2))}= {v(G(J2

1,3)),

v(G(J33,5)), v(G(J4

1,2))}star(v(G(J1

2,4))) = {v(G(J21,3)),

v(G(J33,3)), v(G(J4

1,2))}= {v(G(J2

1,3)),

v(G(J35,5)), v(G(J4

1,2))}= {v(G(J2

1,3)),

v(G(J33,5)), v(G(J4

1,2))}

The vertices v(G(J11,4)) and v(G(J1

2,4)) are then theroots of star(x) in GAGCDP which has the minimumweight. The solution gene set X we are searching formust then be either of the gene contents to which ver-tices v(G(J1

1,4)) and v(G(J12,4)) are mapped into. Note

that the gene content to which the vertex v(G(J11,4)) is

mapped into is {1, 2, 3} and the gene content to whichthe vertex v(G(J1

2,4)) is mapped into is also {1, 2, 3}.Given the minimum-weight star(v(G(J1

1,4))) and star(v(G(J12,4)))

we see that the reference gene set

X = {1, 2, 3}

together with either of the sets of linear intervals

J1 = {J11,4, J

21,3, J

33,3, J

41,2}

J2 = {J11,4, J

21,3, J

35,5, J

41,2}

J3 = {J11,4, J

21,3, J

33,5, J

41,2}

J4 = {J21,3, J

21,3, J

33,3, J

41,2}

J5 = {J21,3, J

21,3, J

35,5, J

41,2}

J6 = {J21,3, J

21,3, J

33,5, J

41,2}

form a solution to our example problem as what is con-cluded in Example 1.

3.0.4 AGCDP with no Reference GenomeIn this case we do not choose a specific genome from the setof input genomes and instead consider all vertices mappedto gene contents which satisfy the input constraint |X| = D.As a result, all vertices from each part of GAGCDP are ad-jacent to other vertices from other parts. This largely in-creases the number of edges in graph GAGCDP which needsto be evaluated for weight. All gene contents of linear in-tervals in all genomes gi which satisfy the size constraint|X| = D or |X| in the range [D−, D+] will be consideredas possible solution gene cluster X to the problem. Thesevertices are the roots of stars which will be evaluated forfinding the solution to the problem. We also assume thatthe penalty weight w− = w+ = 1 and the size constraint|X| = 3 as defined in Example 1.

Example 3. Given the set of genomes G, we constructthe graph GAGCDP .

1. Identify the linear intervals J iji,ki

for each genome gi

in G.

2. Identify gene content G(J iji,ki

) for each linear interval

J iji,ki

.

3. To each identified gene content G(J iji,ki

) a vertex

v(G(J iji,ki

)) in GAGCDP is mapped.

4. For all pairs x, xi, x ∈ Vu, xi ∈ Vi and i 6= u definean edge ex,xi ∈ E. Define the weight assigned to eachedge ex,xi as defined in (3) where x = x and y = xi.

5. For each vertex x ∈ Vu, 1 ≤ u ≤ t, such that |v(x)| = 3determine star(x). As what was done for Example 2,we determine the star(x) for the identified gene con-tents v(x) of linear intervals from all genomes satisfy-

Fig

ure

3:

The

const

ructe

dgra

phG

AGCD

Pfo

rcase

2.

The

set

of

vert

ices

on

the

top

are

ele

ments

ofV1,

on

the

left

are

ele

ments

ofV2,

on

the

right

are

ele

ments

ofV3

and

on

the

bott

om

are

ele

ments

ofV4.

All

vert

ices

are

connecte

dto

all

oth

er

vert

ices

inth

eoth

er

part

sof

the

gra

ph.

The

weig

hts

of

edges

are

om

itte

dfo

rcla

rity

of

the

gra

ph.

The

weig

hts

are

identi

fied

inT

able

2.

ing the size constraint |X| = 3.

star(v(G(J11,4))) = {v(G(J2

1,3)),

v(G(J33,3)) | v(G(J3

5,5)) |v(G(J3

3,5)), v(G(J41,2))}

star(v(G(J12,4))) = star(v(G(J1

1,4)))

star(v(G(J13,5))) = {v(G(J2

1,2)) | v(G(J21,4)),

v(G(J34,5)), v(G(J4

2,2))}star(v(G(J2

1,3))) = {v(G(J11,4)) | v(G(J1

2,4)),

v(G(J23,3)) | v(G(J2

5,5)) |v(G(J2

3,5)), v(G(J41,2))}

star(v(G(J22,4))) = {v(G(J1

1,4)) | v(G(J12,5)) |

v(G(J14,5)), v(G(J3

3,5)),

v(G(J41,1))}

star(v(G(J31,3))) = {v(G(J1

1,1)) | v(G(J12,2)) |

v(G(J11,2)), v(G(J2

3,3)),

v(G(J41,1))}

star(v(G(J32,4))) = {v(G(J1

1,1)) | v(G(J12,2)) |

v(G(J11,2)) | v(G(J1

1,2)),

v(G(J23,4)), v(G(J4

1,1))}star(v(G(J3

3,5))) = {v(G(J11,5)) | v(G(J1

2,5)) |v(G(J1

4,5)), v(G(J22,4)),

v(G(J41,1))}

star(v(G(J41,3))) = {v(G(J1

1,3)) | v(G(J12,3)),

v(G(J21,1)) | v(G(J2

3,3)) |v(G(J2

1,3)), v(G(J33,3))}

where the | symbol means ”or”.

6. Determine minimum-weight star(x) from among thoseidentified in the previous step.

weight(star(v(G(J11,4)))) = 0 + 2 + 1

= 3

weight(star(v(G(J12,4)))) = 0 + 2 + 1

= 3

weight(star(v(G(J13,5)))) = 1 + 1 + 2

= 4

weight(star(v(G(J21,3)))) = 0 + 2 + 1

= 3

weight(star(v(G(J22,4)))) = 1 + 0 + 2

= 3

weight(star(v(G(J31,3)))) = 2 + 2 + 2

= 6

weight(star(v(G(J32,4)))) = 2 + 1 + 2

= 5

weight(star(v(G(J33,5)))) = 1 + 0 + 2

= 3

weight(star(v(G(J41,3)))) = 1 + 2 + 2

= 5

From the evaluation of weight of the identified star(x)the ones with minimum weight are star(v(G(J1

1,4))),star(v(G(J1

2,4))), star(v(G(J21,3))), star(v(G(J2

2,4))) andstar(v(G(J3

3,5))). From these star(x) we identify thesolution gene cluster {1, 2, 3} from v(G(J1

1,4)), v(G(J12,4))

and v(G(J21,3)), and gene cluster {1, 2, 4} from v(G(J2

2,4))and v(G(J3

3,5)). Notice that the solution gene cluster{1, 2, 4} was not identified in the case wherein a ref-erence genome was assumed. This is a consequenceof assuming a reference genome since the possible so-lution gene clusters which can be found are limitedto the available gene contents in the chosen referencegenome.

4. PROOF OF EQUIVALENCEWe show the equivalence of AGCDP with the proposed trans-formation into graph problem of finding minimum-weightstar(x) in an edge-weighted undirected t-partite graph. Weshow the equivalence for the presented two cases of AGCDPwherein the first case we identify a reference genome fromamong the input genomes and the second case wherein noreference genome is identified. We refer to Definition 1, 2,3, and 4 as basis for the following proofs.

With Reference GenomeWithout loss of generality we identify genome g1 as the ref-erence genome. Recall that for any vertex x ∈ V1, x =v(G(J1

j1,k1)) such that 1 ≤ j ≤ k ≤ n1. Equation 3 can

then be written as

wx,y = w−·|G(J1j1,k1

)\G(J iji,ki

)| + w+·|G(J iji,ki

)\G(J1j1,k1

)|

and the total weight of a star(x) defined in (4) can then bewritten as

∑i6=1

w− · |G(J1j1,k1

)\G(J iji,ki

)| + w+ · |G(J iji,ki

)\G(J1j1,k1

)|

If we letX represent any possible solution gene clusterG1j1,k1

,the previous equation is equal to

∑i 6=1

w− · |X\G(J iji,ki

)| + w+ · |G(J iji,ki

)\X|

Note that this is just the objective function cost(X, J) tobe minimized for AGCDP with the first genome, g1, de-fined as reference genome. It is easy to see that a uniquepair of gene cluster X and set of linear intervals J whichminimizes the objective function cost(X, J) in AGCDP withreference genome, (X∗, J∗), is mapped to a unique pair ofvertex x and a set of vertices star(x), (x, star(x)), whichalso minimizes the objective function weight(star(x)) forthe minimum-weight star(x) finding problem in an edge-weighted undirected t-partite graph GAGCDP .

gi

Jj,k{j}

G(J

j,k

)g1

:{1,2,3}

g1

:{2,3,4}

g2

:{1,2,3}

g2

:{1,2,4}

g3

:{1,5,6}

g3{1,4,6}

g3

:{1,2,4}

g4

:{1,3,7}

g1

J1,1

(1)

{1}

22

22

22

J2,2

(1)

{1}

22

22

22

J3,3

(3)

{3}

24

44

42

J4,4

(2)

{2}

22

44

24

J5,5

(4)

{4}

42

42

24

J1,2

(1,1

){1}

22

22

22

J1,3

(1,1

,3)

{1,3}

13

33

31

J1,4

(1,1

,3,2

){1

,2,3}

02

44

22

J1,5

(1,1

,3,2

,4)

{1,2,3,4}

11

53

13

J2,3

(1,3

){1,3}

13

33

31

J2,4

(1,3

,2)

{1,2

,3}

02

44

22

J2,5

(1,3

,2,4

){1,2,3,4}

11

53

13

J3,4

(3,2

){2,3}

13

65

33

J3,5

(3,2

,4)

{2,3

,4}

22

64

24

J4,5

(2,4

){2,4}

31

53

15

g2

J1,1

(3)

{3}

22

44

42

J2,2

(2)

{2}

22

44

24

J3,3

(1)

{1}

24

22

22

J4,4

(4)

{4}

42

42

24

J1,2

(3,2

){2,3}

11

55

33

J1,3

(3,2

,1)

{1,2

,3}

02

44

22

J1,4

(3,2

,1,4

){1,2,3,4}

11

53

13

J2,3

(2,1

){1,2}

13

33

13

J2,4

(2,1

,4)

{1,2

,4}

22

42

04

J3,4

(1,4

){1,4}

33

31

13

g3

J1,1

(5)

{5}

44

44

4J2,2

(6)

{6}

44

44

4J3,3

(1)

{1}

24

22

2J4,4

(4)

{4}

42

42

4J5,5

(2)

{2}

22

22

4J1,2

(5,6

){5,6}

55

55

5J1,3

(5,6,1

){1

,5,6}

46

44

4J1,4

(5,6,1,4

){1,4,5,6}

55

53

5J1,5

(5,6,1,4,2

){1,2,4,5,6}

44

42

6J2,3

(6,1

){1,6}

35

33

3J2,4

(6,1,4

){1

,4,6}

44

42

4J2,5

(6,1,4,2

){1,2,4,6}

33

31

5J3,4

(1,4

){1,4}

33

31

3J3,5

(1,4,2

){1

,2,4}

22

20

4J4,5

(4,2

){2,4}

31

31

5

g4

J1,1

(1)

{1}

24

22

22

2J2,2

(3)

{3}

22

24

44

4J3,3

(7)

{7}

44

44

44

4J1,2

(1,3

){1,3}

13

13

33

3J1,3

(1,3,7

){1

,3,7}

24

24

44

4J2,3

(3,7

){3,7}

33

35

55

5

Table

2:

Evalu

ati

on

of

edge

weig

hts

wit

hre

spect

toth

ep

oss

ible

solu

tion

gene

conte

nts

identi

fied

from

all

the

input

genom

es.

These

gene

conte

nts

are

em

phasi

zed

under

the

colu

mnG

i ji,k

i.

The

min

imum

weig

hts

are

als

ohig

hlighte

din

the

table

.

We have shown that the objective function weight(star(x))to be minimized for the minimum-weight star(x) findingproblem in graph GAGCDP is just equal to the objectivefunction cost(X, J) to be minimized for AGCDP with genomeg1 set as the reference genome. The equivalence from ob-jective function of AGCDP to the objective function of theminimum-weight star(x) finding problem can be shown byconstructing graph GAGCDP from the input parameters ofAGCDP. �

With No Reference GenomeFor the case wherein a reference genome is not chosen inAGCDP we can extend the proof for AGCDP with a ref-erence genome to look also into the inter-genome similarityof gene contents of linear intervals. We consider each genecontent in each genome satisfying the size constraint as apossible solution gene cluster for AGCDP instead of enu-merating all the gene set X ⊂ U such that |X| = D or |X|is in [D−, D+].

Since we are looking into the inter-genome relationship ofgene contents, an edge in GAGCDP = (V,E) is defined suchthat for all pairs x, y, x ∈ Vu, y ∈ Vi, and i 6= u, (x, y) ∈ E.

Recall that for any vertex x ∈ V, x = v(G(J iji,ki

)) for 1 ≤i ≤ t. Equation 3 can then be written as

wx,y = w−·|G(Juju,ku

)\G(J iji,ki

)|+ w+·|G(J iji,ki

)\G(Juju,ku

)|

and the total weight of a star(x) defined in 4 can then bewritten as

∑i 6=u

w− · |G(Juju,ku

)\G(J iji,ki

)| + w+ · |G(J iji,ki

)\G(Juju,ku

)|

Suppose that vertex x = v(G(Juju,ku

)) in the definition ofstar(x) in Definition 3. G(Ju

ju,ku) then is a possible solution

gene cluster. If we denote G(Juju,ku

) as X in the aboveequation, the total weight of a star(x) can then be writtenas

∑i 6=u

w− · |X\G(J iji,ki

)| + w+ · |G(J iji,ki

)\X|

Notice that this is also just the objective function cost(X, J)to be minimized for AGCDP. A unique pair of gene clusterX and set of linear intervals J which minimizes the objec-tive function cost(X, J) in AGCDP, (X∗, J∗), is mappedto a unique pair of vertex x and a set of vertices star(x),(x, star(x)), which also minimizes the objective functionweight(star(x)) for the minimum-weight star(x) finding prob-lem in an edge-weighted undirected t-partite graphGAGCDP .The equivalence of the transformation from AGCDP to theminimum-weight star finding problem in graph GAGCDP canbe shown by constructing GAGCDP from the input para-maters of AGCDP or by reversing the presented proof. �

5. CONCLUSIONIn this paper the authors presented the Approximate GeneCluster Problem (AGCDP) as a combinatorial optimiza-tion problem. A proposed transformation of AGCDP into aminimum-weight star(x) finding problem in an edge-weightedundirected t-partite graph was presented both for the caseswherein a reference genome is assumed from the set of inputgenomes and the case wherein no reference genome is iden-tified. Also, proof of equivalence of the transformation fromAGCDP to the grap problem were presented for both casesand it was shown that the objective function in AGCDPsubject to minimization is equivalent to the objective func-tion in the graph problem. It was observed that there is apossibility that some solution gene clusters within the inputgenomes may not be found when a reference genome fromthe input genomes is assumed. That is the case for the so-lution gene cluster {1, 2, 4} in Example 3. It is worth tofurther investigate other approaches to identifying solutiongene clusters given gene contents from all genomes, i.e. con-sensus gene cluster. The authors hope that by presenting theproposed transformation of AGCDP into a graph problemfurther results on AGCDP will materialize based on whatwas presented by the authors in this paper.

6. ADDITIONAL NOTESFinding Nearby StarsWe could widen the range of the solution space to findingminimum-weight star(x) in a graph to include star(x) whichare within the neighborhood of the best solution. Same withthe idea of neighborhood in AGCDP as discussed in [5],a threshold γ can be identified such that given the weightweight(star(x)) of the star(x) with the minimum weight,

weight(star(xi)) ≤ (1 + γ)(weight(star(x)))

such that γ > 0.

7. FUTURE WORKSFor future works on this topic, the authors aim to achievethe following:

• Classification of the complexity of the graph problem

• Enhancements on the proposed transformation fromAGCDP to minimum-weight star(x) finding problem

• Proposal of algorithm and optimization on finding minimum-weight star in a graph, both for cases with referencegenome and without reference genome

• Evaluation of the proposed transformation based onvariations of AGCDP

• Investigate the idea of Consensus Gene Cluster in iden-tifying solution gene clusters given gene contents fromall input genomes

8. REFERENCES[1] R. Karp, Reducibility Among Combinatorial

Problems, Complexity of Computer Computations,1972

[2] P. Pevzner, S. Sze, Combinatorial Approaches toFinding Subtle Signals in DNA Sequences, AmericanAssociation for Artificial Intelligence, 2000

[3] Bergeron A, Corteel S, Raffinot M., The Algorithmicof Gene Teams, Workshop on Algorithms inBioinformatics (WABI), Vol. 2452 of LNCS, pp.464-476, 2002

[4] Hoberman R, Durand D. The Incompatible Desiderataof Gene Cluster Properties, In: McLysaght A, HusonDH, editors. Comparative Genomics: RECOMB 2005International Workshop, Vol. 3678 of LNCS, pp.73-87, 2005

[5] S. Rahmann, G. Klau, Integer Linear ProgrammingTechniques for Discovering Approximate GeneClusters, Bioinformatics Algorithms, Techniques andApplications, Wiley-Interscience, 2008

[6] N. Deo, Graph Theory with Applications toEngineering and Computer Science, Prentice Hall,1974

[7] T. Cormen, C. Leiserson, R. Rivest, C. Stein,Introduction to Algorithms, 3rd Edition, The MITPress, 2009

[8] M. Garrey, D. Johnson, Computers and Intractability,Bell Telephone Laboratories, Inc., 1979

[9] A. Gibbons, Algorithmic Graph Theory, CambridgeUniversity Press, 1985

[10] E. Zaslavsky, M. Singh, A combinatorial OptimizationApproach for Diverse Motif Finding Applicationss,Algorithms for Molecular Biology, 2006