12
Approximate Gene Cluster Discovery Problem(AGCDP) is NP-hard Gerard S. Cabunducan, Jhoirene B. Clemente, Raissa T. Relator, and Henry N. Adorna Algorithms & Complexity Lab Department of Computer Science University of the Philippines Diliman Diliman 1101 Quezon City, Philippines E-mail: [email protected], [email protected], [email protected], [email protected] Abstract. We proved that approximate gene cluster discovery prob- lem(AGCDP) is NP-Hard by showing a reduction from an NP-complete problem which is median string problem (MSP). We proved that the transformation made runs in polynomial time. We also showed a corre- spondence of Total Distance used by MSP to the cost function used by AGCDP. Keywords: AGCDP, MSP, Reduction, NP-hard 1 Introduction The goal of computational complexity theory is to determine the practical limits of computers. This is done through classifying computational problems to dif- ferent complexity classes. Reduction is one way to determine the hardness of a problem, in this process one problem say A is transformed to another problem B to show that one problem is at least as hard as solving the other problem. Approximate gene cluster discovery problem (AGCDP) aims to find a set of genes that are kept more or less together in a set of sequences. Genes belonging to a cluster are believed to perform similar functions or are involved in the same cellular process. AGCDP shows a strong similarity to an NP-complete problem called median string problem (MSP)[5]. MSP aims to find a median string from a set of sequences, this problem is equivalent to finding patterns in a set of sequences. These patterns are called motifs, which regulate gene transcription. By showing a polynomial time reduction from MSP, we can prove that AGCDP is NP-Hard. The formal definition of AGCDP and MSP are defined in Section 2. It is then followed by our proof in Section 3, which includes the discussion of the proof idea, polynomial time transformation, and the correspondence of the score functions used by both problems.

Approximate Gene Cluster Discovery Problem is NP-hard

Embed Size (px)

Citation preview

Approximate Gene Cluster DiscoveryProblem(AGCDP) is NP-hard

Gerard S. Cabunducan, Jhoirene B. Clemente, Raissa T. Relator, and HenryN. Adorna

Algorithms & Complexity LabDepartment of Computer Science

University of the Philippines DilimanDiliman 1101 Quezon City, Philippines

E-mail: [email protected], [email protected], [email protected],[email protected]

Abstract. We proved that approximate gene cluster discovery prob-lem(AGCDP) is NP-Hard by showing a reduction from an NP-completeproblem which is median string problem (MSP). We proved that thetransformation made runs in polynomial time. We also showed a corre-spondence of Total Distance used by MSP to the cost function used byAGCDP.

Keywords: AGCDP, MSP, Reduction, NP-hard

1 Introduction

The goal of computational complexity theory is to determine the practical limitsof computers. This is done through classifying computational problems to dif-ferent complexity classes. Reduction is one way to determine the hardness of aproblem, in this process one problem say A is transformed to another problemB to show that one problem is at least as hard as solving the other problem.

Approximate gene cluster discovery problem (AGCDP) aims to find a set ofgenes that are kept more or less together in a set of sequences. Genes belongingto a cluster are believed to perform similar functions or are involved in the samecellular process. AGCDP shows a strong similarity to an NP-complete problemcalled median string problem (MSP)[5]. MSP aims to find a median string froma set of sequences, this problem is equivalent to finding patterns in a set ofsequences. These patterns are called motifs, which regulate gene transcription.By showing a polynomial time reduction from MSP, we can prove that AGCDPis NP-Hard.

The formal definition of AGCDP and MSP are defined in Section 2. It isthen followed by our proof in Section 3, which includes the discussion of theproof idea, polynomial time transformation, and the correspondence of the scorefunctions used by both problems.

2 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

2 Basic Definitions

2.1 Median String Problem (MSP)

Given a set of sequences, a median string is a pattern where it minimizes itsdistance to all other string in the sequence. The distance of two strings can bemeasured using different distance measures, examples are Levenshtein distanceand Hamming distance. Definitions of the two distance measures are given below.

Definition 1. Levenshtein DistanceGiven 2 strings w and w

′in Σ∗, the Levenshtein distance between the two, de-

noted by dL(w,w′) is equal to the smallest k such that w

k−→ w′. Transformation

of w to w′

includes the following operations.

1. Single symbol deletion: w = uxv, w′

= uv2. Single symbol insertion: w = uv, w

′= uxv

3. Single symbol substitution: w = uxv, w′

= uyv,

where u, v ∈ Σ∗, and x, y ∈ Σ.

For example, dL(1234, 135) = 2 because, we need two operaGiven a stringtions to transform ′1234′ to ′135′. First we need to insert ′2′ between ′1′ and ′3′

in w′

to get ′1235′. The final step would be substitution of ′4′ to ′5′ in w′

to getw.

Definition 2. Hamming distanceGiven 2 strings w and w

′in Σ∗, where |w| = |w′| , the Hamming distance

between the two, denoted by dH(w,w′) is equal to the total number of mismatched

characters.

The Hamming distance between two l-mers = ′0113200′ and = ′0133210′ isequal to 2.

0 1 1 3 2 0 0x x

0 1 3 3 2 1 0

Given a set of t sequences say S = {s1, s2, s3, . . . , st} , where si is a sequenceof length n defined over some alphabet Σ. Let si(x) denote the character atith sequence in position x. Let r = {r1, r2, r3, . . . , rt} be a vector containing tstarting positions in S, and r̄i = si(ri)....si(ri + l) be the l-mer correspondingto the starting position ri in the ith sequence. From r, we can define a (t × l)alignment A(r) based from the l-mers corresponding to ris in r. For instance,we have S = {s1, s2, s3} equal to

s1 : 1 2 3 1 2s2 : 2 3 1 1 3s3 : 1 2 1 3 2

Approximate Gene Cluster Discovery Problem is NP-hard 3

From the defined S above, s1(2) = ′2′, that is the second symbol in sequence1. If we set r = {2, 3, 1} and l = 3, then the corresponding l-mer per sequenceare ′231′, ′113′, and ′121′ respectively. Forming an alignment A(r) equal to

2 3 11 1 31 2 1

Given a string v, we can compute the distance of v from a set of strings oran alignment A(r). We call it the Total Hamming Distance1 which is equal to

Total Hamming Distance(v, r) =

t∑i=0

dH(v, r̄i),

where dH(v, r̄i) is the Hamming distance of v to string r̄i. Moreover, we definethe Total Distance function to be the minimum Total Hamming Distance overall r in S

Total Distance(v, S) = min(Total Hamming Distance(v, r)), ∀ r

The MSP uses the Total Distance function to get the median string of S.

Definition 3. Median String ProblemGiven a set of sequences S, find a median string.INPUT: S, and lOUTPUT: A string v∗ of length l that minimizes the Total Distance(v, S) overall strings v in S of that length

The naive way of solving the MSP is given in the algorithm below.

Input: Sequences S, lOutput: Median String

procedure NAIVEMSP(DNA, t, n, l)

bestWord= AAA..A

bestDistance= infinity

For each l-mer word from AAA..A to TTT..T

if TOTALDISTANCE(word, DNA) < bestDistance

bestDistance = TOTALDISTANCE(word,DNA)

bestWord = word

return bestWord

end procedure NAIVEMSP

Algorithm 1: Naive MSP

1 or theTotal Levenshtein Distance if we will use dL(v, w) instead of dH(v, w)

4 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

First we minimize the Total Hamming Distance(v, r), given an l-mer v over allpossible starting positions r in S. The next task will be finding a string v suchthat it minimizes the Total Distance for all possible vs in Σ∗. Take note, thatthe computation of the above computation requires the traversal of all possiblel-mers, that is |Σ|l. The computation of the Total Distance will no longer requireto check all possible starting positions, but a single pass over S in O(n·t) steps[3].Therefore, the running time complexity of the naive MSP is O(|Σ|l · n · t).

2.2 Approximate Gene Cluster Discovery Problem (AGCDP)

Gene cluster is a set of genes, that for biological reasons, has been kept “more orless together” or in other words in the same segment in different genomes. Someof the biological reasons are due to functional pressure, being part of biochemi-cal network, evolutionary proximity, not enough time for enough speciation, orco-expression. Identification of gene clusters are useful in Phylogenomics, anddetection of it becomes a challenge for both biological and mathematical com-munity. There are two methods for detecting gene clusters. The first one is basedon the detection of short conserved segments that are easy to detect then pro-cessed using a heuristic to obtain the cluster. The second approach relies onthe formal models of a gene cluster, and algorithms that search and comparegenomes to detect all gene segments.

The formal models of gene clusters are discussed in [1]. Basically, it presentsfour combinatorial models namely, common interval (the exact gene cluster)versus max-gap (gene cluster that allows gap) in permutations versus sequences(allows gene duplications and deletions). But before we continue the discussionof formal models, let us first define an occurrence.

Definition 4. OccurrenceGiven a set of genes X and a genome g represented by a string a1a2 . . . ak, anoccurrence of the set X is a substring ai . . . aj such that,

1. Both ai and aj belong to the set of genes X.2. The set of genes X is contained in the multiset {ai . . . aj}.3. If a substring of ai . . . aj contains no gene in X, then its length must be less

than or equal to δ, a fixed integer that represents the maximal allowed gapsize.

4. The flanking substring ai−1−δ . . . ai−1 and ai+1 . . . aj+1+δ contain no genein X.

To illustrate the definition of an occurrence, Let X = {n, g, e} and g be thethe sentence ‘rearrangements of genomes involve genes and chromosomes’, theoccurrences of the set X with δ = 0, on g are “nge”, “gene”, and “gen”. Ifwe let δ = 1, the occurrences are “ngemen”, “gene”, and “gen”. Here are briefdescription of gene cluster models described in [1].

1. Common interval in permutations:The simplest model of gene cluster is when we assume that genomes are

Approximate Gene Cluster Discovery Problem is NP-hard 5

permutations of each other. In this model, the gene cluster are the commonintervals of the given set of genomes.

Definition 5. Common IntervalLet P be a set of permutations on the set of genes U. A subset of U is acommon interval if it has an occurrence in each permutation of P withoutgaps.

Consider two a set of permutations P = {g1, g2} of the set U = {1, 2, . . . , 11}

g1: 1 2 3 4 5 6 7 8 9 10 11g2: 4 2 1 3 7 8 6 5 11 9 10

If the cardinality of the common interval X is equal to 3, then the commonintervals of g1 and g2 are {2, 1, 3}, {7, 8, 6}, and {11, 9, 10}.

2. Max-gap in permutationsIn this model, the genomes are also permutations of the set U. However,this model is more flexible and it gives a formal definition of being ”moreor less together” by introducing a new variable called δ, this will specify theallowable gaps in the common interval. This model was first introduced by[6], as they coined the term gene teams for gene cluster with allowable gaps.

Definition 6. Gene TeamLet P ∗ be a set of sequence defined over U ∪ {∗} such that each sequence isa permutation from P of U when the symbols {∗} are removed. Let δ ≥ 0be a fixed integer. A subset of U is a gene team if it has an occurrence withmaximum gap size δ in each permutation of P and has no extension.

Consider the following example.g1 = 1 * 2 3 4 * * 5 6 7 * * 8g2 = 8 * 4 * 2 1 5 3 6 * 7 * *

The gene teams of P ∗ = {g1, g2} with δ = 1, are {1, 2, 3, 4}, {5, 6, 7}, and{8}.

3. Common interval in sequencesThe two previous models assume that genes appear only once, which isnot good because we expect repetition and deletion of genes specially ifwe compare a set of genomes from different species. Instead of finding anexact common interval on the permutation of S. This model extends to therepresentation of genomes as strings. We will use the same common intervalin Definition 4, but instead of looking on permutations, we will now considerstrings.

4. Max gap in sequencesThis model is similar to number 2, it allows gap size, denoted by δ. Butinstead of searching gene teams on permutations, we now consider stringswhich allow gene duplications and repetitions.

Definition 7. Approximate Gene Cluster Discovery ProblemGiven a set of genomes, we find a set of genes called gene clusters, where genes

6 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

belonging to the set are “more or less” together in the set of genomes.INPUT: The gene pool U = {0, 1...N} ; set of genomes G = { g1, g2, g3, . . . , gt},where (gi) is a sequence of length ni defined over some alphabet U; size range[D−, D+] or a constant D; integer weights w−, w+ ≥ 0, respective cost of eachmissed and additional gene in an intervalOUTPUT: X ⊂ U with 0 /∈ X and D− ≤ |X| ≤ D+, or |X| = D and linearinterval Ji for each genome in order to minimize

cost∗(X,G) = min cost(X, (Ji)) =

m∑i=1

[w− × |X \ GiJi |+ w+ × |GiJi \ X|],

Gerard: I addedparameters tocost∗, is this con-sistent with thedefinition

where, GiJi is the gene content of Ji in gi

Input: Genomes G, gene set cardinality range [D−, D+], weights W−, and W+

Output: Gene set X

1. Tentatively set X to the gene set for each interval in each genome.2. For each genome gi, except the one where X is taken from,

– Compare X to the character set of each interval J in gi,– Compute the cost(X,G) according to the number of missing and additionalgenes,– Pick the interval J∗

g i in gi with minimum cost c∗gi.3. Return X, such that cost is minimum over all X in G.

Algorithm 2: ILP solution for Basic AGCDP

There are two possible output cases in AGCDP, either report the best setX, or report each X for each genome gi, where cost(X,G) remains below a giventhreshold. The detailed discussion of ILP solver is discussed in [5].

3 Proof

3.1 Proof Idea

To prove that AGCDP is NP-hard, we must show that all problems in NP arepolynomial time reducible to it, regardless whether AGCDP is in NP. Showinga polynomial time reduction of MSP (which is known to be NP-complete) toAGCDP, is equivalent to showing that all problems in NP are polynomial timereducible to AGCDP, because all problems in NP are polynomial time reducibleto NP-complete problems.

In the reduction of MSP to AGCDP, we transform the inputs of MSP forAGCDP in such way that the median string from MSP is derivable from thegene cluster output of AGCDP. Also, we need to show that the transformation

Approximate Gene Cluster Discovery Problem is NP-hard 7

is done in polynomial time. We added another subsection for the discussion ofscore function correspondence of the problems. The correspondence will showthat in all input instances, the transformation and the AGCDP solution willalways yield the median string.

3.2 Reduction of MSP to AGCDP

The reduction includes transformation of the input parameters, as well as provingthat the transformation is done in polynomial time. The transformation of inputsequences S to set of genomes G is discussed below.

MSP Input Transformation

edit notationssuch that it isconsistent withthe MSP defini-tion in section 2.1

The input S of the Median String Problem consists of t sequences, each ofwhich is of the form sk = (skp)p=1...n = (sk1 , ..., s

kn) where sk is the kth sequence,

k = 1...t, skp ∈ Σ.

Let λk be another representation of sequence sk such that:λk = (λkq )q=1...n−l+1 = (λk1 , λ

k2 , ..., λ

kn−l+1) where λkq = (skr )r=q...q+l−1 = (skq , ..., s

kq+l−1), skr ∈

Σ and l is the length of the pattern to find as specified in the MSP input.

Let Φ be a set of all possible l-mers based on Σ such that:φi ∈ Φ, φi = (αj)j=1...l = (α1, ..., αl), αj ∈ Σ and Φ = {φi}i=1...|Σ|l = {φ1, φ2, ..., φ|Σ|l}.Let γ(λkq ) be the mapping of λkq to φi ∈ Φ if λkq is an exact pattern match of a

certain φi such that φi = γ(λkq ).

Let δ(φi) be the mapping of φi to a set of symbols in Φ with Hamming distanceless than or equal to 1 with respect to φi such that σi = δ(φi), where σi is acontainer symbol which represents the set. In this case, σi ∈ Σ̄, where Σ̄ is theset of symbols used in representing the MSP sequences in AGCDP.

Let s̄k be the transformed form of sequence sk in MSP DNA input. Transformingthe sequence sk gives s̄k = (σi)δ(γ(λk

q )),∀λkq ,q=1...n−l+1 = (δ(γ(λkq )))q=1...n−l+1,

which is the transformed sequence in AGCDP.

In order to use AGCDP to solve MSP, the MSP input will be transformed tothe AGCDP input as described below.

Transformation of MSP into Approximate Gene Cluster DiscoveryProblem (AGCDP) Recall that the Median String Problem has t×n matrixDNA and l, the length of the pattern to find, as components of its input; anda string v of l nucleotides that minimizes TotalDistance(v,DNA) over all stringsof that length as its output. The AGCDP input and output will be modified asfollows:

8 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

MSP AGCDP

sk s̄k = (δ(γ(λkq )))q=1...n−l+1 where λk

q = (skq , ..., skq+l−1), skr ∈ Σ

n n̄ = n− l + 1l l̄ = 1t t̄ = t+ 1

Σ Σ̄ = {σi}, where σi = δ(φi), 1 ≤ i ≤ |Σ|l, φi = (α1, ..., αl), αj ∈ Σ,φi ∈ Φ

|Σ| |Σ̄| = |Σ|l

Reduction of MSP to AGCDP

INPUT:The gene pool U = {σ1, σ2...σ|Σ|l} is the set Σ̄ derived from Φ, which is in turnderived from Σ of MSP. There will be t+1 genomes corresponding to the numberof sequences in the DNA matrix of MSP plus an additional sequence. The addi-tional sequence will serve as the first genome such that g1 = (σi)δ(φi),∀φi∈Φ and

gi = s̄k, i = 2...t + 1, k = 1...t where s̄k is the transformed form of sk based oninput l, which is the length of the pattern to find (Note: The size of every genomein AGCDP can vary). The size range for the reference gene set X is [1, 1] suchthat |X| is always equal to 1 (D = 1), and the integer weights w− and w+, re-spective cost of each missed and additional gene in an interval, will all be set to 1.

OUTPUT:Find X ⊂ U with 0 /∈ X and |X| = 1, and linear interval Ji for each genome inorder to minimize:

c∗ = c(X, (Ji)) =

m∑i=1

[w− × |X \GiJi |+ w+ × |GiJi \X|],

where, GiJi is the gene content of Ji in gi. The reference gene set X will cor-respond to the string v of l nucleotides. In the ILP approach to AGCDP, thetentative X will be derived from an interval in each of the given genomes, whichcontains symbols from Σ̄; X in this reduction, however, is derived from an in-terval size of 1 only since |X| is constrained to be equal to 1. In this case, X stillneeds to be transformed into its corresponding symbol in Φ and then into its λkqcounterpart, which is in itself the pattern v of l nucleotides. In MSP, the linearinterval Ji is not part of the output; thus, it can be ignored.

Example 1. MSP DNA input (Σ = {1, 2, 3}):s1 = 1 2 3 1 2s2 = 2 3 1 1 3s3 = 1 2 1 3 2

Approximate Gene Cluster Discovery Problem is NP-hard 9

λk, l = 3:λ1 = 123 231 312λ2 = 231 311 113λ3 = 121 213 132

Φ = set of possible l-mers = {φ1, φ2, ..., φ|Σ|l} (|Σ|l = 33 = 27):

φ1 = 111 φ2 = 112 φ3 = 113 φ4 = 121 φ5 = 122 φ6 = 123 φ7 = 131 φ8 = 132 φ9 = 133φ10 = 211 φ11 = 212 φ12 = 213 φ13 = 221 φ14 = 222 φ15 = 223 φ16 = 231 φ17 = 232 φ18 = 233φ19 = 311 φ20 = 312 φ21 = 313 φ22 = 321 φ23 = 322 φ24 = 323 φ25 = 331 φ26 = 332 φ27 = 333

(γ(λkq ))q=1...n−l+1:

(γ(λ1q))q=1...3 = φ6 φ16 φ20(γ(λ2q))q=1...3 = φ16 φ19 φ3(γ(λ3q))q=1...3 = φ4 φ12 φ8

(δ(γ(λkq )))q=1...n−l+1:

{φ6, φ3, φ9, φ4, φ5, φ15, φ24} {φ16, φ25, φ7, φ10, φ13, φ17, φ18} {φ20, φ11, φ2, φ23, φ26, φ19, φ21}{φ16, φ25, φ7, φ10, φ13, φ17, φ18} {φ19, φ10, φ1, φ22, φ25, φ20, φ21} {φ3, φ12, φ21, φ6, φ9, φ1, φ2}{φ4, φ13, φ22, φ1, φ7, φ5, φ6} {φ12, φ3, φ21, φ15, φ18, φ10, φ11} {φ8, φ17, φ26, φ2, φ5, φ7, φ9}

Transformed form of the DNA input under Σ̄ from (δ(γ(λkq )))q=1...n−l+1:

s̄1 = σ6 σ16 σ20s̄2 = σ16 σ19 σ3s̄3 = σ4 σ12 σ8

AGCDP genome input:g1 = σ1σ2σ3σ4σ5σ6σ7σ8σ9σ10σ11σ12σ13σ14σ15σ16σ17σ18σ19σ20σ21σ22σ23σ24σ25σ26σ27g2 = σ6σ16σ20g3 = σ16σ19σ3g4 = σ4σ12σ8

Given the AGCDP genome input and the gene pool U , X will be tentatively setto an interval of size 1 in each of the given genomes during the search process inthe ILP approach, starting with the first genome g1. The value of X is essentiallya character set given the nature of σi as derived from the δ mapping described inthe previous section. The character set will be compared against the characterset contained in each of the σi in the succeeding genomes, and the AGCDP costfunction will be used to compute the cost. Since the objective is to minimize thecost, the interval with minimum cost will be chosen in every genome; and thecost for each tentative X after going through all of the genomes is the sum ofthe minimum costs. At the end of the AGCDP run, the best X will be reported,

10 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

that is the reference gene set that obtained the lowest value for the cost. It isexpected from this reduction that the best X is the one that corresponds to v,which is the solution to MSP.

To illustrate the cost computation, the comparison of the first tentative X takenfrom g1 with the σi’s in g2 is shown below.

X = σ1 = {φ1, φ10, φ19, φ4, φ7, φ2, φ3}X vs. g2 at |J | = 1:

missing additional costJ [1 : 1] = σ6 = {φ6, φ3, φ9, φ4, φ5, φ15, φ24} 5 5 10J [2 : 2] = σ16 = {φ16, φ25, φ7, φ10, φ13, φ17, φ18} 5 5 10J [3 : 3] = σ20 = {φ20, φ11, φ2, φ23, φ26, φ19, φ21} 5 5 10

When translated to the MSP computation of Hamming distance, X correspondsto 111 based on δ(λ(111)) = σ1 and g2 corresponds to 123 231 312 (λ1). Thedistance between 111 and 123, 111 and 231, and 111 and 312 are all equal to2. To illustrate the relationship between the MSP Hamming distance and theAGCDP cost, the following examples apparently show that there is a form ofcorrespondence. It is obvious to see that if two same character sets are com-pared, the cost is 0. Since identical strings yield the same character sets on theδ mapping, it is also obvious that the corresponding Hamming distance is 0.

AGCDP: σ3 = {φ3, φ12, φ21, φ6, φ9, φ1, φ2} and σ6 = {φ6, φ3, φ9, φ4, φ5, φ15, φ24}:missing = 4; additional = 3; cost = 7.MSP: The Hamming distance between 113 and 123 is 1.

AGCDP: σ16 = {φ16, φ25, φ7, φ10, φ13, φ17, φ18} and σ6 = {φ6, φ3, φ9, φ4, φ5, φ15, φ24}:missing = 7; additional = 7; cost = 14.MSP: The Hamming distance between 231 and 123 is 3.

3.3 Polynomial time Transformation

Claim:Given the function γ(λkq ) that maps λkq to φz such that γ(λkq ) = φz, z = 1, ..., |Σ|l,the function γ can be implemented using an algorithm that runs in polynomialtime.

Proof:Let ι(skq ) be the index of the symbol skq in Σ given Σ = {β1, β2, ..., β|Σ|} such

that if the symbols are arranged in alphabetical order Σ̃ = (β1, β2, ..., β|Σ|), then

the indices of the symbols in Σ̃ correspond to Σ̃INDEX = (0, 1, 2, ..., |Σ| − 1).

Approximate Gene Cluster Discovery Problem is NP-hard 11

The mapping to the indices can be done in polynomial time [justification].

The function γ(λkq ) is therefore γ(λkq ) = φz, where z is the decimal represen-

tation of the numeric value (ι(skq ))q,...,q+l−1 + 1, which is in base |Σ|. Thus,

z = (((ι(skq ))q,...,q+l−1)|Σ| + 1|Σ|)10. It is mentioned in [2] that subquadratic al-gorithms exist for the computation of z in fast conversion routines that use the”divide and conquer” strategy. The function γ(λkq ) can therefore be implementedin polynomial time.

Claim:Given the function δ(γ(λkq )) = δ(φz) that maps φz to a set of φ’s with Hamming

distance greater than or equal to 1 with respect to φz, represented by σz ∈ Σ̄,the set can be generated in polynomial time.

Proof:Since φz is a sequence of symbols found in Σ, a φy with Hamming distance equalto 1 with respect to φz can be generated by altering one of the digits of φz.Given the correspondence of Σ̃ = (β1, β2, ..., β|Σ|) to the indices in Σ̃INDEX =(0, 1, 2, ..., |Σ| − 1), each symbol in φz can be altered to generate a φy by: (i)getting the corresponding index i of the symbol, then (ii) adding a value n, 0 <n ≤ |Σ|−1, to the index such that the new index, i′, is equal to (i+n) mod |Σ|,and finally, (iii) getting the corresponding symbol of the index in Σ̃. There are|Σ|−1 possible ways to alter each digit; thus, there are (|Σ|−1)l possible symbolswith Hamming distance of 1, where l is the length of φz. The generation of theset can therefore be done in polynomial time.

3.4 Correspondence of Total Distance and Cost* Function

Recall that in MSP, we minimize the Total Distance function over all possiblestring v of length l in S to get the median string. Likewise in AGCDP, weminimize the cost function over all possible X in G to get the gene cluster.Section 3.3 shows that we can use the solution from AGCDP to get the medianstring of a given set of sequences. To show that this will always be the case, weneed to show the correspondence of two scoring functions used, that is identifiedgene cluster X∗from AGCDP will yield the median string v∗ in MSP.

I used X∗ identifyfinal gene cluster,this is to removethe ambiguity be-tween X which isthe reference geneset

Both MSP and AGCDP minimizes the scoring function; however, one iscomparing a string and the other one, a set to a sequence. The order of theelements is important in computing the Hamming distance. For instance, com-paring two sequences 1 2 3 4 5 and 5 4 3 2 1, have scores dH(12345, 54321) =5 and cost2({1, 2, 3, 4, 5}, {1, 2, 3, 4, 5}) = 0. Nevertheless, both functions aredirectly proportional. One function increases as the other one increase giventhe same input. Now, in what way can we transform one input to the other?Let us define SetPemutationl(A) of set A, to be a another set containing all

2 instead of the interval Ji, the second parameter will be the set GJ i of alphabets inthe interval Ji

12 G. Cabunducan, J. Clemente, R. Relator, H. Adorna

possible l-permutation in A. Computation of dH will now be the sum of dis-tances between string v to all elements of SetPermutation(A). Which is al-ready familiar, because the summation of all edit distances is defined as theTotal Hamming Distance function in Section 2.1.

Claim:Let v∗ be the median string of S, and X∗ be the gene cluster of G, where G isderived from the polynomial transformation of S. Show that X∗ → v∗, or thestring the minimizes Total Distance is derived from the gene set that minimizesthe Cost∗ function.

Proof:To show that X∗ → v∗, then we need to prove that If cost(X∗, G) is minimum,then Total Distance(v∗, S) is minimum, where v∗ is the median string of S, X∗

is the gene cluster of G, and f(S) = G.First, let us assume the contradiction. If cost(X∗, G) is minimum, then there

exist v̂ such that Total Distance(v̂, S) < Total Distance(v∗, S).

4 Conclusion

References

1. Anne Bergeron, Cedric Chauve, Yannick Gingras, “Formal Models of Gene Clus-ters”, Bioinformatics Algorithm: Techniques and Applications, (2008)

2. Richard P. Brent and Paul Zimmermann, “Modern Computer Arithmetic”, Cam-bridge University Press, Chapter 1, pp 37-39 (2010)

3. Neil C. Jones and Pavel A. Pevzner, “An Introduction to Bioinformatics Algo-rithm”, Massachusetts Institute of Technology, Chapter 4, pp 97-100, (2004)

4. Michael Sipser, “Introduction to the Theory of Computation Second Edition”,Massachusetts Institute of Technology, Chapter 5, (2007)

5. Sven Rahmann and Gunnar Klau, “Integer Linear Programming Techniques forDiscovering Approximate Gene Clusters”, Bioinformatics Algorithm: Techniquesand Applications, (2008)

6. Anne Bergeron, Cedric Chauve, Yannick Gingras, “Formal Models of Gene Clus-ters”, Bioinformatics Algorithms: Techniques and Applications, (2008)

7. Francois Nicolas, Eric Rivals, “Complexities of the Centre and Median String Prob-lems”, In proceedings of the 14th Annual Symposium on Combinatorial PatternMatching, June 25-27 2003

8.

details higuerapaper, (MSP isNP-complete)