Parallel and sequential approximation of shortest superstrings

Sequential and Parallel Approximation of ShortestSuperstrings �Artur CzumajyHeinz Nixdorf Institut, Universit�at-GH-Paderborn,Warburger Str. 100, 33098 Paderborn, Germany,email : [email protected] G�asienieczInstytut Informatyki, Uniwersytet Warszawski,Banacha 2, 02-097 Warszawa, Poland,email : [email protected] Piotr�owxHeinz Nixdorf Institut, Universit�at-GH-Paderborn,Warburger Str. 100, 33098 Paderborn, Germany,email : [email protected] RytterzInstytut Informatyki, Uniwersytet Warszawski,Banacha 2, 02-097 Warszawa, Poland,email : [email protected] 26, 1994AbstractSuperstrings have many applications in data compression and genetics. However the de-cision version of the shortest superstring problem is NP-complete. In this paper we examinethe complexity of approximating shortest superstrings. There are two basic measures of theapproximations: the approximation ratio and the compression ratio. The well known andpractical approximation algorithm is the sequential algorithm Greedy . It approximatesthe shortest superstring with the compression ratio of 12 and with the approximation ratioof 4. Our main results are:(1) An improved sequential algorithm: the approximation ratio is reduced to 2.83. Previ-ously it was reduced by Teng and Yao from 3 to 2.89.(2) A proof that the algorithmGreedy is not parallelizable, the computation of its outputis P-complete.(3) An NC algorithm which achieves the compression ratio of 14+" .(4) The design of an RNC algorithm with constant approximation ratio and an NC algo-rithm with logarithmic approximation ratio.�A preliminary version of this paper appeared in Proceedings of the 4th Scandinavian Workshop on Algorithms Theory(SWAT'94), Aarhus, Denmark, July 1994, Springer-Verlag, LNCS 824, pp. 95{106.ySupported by DFG-Graduiertenkolleg \Parallele Rechnernetzwerke in der Produktionstechnik", ME 872/4-1 and bythe ESPRIT Basic Research Action No. 7141 (ALCOM II)zSupported in part by the EC Cooperative Action IC 1000 Algorithms for Future Technologies \ALTEC".xSupported in part by Alexander von Humboldt-Stiftung, by Volkswagen Stiftung, and by the ESPRIT Basic ResearchAction No. 7141 (ALCOM II) 1

1 IntroductionLet S = fs1; : : : ; sng be a set of n strings over some alphabet �. A superstring of S is a string sp over� such that each string si 2 S appears as a substring of sp. The shortest superstring problem is to �ndfor a given set S the shortest superstring ss(S) . We use opt(S) to denote the length of ss(S) . Assumefurther without loss of generality, that no string si 2 S is a substring of any other sj 2 S.It is known that the shortest superstring problem is NP-hard [5, 6]. Because of its importantapplications in data compression practice [17] and DNA sequencing procedure [10, 12, 16], it is ofinterest to �nd approximation algorithms with good performance guarantees. For example, a DNAmolecule can be represented as a character string over a set of nucleotides fA;C;G; Tg. Although aDNA string can have up to 3 � 109 characters (for a human being), with current laboratory methodsonly small fragments of at most 500 characters can be determined at a time. Then from a huge numberof these fragments, a biochemist should reconstruct the superstring representing the whole molecule.As many authors pointed out [10, 12, 16], e�cient superstring approximation algorithms can be used tocope with this job.To evaluate how good is the obtained approximation, two kinds of measure are used. The �rst one,most important in practice, is to �nd a superstring sp of S such that the ratio jspjopt(S) is minimized.We will call this ratio to be the approximation factor of a superstring . The second approach is to�nd a superstring sp of S such that the ratio of the total compression obtained by sp and by ss(S) ismaximized. That is, we want to maximize jSj�jspjjSj�opt(S) , where jSj = P1�i�n jsij. We will call this ratioto be the compression factor of a superstring .The algorithmGreedy is a simple sequential approximation of a shortest superstring and appears todo quite well. It can be presented in the followingway. Given a non-empty set of strings S = fs1; : : : ; sng,repeat the following steps until S contains just one string which is a superstring of S:� Select a pair of strings s0; s00 2 S that maximizes overlap between s0 and s00� Remove s0 and s00 from S replacing them with the merge of s0 and s00Tarhio and Ukkonen [18] and independently Turner [20] showed that Greedy achieves the com-pression factor of at least 1/2. Other heuristics have been also considered by the authors, but 1/2 is stillthe best obtained compression factor. The approximation factor of Greedy was unknown for a longtime. The �rst breakthrough was made by Blum et al. [2], whose proved that Greedy achieves theapproximation factor of 4. Furthermore, they showed a modi�ed greedy algorithm that has an approx-imation factor of 3, and proved that the superstring problem is MAX-SNP-hard [15]. The recent resultthat MAX SNP-hard problems do not have polynomial time approximation scheme unless P = NP[1] implies that a polynomial time approximation scheme (that is, polynomial time algorithms withapproximation factor of 1 + " for any �xed " > 0) for this problem is unlikely. The main idea of thealgorithm of Blum et al. was �rst to �nd an optimal cycle cover in certain graph and then to combinethe cycles to get a Hamiltonian path which de�nes a superstring. The main di�culty in improving theapproximation factor was made by short cycles in the cycle cover. Analysing separately cycles of lengthtwo Teng and Yao [19] were able to bound the approximation factor of the obtained superstring by 2.89.We describe an algorithm that analyses cycles of length two and three, combines their nodes in a newway to get a Hamiltonian path and, �nally, outputs a superstring whose approximation factor is at most2.83.In this paper we also study parallel complexity of the approximation of the shortest superstring(for a more detailed discussion on parallel algorithms and models of parallel computations see e.g. [8]).Given the importance of the applications, it seems to be reasonable to ask when parallelization can help.We prove that algorithm Greedy , which is commonly used by biologists in their applications, is notparallelizable, that is, the computation of its output is P-complete. Next we use a similar approachto that one given by Li [12] and reduce a certain approximation of the shortest superstring to theweighted set cover problem. This immediately implies anNC algorithmwith logarithmic approximation.We also show how one can easily implement our sequential 2.83-approximation algorithm to get anRNC algorithm with the same approximation factor. Finally we give an NC algorithm that achievesthe compression ratio of 14+" , for any constant " > 0. The idea behind this algorithm is a parallelapproximation of algorithm CC-GREEDY , which is a sequential approximation of the maximum2

weight cycle cover problem. We leave as an open problem how to achieve an NC algorithm with aconstant approximation factor.This paper is organized as follows. Section 2 contains some preliminary de�nitions used in the paper.In Section 3 we describe a sequential algorithm that achieves the approximation factor of 2.83. Thenwe focus on parallel complexity and show in Section 4 that the computation of the output of algorithmGreedy is P-complete. In Section 5 we present two NC algorithms: one that achieves the compressionratio of 14+" , the other that has a logarithmic approximation factor.2 PreliminaryFor two strings s and t let v be the longest string such that s = uv and t = vw for some non-emptystrings u and w. The overlap between two strings s and t is the length of the string v. We will denoteit as ov(s; t). The pre�x of a string s with respect to a string t is the length of the string u. We willdenote it as pref(s; t). It will cause no confusion if sometimes we also call the string v to be the overlap(ov(s; t)) and the string u to be the pre�x (pref(s; t)) of s and t. De�ne s � t to be the string uvw, thatis s � t = pref(s; t) t.For a given set of strings S = fs1; : : : ; sng we consider two complete directed graphs PG(S) =(V;E; pref) and OG(S) = (V;E; ov). Both graphs have n vertices V = f1; : : : ; ng and n2 edges E =f(i; j) : 1 � i; j � ng. We take as weight function the pre�x pref(; ) in PG(S) and the overlap ov(; ) inOG(S): edge (i; j) has weight pref(i; j) = pref(si; sj) in PG(S), and ov(i; j) = ov(si; sj) in OG(S). Theset of vertices corresponds to the strings in S and it should not make any confusion when we will saythat a string si is vertex of PG(S) or OG(S).Note that we may assume that the weight of these two graphs are given in unary. Indeed all weightsare non-negative integer, pref(i; j) + ov(i; j) = jsij, and hence Pi;j(pref(i; j) + ov(i; j)) = jSj � n.Observe also that a Hamiltonian path in PG(S) or OG(S) de�nes us the superstring of S. Ifi1; i2 � � � in�1; in is a Hamiltonian path then si1 � si2 � : : : � sin�1 � sin is the corresponding superstring.Algorithm Greedy can be restated in terms of the overlap graph OG(S). Repeat until selectededges do not form a Hamiltonian path in OG(S): Scan the edges of OG(S) in non-increasing order ofweights and select an edge (i; j) if no edge of the form (i; p) or (q; j) has been previously selected andif the collection of paths constructed so far does not include a path from j to i. Obtained Hamiltonianpath de�nes us the superstring obtained by algorithm Greedy .De�ne a cycle cover of a graph G to be a maximal collection of cycles in G such that each vertexis in at most one cycle. Let also a path-cycle cover (called also 2-factor) be a collection of paths andcycles in G such that each vertex is in at most one path or cycle. We call a path-cycle cover maximalif it can not be extended by any other edge in G.A weighted digraph G is an overlap graph if there exists a set S of strings such that the graphobtained from OG(S) by removing all zero-weighted edges is isomorphic to G.3 Sequential Algorithm with 2.83 Approximation FactorIn this section we present a new sequential algorithm for the superstring problem that has an approxi-mation factor of 256 and thus supersedes the algorithm of Teng and Yao [19]. The later algorithm hasthe factor of 289 and is an improvement on the algorithm TGREEDY of Blum et al. [2] that has thefactor of 3.In our presentation we consider the graphs PG(S) and OG(S) assuming that both contain no sel oop.For a cycle c in a cycle cover C in the graph PG(S), let d(c) denote the total pref(�; �) weight of theedges in c. We refer to d(c) as the weight of c.One of the basic observations concerning the shortest superstring problem is that the length of ashortest superstring is not smaller than the weight of a minimumweight Hamiltonian cycle in PG(S) [2].This motivates to work on the graph PG(S) and try to approximate such a cycle. Since any Hamiltoniancycle is also a cycle cover, the length of a shortest superstring is bounded by the weight of a minimumweight cycle cover in PG(S). Such a cycle cover in a weighted digraph can be found by a polynomial-time algorithm [14]. Therefore, almost all e�cient approximation algorithms for the shortest superstringproblem �rst �nd a minimum weight cycle cover in PG(S) and then try to combine the cycles to get a3

Hamiltonian path of small weight. The main problem, however, is that it is di�cult to approximate thelength of a shortest Hamiltonian path and it seems to be much easier to approximate a path or cyclewith the maximum length. The observation used by many authors (see e.g. [2]) is that the weight ofa cycle in PG(S) is related to the weight of the corresponding cycle in OG(S). Indeed one can easilyshow that a minimumweight cycle cover in PG(S) is also a maximumweight cycle cover in OG(S), andvice versa. We will call such a cycle cover an optimal cycle cover. The main di�culty in combining thecycles into a Hamiltonian path of small weight is made by short cycles. One can observe that if we �nda minimumweight cycle cover without cycles of length 1; 2; � � � ; k, then the algorithm of Blum et al. [2]produces a superstring with approximation factor bounded by 2 2k+1 . A possible room for improvementis also in better combining elements of short cycles into a superstring. Basing on this idea Teng and Yao[19] treated separately cycles of length two and designed an algorithm with the approximation factor of2.89. In our paper we follow this line and analyse how to combine e�ectively cycles of length two andthree. Basing on 2-cycle and 3-cycle Lemmas, we choose a representative for each such a cycle and thenbuild an optimal cycle cover on them. Extending e�ciently this cycle cover we obtain a superstringwhose approximation factor is at most 2.83.For the sake of completeness we recall two crucial lemmas from [2] and [19]. Recall that a minimumweight cycle cover in PG(S) is called canonical if each string s is assigned to a cycle whose weight isthe smallest among all cycles that s �ts [19]. The �rst Lemma shows a relation between overlaps andpre�xes [2]. Basing on the bound of the overlaps between any two strings from two di�erent cycles onecan combine these cycles into a Hamiltonian path in PG(S) of relatively small weight.Lemma 3.1 Let c1 and c2 be two cycles in an optimal cycle cover C with s1 2 c1 and s2 2 c2. Then,the overlap between s1 and s2 is less than d(c1) + d(c2).The second Lemma of Teng and Yao [19] bounds the sum of overlaps in 2-cycles.Lemma 3.2 (2-cycle Lemma)Let c1 and c2 be two cycles in a canonical weight minimum cycle cover C in PG(S) with r1 2 c1 andr2 2 c2. Then ov(r1; r2) + ov(r2; r1) < max(jr1j; jr2j) +min(d(c1); d(c2)).The algorithm TGREEDY , given by Blum et al. [2], �nds an optimal cycle cover and then openseach cycle by deleting the edge with the shortest overlap and joins the obtained paths by Greedy .One can obtain the same approximation factor of 3 by the following procedure:1. �nd an optimal cycle cover C of S2. select a set R of cycles' representatives (one node si from each cycle ci) and �nd an optimal cyclecover CC of R3. open each cycle in CC by deleting the shortest-overlap edge, and concatenate the obtained stringsto form �4. split each cycle ci in the selected node si (the result will begin and end with si) and replace si in� by the resultsThese steps form the basis of the algorithm presented by Teng and Yao [19]. Their improvementon the approximation factor is obtained by making the cycle cover in Step 1 canonical and by treatingseparately 2-cycles in Step 3. Our further improvement is achieved by selecting \good" representativesnot only of 2-cycles but also of 3-cycles in Step 3, and by �nding an optimal cycle cover on them. Thisallows us to create a better superstring for these representatives.Below there is an analog of Lemma 3.2 for a 3-cycle.Lemma 3.3 (3-cycle Lemma)Let c1, c2 and c3 be cycles in a minimum weight cycle cover C with r1 2 c1, r2 2 c2 and r3 2 c3. Thenov(r1; r2) + ov(r2; r3) + ov(r3; r1) � 2 �min(jr1j; jr2j; jr3j) + d(c1) + d(c2) + d(c3).Proof: The inequality follows from the following observations. Assume without loss of generalitythat jr1j = min(jr1j; jr2j; jr3j). Then ov(r1; r2) � jr1j, ov(r3; r1) � jr1j. By Lemma 3.1 ov(r2; r3) �d(c2) + d(c3). 24

overlap edgethe shortest

1

2

1

1

1

22

2

1 2

1 1

11

1

22

2

22

1

1

1

2

2

2Figure 1: Splitting CCC cycles together with 2,3-cycles from CC into two superstrings. Edges labeledwith 1(2) are in the superstring q1(q2).The Algorithm1. Find an optimal cycle cover C of S, and make C canonical.2. Take an arbitrary string from each cycle of C to form a set of representatives R, and �nd anoptimal cycle cover CC for R.3. Select representative set RR containing one element for each 2-cycle in CC and one element foreach 3-cycle in CC. From each 2-cycle take the longer string and from each 3-cycle take a stringthat is not in the pair with the longest overlap (i.e. when s1, s2 and s3 are the strings of a 3-cycle,ov(s1; s2) � ov(s2; s3) and ov(s1; s2) � ov(s3; s1), then select s3). Let RR = fg1; g2; : : : ; gmg. Ifgi is from a 2-cycle then let fi be the second element of the cycle. If gi is from a 3-cycle then letfi be the superstring of two other strings in the cycle, ordered as on the cycle. In this way we jointwo nodes on a 3-cycle and transform it into a 2-cycle (gi; fi; gi). Let RR0 = ff1; f2; : : : ; fmg.4. Find an optimal cycle cover CCC on RR. From each cycle in CCC with odd number of nodesremove the edge with the smallest overlap. Add 2-cycles from CC (including 3-cycles after trans-formation to 2-cycles) and call the resulting graph GC. Note that RR [RR0 form the vertices ofGC.5. Split the edges in GC into two disjoint sets of directed paths (see Fig. 1) such that paths in eachset are vertex disjoint and cover all nodes in RR [ RR0. Create two superstrings q1 and q2 forRR [RR0, each superstring corresponds to one set of paths. Let q be the shorter of them.6. Open each i-cycle, i � 4, in CC by deleting the edge with the smallest overlap and concatenatethe resulting strings together with q to obtain �. Note that � is a superstring for R. Split eachcycle in C in a node from R to obtain a superstring that begins and ends with the string fromR. Let �� be the extended string of � obtained by replacing each string of R with the superstringrepresenting its cycle in C. �� is the resulting superstring.AnalysisObserve �rst that the algorithm runs in polynomial time, because an optimal cycle cover can be con-structed in time O(n3) (see e.g. [14]), and for a given minimumweight cycle cover one can transform itinto a canonical minimum one in O(njSj) time [19].Let d2, d3, d4 be, respectively, the total weight of the cycles in C that have representatives in 2-cycles,3-cycles, and all i-cycles for i � 4, in cycle cover CC. Thus d2 + d3 + d4 = d(C). Let ov2, ov3 and ov4be, respectively, the total overlap present in the 2-cycles, 3-cycles, and all i-cycles for i � 4, in cyclecover CC. Then, we have: ov(q1) + ov(q2) � ov2 + ov3 + ov3=3 + 23ov(CCC), since in each 3-cycle ofCC we take twice the edge with the longest overlap and in CCC-cycles with the odd number of edgesonly the edge with the smallest overlap is deleted. Since ov(CCC) � ovopt(RR) = jRRj � opt(RR), weget ov(q) � 12ov2 + 23ov3 + (jRRj � opt(RR))=3:5

Let RRi, i = 2; 3, denote the set fr 2 RR : r is in an i-cycle in CCg. Then ov(�) � ov(q) + 34ov4 =12ov2 + 23ov3 + 34ov4 + (jRR2j+ jRR3j � opt(RR))=3.Lemma 3.4 jRR2j � ov2 � d2=2jRR3j � ov3=2� d3=2Proof: The �rst inequality is a result of summing over all 2-cycles the inequality from Lemma 3.2.Recall that RR2 contains the longer representatives from each 2-cycle in CC. The second one can beobtained using Lemma 3.3 and summing over all 3-cycles. 2Combining all observations above we get,j�j = jRj � ov(�) = opt(R) + ovopt(R)� ov(�) � opt(R) + ov(CC)� ov(�)� opt(R) + opt(RR)=3 + 12ov2 + 13ov3 + 14ov4 � (jRR2j+ jRR3j)=3� opt(R) + opt(RR)=3 + 16ov2 + 16ov3 + 14ov4 + 16(d2 + d3)� opt(S) + opt(S)=3 + 13d2 + 13d3 + 12d4 + 16(d2 + d3)= 43opt(S) + 12(d2 + d3 + d4) � opt(S)(43 + 12) = 156opt(S)Since j��j = j�j+ d(C), we obtain j��j � 256opt(S).4 Algorithm Greedy Is Not ParallelizableAlgorithmGreedy appears to be very sequential in nature, since to select a current pair of strings withthe largest overlap we need to know the results of previous merges. To formalize this observation wewould like to prove that Greedy applied to the superstring problem is P-complete1, what is commonlybelieved to mean: a hardly parallelizable algorithm.We start with proving that the problem of �nding the Hamiltonian path chosen by algorithmGreedy is P-complete. For a given Boolean circuit, a certain complete weighted digraph is intro-duced, in which a Hamiltonian path selected by Greedy can simulate a computation of the circuit'svalue. Then we argue that the digraph is an overlap graph, i.e., a set of strings can be constructedwhose overlap graph is isomor�c to the digraph.Lemma 4.1 The problem of �nding the Hamiltonian path chosen by algorithmGreedy is P-complete.Proof: We reduce algorithm Greedy to the circuit value problem. The circuit value problem isthe following: De�ne a circuit � to be a vector (x1; x2; : : : ; xm), where each xi is either an input valuetrue or false , or a boolean gate from some basis. The circuit value problem is to check which is theoutput of �.This problem is P-complete even for some restricted circuits. We will consider circuit � where gatesare numbered topologically such that if xi receives an input from xj then i < j. We will also restrictto circuits with not and or gates, where gates not have one input and one output and gates or havetwo inputs and at most two outputs. For such restricted circuits the circuit value problem remainsP-complete [7].We show how to simulate computation in a circuit � by algorithm Greedy . The circuit � istransformed in polylogarithmic time with polynomial number of processors into a weighted digraph. Acertain edge will be chosen by Greedy if and only if the output of the circuit is true . The pathschosen by algorithm Greedy will simulate the circuit. For each gate there is a collection of nodes.There are the nodes representing the inputs to the gates, the outputs of the gate, and some additionalnodes which direct algorithm Greedy .1To be more precise: a search problem of �nding a superstring obtained by GREEDY is considered and proved to beP-complete 6

TRUE FALSE

1 + 20 k

1 + 20 k1 + 20 k

1 + 20 kFigure 2: k-th input gate with value true or falseOUTPUT OUTPUT

INPUTINPUT

20 + 20 k

16 + 20 k

15 + 20k 11 + 20k

14 + 20 k

12 + 20 k

19 + 20 k

12 + 20 k

12 + 20 k

13 + 20 k

11 + 20 k

11 + 20 k

18 + 20k

17 + 20 k

11 + 20 kFigure 3: k-th gate or with two inputs and two outputsSubgraphs corresponding to gates are illustrated in �gures 2, 3 and 4. The gates (correspondingsubgraphs) are drawn so that their inputs are at the top and their outputs are at the bottom. Eachinput and output is represented by a subgraph called a switch which contains three nodes (two squaresand one circle) and edges which are used to indicate the value of a wire.In �gures 2, 3 and 4 are drawn only edges with non-zero weights. Edges with weight zero can bechosen by algorithm Greedy only at the end, and therefore they are not of importance. These edgesonly ensure that Greedy will �nd a Hamiltonian path. Therefore in what follows, we will say that anedge is (already) selected by Greedy if it is positive and is has been (already) chosen.To obtain the whole graph corresponding to the circuit �, the subgraphs for all the gates are con-nected so that if a wire is shared by two gates, then the corresponding nodes of the switch are identi�ed.The crucial observation is that algorithm Greedy selects an edge between a square node and thecircle node in a switch if and only if the value of corresponding wire is true . The topological numberingof the gates assures us that algorithm Greedy can choose the edges in a k-th gate only if no morepositive-weighted edges in the gates with numbers greater than k may be selected. This also means thatbefore Greedy begins verifying the edges of k-th gate, all positive-weighted edges in its input switcheshave been considered.Therefore, what we should show is that: if we know the edges selected by Greedy in the inputswitches of a gate (that is, if we know whether edges adjacent to circle nodes have been selected), thenthe edges chosen in its output switches corresponds to the boolean function of the gate. This is obviouslytrue for the input gates and one can easily verify that also gates not and or are designed properly. 2Remark 1 For the further considerations it is useful to insert copy gates between consecutive or gates(precisely: a copy gate should connect an output switch of an or gate xi with an input switch of anotheror gate xj ; its edge weights should be set as if it has the number xi). One can observe that this will7

OUTPUT

INPUT

9 + 20 k

6 + 20 k

8 + 20 k

7 + 20 k

5 + 20 k

2 + 20 k

4 + 20 k

3 + 20 k

2 + 20 k

1 + 20 k

OUTPUT

INPUT

5 + 20 k

9 + 20 k

6 + 20 k

7 + 20 k

8 + 20 k

2 + 20 k

1 + 20 k2 + 20 k

3 + 20 k

4 + 20 kFigure 4: k-th gate not and k-th gate copy with one input and one outputnot change the simulation of a circuit. We will call the result digraph the circuit-simulating graph.For a digraph G de�ne its skeleton G to be an undirected graph with the vertex set the same as thevertex set of G and the edge set which is obtained from the edge set of G by removing directions.In the following lemmas we would like to derive some su�cient conditions of a digraph to be anoverlap graph. The �rst observation is that a positive-weighted edge in an overlap graph relates thebeginning of one string to the end of another. Therefore, when we want to collect the related strings ina structure, we have to consider adjacent edges in alternating directions. This leads us to the followingde�nitions:An alternating path is a sequence of nodes and edges v1e1v2e2 � � �vk�1ek�1vk such that either e1 =(v1; v2); e2 = (v3; v2); e3 = (v3; v4); e4 = (v5; v4) � � �, or e1 = (v2; v1); e2 = (v2; v3); e3 = (v4; v3); e4 =(v4; v5) � � �. An alternating tree is a connected (ie.,AT is connected) digraphAT = (VT ; ET ) containingt nodes and t � 1 edges such that each sequence v1e1v2e2 � � �vl�1el�1vl, where vi 2 VT , ei 2 ET andv1e1v2e2 � � �vl�1el�1vl is a path inAT , is an alternating path. Assume that all edges inAT have weightsde�ned by a function w. A weighted alternating tree AT is monotone if there exists a vertex r, called theroot of AT , such that for each alternating path re1v1e2v2 � � �ekvk in AT , w(e1) < w(e2) < � � � < w(ek)An alternating cycle in a digraph G is a cycle in G that can be transformed into an alternating path bysplitting it in a node.Lemma 4.2 Every monotone alternating tree with positive, integer weights is an overlap graph.Proof: Let AT = (V;E;w) denote an alternating tree with weights de�ned by a function w. Wewould like to assign strings to nodes of AT in such a way that the overlap between two strings is equalto the weight of the edge between respective nodes. To this end we introduce the following de�nitions.For each v 2 V de�ne w(v) = maxfw(e) : e is adjacent to vg. Let tofather(v) denote an edge that isadjacent to v on the path from v to the root r of AT . Let w0(v) = w(v)�w(tofather(v)) for v 6= r andw0(r) = w(r). Let �AT and �0AT be disjoint alphabets with j�AT j = j�0AT j = jV j and let a : V 7! �ATand $ : V 7! �0AT be 1-1 mappings.For each v 2 V we will de�ne a string str(v) 2 ��AT�0AT [ �0AT��AT such that OG(S), where8

$ aa15a7 2

4 4 2$ aa1

5a5 5 26 4 $ aa5 5

6 1 2

aa15a $7 2

4 2 8aa1

5a $3 45 2 9

aa15a $6 4

5 2 10

$ a511

a5a15 $2 2

aa1 $3 3

3 2

$ aa7 1 33 2

1

8 9

4 5 6 7

2 3

10

10 5

5

14 12 15

79

3Figure 5: A string assignment to nodes in a monotone alternating treeS = fstr(v) : v 2 V g, is an overlap graph isomor�c to AT . Let r denote the root of AT . Thenstr(r) = ( $raw0(r)r if outdeg(r) 6= 0aw0(r)r $r if indeg(r) 6= 0:For each v 2 V , v 6= r, let v1e1v2e2 : : : ek�1vk, v1 = r, vk = v, be the alternating path from the rootr to v. Let w1 = w(e1), w2 = w(e2) � w(e1), w3 = w(e3) � w(e2); : : : ; wk�1 = w(ek�1) � w(ek�2) anda1 = a(v1), a2 = a(v2),: : : , ak = a(vk). Without loss of generality assume that e1 = hv1; v2i. Thenstr(v) = ( $vaw0(v)k : : :aw55 aw33 aw11 aw22 aw44 : : :awk�1k�1 if k is oddawk�1k�1 : : :aw55 aw33 aw11 aw22 aw44 : : : aw0(v)k $v if k is even:A tree with corresponding string assignment is shown in Figure 5. It su�ces to argue that the assignmentstr() has the desired property. Let us notice thatstr(vk�1) = ( awk�2k�2 : : :aw55 aw33 aw11 aw22 aw44 : : : aw0(vk�1)k�1 $vk�1 if k is odd$vk�1aw0(vk�1)k�1 : : :aw55 aw33 aw11 aw22 aw44 : : :awk�2k�2 if k is even:therefore, when k is odd,ov(str(v); str(vk�1)) = wk�2 +wk�4 + : : :+ w3 + w1 +w2 +w4 + : : :+wk�1= w(e1) + k�1Xi=2(w(ei)� w(ei�1)) = w(ek�1)and similarly when k is even. It follows that the overlap graph OG(S) covers every edge e = hv; v0i 2E by an edge with ov(str(v); str(v0)) = w(e). It remains to prove that when hv; v0i =2 E thenov(str(v); str(v0)) = 0.Let height(v) denote the number of edges on the path from r to v. If height(v) = height(v0),then both str(v) and str(v0) begin or end with di�erent letters from �0AT , thus their overlap is 0.If height(v) > height(v0), then str(v) ends with a letter which does not appear in str(v0) and ifheight(v) < height(v0), then str(v0) begins with a letter which does not appear in str(v). 2Lemma 4.3 If a weighted digraph G with positive, integer weights can be edge-covered by a disjointsum of monotone alternating trees in such a way that for each vertex in G all its incoming edges are inone tree and all its outgoing edges are in another tree, then G is an overlap graph.9

Proof: According to Lemma 4.2, each alternating tree AT in the cover C of G is an overlap graph.That is, strings can be assigned to nodes of AT to obtain the corresponding overlap graph. Let �ATdenote an alphabet of the strings. Without loss of generality, we can assume that alphabets �AT ,AT 2 C, are pairwise disjoint. Let the incoming edges of a vertex v be in AT and its outgoing edgesin AT 0. Thus we have got two strings inv 2 ��AT and outv 2 ��AT 0 . The result string for v we obtainby concatenating inv with outv. It can be easily checked that the overlap between two such strings isnon-zero if and only if the corresponding nodes are joint by an edge. 2Lemma 4.4 If a digraph G does not contain alternating cycles, then it can be (uniquely) edge-coveredby edge-disjoint alternating trees such that each vertex of G has all its incoming edges in one tree andall its outcoming edges in another tree.Proof: An easy induction on the number of nodes in a digraph G. 2Theorem 4.5 The problem of �nding a superstring chosen by algorithm Greedy is P-complete.Proof: With respect to Lemma 4.1 we have only to show that a circuit-simulating graph G is anoverlap graph. The gates in G have been designed in such a way that G contains no alternating cycles.By Lemma 4.4, G can be uniquely edge-covered by disjoint alternating trees. Moreover the trees are ofsize bounded by a constant (independent on the size of a circuit) and they are monotonic. Hence, byLemma 4.3, G is an overlap graph. The string assigned to a node of G can be computed in a constanttime because of the tree sizes. 2Remark 2 One can slightly modify the construction above to show that the problem of �nding a su-perstring chosen by algorithm Greedy remains P-complete even when the input alphabet is binary.5 NC-Approximations of Shortest SuperstringsAll known superstring approximation algorithms with a constant approximation factor are based on thegreedy heuristic or on the construction of a minimumweight cycle cover. The former is not parallelizable,as it was proved in Section 4, the latter can be parallelized in RNC . Indeed, when we consider thealgorithm presented in Section 3, one can easily seen that the only non trivial part is how to �nd anoptimal cycle cover of S, of R, or of RR. All other steps are reduced to simple operations on listsor cycles. It is well known that the problem of �nding a minimum weight cycle cover is equivalent tothe problem of �nding a minimum weight matching in bipartite graph [14]. In general it is not knownwhether it can be done either NC or in RNC . However, when the weights of the graph are given inunary one can �nd a minimum weight matching in RNC [21]. Since in our case the weights of OG(S)are given in unary, the construction can be parallelized to get an RNC algorithm. We also mentionthat the process of �nding a canonical optimal cycle cover [19] required in Step 1 can be easily done inparallel.Theorem 5.1 There exists an RNC algorithm that �nds a superstring of the length at most 2:83 �opt(S).In the following subsections we present two NC algorithms for the shortest superstring problem: onethat achieves a constant compression factor, the other that has a logarithmic approximation factor. Theformer is based on a certain approximation of a greedy heuristic for the maximum weight cycle-cover,the latter uses a reduction of the superstring approximation problem to the weighted set cover problemand then, an approximation algorithm for that problem.In Sections 5.1 and 5.2 we show that there is an NC algorithm that �nds an ( 12+" )-approximationof the maximum weight cycle cover (and also for the maximum weight matching problem in bipartitegraphs). Then, in Section 5.3, we apply this algorithm to get an NC approximation algorithm withcompression factor of 14+" , for any constant " > 0.10

5.1 Sequential Approximation of a Maximum Weight Cycle-CoverWe begin with a sequential 12 -approximation. Let G = (V;E;w) be a complete weighted digraph withoutsel oops, where V = f1; : : : ; ng is the set of vertices, E = f(i; j) : i 6= j 2 V g is the set of edges andw : E ! R+ is the (non-negative) weight function.The following is a simple greedy algorithm that �nds a cycle cover.Algorithm CC-GREEDY :Repeat until selected edges do not form a cycle cover of G:Scan the edges of G in non-increasing order of weight and select an edge (i; j) if no edge of the form(i; p) or (q; j) has been previously selected.Lemma 5.2 Algorithm CC-GREEDY �nds a cycle cover of weight that is at least half of the weightof a maximum weight cycle cover.Proof: The proof follows the ideas of Turner's estimation of the superstring compression factorachieved by Greedy [20]. Algorithm CC-GREEDY selects n edges in non-increasing order and let eibe the i-th chosen edge. Let Pi be a maximum weight cycle cover that includes edges fe1; : : : ; eig andlet Ci = Pi � fe1; : : : ; eig. We show that for 1 � i � n, w(Ci�1) � w(Ci) + 2w(ei). Since w(C0) is theweight of a maximumweight cycle cover and w(Cn) = 0, this would imply the lemma.Let ei = (p; q). Since ei is the i-th edge chosen by CC-GREEDY , w(ei) � maxfw(e) : e 2 Ci�1g.There are at most two edges e0 and e00 that are in Ci�1 � Ci and which share the head or the tailwith ei. Let e0 = (p; s); e00 = (t; q) and when s 6= t, let e� = (t; s). Then we can obtain a cycle cover(Pi�1 � fe0; e00g) [ fei; e�g when s 6= t or a cycle cover (Pi�1 � fe0; e00g) [ feig otherwise. In both caseswe have: w(Ci) +w(ei) � w((Ci�1 � fe0; e00g) [ feig)= w(Ci�1)� w(e0)�w(e00) + w(ei)� w(Ci�1)� w(e)� w(e) + w(ei)= w(Ci�1)� w(ei)which implies w(Ci�1) � w(Ci) + 2w(ei). 2Remark 3 The lemma above still holds for graphs with sel oops.Remark 4 The proof of Lemma 5.2 can be easily modi�ed to get Theorem 10 of Blum et al. [2]which states that in the overlap graph with sel oops (this is an important assumption) algorithmCC-GREEDY always �nds a maximumweight cycle cover.In the proof of Lemma 5.2 we should always consider cycle cover (Ci�1 � fe0; e00g) [ fe; e�g, in-dependently whether s 6= t or not. It can be shown (see e.g., [2]) that in the overlap graph ifw(e) � maxfw(e0); w(e00)g, then w(e) + w(e�) � w(e0) + w(e00). Thus we obtainw(Ci) + w(ei) � w((Ci�1 � fe0; e00g) [ fe; e�g)= w(Ci�1) + (w(e) + w(e�)� w(e0)� w(e00))� w(Ci�1)which impliesw(Ci�1) � w(Ci)+w(ei), and �nallyw(C0) �Pni=1w(ei). Since algorithmCC-GREEDYchooses a cycle cover of weight Pni=1 w(ei), the cycle cover we have constructed must be a maximumweight cycle cover.Remark 5 Greedy matching algorithm in a weighted (undirected) graph G runs as follows:Scan the (positive-weighted) edges of G in non-increasing order of weight and select an edge fi; jgto the matching if no edge of the form fi; pg or fq; jg has been previously selected.Using well-known transformations (see eg. [14]) Lemma5.2 can be restated as the followingCorollary.Corollary 5.3 In a weighted bipartite graph the Greedy matching algorithm �nds a matching ofweight that is at least half of the weight of a maximum weight matching.11

5.2 Parallel Approximation of a Maximum Cycle-CoverIn this section we describe an approximation of algorithm CC-GREEDY . Intuitively, algorithmCC-GREEDY could be only a bit worse if it would choose instead of the maximum weight edge onewith a similar weight. We partition the edges of G into levels, such that the weights of all edges in alevel are within a factor c. Then we start with the level of maximum weights and go down to the levelof minimum weight. On each level we behave like CC-GREEDY assuming that all the edges have thesame weights.Let c > 1, G = (V;E;w) and for each edge e if w(e) 2 (ck�1; ck] then de�ne c-level of e, levelc(e),to be k (ie., levelc(e) = dlogc w(e)e); additionally if w(e) � 1 then levelc(e) = 0. For a given pathv1; v2; : : : ; vk its contraction is de�ned as follows. We remove vertices v2; : : : ; vk together with the edgesincident to them and change the edges (with their weights) outgoing from v1 to be the edges outgoingpreviously from vk. Vertex v1 is called the contraction of the path v1; v2; : : : ; vk. The recontraction of acycle or a path is obtained by recursively replacing each contraction by its path.Algorithm ACC-GREEDY1. Let s be the maximum c-level of edges in G (ie., s = maxe2Eflevelc(e)g) and let ~G = G andC = ;2. Repeat for t = s downto t = 0(a) Let Gt be the (non-weighted) subgraph of ~G induced by the edges from the t-th c-level(b) Find a maximal path-cycle cover of Gt(c) Remove all cycles in Gt from ~G and after recontracting add them to C; contract all obtainedpaths3. The set C contains a cycle coverLemma 5.4 For any c > 1 algorithm ACC-GREEDY �nds a cycle cover of weight at least 12c of theweight of a maximumweight cycle cover.Proof: Let G = (V;E;w) be the graph with weightsw(e) = � clevelc(e) if ACC-GREEDY chooses ew(e) otherwiseNote that w(e) � w(e) < cw(e) for every e 2 E. LetMCC(G) be the weight of a maximumweight cyclecover in G and w(AM (G)) be the sum of weights of the edges chosen by algorithmACC-GREEDY onG. Observe that algorithm ACC-GREEDY on the graph G is equivalent to CC-GREEDY assumingthat both algorithms select the same edges in the case of equal weights. Thus w(AM (G)) � 12MCC(G) �12MCC(G). On the other hand w(AM (G)) < c w(AM (G)). Thus �nally we get c w(AM (G)) �12MCC(G). 2Now we want to show that this algorithm can be implemented to run in polylogarithmic time withpolynomial number of processors. The main loop is executed dlogc(maxe2Efw(e)g)e times. Since theweights of the graph are given in unary, we only have to show that there is an NC algorithm that �ndsa maximal path-cycle cover.Lemma 5.5 There is an NC algorithm that �nds a maximal path-cycle cover of a digraph G.Proof: De�ne an undirected graph ~G = (~V ; ~E) as follows. The set of vertices ~V corresponds to theset of edges in G and there is an edge in ~G between two nodes corresponding to edges e1; e2 in G if andonly if they have the same head or tail (ie., either e1 = (i; j); e2 = (i; k) or e1 = (j; i); e2 = (k; i) for somenodes i; j; k in G). Let MIS( ~G) be a maximal independent set in ~G and let cov(G) be the set of edgesin G corresponding to the vertices in MIS( ~G). We shall show that cov(G) is a maximal path-cyclecover of G. First observe that there are no two edges in cov(G) that share a head or a tail; otherwisethe corresponding vertices inMIS( ~G) would be adjacent, what contradicts the de�nition of independent12

set. Thus cov(G) is a path-cycle cover of G. Now we only have to show that the set of edges cov(G)is maximal. Let e1 be an edge in G which is not in cov(G) and let ve1 be the corresponding vertex in~G. Since ve1 62MIS( ~G), there is a vertex ve2 2 MIS( ~G) which is adjacent to ve1 . Thus the edge e2 inG corresponding to ve2 shares either the tail or the head with e1. And since e2 2 cov(G), adding theedge e1 to cov(G) would destroy a path-cycle cover.To �nish the proof it remains to show that one can �nd a maximal independent set in NC . Thisproblem can be solved either in O(log2 n) time with n4 processors by the algorithm of Luby [11] or inin O(log3 n) time with n2 processors on a CREW PRAM using the algorithm by Israeli and Shiloach[9]. 2Lemma 5.4 and Lemma 5.5 immediately imply the following theorem.Theorem 5.6 There is an NC algorithm that �nds a cycle cover of a weighted digraph G with the costof at least 12+" of the weight of a maximum cycle cover. Here " > 0 is any arbitrary but �xed constant,and we assume that the weights of G are given in unary.The running time of this algorithm is either O(log2 n log1+"(maxe2Efw(e)g)) with n4 processors orO(log3 n log1+"(maxe2Efw(e)g)) with n2 processors.5.3 Parallel Approximation of Shortest SuperstringsIn this section we develop techniques presented in the previous section to design an NC algorithm that�nds a superstring that has the overlap at least 14+" that of a shortest superstring.First we build the overlap graph OG(S) for the set of strings S. We assume in the construction thatthere is no sel oop in OG(S). Then we �nd a cycle cover in OG(S) using algorithmACC-GREEDY .We next remove from every cycle an edge with the minimum weight and join (by any edges) obtainedpaths to get a Hamiltonian path.Lemma 5.7 Obtained Hamiltonian path is of the weight at least 14+" of the weight of a maximumweight Hamiltonian path, for any " > 0.Proof: Any maximum cycle cover MCC is of weight not smaller than the weight of a maximumHamiltonian path MHP. Let C be a cycle cover obtained by algorithm ACC-GREEDY and HP bea Hamiltonian path obtained by the algorithm presented above. >From Theorem 5.6 we get w(C) �12+ 12 "w(MCC) for any " > 0. Since we remove the least weighted edge from every cycle in C, we getw(HP ) � w(C)=2. Thusw(HP ) � w(C)=2 � 14 + "w(MCC) � 14 + "w(MHP) 2Theorem 5.8 There is an NC algorithm that achieves compression factor of 14+" . It runs either inO(log2 n�log1+" jSj) time with jSj+n4 processors or in O(log3 n�log1+" jSj) time with jSj+n2 processors.Proof: The proof is a straightforward application of the previously known results (see e.g. [2][20]).The output string sp is the one corresponding to the Hamiltonian path obtained by the algorithmabove. The compression factor is de�ned to be jSj�jspjjSj�opt(S) . One can easily observe that the weight of aHamiltonian path corresponding a superstring w is always equal to jSj � jwj. Hence the compressionfactor is equal to the approximation of the maximum weight Hamiltonian path in OG(S). 25.4 NC Algorithm with Logarithmic Approximation FactorLet X = f1; 2 : : : ; ng and let Y = fY1; Y2; : : : ; Ymg � 2X be a family of its subsets. A cover of X isa subcollection Y 0 � Y such that SYi2Y 0 Yi = X. For each set Yi 2 Y let w(Yi) denote its (positive)weight and for a subcollection Y 0 � Y de�ne its weight by w(Y 0) = PYi2Y 0 w(Yi). The weighted setcover problem is to �nd a cover Y � of X of the minimum weight.13

The weighted set cover problem is known to be NP-hard [6]. A recent result of Lund and Yannakakis[13] shows that this problem cannot be approximated in P with ratio c log2 n for any c < 1=4 unlessNP = DTIME(nO(1)). However there is known a polynomial-time algorithm that �nds a logarithmic-factor approximation. Recently Berger et al. [3] has been proved the following lemma.Lemma 5.9 For any " > 0, there is an NC algorithm for the weighted set cover problem that runs inO(log4 n logm log2(nm)="6) time, uses O(n+Pmi=1 j Yi j) processors, and produces a cover of weight atmost (1 + ") logn times the weight of a minimum cover.For any si; sj 2 S and d; 0 � d < minfj si j; j sj jg, let u and v be strings of the length d such thatsi = xu and sj = vy for some non-empty string x and y. If u = v then we de�ne con(i; j; d) = xuy;otherwise con(i; j; d) is unde�ned. If con(i; j; d) is de�ned, then let CON(i; j; d) = fsi; sjg [ fsk :sk is a substring of con(i; j; d)g. In this way we have obtained a family FAM-CON of subsets of S.For each CON(i; j; d) 2 FAM-CON de�ne its weight w(i; j; d) =j con(i; j; d) j=j si j + j sj j �d.Let C = fCON(i1; j1; d1);CON(i2; j2; d2); : : : ;CON(ic; jc; dc)g be a cover of S de�ned by a collec-tion of sets from FAM-CON . We can obtain the corresponding superstring SC = con(i1; j1; d1) �con(i2; j2; d2) � � � � � con(ic; jc; dc). Since C is a cover of S, every string from S must be a substring ofSC . Note also that w(C) =P1�k�cCON(ik; jk; dk) =P1�k�c j con(ik; jk; dk) j�j SC j. The followingfact can be easily derived (see eg., [12]).Fact 5.10 Let C� be a minimumweighted set cover of S. Then j SC� j� w(C�) � 2 � opt(S).Now, suppose that we have found a set cover C such that w(C) � t �w(C�), for some t. Then clearlyj SC j� t �w(C�). Thus, the superstring SC has the length at most 2 � t �opt(S). Hence using Lemma 5.9we obtain the following theorem.Theorem 5.11 There is an NC algorithm that for any " > 0, �nds a superstring whose length is atmost (2 + ") logn times the length of a shortest superstring.A similar construction was used implicitly by Li [12] for a sequential algorithm.References[1] A. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy, Proof veri�cation and hardness ofapproximation problems, in Proceedings of the 33rd IEEE Symposium on Foundations of ComputerScience, pp. 14{23, 1992.[2] A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis, Linear approximation of shortest super-strings, in Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, pp. 328{336,1991.[3] B. Berger, J. Rompel, and P. W. Shor, E�cient NC algorithms for set cover with applications tolearning and geometry, in Proceedings of the 30th IEEE Symposium on Foundations of ComputerScience, pp. 54{59, 1989. The full version is to appear in J. Algorithms.[4] V. Chvatal, A greedy heuristic for the set-covering problem, Math. Oper. Res. 4 (1979), 233{235.[5] J. Gallant, D. Maier, and J. Storer, On �nding minimal length superstrings, J. Comput. SystemSci. 20 (1980), 50{58.[6] M. R. Garey and D. S. Johnson, \Computers and Intractability: A Guide to the Theory of NP-completeness", Freeman, New York, 1979.[7] R. Greenlaw, H. J. Hoover, and W. L. Ruzzo, A compendium of problems complete for P, TechnicalReport, University of Washington, 1991.[8] J. J�aJ�a, \An Introduction to Parallel Algorithms", Addison-Wesley, 1992.14

[9] A. Israeli and Y. Shiloach, An improved parallel algorithm for maximal matching, Inform. Process.Lett. 22 (1986), 57{60.[10] A. Lesk (edited), \ComputationalMolecular Biology, Sources and Methods for Sequence Analysis",Oxford University Press, 1988.[11] M. Luby, A simple parallel algorithm for the maximal independent set problem, SIAM J. Comput.15 (1986), 1036{1053.[12] M. Li, Towards a DNA sequencing theory (Learning a string), in Proceedings of the 31st IEEESymposium on Foundations of Computer Science, pp. 125{134, 1990.[13] C. Lund and M. Yannakakis, On the hardness of approximating minimization problems, in Pro-ceedings of the 25th Annual ACM Symposium on Theory of Computing, pp. 59{68, 1993.[14] C. Papadimitriou and K. Steiglitz, \Combinatorial Optimization: Algorithms and Complexity",Prentice-Hall, 1982.[15] C. Papadimitriou and M. Yannakakis, Optimization, approximation, and complexity classes, inProceedings of the 20th Annual ACM Symposium on Theory of Computing, pp. 229{234, 1988.[16] H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen, Algorithms for some string matching problemsarising in molecular genetics, in Proceedings of the Information Processing Congress, IFIP, pp. 53{64, 1983.[17] J. Storer, \ Data Compression: Methods and theory", Computer Science Press, 1988.[18] J. Tarhio and E. Ukkonen, A Greedy approximation algorithm for constructing shortest commonsuperstrings, Theoret. Comput. Sci. 57 (1988),131{145.[19] S-H. Teng and F. Yao, Approximating shortest superstrings, in Proceedings of the 34th IEEESymposium on Foundations of Computer Science, pp. 158{165, 1993.[20] J-S. Turner, Approximation algorithms for the shortest common superstring problem, Inform.Comput. 83 (1989), 1{20.[21] V. V. Vazirani, Parallel graph matching, in \Synthesis of Parallel Algorithms" (J. H. Reif, Ed.),pp. 783{811, Morgan Kaufmann, 1993.15

Documents

Parallel and sequential approximation of shortest superstrings