[IEEE 1994 IEEE International Symposium on Information Theory - Trondheim, Norway (27 June-1 July 1994)] Proceedings of 1994 IEEE International Symposium on Information Theory - Codes

Codes Capable of Correcting Bursts of Insertions and Deletions Patrick A.H. B o n d

Eindhoven University of Technology, Department of Mathematics and Computing Science, email:wsdwpbOwin.tue.nl, PO-Box 513, 5600 MB Eindhoven, the Netherlands

Abstmct - An array code 2) is constructed, capable of correcting burst8 of insertions, deletions, and substitutions. The rows of the codewords of D are used for synchronisation purposes and the columns are used to correct substitution errors and to retrieve information that was lost during transmission.

I . INTRODUCTION The code D we will construct, is an array code, capable of correcting bursts of insertions and deletions. An array will be transmitted row by row. The code Cz that is used to encode the rows of the array code, is a comma free code (CFC) and is needed t o synchronise the received rows. ROWS that are corrupted during transmission are considered lost and are replaced by rows of erasures, a t the receivers end. Code GI, that is used to encode the columns of the array code, is an or- dinary error and erasure correcting code and will be needed to retrieve the rows that are lost during transmission. A unique ID-sequence is attached to each row of an array, so the receiver knows which rows are received correctly and which are lost during transmission.

11. ENCODING AND DECODING A codeword of V is constructed in two steps. In the first step an nl x kn matrix is made as follows. The first t := pog,(nl)l bits of each row form the ID-sequence of that row and the last kz - t columns are codewords of code Cl. In the second encoding step the k2 bits of each row are replaced, in a unique way, by a codeword of length n2 of code C2. It is obvious that the rate R of D equals

R - kt ' (k2 - t, nt .n2

Note that by this encoding, the ID-sequences are also pro- tected by the CFC. The choice of the CFC CZ is discussed in the next section.

Decoding is done using a decoding window of size nl. If the contents of this window is a codeword of C2, this word is decoded and the window is shifted n2 positions. If the contents of the decoding window is not a codeword of the CFC, the window is shifted in steps of one position, until a codeword is detected. If, due to insertion and/or deletion errors, the contents of the decoding window is a codeword, but this codeword is not a transmitted row, we say that we have detected an "incorrect" codeword. The probability of detecting an "incorrect" codeword due to a single insertion or deletion error depends very much on the choice of the CFC CZ, as we wi l l see in the n u t section.

It can be shown that, if synchronization is lost, due to errors, the second received uncorrupted row after the last corrupted row can always be recovered. In most cases already the first received uncorrupted row can be recovered, but i t is possible that, due to the detection of an "incorrect" codeword, this first uncorrupted row is not detected using the above described decoding procedure.

111. COMMA FREE CODE The Comma Free Code that will be used is a modified version of the F-codes described by Clague in [l]. The modified version will be denoted by +-Code. All codewords of the length n F-code have zeroes on positions 1,2, . . . , s and ones on positions 8 + 1 , 2 s + 1,. . . , t s + 1, where 3 and t are so that 2st 2 R - 1. The remaining positions can be filled with information bits, hence the cardinality of an F-code is 2"-'-'. To construct a +code, we first add two extra fixed positions, i.e. position (t + 1)s + 1 will be 0 and position n will be 1, hence the cardinality of a CF-code is 2n-('t')-('t1). Next, the coordinates of the CF-code will be permuted as follows. Coordinate i will become coordinate (im + r) mod n, where 0 5 7 < n and m is so that (m,n) = 1. It is easy to show that the resulting codes are still comma free and that the probability of detecting an "incorrect" codeword due to the occurrence of a single insertion or deletion error can be reduced tremen- dously by taking a good choice of m and r. It turns out that (m, r) = (n - 1,2) (5 'often a good choice if IZ is not too large.

The advantage of using a CF-COde is that it is easy to encode and decode. The disadvantage is a reduction in the rate of the code, compared to the maximal rate of a CFC. For odd lengths n it has been shown that Comma Free Codes can be constructed, for which the rate tends to 1 - log,(n)/n. For CF-codes, the rate tends to 1 - m, when n becomes large.

IV. ERROR CORRECTING CAPABILITIES Assume a burst of insertions or deletions has size a t most n2 + 1, hence a burst corrupts at most two consecutive rows of an array. Furthermore assume that at most one burst occurs inside each transmitted array. Suppose a burst of deletions occurred during transmission. Then there are two possibili- ties. First no "incorrect" codeword is detected, hence a t most 2 rows cannot be retrieved due to this burst of deletions and the first uncorrupted row can be recovered by the receiver, so this burst results in two erased rows in the received array. Next suppose than an "incorrect" codeword is detected. In the worst case, this "incorrect" codeword contains the begin- ning of the first uncorrupted row, which means that instead of 3 correct rows, the receiver detects only 1 incorrect row, hence the burst results in two erased rows and one incorrect row in the received array. We see that, in order to retrieve the erased rows and correct the corrupted rows, the minimal distance of code CI has to be a t least equal to 5. The case that a single burst of insertions, of maximal size 112 + 1 occurs, can be treated in a similar way.

REFERENCES [l] Clague, D.J., "New Classes of Syncbronoua Codes", IEEE

Trans. on Elect. Comp., vol EC-16, pp. 290-298, 1967.

0 - 7803-2015-8/94/$4.00 (Dl994 IEEE - 63 -

Documents

[IEEE 1994 IEEE International Symposium on Information Theory - Trondheim, Norway (27 June-1 July 1994)] Proceedings of 1994 IEEE International Symposium on Information Theory - Codes