3
ELSEVIER Computers and Chemistry 24 (2000) 135 137 Computers & Chemistry www.elsevier.com/locate/compchem Book review Algorithms on Strings, Trees and Sequences, Dan Gusfield Understanding the purpose of Dan Gusfield's book is made easy by reading the preface written by the author. Borrowing his own terms, the content is defined as a general-purpose rigorous treatment of the entire field of deterministic algorithms that operate on strings and se- quences. The concept of sequence widely covers applica- tions in molecular biology. The book contains four parts. The first two consist in a review of exact string matching methods and suffix tree construction and application: algorithms are given in detail and a variety of examples, figures and exercises illustrate each topic. Such algorithms are readily usable to build new or speed-up existent software. Real molec- ular biology problems are introduced in the last chapter of the second part (suffix tree) as a transition to the last two parts of the book where issues in computational molecular biology are extensively described and explained. In recent years, research in computational biology/ bioinformatics has quickly evolved. A short historical background will confirm how timely the release of this book is. First, the nature of the data has changed. The availability of specialized databases and whole genome sequences has imposed a more systematic approach. Raw data tend to be more consistent when a complete bacterial chromosome is considered than when a few alike bacterial sequences are extracted from general databanks. Such a plain fact is full of consequences. One is the indispensable caution required in gathering alike sequences and in stating related issues. It is some- what disappointing to find some of the key questions only set in the Epilogue of the book: what are the good problems in computational molecular biol- ogy... ?... what is the definition a motif?..., what is a family?... (the latter is discretely addressed in a foot- note in the third part). If, as the author claims, good problems are biology-driven, one must bear in mind that biology is data-driven so it is not surprising to find that the worth of an algorithm in biology journals is often only measured by the rate of success in coming up with the expected results. Moreover, quite a few com- puter methods were designed and/or used with 'home made' data sets. The situation is sometimes paradoxical as it may come down to comparing comparison meth- ods with data sets which are not necessarily comparable or compatible in the first place. Second, the definition of optimality has changed. Whatever the method used to solve a problem, it relies at some stage on setting optimal criteria. The variety of criteria upon which methods published in biology jour- nals are evaluated has kept on spreading. How explicit are these criteria and when are they supposed to be set? The introduction of an increasing number of formal approaches in biology has brought these questions in the foreground. Most of the theoretical computer science approach work presented in the book, is covered in other refer- enced books, where it is classically shown that the worth of an algorithm is estimated by means of com- plexity and running time calculations. An algorithm is optimal if these values are minimal. Then, conditions for applying an algorithm need to be optimised. For instance, there is no dynamic programming without a clear statement of a cost function. The essence of this approach is to minimize such a cost. This is an a priori criterion. Even though there is no consensus among biologists on the definition of cost, results can be discussed in the light of the explicit assumptions set initially. But, in many cases, optimal criteria are in fact set a posteriori, as an interpretation of results. The definition of optimality is therefore usually much looser in biology and more diverse: 1. An algorithm is optimal if all results match the available knowledge. 2. An algorithm is optimal if results generated with real data are clearly distinct from those with ran- dom data. 3. An algorithm is optimal if results can be experimen- tally tested. These definitions are not mutually exclusive. Needless to say that formal definitions are far more precise than these but biology is definitely empirical. Consider a set or a family of sequences whether DNA, RNA or protein. Features like sequence compo- 0097-8485/00/$ - see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0097-8485(99)00054-6

Algorithms on Strings, Trees and Sequences: Dan Gusfield

Embed Size (px)

Citation preview

Page 1: Algorithms on Strings, Trees and Sequences: Dan Gusfield

ELSEVIER Computers and Chemistry 24 (2000) 135 137

Computers & Chemistry

www.elsevier.com/locate/compchem

Book review

Algorithms on Strings, Trees and Sequences, Dan Gusfield

Understanding the purpose of Dan Gusfield's book is made easy by reading the preface written by the author. Borrowing his own terms, the content is defined as a general-purpose rigorous treatment o f the entire field o f deterministic algorithms that operate on strings and se- quences. The concept of sequence widely covers applica- tions in molecular biology.

The book contains four parts. The first two consist in a review of exact string matching methods and suffix tree construction and application: algorithms are given in detail and a variety of examples, figures and exercises illustrate each topic. Such algorithms are readily usable to build new or speed-up existent software. Real molec- ular biology problems are introduced in the last chapter of the second part (suffix tree) as a transition to the last two parts of the book where issues in computational molecular biology are extensively described and explained.

In recent years, research in computational biology/ bioinformatics has quickly evolved. A short historical background will confirm how timely the release of this book is.

First, the nature of the data has changed. The availability of specialized databases and whole genome sequences has imposed a more systematic approach. Raw data tend to be more consistent when a complete bacterial chromosome is considered than when a few alike bacterial sequences are extracted from general databanks. Such a plain fact is full of consequences. One is the indispensable caution required in gathering alike sequences and in stating related issues. It is some- what disappointing to find some of the key questions only set in the Epilogue of the book: what are the good problems in computational molecular biol- ogy . . . ? . . . what is the definition a motif?.. . , what is a fami ly? . . . (the latter is discretely addressed in a foot- note in the third part). If, as the author claims, good problems are biology-driven, one must bear in mind that biology is data-driven so it is not surprising to find that the worth of an algorithm in biology journals is often only measured by the rate of success in coming up

with the expected results. Moreover, quite a few com- puter methods were designed and/or used with 'home made' data sets. The situation is sometimes paradoxical as it may come down to comparing comparison meth- ods with data sets which are not necessarily comparable or compatible in the first place.

Second, the definition of optimality has changed. Whatever the method used to solve a problem, it relies at some stage on setting optimal criteria. The variety of criteria upon which methods published in biology jour- nals are evaluated has kept on spreading. How explicit are these criteria and when are they supposed to be set? The introduction of an increasing number of formal approaches in biology has brought these questions in the foreground.

Most of the theoretical computer science approach work presented in the book, is covered in other refer- enced books, where it is classically shown that the worth of an algorithm is estimated by means of com- plexity and running time calculations. An algorithm is optimal if these values are minimal. Then, conditions for applying an algorithm need to be optimised. For instance, there is no dynamic programming without a clear statement of a cost function. The essence of this approach is to minimize such a cost. This is an a priori criterion. Even though there is no consensus among biologists on the definition of cost, results can be discussed in the light of the explicit assumptions set initially. But, in many cases, optimal criteria are in fact set a posteriori, as an interpretation of results. The definition of optimality is therefore usually much looser in biology and more diverse: 1. An algorithm is optimal if all results match the

available knowledge. 2. An algorithm is optimal if results generated with

real data are clearly distinct from those with ran- dom data.

3. An algorithm is optimal if results can be experimen- tally tested.

These definitions are not mutually exclusive. Needless to say that formal definitions are far more precise than these but biology is definitely empirical.

Consider a set or a family of sequences whether DNA, RNA or protein. Features like sequence compo-

0097-8485/00/$ - see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0097-8485(99)00054-6

Page 2: Algorithms on Strings, Trees and Sequences: Dan Gusfield

136 Book review

sition, signals, specific folding (for RNA or protein), etc. characterize such a collection. To characterize it without ambiguity that is to set an optimal definition, the corresponding features must constitute a set of necessary and sufficient conditions. String matching methods are often useful as tools reflecting necessary conditions. The presence of a signal, of specific repeats, of palindromic structures, etc. are considered necessary to identify a set of sequences similar to the initial set. Fast motif searches can be implemented. But, in the vast majority of cases, these characteristics yield a large amount of false positives. The selection can only be refined by the setting of sufficient conditions. This task is hard enough to delay the preoccupation of how time and memory consuming a program is, and, more often than not, it remains an unreachable goal. Conse- quently, optimizing a criterion is finding the best com- promise between formal and loose definitions. D. Gusfield suggests to loosen the strict criteria set earlier by "relying more heavily on correctness proofs, worst- case analysis, lower bound arguments, randomized al- gorithm analysis, and bounded approximation results . . . . . ". But, no biologist will ever be impressed by a perfectly well running algorithm saving maximum time and space unless it is solving his biology problem. So, solving the problem comes first, which is only acknowledged in the last two paragraphs of the book.

Saving time and space became an issue for biologists nevertheless, when sequence databanks reached an un- manageable size, about 10 years ago. The response came with the introduction of heuristic methods like FASTA and BLAST. These programs are an early compromise of the various definitions of optimality. They are efficient in computer terms and perform useful searches for biologists. This usefulness is justified in the book by two plain statements which speak for them- selves: "The day may well come when sequence database search will just involve the repeated application of precise two-strings alignments. That day is not here yet".

In reality, the worth of computer tools set as stan- dards in biology is practical. Information to be found in sequences is loosely defined such that optimal criteria are even more difficult to establish. These tools are necessary as field-glasses were, to observe the moon a few hundred years ago. But, only telescopes told us the landscape was pitted by craters. Could men have done without field-glasses? Probably, but they would have designed another intermediary device to look up in the sky which would have become the ancestor of a better one which would have given us the chance of seeing the craters anyway. The important point is to remain aware of how limited the interpretation of results is. Cross- checking the outputs of heuristic methods together or with those obtained running more formal methods is a basic recommendation echoed by D. Gusfield.

The introduction of dynamic programming in molec- ular biology is even older than the initiative to use fast search programs. The diversity of scoring matrices and the on-going issue of gap penalty have prevented an agreement on the optimality of the approach for se- quence alignment. Sections of the book are devoted to these matters. Moreover, the formal approach of string comparison could not cope with either large data sets (see the above paragraph), nor the diversity of se- quences to be compared. These aspects are developed in the third part of the book. The careful presentation of string comparison in the context of sequence alignment is covering all issues involving optimal criteria. Differ- ences between local and global alignment are quite clearly explained and the question of multiple align- ment is also debated. Besides, the difficulty of setting optimal criteria to assess computer methods is further confirmed by D. Gusfield's choice not to discuss proba- bilistic models. Whether these are involving hidden Markov models or the Gibbs sampling method, the interpretation of the various parameters does not corre- spond to explicit biological features. Optimizing in this case is a formal exercise which sheds little light on real mechanisms.

The success of applying string matching algorithms to analyze genetic sequences lies in the possible inter- pretation of results in biological terms. Conversely, the challenge of improving the subtlety of such interpreta- tions motivates the determination of new algorithms. In fact, the reciprocal contribution of string matching methods to biology is ever-present in the last two parts of the book. The fourth part is devoted to further questions dealing with physical mapping, sequence as- sembly, phylogeny, genome rearrangements and a short introduction to molecular computation and gene pre- diction. The latter two are weirdly placed in the same section, whereas their scope is of completely different nature. Various interesting efforts have been invested in improving gene prediction whereas molecular computa- tion seems a gratuitous exercise which does not really belong in the picture.

More generally, the third part of the book is more consistent than this last part, probably because align- ment problems are simpler to set and have caught the attention of researchers quite a while ago. The formula- tion of other biology problems is not as straightfor- ward, particularly because the string aspect appears insufficient. For instance, understanding regulation is not reduced to optimizing the detection of regulatory signals. It involves other known and unknown mecha- nisms. Such further complications make the slightest definition more difficult to set and the topic much more messy. This is implied in the Epilogue again.

D. Gusfield explained his motivation to write the book following the evolution of his thoughts in apply- ing string matching techniques in molecular biology. In

Page 3: Algorithms on Strings, Trees and Sequences: Dan Gusfield

Book rev~w 137

fact, the current states of both quality of string and pattern matching methods and knowledge in molecular biology brought together diverse viewpoints previously apart. This convergence is manifested in the book, which makes it a useful, if not a necessary effort.

Algorithms on Strings, Trees and Sequences meets its goal of being clear, rigorous expos6 on algorithms, their use and their limitations. A substantial background in computer science seems to be a basic requirement for grasping the merits of this essay. The fact that none of the exercises given in the book can be solved without an in-depth understanding of the text does not make this essay a casual reading for biologists or chemists want- ing to know more about computer applications in their fields. However, computer scientists curious about biol-

ogy as well as those biologists and non-biologists who teach computational biology/bioinformatics are likely to consider this essay as an excellent reference book. Almost 500 bibliographic references are given with only a small bias towards pure computer science (checking the relevance of references in detail was beyond this reviewer's intentions.) A glossary of usual biology terms is also conveniently provided for biology learners.

Frederique Lisacek Laboratoire Genome et Informatique,

Universite de Versailles, 45 avenue des Etats-Unis,

78035 Versailles Cedex, France