Managing Gigabytes: Compressing and Indexing Documents and Images

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 41, NO. 6, NOVEMBER 1995 2101

Book Reviews

Managing Gigabytes: Compressing and Indexing Documents and Images-Ian H. Witten, Alistair Moffat, and Timothy C. Bell (New York: Van Nostrand Reinhold, 1994, pp 429 + xiv, $54.95)

’ Reviewer: Joy A. Thomas

Information theory is often confused with the management of information systems, even though there appears to be little in common between the two fields. One deals with the fundamental limits in the representation and communication of information, and the other deals with the efficient organization and searching of data. It is therefore very interesting to find an excellent book that lies on the interface between the two areas. As the title indicates, this book deals with the problem of organizing, compressing, indexing, and querying a large database of text and images. The book is written in a clear and witty style, and is a readable and comprehensive introduction to the field.

Many of the ideas in the book have been implemented in a public domain document management system called mg, which is described in an appendix and is available on the Internet. The authors have tackled a number of practical issues in building such a system, and the book reflects their work. This is very much an “engineering” book; there are no theorems or proofs, and the mathematics is limited to simple back-of-the-envelope calculations analyzing the efficiency of some of the algorithms and heuristics proposed here. The authors describe some of the basic methods of compression and indexing, and provide practical examples of their use in real systems. Many of the methods are illustrated with numbers from sample databases, such as the TREC database a large (2-GB) database with three quarter of a million documents and half a million terms in the index.

The book is essential reading for anyone interested in building or maintaining a large database for text or images. It is also a good introduction to some practical applications of information theory. During a course on information theory, I used material from the book to illustrate how ideas like Huffman coding or arithmetic coding are actually implemented in standard compression algorithms. As good applications inspire good theory, the book could also provide a source of interesting research problems, e.g., formalizing or improving some of the heuristics using information-theoretic ideas.

The first chapter introduces the problem of indexing and compressing text. Examples of manual construction of concordances or indexes are used to motivate the need for efficient computer algorithms to do the same. While single books or a few megabytes of documents could be searched using simple intuitive algorithms, these methods would be infeasible for the large databases envisioned in this book. The current explosion in the availability of information necessitates efficient techniques for storing and searching this information. As an example, the authors discuss the notion of a “memex,” a kind of generalized personal digital assistant envisaged by Vannevar Bush in 1945, which would store all the books, records, and communications of an individual. Such a device is suite conceivable with current

Manuscript received August 10, 1995 The author is with IBM Thomas J. Watson Research Center, P.O. Box 704,

IEEE Log Number 9415105. Yorktown Heightas, NY 10598 USA.

technology, but its usefulness would depend on efficient storage and search mechanisms, which are the main subject of this book.

The second chapter introduces the need for compression with Parkinson’s law for space-despite increases in computer storage capacity or bandwidth, there is always more data to fill all the space available. The chapter is a brief survey of the fundamental techniques for lossless source coding, including Huffman coding, arithmetic coding, and Lempel-Ziv coding, and modeling methods such as Dynamic Markov Chains (DMC) and Prediction by Partial Match (PPM). The presentation is necessarily brief, and much of the material is covered in greater detail in [l]. Some comparisons of compression performance and speed are given for files in the Calgary

The next three chapters deal with indexing. The first of these introduces the basic form of index, i.e., the inverted file, which is essentially a table, which for every word or term in the lexicon lists the document numbers in which the word appears. The authors discuss the problems such as correcting for upper and lower case letters and the problem of stemming, which reduces similar words to a common root, e.g., reduce “compressed” and ”compression” to the root “compress.” They also consider other forms of indexing, such as signature files and bitmaps. The main focus of this chapter is on efficient storage of the index, and for this purpose, there is a discussion of coding schemes such as the Golomb code for coding the integers . The index compression methods allow the index for the TREC database to be a little more then 4% of the size of the data it indexes. Thus when combined with compression of the original data, the combined database is only a fraction of the size of the original data, and yet can be queried efficiently without scanning the entire database.

The fourth chapter deals with querying. The initial portion deals with Boolean queries, which look for documents that match all the search terms. The later part of this chapter deals with ranked queries, where the objective is to find documents that match as large a fraction of the search terms as possible. In this process, it is advisable to give higher weights to infrequent query terms, and various heuristics to find a ranked set of matching documents are described. The next chapter discusses the construction of indexes. Conceptually, the problem is very simple, since all one has to do is to count the occurrences of each term in each document, However, practical implementation for large databases requires careful algorithm design to avoid using too much memory. For example, for the TREC database, the sophisticated approach using a modified mergesort algorithm reduces the time needed to construct the index from more than a hundred years to about 5 hours on a typical workstation.

The sixth chapter is a survey of techniques for image compression, with the initial part focused on bilevel images and the the latter part on gray-scale images. It includes a very good description of the Group 3/4 FAX compression standards and the JBIG standard, which provide interesting examples of the application of Huffman coding and arithmetic coding, respectively. The authors also discuss resolution reduction and progressive transmission for images. The latter half of the chapter deals with the JPEG standard and the FELICS (Fast Efficient Lossless Image Coding System) [Z] algorithm. The chapter concentrates on algorithms that are already standards,

corpus.

0018-9448/95$04.00 0 1995 IEEE

2102 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 41, NO. 6, NOVEMBER 1995

and does not discuss current research in wavelet or fractal image compression. There is also no discussion of other kinds of data that might be found in a multimedia database, e.g., audio or video data.

Since most document images are images of text, the next chapter is devoted to the compression of textual images. The authors describe in detail a method involving the extraction of marks (contiguous regions of black pixels, which are presumed to be characters of the text), forming a library of marks, and representing new marks by finding the closest match in the library. If a good match is not found for the mark, it is assumed to be a new character and added to the library. Thus the page is represented by a sequence of pointers into the library and the corresponding horizontal and vertical offsets of the location of the mark on the page. Although this process is similar to Optical Character Recognition (OCR), the objective is to compress the text, not to decode the letters, and it avoids some of the imperfections of current OCR technology. The algorithm aptly illustrates the interplay between good models, good compression, and good recognition. While the basic scheme is lossy (since the marks on the page would not match the marks in the library exactly), it is possible to make it lossless by also storing the residue image (the difference between the original and the reconstructed image). Even though the residue image has much fewer black pixels, it is less compressible then the original image, and applying a standard compression algorithm on the residue results in no savings relative to the original image. This is because the process of extraction of marks has extracted most of the structure out of the original image, leaving mostly noise. However, if the residue is compressed relative to the reconstructed image (which is already available to the decoder) using arithmetic coding, it is possible to compress the residue efficiently. This combination of a lossy and a lossless scheme provides a natural two level coding scheme, where for most applications the lossy reconstruction would suffice, but the onginal image is available for archival purposes.

Chapter 8 deals with documents images that combine text with graphics or half-tone images. Using different compression schemes for different parts of the page would be more effective than using a common scheme, so the authors discuss algorithms to segment and classify different regions of the page. They introduce methods such as the Hough transform, which allows one to find collinear marks on the page, and the docstrum or document spectrum, which provides the two-dimensional distribution of distances of neighboring marks. These techniques are combined to orient the page and to segment it into text and image regions.

Chapter 9 considers implementation issues, including the choices between the different text compression, indexing, and image compres-

sion algorithms described in the book. A canonical Huffman coding algorithm is described which allows for fast decompression. Memory and time requirements for example databases are also given.

The last chapter provides a philosophical overview of the entire field of managing large quantities of information. The current growth of information on the Internet reminds one of the notion of a world brain, suggested by H. G. Wells more than 50 years ago as a means of providing universal access to all human knowledge. Although there are problems such as quality control and copyright for information on the Internet, one big issue is access-how does one find the information one needs? Many tools such as Archie, Gopher, and the World Wide Web have made access easier, but the authors suggest that the techniques described in the book would allow one to automatically index and compress textual information in less space than the uncompressed data, and allow easy access and searching. The techniques could ultimately be extended to images as well. Although, as they admit, the techniques are still in their infancy, we can already envision a day when the ”memex” of Vannevar Bush becomes a reality and we will have instant and comprehensive access to all the information we need!

It is not often that one comes across a book that succeeds so well in introducing a timely subject to a broad audience. Managing Gigabytes is an essential reference for anyone working with large text or image databases, CD-ROM’s, digital libraries, etc. But it is also an excellent introduction for anyone who has ever grappled with the information explosion and wondered about automated means to tackle the problem.

REFERENCES

T. C. Bell, J. G. Cleary, and I. H. Witten, Text Compression. Engle- wood Cliffs, NJ: Prentice-Hall, 1990. P. G. Howard and J. S. Vitter, “Fast and efficient lossless image compression,” in J. Storer and M. Cohn, Eds., Proceedzngs ofthe IEEE Datu Compression Conference, IEEE Computer Society Press, 1993, pp. 351-360.

Joy A. Thomas received the B.Tech. degree from the Indian Institute of Technology, Madras, India, and the M.S. and Ph.D. degrees from Stanford University, Stanford, CA. He is currently a Research Staff Member at the IBM T. J. Watson Research Center, Yorktown Heights, Ny, working on data compression and the relationships between information theory and queueing theory. He is a co-author (with Pro5 Thomas Cover) of the book, Elements of Information Theory (Wiley, 1991).