Inverted Text Index

Embed Size (px)

Citation preview

  • 8/14/2019 Inverted Text Index

    1/16

    Using a Relational Database for an Inverted Text Index

    Steve Putz

    Xerox Palo Alto Research Center

    Copyright January 1991 Xerox Corporation. All rights reserved.

    System Sciences Laboratory

    Palo Alto Research Center

    3333 Coyote Hill Road

    Palo Alto, California 94304

    [P91-00158]SSL-91-20

  • 8/14/2019 Inverted Text Index

    2/16

  • 8/14/2019 Inverted Text Index

    3/16

    1

    Using a Relational Database for an Inverted Text Index

    Steve Putz

    Xerox Palo Alto Research Center

    Abstract

    Inverted indices for free-text search are commonly stored and updated using B-Trees. This paper

    shows how to efficiently maintain a dynamic inverted index for rapidly evolving text document sets

    using a suitable SQL-based relational database management system. Using a relational system pro-

    vides performance and reliability features such as efficient index access and maintenance, caching,

    multi-user transactions, access control, back-up and error recovery.

    Table of Contents

    1.0 Introduction..................................................................................................2

    2.0 Inverted Index Storage Optimizations .........................................................2

    3.0 Relational System Requirements.................................................................4

    4.0 Structure of the Relational Database Tables................................................4

    4.1 Structure of the Postings Table ............................................................................. 5

    4.2 Specifying the SQL Table and Index.................................................................... 7

    4.3 Using Separate Word and Postings Tables............................................................ 7

    5.0 Using the Inverted Index..............................................................................8

    5.1 Retrieval of Postings.............................................................................................9

    5.2 Retrieval from Separate Word and Postings Tables.............................................. 9

    5.3 Retrieval with a Limited Document Identifier Range......................................... 10

    6.0 Insertion of Postings ..................................................................................11

    6.1 Update Optimizations ......................................................................................... 11

    6.2 Appending to Existing Postings Lists................................................................. 11

    6.3 Removal of Postings ........................................................................................... 11

    7.0 Sybase Implementation Statistics ..............................................................12

    7.1 Space Used..........................................................................................................12

    7.2 Indexing Speed ................................................................................................... 13

    7.3 Search Speed....................................................................................................... 13

    8.0 Conclusion .................................................................................................14

    9.0 References..................................................................................................14

  • 8/14/2019 Inverted Text Index

    4/16

    2 Using a Relational Database for an Inverted Text Index

    1.0 Introduction

    Efficient implementations of free-text search algorithms often require an inverted index. An inverted index is a data

    structure that maps a search key (e.g. a word) to apostingslist enumerating documents containing the key. It is often

    useful for the postings list to also include the locations of each word occurrence within each document. An index of

    this type allows efficient implementation of boolean, extended boolean, proximity and relevance search algorithms

    [5].

    Because it allows efficient insertion, lookup and deletion, the file-based B-Tree data structure is useful for implement-

    ing a large dynamically updated inverted index [1]. Cutting and Pedersen have developed several space and time op-

    timizations for efficiently maintaining an inverted index using a B-Tree and a heap file [2]. Unfortunately good B-

    Tree packages are not generally available for most computer languages and operating systems, and implementation of

    efficient and reliable B-Tree software is difficult. The algorithms for B-Tree insertion and deletion can be complex

    and difficult to debug, especially for B-Trees with variable length records. Adding the capability for multi-user trans-

    actions and error recovery is even more difficult.

    A relational database management systemis a kind of general purpose database system that represents data as the

    contents of one or more tables. Relational systems typically use B-Trees to maintain indices on the contents of the da-

    tabase tables. The tables can be searched in a wide variety of ways using a general purpose query language such as

    SQL [6]. Several relational systems are commercially available on a wide variety of computer hardware and operat-ing systems. Many relational systems have excellent performance and reliability, and they provide valuable features

    such as distributed database access, multi-user transactions, access control, back-up, and error recovery. Many have

    application programming interfaces that allow convenient integration of database access into computer programs.

    Although relational systems can maintain indices on the contents of database tables, they are not directly suitable for

    free-text search of large document sets. Many relational systems cannot store large (or even small) text documents,

    and the relational systems that can store large documents do not provide for efficient searching of them. However a

    full-text information retrieval system can be built using the database tables in a suitable relational system as a B-Tree-

    like database structure, eliminating the need for separate B-Tree software and gaining the performance, robustness

    and additional features provided by the relational system.

    The methods described in this paper will not turn a relational system by itself into a text retrieval system. Rather the

    relational system serves as an efficient and robust data storage and retrieval system upon which text indexing and re-

    trieval application programs can be built. In situations where a good B-Tree package is available, its use (with the op-

    timizations described in section 2.0) may be more appropriate than using a relational system. Most of the text

    indexing and search algorithms described in this paper can be applied to implementations using either a B-Tree pack-

    age or relational system.

    2.0 Inverted Index Storage Optimizations

    Cutting and Pedersen have developed several space and time optimizations for maintaining an inverted text index us-

    ing a B-Tree and a heap file [2]. These optimizations can be adapted for implementation with a suitable relational sys-

    tem. Cutting and Pedersen use a B-Tree to store a short postings list for each indexed word as shown in Figure 1.

    When a postings list becomes too large for the B-Tree (e.g. more than 16 postings), portions of it arepulsedto a sep-arate heap file. The heap is a binary memory file with contiguous chunks allocated as necessary for the overflow post-

    ings lists. For very long postings lists (several hundred postings), heap chunks are linked together with pointers. Heap

    management software handles allocation and deallocation and keeps track of deallocated chunks for reuse.

  • 8/14/2019 Inverted Text Index

    5/16

    2.0 Inverted Index Storage Optimizations 3

    Figure 1. An Inverted Index using a B-Tree and Heap

    Cutting and Pedersen use several techniques to minimize the space required for storing postings lists. A words post-

    ings list is encoded as a list of integer document identifier and word frequency pairs, followed by a list of integer

    word positions for each document as shown in Figure 2.

    Figure 2. Postings List Encoding Example

    Document identifiers (after the first) within a document list are delta encodedas a difference from the previous iden-

    tifier. For example, the sequence of document identifiers (515, 676, 786, 881) becomes (515, 161, 110, 95) when delta

    encoded. Word positions within a document are also delta encoded.

    The integer values are each byte encoded using a variable-length format such that small values require less storage

    space than large values as shown in Table 1. The high bit of each encoded byte indicates whether additional bytes fol-

    low. The low seven bits of each byte encode a portion of the integer with least significant bytes first.

    Table 1. Space Required to Store Integer Values

    min value max value bits encoded bytes required

    0 127 7 1

    128 16383 14 2

    16384 2097151 21 3

    2097152 268435455 28 4

    Table 2 shows a sequence of document identifiers along with their delta encodings and variable length byte encod-

    ings. The byte values are shown in hexadecimal notation.

    Heap fileB-Tree

    word postings list

    word postings list postings list

    postings list

    with delta encoding:

    with variable length

    integer values:

    byte encoding (hex):

    document list positions list

    12515 1 161 3 2 17 10

    0C83, 04 01 A1, 01 03 02 11 0A

    doc A

    12

    doc Afreq

    515 1

    doc Bfreq

    676 3

    doc B

    2

    doc B

    19

    doc B

    29

    doc A doc Bid id pos pos pos pos

  • 8/14/2019 Inverted Text Index

    6/16

    4 Using a Relational Database for an Inverted Text Index

    Table 2. Byte and Delta Encoded Document Identifiers

    identifier encoding (hex) id encoding (hex)

    doc A 515 83, 04 515 83, 04

    doc B 676 A4, 05 161 A1, 01

    doc C 786 92, 06 110 6E

    doc D 881 F1, 06 95 5F

    doc E 1150 FE, 08 269 8D, 02

    doc F 1182 9E, 09 32 20

    3.0 Relational System Requirements

    Cutting and Pedersens encoding for postings lists can be adapted for use with relational systems that have an applica-

    tions programming interface and provide an efficient variable length binary data type capable of storing values up to

    approximately 250 bytes long.1The performance and space efficiency of the resulting text retrieval system depend on

    various properties of the relational system. For example, space efficiency will be very poor using a relational system

    that consumes a fixed amount of memory for every table value regardless of the value lengths. The methods describedin this paper have been implemented and tested using the SQL Server database management system available from

    Sybase Incorporated using application programs written in the C programming language.2

    In order to use a relational system to implement an efficient inverted text index, several algorithms must be imple-

    mented in application programs. The application programs are necessary to encode, decode and manipulate the post-

    ings lists stored in the relational system. The Sybase DB Library is a set of programming language routines and

    macros that allow an application program to interact with the SQL Server [4].

    The Sybase SQL Server supports Transact-SQL, an enhanced version of the SQL relational database language [3].

    Transact-SQL provides a variable length character data type called varchar and a variable length binary data type

    called varbinary. These data types can store values from 1 to 255 bytes long. For a large index it is important that

    the storage space used by these data types is the actual lengths of the stored values, not the maximum length.

    The amount of storage space required to store an inverted text index in a relational system using these methods can

    vary widely depending on the relational system used and other factors such as the choice of non indexed drop

    words. In experiments with the Sybase SQL Server and a list of 72 drop words, the index size tends to be between

    30% and 50% of the size of the original documents for document sets between 10 and 64 million bytes.

    4.0 Structure of the Relational Database Tables

    It is possible to store the postings lists of an inverted text index as rows in a relational database table rather than using

    a B-Tree and heap file as described in section 2.0. When an appropriate SQL index is created for the table, the SQL

    Server provides the same search and update capabilities as the B-Tree implementation (in fact the SQL Server creates

    a B-Tree to accomplish this).

    1. A character data type can be used if character values are allowed to contain any of the 8-bit character codes from 0 through 255. Although manyrelational systems provide separate data types for storing long character or binary data strings, the methods described in this paper do not requirethem. Note that special long data types provided by some relational systems may be inefficient when used to store small or medium size datastrings.

    2. SYBASE, SQL Server, Transact-SQL, and DB-Library are trademarks of Sybase Inc.

  • 8/14/2019 Inverted Text Index

    7/16

    4.0 Structure of the Relational Database Tables 5

    Rather than allowing postings lists of arbitrary length or linking postings lists together with pointers, long postings

    lists are split into pieces no longer than 255 bytes and stored in adjacent database table rows. Since most words in a

    full text inverted index occur relatively infrequently, most postings lists are short and a single table row per word is

    sufficient to contain the postings for most words. Depending on the document set size, about 5% of the words will re-

    quire two or more table rows to store all the postings. In a large document set, some words may require a hundred or

    more table rows for their postings.

    4.1 Structure of the Postings Table

    An inverted text index can be efficiently stored using a single relational database table with the record structure

    shown in Table 3. The tables wordcolumn contains an indexed word or other key. When a words postings do not fit

    in a single postings block, multiple rows will contain the same word, each with a blockvalue containing a portion of

    the complete postings.

    Table 3. Columns for Combined Word and Postings Table

    column name data type bytes description

    word varchar 100 contains an indexed word

    firstdoc integer 4 contains the lowest document id referenced in the block flags tinyint 1 indicates the block type, length of doc list and/or sequence number

    block varbinary 255 contains encoded document and/or position postings for the word

    The firstdoccolumn contains the lowest document identifier for which postings are contained in the corresponding

    blockvalue.

    There are three types of block encodings which are distinguished by the value stored in the flagscolumn:

    When 0 < flags< 128, the blockvalue contains a the complete encoded postings list as described in section 2.0. The

    flags value contains the number of document/frequency pairs in the document list. With this block type the positions

    list is never split; it is always contained entirely in the block. Figure 3 shows the column values for the example from

    section 2.0.

    Figure 3. Postings Example Using a Single Table Row (0 < flags < 128)

    When flags= 0, the blockcontains only a document list, encoded as described in section 2.0. The number of identifi-

    er/frequency pairs in the document list can be inferred from the length of the block. Any row with flags= 0 must be

    immediately followed by a positions list row with the same wordand firstdocvalues and have a flagsvalue = 128.

    block

    apple

    document list (hex) positions list (hex)

    word firstdoc flags

    515 2 doc A

    0C

    doc A

    83, 04 01

    doc B

    A1, 01 03

    doc B

    02

    doc B

    11

    doc B

    0A

    doc A doc Bposid freq id freq pos pos pos

    (515 encodes as 83, 04)

  • 8/14/2019 Inverted Text Index

    8/16

    6 Using a Relational Database for an Inverted Text Index

    Figure 4. Postings Example Using Two Table Rows (flags = 0 and flags = 128)

    When flags128, the block contains only a positions list, encoded as described in section 2.0. A row with flags128

    is always associated with the document list from a previous row with flags= 0. Figure 4 shows the relationship be-

    tween the document list encoded in one row and the positions list encoded in the following row.

    When a document list becomes too long to encode within a single block, it is split and additional database rows areused. The corresponding positions list is also split. If they are small enough, the remaining portions of the document

    list and positions list can be combined into a new third row (with 0 < flags< 128). Otherwise two new rows are creat-

    ed, one with flags= 0 for the left over documents list and the other with flags= 128 for the corresponding positions

    list.

    For example, assuming the maximum block size is 12 bytes (rather than 255), adding new postings to the example of

    Figure 4 would require the creation of a new row as shown in Figure 5. The rows shown in Figure 4 would remain un-

    changed.

    Figure 5. Added Table Row after a Document List Overflow

    When a positions list overflows, its block is split, but the corresponding document list block is notsplit. A document

    list may refer to postings in several consecutive table rows. These continuation rows are given a firstdocvalue equal

    to the document identifier in effect where the split occurred in the positions list.

    The flagsvalue of a continuation row is set to 128 unless the firstdocvalue is the same as that of the previous row, in

    which case the flagsvalue is set to the previous rows flagsvalue plus one. This is necessary to guarantee unambigu-ous ordering of the continuation blocks. An example of continuation rows is given in Figure 6.

    block

    box doc C

    6E

    doc A

    83, 04 01

    doc B

    A1, 01 02

    doc C

    02

    doc D

    5F

    doc D

    03

    doc A doc B

    word firstdoc flags

    515 0

    document list (hex):

    box doc C

    07

    doc A

    1F B1, 02

    doc B

    6B 42

    doc D

    91, 01

    doc D

    32

    doc D

    0E

    doc B doc C 515 128

    positions list (hex):

    pospos pos pos pos pos pos pos

    idid freq id freq freq id freq

    (515 encodes as 83, 04)

    block

    box

    word firstdoc flags

    1150 2

    (1150 encodes as FE, 08)

    document list (hex) positions list (hex)

    doc E

    55

    doc E

    FE, 08 02

    doc F

    20 01

    doc E

    62

    doc F

    11

    doc E doc F posid freq id freq pos pos

  • 8/14/2019 Inverted Text Index

    9/16

    4.0 Structure of the Relational Database Tables 7

    Figure 6. Postings Example Using Continuation Rows

    4.2 Specifying the SQL Table and Index

    In order to get efficient access to information in a large relational database table, the relational system must be in-

    structed to create an index. The following SQL commands create an empty postings table called postings and build

    an appropriate index for it.

    CREATE TABLE postings

    (word varchar(100) NOT NULL,

    firstdoc int NOT NULL,

    flags tinyint NOT NULL,

    block varchar(255) NOT NULL)

    CREATE CLUSTERED INDEX clust_index ON postings (word, firstdoc, flags)

    It is essential that the index be on the word, firstdoc, and flagsfields in that order. This causes most relational sys-

    tems to create a B-Tree index using these three fields as the key. When the table rows for a given word are retrieved

    using the index, they will be enumerated in ascending order of the firstdocfield. Rows with the same firstdocfield

    will be enumerated in ascending order of the flagsfield.

    Making the index CLUSTERED tells the relational system to place the indexed fields directly into the B-Tree, sav-ing a level of indirection. Retrievals will be faster if the index is clustered.

    4.3 Using Separate Word and Postings Tables

    As a variation, the words may be held in a separate table which isjoinedto the postings table for queries via an inte-

    ger wordnumkey. The separate words table has the advantage of providing a place to store information about each

    word such as number of occurrences of each word as shown in Table 4.

    block

    crate

    word firstdoc flags

    515 0

    document list (hex):

    crate doc C

    D7, 02

    doc A

    82, 01 72

    doc B

    9E, 01 BC, 01

    doc C

    93, 01

    doc B doc C 515 128

    positions list (hex):

    pospos pos pos pos pos

    (515 encodes as 83, 04)

    crate 786 128

    positions list (hex):

    crate 786 129

    positions list (hex):

    (doc Cs id is 786)

    doc D

    17

    doc C

    DD, 01 5E

    doc C

    6B 42

    doc D

    2F

    doc D

    36

    doc C doc Dpospos pos pos pos pos pos

    doc C

    89, 02

    doc C

    E3, 02 86, 01

    doc C

    F3, 03 4D

    doc C

    8A, 02

    doc C doc C pospos pos pos pos pos

    doc C

    6E

    doc A

    83, 04 01

    doc B

    A1, 01 02

    doc C

    0C

    doc D

    5F

    doc D

    04

    doc A doc Bidid freq id freq freq id freq

  • 8/14/2019 Inverted Text Index

    10/16

    8 Using a Relational Database for an Inverted Text Index

    Table 4. Columns for Separate Word Table

    column name type bytes description

    word varchar 100 contains an indexed word

    wordnum integer 4 contains a unique word number

    doc_count integer 4 optional field for number of documents containing the word

    word_count integer 4 optional field for total number of occurrences of the word

    The postings table is the same as described in section 4.1 except it has a wordnumcolumn instead of a wordcolumn

    as shown in Table 5.

    Table 5. Columns for Separate Postings Table

    column name type bytes description

    wordnum integer 4 contains the word number for an indexed word

    firstdoc integer 4 contain the lowest document id referenced in the block

    flags tinyint 1 indicates the block type, length of doc list and/or sequence number

    block varbinary 255 contains encoded document and/or position postings for the word

    The following SQL commands creates an empty words table called wordlist and postings table called numpost-

    ings and builds an appropriate index for each.

    CREATE TABLE wordlist

    (word varchar(100) NOT NULL,

    wordnum integer NOT NULL,

    doc_count integer NOT NULL,

    word_count integer NOT NULL)

    CREATE CLUSTERED INDEX clust_index ON RH_word list (word, wordnum, doc_count, word_count)

    CREATE TABLE numpostings

    (wordnum integer NOT NULL,

    firstdoc int NOT NULL,

    flags tinyint NOT NULL,

    block varchar(255) NOT NULL)

    CREATE CLUSTERED INDEX clust_index ON numpostings (wordnum, firstdoc, flags)

    In practice, additional database tables may be used to store other useful information such as document lengths, docu-

    ment creation dates, and access control information.

    5.0 Using the Inverted Index

    In order to perform a text search operation, an application program must perform several steps. It must retrieve the ta-

    ble rows containing postings for one or more words and decode the posting blocks. The postings lists from each word

    are compared or combined as appropriate for the type of text search being performed.

    For boolean searches, the document lists for each word are combined using set intersection and union operations.

    Proximity search algorithms use the word positions information to select just the documents in which the queried

  • 8/14/2019 Inverted Text Index

    11/16

    5.0 Using the Inverted Index 9

    words occur in a particular relationship. In both kinds of searches, the resulting documents can be priority ranked ac-

    cording to the number of times the queried words occurred, possibly weighted by parameters such as the document

    lengths and importance of each query word.

    5.1 Retrieval of Postings

    The postings for a given word are retrieved from the relational system using an appropriate SQL query expression.For example, the SQL query below retrieves the rows for the word box from the postings table named postings:

    SELECT word, firstdoc, flags, block FROM postings

    WHERE word = box

    ORDER BY firstdoc, flags

    In practice, the ORDER BY clause is not required because the rows are already properly ordered in the clustered in-

    dex. With some relational systems, using ORDER BY may impose a performance penalty. The resulting rows are

    shown in Table 6. The first row contains a document list (flags= 0), the second contains the corresponding word po-

    sitions (flags128), and the third row contains combined document and positions list for the last two documents

    (flags= 2).

    Table 6. Results of Search for Postings of the Word box

    word firstdoc flags block

    box 515 0 830401A101026E025F03

    box 515 128 1FB1026B42079101320E

    box 1150 2 FE08022001556211

    Many text search queries do not require word position information (e.g. simple boolean queries). For these queries,

    the SQL expression can be modified to retrieve only those rows that contain document lists (i.e. with flags< 128).

    Also it is not necessary to retrieve the wordfield, since it is known in advance. The rows retrieved by the following

    SQL query are shown in Table 7:

    SELECT firstdoc, flags, block FROM postings

    WHERE word = box

    AND flags < 128

    Table 7. Results of Search for Document Lists Only

    firstdoc flags block

    515 0 830401A101026E025F03

    1150 2 FE08022001556211

    5.2 Retrieval from Separate Word and Postings Tables

    When separate word and postings tables are used as described in section 4.3, the two tables are joined as in the fol-

    lowing SQL query, which produces the same results as the previous example:

    SELECT firstdoc, flags, block FROM wordlist, numpostings

    WHERE word = box

    AND wordlist.wordnum = numpostings.wordnum

    AND flags < 128

  • 8/14/2019 Inverted Text Index

    12/16

    10 Using a Relational Database for an Inverted Text Index

    5.3 Retrieval with a Limited Document Identifier Range

    By specifying a constraint on the firstdoccolumn, it is possible to construct an SQL query to retrieve word postings

    that occur within a specified range of document identifiers. This is useful if document identifiers are assigned in a

    meaningful order, such as by creation or entry date. For example, the following SQL query retrieves only rows con-

    taining document identifiers between 1160 and 1170:

    SELECT firstdoc, flags, block FROM postings

    WHERE word = box

    AND firstdoc >=

    (SELECT max (firstdoc) FROM postings

    WHERE word = box AND flags < 128 AND firstdoc =

    (SELECT max (firstdoc) FROM postings

    WHERE word = box AND flags < 128 AND firstdoc

  • 8/14/2019 Inverted Text Index

    13/16

    6.0 Insertion of Postings 11

    6.0 Insertion of Postings

    The inverted index implementation described here is designed so postings from new documents can be efficiently

    added to the index at any time. A limitation of this implementation is that document identifiers must be added to the

    index in ascending order.

    6.1 Update Optimizations

    Cutting and Pedersen have developed efficient algorithms for creating and updating an inverted index by buffering a

    large number of postings lists in main memory and periodically merging them into the externally stored index [2].

    This merge updateoptimization dramatically reduces the number of secondary storage accesses required to index

    new documents. This optimization is equally important when storing the inverted index in a relational system.

    At any given time, the final row(s) containing the document list and positions list for a large number of words should

    be kept in main memory. Whenever a rows blockbecomes filled (to its maximum of 255 bytes), it is stored into the

    database and a new empty row begun.

    If the relational systems application programming interface provides efficient bulk data copy primitives, they should

    be used for inserting new rows into the database tables.

    6.2 Appending to Existing Postings Lists

    In order to append new postings to those already stored for a word, the table rows containing the last document list

    and positions list for the word must first be retrieved into main memory. The new postings are appended to the re-

    trieved lists and the updated table rows eventually stored back to the database.

    The following SQL expression retrieves just the rows containing the last document list and positions list for the word

    box. The resulting row from the postings table is shown in Table 10.

    SELECT firstdoc, flags, block FROM postings

    WHERE word = boxAND firstdoc >=

    (SELECT max (firstdoc) FROM postings

    WHERE word = box AND flags < 128)

    The result may either be a single row containing document and positions lists (with 0 flags< 128) or multiple rows,

    the first containing the document list (with flags= 0) and the rest containing the positions lists (with flags128).

    Only the last row of positions needs to be kept in main memory.

    Table 10. Results of SQL Query to Retrieve Last Document and Positions Lists for the Word box

    firstdoc flags block

    1150 2 FE08022001556211

    6.3 Removal of Postings

    The fact that the index is inverted according to individual words makes the removal of a single document inefficient.

    The technique described in section 5.3 can be used to reduce the number of rows which must be examined looking for

    the documents identifier, but single document removal is still inefficient. Note that removing a range of document

    identifiers from the index can be done for about the same cost as removing a single document.

  • 8/14/2019 Inverted Text Index

    14/16

    12 Using a Relational Database for an Inverted Text Index

    If document removal is infrequent, an auxiliary table can be used to hold the removed document identifiers. Often

    there is already a table containing an entry for each document. If so, this feature can be added for very little cost.

    When it is known in advance what identifier ranges will be removed, the insertion algorithm can be modified to ar-

    range the database so the range boundaries always occur between rows. This will cause less efficient space utilization,

    but removing all documents between the range boundaries can then be done with a single SQL expression rather than

    an exhaustive search of individual word postings.

    For example, if the indexing software started new table rows for 1991s documents (starting with document identifier

    1302), the following SQL expression will remove all document postings from the year 1990 or earlier:

    DELETE postings

    WHERE firstdoc = 1 Jan 1991)

    7.0 Sybase Implementation Statistics

    The methods described in this paper have been implemented on a Sun Unix workstation using a Sybase SQL Server.

    Text indexing and search programs were implemented in the C programming language using the Sybase DB-Library

    package. The indexing program is 2000 lines long and the boolean text search program is 1200 lines long.

    Two electronic encyclopedia databases were used to test the software: The Alphapedia section of the 1990 Random

    House Encyclopedia (18,000 articles, 126,000 words) and Groliers Academic Encyclopedia (31,000 articles,

    780,000 words). An inverted index was created for the words (except for a list of 72 drop words) occurring in each

    encyclopedia. A separate index was created for the article titles in each encyclopedia, as well as an index for the Ran-

    dom House article categories.

    7.1 Space Used

    The amount of storage space used by a relational system to store an inverted index depends on the efficiency of its in-

    ternal data and index storage. For each of the two encyclopedias, Table 11 shows the original text size, the number of

    indexed words, the number of distinct words, and the number of table rows and space required to encode the inverted

    index.

    Table 11. Size of Full Text Encoded Inverted Indices

    table name text size words vocabulary rows table size (%text) Sybase-1 Sybase-2 (%text)

    RH_word 7856 KB 844753 59141 65334 3632 KB (46%) 6374 KB 4398 KB (56%)

    GR_word 63256 KB 5035874 139870 191148 17816 KB (28%) 31142 KB 20908 KB (33%)

    Three index sizes are given. The column labeled table size is the amount of storage space required using the encod-ings described in this paper, assuming no additional storage or indexing overhead. The column labeled Sybase-1 is

    the actual database table size reported by the Sybase SQL Server after the table rows were added to the relational da-

    tabase by the document indexing program.

    It was discovered that for the Sybase SQL Server, 30% less space is required for the same database tables when they

    are copied using the Sybase bulk database copy program, apparently because the data is then stored more compactly.

    The column labeled Sybase-2 shows the reduced amount of space resulting from this technique. It may be possible

  • 8/14/2019 Inverted Text Index

    15/16

    7.0 Sybase Implementation Statistics 13

    to achieve similar space savings in the incremental indexing program by using the available DB-Library bulk copy

    routines.

    The encoding techniques presented here are most efficient for indexing the words of text documents, however the

    same database structure can also be used for a smaller inverted index of document titles, keywords, or category codes.

    Table 12 shows the space required to encode inverted indices for the encyclopedia article titles and category codes.

    Complete article titles were indexed separately from individual title words.

    Table 12. Size of Title and Category Encoded Inverted Indices

    table name text size words vocabulary rows index size (%text) Sybase-1 Sybase-2 (%text)

    RH_category 105 KB 17938 55 317 58 KB (55%) 112 KB 80 KB (76%)

    RH_titleword 290 KB 34762 17006 17050 392 KB (135%) 766 KB 528 KB (182%)

    RH_title 290 KB 17938 17632 17632 456 KB (157%) 892 KB 620 KB (214%)

    GR_titleword 590 KB 73899 26170 26396 672 KB (114%) 1292 KB 908 KB (154%)

    GR_title 590 KB 30762 30388 30388 880 KB (149%) 1658 KB 1164 KB (197%)

    7.2 Indexing Speed

    Indexing speed depends almost entirely on the speed with which rows can be retrieved and stored into the relational

    system. Our indexing program was able to create an inverted index for the Random House Encyclopedia in 37 min-

    utes1. This is the equivalent of 380 words per second or about 14 megabytes of text per hour. Due to overhead associ-

    ated with writing out and later retrieving buffered postings when main memory becomes full, the larger Groliers

    Encyclopedia was indexed at a rate of 140 words per second (6.4 megabytes of text per hour). Better buffering and

    the use of bulk copy routines would improve the indexing speed.

    7.3 Search Speed

    Text search speed using the inverted index depends on the speed with which rows can be retrieved from the relational

    system and the postings lists decoded and combined by the search program. Our boolean text search program per-forms separate SQL queries for each word in the search query, decodes the results and combines the document identi-

    fier lists using union and intersection operations2. The resulting document list is sorted so the documents with the

    most word matches occur at the head of the list.

    Table 13 lists several boolean text search queries performed using the Groliers Encyclopedia index, and the amount

    of time taken to compute the matching document identifiers. The % at the end of a search query term is the SQL

    wildcard character used to match multiple words with the same prefix (e.g. newt% matches newt, newton,

    newtonian, etc.). The & character indicates boolean AND operator and the | character indicates the OR operator.

    For each search, Table 13 lists the number of matching documents, the number of SQL queries required, the number

    of table rows retrieved, and the total number of document identifiers processed. The column labeled 1st time indi-

    cates the number of seconds required for a query that has not been performed recently. Due to caching in the SQL

    server, immediately repeating the query results in a faster response as shown in the column labeled 2nd time.

    The timings shown include the time to construct and perform the SQL queries, decode the results, and combine the

    document identifier lists from multiple rows, but do not include the time required to initialize the SQL server connec-

    tion and rank the document lists.

    1. Running on a Sun 4/490 SPARC processor with 32 megabytes of main memory.

    2. Multiple term queries could be made faster by retrieving the postings for all the terms using a single SQL query.

  • 8/14/2019 Inverted Text Index

    16/16

    14 Using a Relational Database for an Inverted Text Index

    Table 13. Boolean Text Search Speed

    text search query matches queries rows docs 1st time 2nd time

    xxxxx 0 1 0 0 0.09 0.02

    newt 7 1 1 7 0.13 0.02

    newt% 205 1 11 205 0.36 0.07

    fight 197 1 2 197 0.18 0.06

    battle 758 1 7 758 0.11 0.02

    california 785 1 7 785 0.11 0.03

    general 2235 1 18 2235 0.29 0.07

    war 4170 1 34 4170 0.37 0.11

    century 5471 1 44 5471 0.37 0.12

    war & century 1257 2 78 9641 0.59 0.36

    california & war & century 79 3 85 10426 0.89 0.38

    fight & battle & california & general & war & century 4 6 112 13616 1.30 0.54

    (fight | battle) & california & general & war & century 14 6 112 13616 1.14 0.56

    8.0 Conclusion

    Information retrieval systems often provide full text search using B-Trees and heap files, but reliable high perfor-

    mance B-Tree software is not available for many computing environments and it is difficult to implement. It is possi-

    ble to implement an efficient inverted index using database tables in a relational database management system as a

    substitute for B-Trees and heap files. Using a relational systems application programming interface, it is possible to

    create text indexing and search programs with performance and space efficiency similar to that obtained using a B-

    Tree package. Available relational systems provide high performance along with valuable features not normally

    found in a B-Tree package, such as multi-user transactions, access control, reliable back-up, and automatic error re-

    covery.

    9.0 References

    [1] R. Bayer and E. McCreight. Organization and Maintenance of Large Ordered Indexes. Acta Informatica,

    1:173189, 1972.

    [2] D. Cutting and J. Pedersen. Optimizations for Dynamic Inverted Index Maintenance. Proceedings of the 13th

    International Conference on Research and Development in Information Retrieval, pp. 405411, September 1990.

    [3] M. Darnovsky and J. Bowman. Transact-SQL Users Guide, SYBASE Release 4.0, Sybase Inc., May 1989.

    [4] S. Goodman and A. Cohen. Open Client DB-Library Reference Manual,SYBASE Release 4.0, Sybase Inc., May

    1989

    [5] G. Salton and M. McGill.Introduction to Modern Information Retrieval.McGraw Hill, 1983.

    [6] J. H. Trimble, Jr. and D. Chappell.A Visual Introduction to SQL. John Wiley & Sons, 1989.