50
E.G.M Petrakis Multimedia Information Retrieval 1 Information Retrieval (IR) Deals with the representation, storage and retrieval of unstructured data Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text The evolution of multimedia databases and of the web have given new interest to IR

Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

Embed Size (px)

Citation preview

Page 1: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

1

Information Retrieval (IR)

Deals with the representation, storage and retrieval of unstructured data

Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets

Classical IR deals mainly with text The evolution of multimedia databases

and of the web have given new interest to IR

Page 2: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

2

Multimedia IR

A multimedia information system that can store and retrieve attributes

text

2D grey-scale and color images

1D time series

digitized voice or music

video

Page 3: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

3

Applications

Financial, marketing: stock prices, sales etc. find companies whose stock prices move similarly

Scientific databases: sensor data whether, geological, environmental data

Office automation, electronic encyclopedias, electronic books

Medical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints, Personal archives, text and color images

Page 4: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

4

Queries

Formulation of user’s information need Free-text query

By example document (e.g., text, image)

e.g., in a collection of color photos find those showing a tree close to a house

Retrieval is based on the understanding of the content of documents and of their components

Page 5: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

5

Goals of Retrieval

Accuracy: retrieve documents that the user expects in the answer With as few incorrect answers as possible

All relevant answers are retrieved

Speed: retrievals has to be fast The system responds in real time

Page 6: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

Accuracy of Retrieval

Depends on query criteria, query complexity and specificity Attributes / Content criteria?

Depends on what is matched with the query How documents content is represented

Typically feature vectors

Depends also on matching function Euclidean distance

E.G.M Petrakis Multimedia Information

Retrieval 6

Page 7: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

7

Speed of Retrieval

The query is compared (matched) with all stored documents

The definition of similarity criteria (similarity or distance function between documents) is an important issue: Similarity or distance function between documents Matching has to fast: document matching has to be

computationally efficient and sequential searching must be avoided

Indexing: search documents that are likely to match the query

Page 8: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

8

Main Idea

Descriptions are extracted and stored Same or separate storage with documents Queries addresses the stored descriptions

rather the documents themselves Images & video: the majority of stored

data

Solution: retrieval by text Text retrieval is a well researched area Most systems work with text, attributes

Page 9: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

Doc1-FeactureVector

Doc2-FeactureVector

DocN-FeatureVector

Database

E.G.M Petrakis Multimedia Information Retrieval

9

database

descriptions documents

Doc1

Doc2

DocN

index

query

Page 10: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

10

Problem

Text descriptions for images and video contained in documents are not available e.g., the text in a web site is not always

descriptive of every particular image contained in the web site

Two approaches human annotations

feature extraction

Page 11: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

11

Human Annotations

Humans insert attributes, captions for images and videos inconsistent or subjective over time and

users (different users do not give the same descriptions)

retrievals will fail if queries are formulated using different keywords or descriptions

time consuming process, expensive or impossible for large databases

Page 12: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

12

Feature Extraction

Features are extracted from audio, image and video cheaper and replicable approach

consistent descriptions but

inexact

low level feature (patterns, colors etc.)

difficult to extract meaning

different techniques for different data

Page 13: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

13

Furht at.al. 96

Architecture of a IR System for Images

Page 14: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

14

Multimedia IR in Practice

Relies on text descriptions Non-text IR there is nothing similar to text-based

retrieval systems automatic IR is sometimes impossible requires human intervention at some level significant progress have been made domain specific interpretations

Page 15: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

15

Ranking of IR Methods

Ranking in terms of complexity and accuracy attributes text audio image video

Combined IR based on two or more data types for more accurate retrievals

Complex data types (e.g.,video) are rich sources of information

accuracy complexity

Page 16: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

16

Similarity Searching

IR is not an exact process two documents are never identical!

they can be “similar”

searching must be “approximate”

The effectiveness of IR depends on the types and correctness of descriptions used

types of queries allowed

user uncertainly as to what he is looking for

efficiency of search techniques

Page 17: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

17

Approximate IR

A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries: range queries: retrieve all documents up to

a distance “threshold” T

nearest-neighbor queries: retrieve the k best matches

Page 18: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

18

Distance - Similarity

Decide whether two documents are similar distance: the lower the distance, the more

similar the documents are

similarity: the higher the similarity, the more similar the documents are

Key issue for successful retrieval: the more accurate the descriptions are, the more accurate the retrieval is

Page 19: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

19

Retrieval Quality Criteria

Retrieval with as few errors as possible Two types of errors: false dismissals or misses: qualifying but

non retrieved documents false positives or false drops: retrieved

but not qualifying documents

A good method minimizes both Ranking quality: retrieve qualifying

documents before non-qualifying ones

Page 20: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

20

Evaluation of Retrieval

“Given a collection of documents, a set of queries and human expert’s responses to the above queries, the ideal system will retrieve exactly what the human dictated”

The deviations from the above are measured

collection the in relevant or number total

answer in relevant retrieved or number=

retrieved of number

answer in relevant retrieved of number=

recall

precision

Page 21: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

21

Precision-Recall Diagram

Measure precision-recall for 1, 2 ..N answers

High precision means few false alarms

High recall mean few false dismissals

precis

ion

recall

1

1

0,5

0,5

the ideal method

Page 22: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

22

Harmonic Mean

A single measure combining precision and recall

r: recall, p: precision F takes values in [0,1] F 1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision and

recall

pr

F11 +

2=

Page 23: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

23

Ranking Quality

The higher the Rnorm the better the ability of a method to retrieve correct answer before incorrect

A human expert evaluates the answer set Then take answers in pairs: (relevant,irrelevant) S+ : pairs ranked correctly (the relevant entry

has retrieved before the irrelevant one) S- : pairs ranked incorrectly Smax

+: total ranked pairs

otherwise

SifR S

SS

norm1

0 )1( max-

21

max

-

Page 24: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

24

Searching Method

Sequential scanning: the query is matched with all stored documents slow if the database is large or if matching is slow

The documents must be indexed hashing, inverted files, B-trees, R-trees different data types are indexed and

searched separately

Page 25: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

25

Goals of IR

Summarizing, there are two general goals common to all IR systems effectiveness: IR must be accurate

(retrieves what the user expects to see in the answer)

efficiency: IR must be fast (faster than sequential scanning)

Page 26: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

26

Approximate IR

A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity

Two common types of similarity queries: range queries: retrieve all documents up to

a distance “threshold” T

nearest-neighbor queries: retrieve the k best matches

Page 27: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

27

Query Formulation

Must be flexible and convenient

SQL queries constraints on all attributes and data types may become very complex

Queries by example e.g., by providing an example document or image

Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results

Page 28: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

28

Access Methods for Text

Access methods for text are interesting for at least 3 reasons multimedia documents contain text, e.g.,

images often have captions text retrieval has several applications in

itself (library automation, web search etc.) text retrieval research has led to useful

ideas like, vector space model, information filtering, relevance feedback etc.

Page 29: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

29

Text Queries

Single or multiple keyword queries context queries: phrases, word proximity

boolean: keywords with and, or, not

Natural language: free text queries

Structured search takes also text structure into account and can be: flat or hierarchical for searching in titles,

paragraphs, sections, chapters

hypertext: combines content-connectivity

Page 30: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

30

Keyword Matching

The whole collection is searched no preprocessing

no space overhead

updates are easy

The user specifies a string (regular expression) and the text is parsed using a finite state automaton KMP, BMH algorithms

Page 31: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

31

Error Tolerant Methods

Methods that can tolerate errors

Scan text one character at a time by keeping track of matched characters and retrieve strings within a desired editing distance from query Wu, Manber 92, Baeza-Yates, Connet 92

Extension: regular expressions built-up by strings and operators: “pro (blem | tein) (s | ε) | (0 | 1 | 2)”

Page 32: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

32

Error Counting Methods

Retrieve similar words or phrases e.g., misspelled words or words with different pronunciation

editing distance: a numerical estimate of the similarity between 2 strings

phonetic coding: search words with similar pronunciation (Soundex, Phonix)

N-grams: count common N-length substrings

Page 33: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

33

Editing Distance

Minimum number of edit operations that are needed to transform the first string to the second edit operation: insert, delete, substitute

and (sometimes) transposition of characters

d(si,ti) : distance between characters usually d(si,ti) = 1 if si < > ti and 0 otherwise edit(cordis,codis) = 1: r is deleted edit(cordis, codris) = 1: transposition of rd

Page 34: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

34

Damerau-Levenstein Algorithm

Basic recurrence relation (DP algorithm)

edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j; edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1,

edit(i-1,j-1) + d(si,ti),

edit(i-2,j-2) +

d(si,tj-1) + d(si-1,sj) + 1

}

Page 35: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M. Petrakis Multimedia Information Retrieval

35

Matching Algorithm

Compute D(A,B) #A, #B lengths of A, B 0: null symbol R: cost of an edit operation D(0,0) = 0 for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0); for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]); for i = 0 to #A for j = 0 to #B {

1. m1 = D(i,j-1) + R(0B[j]); 2. m2 = D(i-1,j) + R(A[i] 0); 3. m3 = D(i-1,j-1) + R(A[i] B[j]); 4. D(i,j) = min{m1, m2, m3};

}

Page 36: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M. Petrakis Multimedia Information Retrieval

36

i 0 1 2 3 4 5 6

j a b b c d d

0 0 1 2 3 4 5 6

1 a 1 0 1 2 3 4 5

2 a 2 1 1 2 3 4 5

3 a 3 2 2 2 3 4 5

4 b 4 3 2 2 3 4 5

5 c 5 4 3 2 2 3 4

6 c 6 5 4 4 3 3 4

7 c 7 6 5 5 4 4 4

8 d 8 7 6 6 5 4 4

initialization cost

total cost

Page 37: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

37

Classical IR Models

Boolean model simple based on set theory

queries as Boolean expressions

Vector space model queries and documents as vectors in term

space

Probabilistic model a probabilistic approach

Page 38: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

38

Text Indexing

A document is represented by a set of index terms that summarize document

contents

adjectives, adverbs, connectives are less useful

mainly nouns (lexicon look-up)

requires text pre-processing (off-line)

Page 39: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

39

Indexing

index

Data Repository

Page 40: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

40

Text Preprocessing

Extract index terms from text Processing stages word separation, sentence splitting change terms to a standard form (e.g.,

lowercase) eliminate stop-words (e.g. and, is, the, …) reduce terms to their base form (e.g.,

eliminate prefixes, suffixes) construct mapping between terms and

documents (indexing)

Page 41: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

41

Text Preprocessing Chart

from Baeza – Yates & Ribeiro – Neto, 1999

Page 42: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

42

Text Indexing Methods

Main indexing methods: inverted files

signature files

bitmaps

Size of index may exceed size of actual collection

compress index to reduce storage

typically 40% of size of collection

Page 43: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

43

Inverted Files

Dictionary: terms stored alphabetically Posting list: pointers to documents

containing the term Posting info: document id, frequency of

occurrence, location in document, etc. dictionary indexed by B-trees, tries or

binary searched

Pros: fast and easy to implement Cons: large space overhead (up to 300%)

Page 44: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

44

Inverted Index

άγαλμα αγάπη … δουλειά … πρωί … ωκεανός

index posting list

(1,2)(3,4)

(4,3)(7,5)

(10,3)

1

2

3

4

5

6

7

8

9

10

11

………

documents

Page 45: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

45

Query Processing

1. Parse query and extract query terms 2. Dictionary lookup binary search

3. Get postings from inverted file one term at a time

4. Accumulate postings record matching document ids, frequencies, etc. compute weights

5. Rank answers by weights (e.g., frequencies)

Page 46: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

46

Signature Files

A filter for eliminating most of the non-qualifying documents signature: short hash-coded representation

of document or query

the signatures are stored and searched sequentially

search returns all qualifying documents plus some false alarms

the answer is searched to eliminate false alarms

Page 47: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

47

Superimposed Coding

Each word yields a bit pattern of size F with m bits set to 1 and the rest left as 0 size of F affects the false drop rate

bit patterns are OR-ed to form the doc signature

find documents having 1 in the same bits as the query

word signature

data 001 000 110 010

base 000 010 101 001

document signature 001 010 111 011

Page 48: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

48

Bitmaps

For every term, a bit vector is stored each bit corresponds to a document

set to 1 if term appears in document, 0 otherwise

e.g., “word” 10100: the “word” appears in documents 1 and 3 in a collection of 5 documents

Very efficient for Boolean queries retrieve bit-vectors for each query term

combine bit vectors with Boolean operators

Page 49: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

49

Comparison of Text Indexing Methods

Bitmaps: pros: intuitive, easy to implement and to process

cons: space overhead

Signature files: pros: easy to implement, compact index, no lexicon

cons: many unnecessary accesses to documents

Inverted files: pros: used by most search engines

cons: large space overhead, index can be compressed

Page 50: Information Retrieval (IR) - TUCpetrakis/courses/multimedia/retrieval.pdf · E.G.M Petrakis Multimedia Information Retrieval 1 ... they can be “similar” ... The whole collection

E.G.M Petrakis Multimedia Information Retrieval

50

References

“Searching Multimedia Databases by Content”, C. Faloutsos, Kluwer Academic Publishers, 1996

“Modern Information Retrieval”, R. Baeza-Yates, B. Ribeiro-Neto, Addison Wesley, 1999

“Automatic Text Processing”, Gerard Salton, Addison Wesley, 1989

Information Retrieval Links: http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html