41
9/6/2001 Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson

9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Introduction to Information Retrieval (cont.): Boolean Model

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Lecture authors: Marti Hearst & Ray Larson

Page 2: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

The Standard Retrieval Interaction Model

Page 3: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

IR is an Iterative Process

Repositories

Workspace

Goals

Page 4: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory

completion of research related to an information need.” (after Bates 89)

Q0

Q1

Q2

Q3

Q4

Q5

Page 5: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages.

• Its response is limited to selecting from these passages and presenting them to the user.

• It must select, say, 10 or 20 passages out of millions or billions!

Page 6: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries.

• This set of assumptions underlies the field of Information Retrieval.

Page 7: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Some IR History

– Roots in the scientific “Information Explosion” following WWII

– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)

• Probabilistic models at Rand (Maron & Kuhns) (1960)

• Boolean system development at Lockheed (‘60s)

• Vector Space Model (Salton at Cornell 1965)

• Statistical Weighting methods and theoretical advances (‘70s)

• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)

Page 8: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 9: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 10: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 11: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 12: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Relevance (introduction)• In what ways can a document be relevant to a query?

– Answer precise question precisely.– Who is buried in grant’s tomb? Grant.

– Partially answer question.– Where is Danville? Near Walnut Creek.

– Suggest a source for more information.– What is lymphodema? Look in this Medical Dictionary.

– Give background information.– Remind the user of other knowledge.– Others ...

• Ideally, IR systems should retrieve ALL and ONLY the RELEVANT documents for a user…

Page 13: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Query Languages

• A way to express the question (information need)

• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)

Page 14: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Simple query language: Boolean

– Terms + Connectors (or operators)– terms

• words• normalized (stemmed) words• phrases• thesaurus terms

– connectors• AND• OR• NOT

Page 15: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Queries• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

Page 16: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:

• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x

Page 17: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:

• Cat x x

• Dog x x

• Collar x x

• Leash x x

Page 18: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Logic

A B

BABA

BABA

BAC

BAC

AC

AC

:Law sDeMorgan'

Page 19: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Queries– Usually expressed as INFIX operators in IR

• ((a AND b) OR (c AND b))

– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))

– AND and OR can be n-ary operators• (a AND b AND c AND d)

– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)• NOT(a) OR NOT(b)= NOT(a AND b)• NOT(NOT(a)) = a

Page 20: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Logic

t33

t11 t22

D11D22

D33

D44D55

D66

D88D77

D99

D1010

D1111

m1

m2

m3m5

m4

m7m8

m6

m2 = t1 t2 t3

m1 = t1 t2 t3

m4 = t1 t2 t3

m3 = t1 t2 t3

m6 = t1 t2 t3

m5 = t1 t2 t3

m8 = t1 t2 t3

m7 = t1 t2 t3

Page 21: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

Page 22: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Psuedo-Boolean Queries

• A new notation, from web search– +cat dog +collar leash

• Does not mean the same thing!

• Need a way to group combinations.

• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”

Page 23: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Page 24: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Result Sets• Run a query, get a result set• Two choices

– Reformulate query, run on entire collection

– Reformulate query, run on result set

• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents

Page 25: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Reformulated Query

Re-Rank

Page 26: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Ordering of Retrieved Documents• Pure Boolean has no ordering• In practice:

– order chronologically– order by total number of “hits” on query terms

• What if one term has more hits than others?• Is it better to one of each term or many of one term?

• Fancier methods have been investigated – p-norm is most famous

• usually impractical to implement• usually hard for user to understand

Page 27: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Boolean• Advantages

– simple queries are easy to understand– relatively easy to implement

• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined

• Dominant language in commercial systems until the WWW

Page 28: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Faceted Boolean Query

• Strategy: break query into facets (polysemous with earlier meaning of facets)

– conjunction of disjunctionsa1 OR a2 OR a3

b1 OR b2

c1 OR c2 OR c3 OR c4

– each facet expresses a topic“rain forest” OR jungle OR amazon

medicine OR remedy OR cure

Smith OR Zhou

AND

AND

Page 29: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Faceted Boolean Query

• Query still fails if one facet missing

• Alternative: Coordination level ranking– Order results in terms of how many facets (disjuncts)

are satisfied

– Also called Quorum ranking, Overlap ranking, and Best Match

• Problem: Facets still undifferentiated

• Alternative: assign weights to facets

Page 30: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Proximity Searches• Proximity: terms occur within K positions of one

another– pen w/5 paper

• A “Near” function can be more vague– near(pen, paper)

• Sometimes order can be specified• Also, Phrases and Collocations

– “United Nations” “Bill Clinton”

• Phrase Variants– “retrieval of information” “information retrieval”

Page 31: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Filters

• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata

– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned

Page 32: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

How are the texts handled?

• What happens if you take the words exactly as they appear in the original text?

• What about punctuation, capitalization, etc.?• What about spelling errors? • What about plural vs. singular forms of words• What about cases and declension in non-

english languages?• What about non-roman alphabets?

Page 33: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Content Analysis• Automated Transformation of raw text into a form

that represent some aspect(s) of its meaning• Including, but not limited to:

– Automated Thesaurus Generation

– Phrase Detection

– Categorization

– Clustering

– Summarization

Page 34: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Techniques for Content Analysis• Statistical

– Single Document

– Full Collection

• Linguistic– Syntactic

– Semantic

– Pragmatic

• Knowledge-Based (Artificial Intelligence)• Hybrid (Combinations)

Page 35: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Text Processing

• Standard Steps:– Recognize document structure

• titles, sections, paragraphs, etc.

– Break into tokens• usually space and punctuation delineated

• special issues with Asian languages

– Stemming/morphological analysis

– Store in inverted index (to be discussed later)

Page 36: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

How isthe queryconstructed?

How isthe text processed?

Page 37: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

Information Organization and Retrieval

Document Processing Steps

Page 38: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Stemming and Morphological Analysis

• Goal: “normalize” similar words• Morphology (“form” of words)

– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class

– dog, dogs– tengo, tienes, tiene, tenemos, tienen

– Derivational Morphology • Derive one word from another, • Often change grammatical class

– build, building; health, healthy

Page 39: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Automated Methods• Powerful multilingual tools exist for

morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata

• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: pass results through a lexicon

Page 40: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Errors Generated by Porter Stemmer (Krovetz 93)

Too Aggressive Too Timidorganization/ organ european/ europe

policy/ police cylinder/ cylindrical

execute/ executive create/ creation

arm/ army search/ searcher

Page 41: 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

9/6/2001 Information Organization and Retrieval

Next

• Statistical Properties of Text

• Preparing information for search: Lexical analysis

• Introduction to the Vector Space model of IR.