58
(C) 2003, The University of Michigan 1 Information Retrieval Handout #1 January 6, 2003

(C) 2003, The University of Michigan1 Information Retrieval Handout #1 January 6, 2003

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #1

January 6, 2003

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: TBA

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall

(C) 2003, The University of Michigan 3

Introduction

(C) 2003, The University of Michigan 4

Demos

• Google

• Vivísimo

• AskJeeves

• NSIR

• Lemur

• MG

(C) 2003, The University of Michigan 5

Syllabus (Part I)

CLASSIC IR

Week 1 The Concept of Information Need, IR Models, Vector models, Boolean models

Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference collection, The TREC conferences

Week 3 Queries and Documents, Query Languages, Natural language querying, Relevance feedback

Week 4 Indexing and Searching, Inverted indexesWeek 5 XML retrievalWeek 6 Language modeling approaches

(C) 2003, The University of Michigan 6

Syllabus (Part II)

WEB-BASED IR

Week 7 Crawling the Web, hyperlink analysis, measuring the WebWeek 8 Similarity and clustering, bottom-up and top-down paradigmsWeek 9 Social network analysis for IR, Hubs and authorities, PageRank

and HITSWeek 10 Focused crawling, Resource discovery, discovering

communitiesWeek 11 Question answeringWeek 12 Additional topics, e.g., relevance transferWeek 13 Project presentations

(C) 2003, The University of Michigan 7

ReadingsBOOKSRicardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999http://www.sims.berkeley.edu/~hearst/irbook/Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002http://www.cse.iitb.ac.in/~soumen/

PAPERSBharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000Haveliwala "Topic-sensitive pagerank" WWW 2002Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998Menczer "Links tell us about lexical and semantic Web content" arXiv 2001Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002Radev et al. “Content Diffusion on the Web Graph“

CASE STUDIES (IR SYSTEMS)Lemur, MG, Google, AskJeeves, NSIR

(C) 2003, The University of Michigan 8

Assignments

Homeworks:The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines.

Project:The final course project can be done in three different formats:(1) a programming project implementing a challenging and novel information retrieval application,(2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or(3) a SIGIR-style experimental IR paper.

(C) 2003, The University of Michigan 9

Grading

• Three HW assignments (30%)

• Project (30%)

• Final (40%)

(C) 2003, The University of Michigan 10

Topics

• IR systems

• Evaluation methods

• Indexing, search, and retrieval

(C) 2003, The University of Michigan 11

Need for IR

• Advent of WWW - more than 3 Billion documents indexed on Google

• How much information?http://www.sims.berkeley.edu/research/projects/how-much-info/

• Search, routing, filtering

• User’s information need

(C) 2003, The University of Michigan 12

Some definitions of Information Retrieval (IR)

Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”

Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

(C) 2003, The University of Michigan 13

Examples of IR systems

• Conventional (library catalog). Search by keyword, title, author, etc.

• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).

• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language

(C) 2003, The University of Michigan 14

(C) 2003, The University of Michigan 15

(C) 2003, The University of Michigan 16

Types of queries (AltaVista)Including or excluding words:To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box.

Example: To find recipes for cookies with oatmeal but without raisins, tryrecipe cookie +oatmeal -raisin.

Expand your search using wildcards (*):By typing an * at the end of a keyword, you can search for the word with multiple endings.

Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy.

(C) 2003, The University of Michigan 17

Types of queriesAND (&)Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb.

OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lambfinds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!)Excludes documents containing the specified word or phrase.Mary AND NOT lambfinds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND.

NEAR (~)Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lambwould find the nursery rhyme, but likely not religious or Christmas-related documents.

(C) 2003, The University of Michigan 18

Mappings and abstractions

Reality Data

Information need Query

From Korfhage’s book

(C) 2003, The University of Michigan 19

Documents

• Not just printed paper

• collections vs. documents

• data structures: representations

• document surrogates: keywords, summaries

• encoding: ASCII, Unicode, etc.

(C) 2003, The University of Michigan 20

Typical IR system

• (Crawling)

• Indexing

• Retrieval

• User interface

(C) 2003, The University of Michigan 21

Sample queries (from Excite)

In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning

(C) 2003, The University of Michigan 22

Size matters

• Typical document surrogate: 200 to 2000 bytes

• Book: up to 3 MB of data

• Stemming: computer, computational, computing

(C) 2003, The University of Michigan 23

Key Terms Used in IR

• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.

• DOCUMENT: an information entity that the user wants to retrieve

• COLLECTION: a set of documents

• INDEX: a representation of information that makes querying easier

• TERM: word or concept that appears in a document or a query

(C) 2003, The University of Michigan 24

Other important terms

• Classification

• Cluster

• Similarity

• Information Extraction

• Term Frequency

• Inverse Document Frequency

• Precision

• Recall

• Inverted File

• Query Expansion

• Relevance

• Relevance Feedback

• Stemming

• Stopword

• Vector Space Model

• Weighting

• TREC/TIPSTER/MUC

(C) 2003, The University of Michigan 25

Query structures

• Query viewed as a document?– Length– repetitions– syntactic differences

• Types of matches:– exact– range– approximate

(C) 2003, The University of Michigan 26

Additional references on IR• Gerard Salton, Automatic Text Processing, Addison-

Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory

and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern

Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths

(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,

Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries

(C) 2003, The University of Michigan 27

Related courses elsewhere

• Berkeley (Marti Hearst and Ray Larson)http://www.sims.berkeley.edu/courses/is202/f00/

• Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze)http://www.stanford.edu/class/cs276a/

• Cornell (Jon Kleinberg)http://www.cs.cornell.edu/Courses/cs685/2002fa/

• CMU (Yiming Yang and Jamie Callan)http://la.lti.cs.cmu.edu/classes/11-741/

(C) 2003, The University of Michigan 28

Readings for weeks 1 – 3

• MIR (Modern Information Retrieval)– Week 1

• Chapter 1 “Introduction”• Chapter 2 “Modeling”• Chapter 3 “Evaluation”

– Week 2• Chapter 4 “Query languages”• Chapter 5 “Query operations”

– Week 3• Chapter 6 “Text and multimedia languages”• Chapter 7 “Text operations”• Chapter 8 “Indexing and searching”

(C) 2003, The University of Michigan 29

IR models

(C) 2003, The University of Michigan 30

Major IR models

• Boolean

• Vector

• Probabilistic

• Language modeling

• Fuzzy

• Latent semantic indexing

(C) 2003, The University of Michigan 31

Major IR tasks

• Ad-hoc

• Filtering and routing

• Question answering

• Spoken document retrieval

• Multimedia retrieval

(C) 2003, The University of Michigan 32

Venn diagrams

x w y z

D1D2

(C) 2003, The University of Michigan 33

Boolean model

A B

(C) 2003, The University of Michigan 34

restaurants AND (Mideastern OR vegetarian) AND inexpensive

Boolean queries

• What types of documents are returned?

• Stemming

• thesaurus expansion

• inclusive vs. exclusive OR

• confusing uses of AND and OR

dinner AND sports AND symphony

4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

(C) 2003, The University of Michigan 35

Boolean queries• Weighting (Beethoven AND sonatas)

• precedence

coffee AND croissant OR muffin

raincoat AND umbrella OR sunglasses

• Use of negation: potential problems

• Conjunctive and Disjunctive normal forms

• Full CNF and DNF

(C) 2003, The University of Michigan 36

Transformations

• De Morgan’s Laws:

NOT (A AND B) = (NOT A) OR (NOT B)

NOT (A OR B) = (NOT A) AND (NOT B)

• CNF or DNF? – Reference librarians prefer CNF - why?

(C) 2003, The University of Michigan 37

Boolean model

• Partition

• Partial relevance?

• Operators: AND, NOT, OR, parentheses

(C) 2003, The University of Michigan 38

Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information retrieval”

• Q2 = “information ¬computer”

(C) 2003, The University of Michigan 39

Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

(C) 2003, The University of Michigan 40

Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63

(C) 2003, The University of Michigan 41

Vector-based representationTerm 1

Term 2

Term 3

Doc 1

Doc 2

Doc 3

(C) 2003, The University of Michigan 42

Vector queries

• Each document is represented as a vector

• non-efficient representations (bit vectors)

• dimensional compatibility

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

(C) 2003, The University of Michigan 43

The matching process

• Matching is done between a document and a query - topicality

• document space

• characteristic function F(d) = {0,1}

• distance vs. similarity - mapping functions

• Euclidean distance, Manhattan distance, Word overlap

(C) 2003, The University of Michigan 44

Vector-based matching

• The Cosine measure

(D,Q) = (di x qi)

(di)2 x (qi)2

• Intrinsic vs. extrinsic measures

(C) 2003, The University of Michigan 45

Exercise

• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>

• Compute the corresponding Euclidean distances.

(C) 2003, The University of Michigan 46

Matrix representations

• Term-document matrix (m x n)

• term-term matrix (m x m x n)

• document-document matrix (n x n)

• Example: 3,000,000 documents (n) with 50,000 terms (m)

• sparse matrices

• Boolean vs. integer matrices

(C) 2003, The University of Michigan 47

Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095

(C) 2003, The University of Michigan 48

Evaluation

(C) 2003, The University of Michigan 49

Contingency table

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

(C) 2003, The University of Michigan 50

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

(C) 2003, The University of Michigan 51

Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like.

(C) 2003, The University of Michigan 52

n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]

(C) 2003, The University of Michigan 53

P/R graph

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cisi

on

(C) 2003, The University of Michigan 54

Issues

• Standard levels for P&R (0-100%)

• Interpolation

• Average P&R

• Average P at given “document cutoff values”

• F-measure: F = 2/(1/R+1/P)

(C) 2003, The University of Michigan 55

Relevance collections

• TREC adhoc collections, 2-6 GB

• TREC Web collections, 2-100GB

(C) 2003, The University of Michigan 56

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

(C) 2003, The University of Michigan 57

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>