(C) 2003, The University of Michigan1 Information Retrieval Handout #1 January 6, 2003

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #1

January 6, 2003


Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: TBA

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall


Introduction


Demos

• Google

• Vivísimo

• AskJeeves

• NSIR

• Lemur

• MG


Syllabus (Part I)

CLASSIC IR

Week 1 The Concept of Information Need, IR Models, Vector models, Boolean models

Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference collection, The TREC conferences

Week 3 Queries and Documents, Query Languages, Natural language querying, Relevance feedback

Week 4 Indexing and Searching, Inverted indexesWeek 5 XML retrievalWeek 6 Language modeling approaches


Syllabus (Part II)

WEB-BASED IR

Week 7 Crawling the Web, hyperlink analysis, measuring the WebWeek 8 Similarity and clustering, bottom-up and top-down paradigmsWeek 9 Social network analysis for IR, Hubs and authorities, PageRank

and HITSWeek 10 Focused crawling, Resource discovery, discovering

communitiesWeek 11 Question answeringWeek 12 Additional topics, e.g., relevance transferWeek 13 Project presentations


ReadingsBOOKSRicardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999http://www.sims.berkeley.edu/~hearst/irbook/Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002http://www.cse.iitb.ac.in/~soumen/

PAPERSBharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000Haveliwala "Topic-sensitive pagerank" WWW 2002Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998Menczer "Links tell us about lexical and semantic Web content" arXiv 2001Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002Radev et al. “Content Diffusion on the Web Graph“

CASE STUDIES (IR SYSTEMS)Lemur, MG, Google, AskJeeves, NSIR


Assignments

Homeworks:The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines.

Project:The final course project can be done in three different formats:(1) a programming project implementing a challenging and novel information retrieval application,(2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or(3) a SIGIR-style experimental IR paper.


Grading

• Three HW assignments (30%)

• Project (30%)

• Final (40%)


Topics

• IR systems

• Evaluation methods

• Indexing, search, and retrieval


Need for IR

• Advent of WWW - more than 3 Billion documents indexed on Google

• How much information?http://www.sims.berkeley.edu/research/projects/how-much-info/

• Search, routing, filtering

• User’s information need


Some definitions of Information Retrieval (IR)

Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”

Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”


Examples of IR systems

• Conventional (library catalog). Search by keyword, title, author, etc.

• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).

• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language




Types of queries (AltaVista)Including or excluding words:To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box.

Example: To find recipes for cookies with oatmeal but without raisins, tryrecipe cookie +oatmeal -raisin.

Expand your search using wildcards (*):By typing an * at the end of a keyword, you can search for the word with multiple endings.

Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy.


Types of queriesAND (&)Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb.

OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lambfinds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!)Excludes documents containing the specified word or phrase.Mary AND NOT lambfinds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND.

NEAR (~)Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lambwould find the nursery rhyme, but likely not religious or Christmas-related documents.


Mappings and abstractions

Reality Data

Information need Query

From Korfhage’s book


Documents

• Not just printed paper

• collections vs. documents

• data structures: representations

• document surrogates: keywords, summaries

• encoding: ASCII, Unicode, etc.


Typical IR system

• (Crawling)

• Indexing

• Retrieval

• User interface


Sample queries (from Excite)

In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning


Size matters

• Typical document surrogate: 200 to 2000 bytes

• Book: up to 3 MB of data

• Stemming: computer, computational, computing


Key Terms Used in IR

• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.

• DOCUMENT: an information entity that the user wants to retrieve

• COLLECTION: a set of documents

• INDEX: a representation of information that makes querying easier

• TERM: word or concept that appears in a document or a query


Other important terms

• Classification

• Cluster

• Similarity

• Information Extraction

• Term Frequency

• Inverse Document Frequency

• Precision

• Recall

• Inverted File

• Query Expansion

• Relevance

• Relevance Feedback

• Stemming

• Stopword

• Vector Space Model

• Weighting

• TREC/TIPSTER/MUC


Query structures

• Query viewed as a document?– Length– repetitions– syntactic differences

• Types of matches:– exact– range– approximate


Additional references on IR• Gerard Salton, Automatic Text Processing, Addison-

Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory

and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern

Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths

(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,

Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries


Related courses elsewhere

• Berkeley (Marti Hearst and Ray Larson)http://www.sims.berkeley.edu/courses/is202/f00/

• Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze)http://www.stanford.edu/class/cs276a/

• Cornell (Jon Kleinberg)http://www.cs.cornell.edu/Courses/cs685/2002fa/

• CMU (Yiming Yang and Jamie Callan)http://la.lti.cs.cmu.edu/classes/11-741/


Readings for weeks 1 – 3

• MIR (Modern Information Retrieval)– Week 1

• Chapter 1 “Introduction”• Chapter 2 “Modeling”• Chapter 3 “Evaluation”

– Week 2• Chapter 4 “Query languages”• Chapter 5 “Query operations”

– Week 3• Chapter 6 “Text and multimedia languages”• Chapter 7 “Text operations”• Chapter 8 “Indexing and searching”


IR models


Major IR models

• Boolean

• Vector

• Probabilistic

• Language modeling

• Fuzzy

• Latent semantic indexing


Major IR tasks

• Ad-hoc

• Filtering and routing

• Question answering

• Spoken document retrieval

• Multimedia retrieval


Venn diagrams

x w y z

D1D2


Boolean model

A B


restaurants AND (Mideastern OR vegetarian) AND inexpensive

Boolean queries

• What types of documents are returned?

• Stemming

• thesaurus expansion

• inclusive vs. exclusive OR

• confusing uses of AND and OR

dinner AND sports AND symphony

4 OF (Pentium, printer, cache, PC, monitor, computer, personal)


Boolean queries• Weighting (Beethoven AND sonatas)

• precedence

coffee AND croissant OR muffin

raincoat AND umbrella OR sunglasses

• Use of negation: potential problems

• Conjunctive and Disjunctive normal forms

• Full CNF and DNF


Transformations

• De Morgan’s Laws:

NOT (A AND B) = (NOT A) OR (NOT B)

NOT (A OR B) = (NOT A) AND (NOT B)

• CNF or DNF? – Reference librarians prefer CNF - why?


Boolean model

• Partition

• Partial relevance?

• Operators: AND, NOT, OR, parentheses


Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information retrieval”

• Q2 = “information ¬computer”


Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))


Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63


Vector-based representationTerm 1

Term 2

Term 3

Doc 1

Doc 2

Doc 3


Vector queries

• Each document is represented as a vector

• non-efficient representations (bit vectors)

• dimensional compatibility

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10


The matching process

• Matching is done between a document and a query - topicality

• document space

• characteristic function F(d) = {0,1}

• distance vs. similarity - mapping functions

• Euclidean distance, Manhattan distance, Word overlap


Vector-based matching

• The Cosine measure

(D,Q) = (di x qi)

(di)2 x (qi)2

• Intrinsic vs. extrinsic measures


Exercise

• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>

• Compute the corresponding Euclidean distances.


Matrix representations

• Term-document matrix (m x n)

• term-term matrix (m x m x n)

• document-document matrix (n x n)

• Example: 3,000,000 documents (n) with 50,000 terms (m)

• sparse matrices

• Boolean vs. integer matrices


Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095


Evaluation


Contingency table

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved


Precision and Recall

Recall:

Precision:

w

w+y

w+x

w


Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like.


n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]


P/R graph

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cisi

on


Issues

• Standard levels for P&R (0-100%)

• Interpolation

• Average P&R

• Average P at given “document cutoff values”

• F-measure: F = 2/(1/R+1/P)


Relevance collections

• TREC adhoc collections, 2-6 GB

• TREC Web collections, 2-100GB


Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208


<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE>March 16, 1989, Thursday, Home Edition </DATE><SECTION>Business; Part 4; Page 1; Column 5; Financial Desk </SECTION><LENGTH>586 words </LENGTH><HEADLINE>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </HEADLINE><BYLINE>By LINDA WILLIAMS, Times Staff Writer </BYLINE><TEXT>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. ...</TEXT><GRAPHIC> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </GRAPHIC><SUBJECT>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </SUBJECT></DOC>


TREC results

• http://trec.nist.gov/presentations/presentations.html

http://trec.nist.gov/presentations/presentations.html