View
215
Download
2
Embed Size (px)
Citation preview
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: TBA
• Course page: http://tangra.si.umich.edu/~radev/650/
• Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan 5
Syllabus (Part I)
CLASSIC IR
Week 1 The Concept of Information Need, IR Models, Vector models, Boolean models
Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference collection, The TREC conferences
Week 3 Queries and Documents, Query Languages, Natural language querying, Relevance feedback
Week 4 Indexing and Searching, Inverted indexesWeek 5 XML retrievalWeek 6 Language modeling approaches
(C) 2003, The University of Michigan 6
Syllabus (Part II)
WEB-BASED IR
Week 7 Crawling the Web, hyperlink analysis, measuring the WebWeek 8 Similarity and clustering, bottom-up and top-down paradigmsWeek 9 Social network analysis for IR, Hubs and authorities, PageRank
and HITSWeek 10 Focused crawling, Resource discovery, discovering
communitiesWeek 11 Question answeringWeek 12 Additional topics, e.g., relevance transferWeek 13 Project presentations
(C) 2003, The University of Michigan 7
ReadingsBOOKSRicardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999http://www.sims.berkeley.edu/~hearst/irbook/Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002http://www.cse.iitb.ac.in/~soumen/
PAPERSBharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000Haveliwala "Topic-sensitive pagerank" WWW 2002Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998Menczer "Links tell us about lexical and semantic Web content" arXiv 2001Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002Radev et al. “Content Diffusion on the Web Graph“
CASE STUDIES (IR SYSTEMS)Lemur, MG, Google, AskJeeves, NSIR
(C) 2003, The University of Michigan 8
Assignments
Homeworks:The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines.
Project:The final course project can be done in three different formats:(1) a programming project implementing a challenging and novel information retrieval application,(2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or(3) a SIGIR-style experimental IR paper.
(C) 2003, The University of Michigan 9
Grading
• Three HW assignments (30%)
• Project (30%)
• Final (40%)
(C) 2003, The University of Michigan 10
Topics
• IR systems
• Evaluation methods
• Indexing, search, and retrieval
(C) 2003, The University of Michigan 11
Need for IR
• Advent of WWW - more than 3 Billion documents indexed on Google
• How much information?http://www.sims.berkeley.edu/research/projects/how-much-info/
• Search, routing, filtering
• User’s information need
(C) 2003, The University of Michigan 12
Some definitions of Information Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
(C) 2003, The University of Michigan 13
Examples of IR systems
• Conventional (library catalog). Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language
(C) 2003, The University of Michigan 16
Types of queries (AltaVista)Including or excluding words:To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box.
Example: To find recipes for cookies with oatmeal but without raisins, tryrecipe cookie +oatmeal -raisin.
Expand your search using wildcards (*):By typing an * at the end of a keyword, you can search for the word with multiple endings.
Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy.
(C) 2003, The University of Michigan 17
Types of queriesAND (&)Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb.
OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lambfinds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!)Excludes documents containing the specified word or phrase.Mary AND NOT lambfinds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND.
NEAR (~)Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lambwould find the nursery rhyme, but likely not religious or Christmas-related documents.
(C) 2003, The University of Michigan 18
Mappings and abstractions
Reality Data
Information need Query
From Korfhage’s book
(C) 2003, The University of Michigan 19
Documents
• Not just printed paper
• collections vs. documents
• data structures: representations
• document surrogates: keywords, summaries
• encoding: ASCII, Unicode, etc.
(C) 2003, The University of Michigan 20
Typical IR system
• (Crawling)
• Indexing
• Retrieval
• User interface
(C) 2003, The University of Michigan 21
Sample queries (from Excite)
In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning
(C) 2003, The University of Michigan 22
Size matters
• Typical document surrogate: 200 to 2000 bytes
• Book: up to 3 MB of data
• Stemming: computer, computational, computing
(C) 2003, The University of Michigan 23
Key Terms Used in IR
• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user wants to retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes querying easier
• TERM: word or concept that appears in a document or a query
(C) 2003, The University of Michigan 24
Other important terms
• Classification
• Cluster
• Similarity
• Information Extraction
• Term Frequency
• Inverse Document Frequency
• Precision
• Recall
• Inverted File
• Query Expansion
• Relevance
• Relevance Feedback
• Stemming
• Stopword
• Vector Space Model
• Weighting
• TREC/TIPSTER/MUC
(C) 2003, The University of Michigan 25
Query structures
• Query viewed as a document?– Length– repetitions– syntactic differences
• Types of matches:– exact– range– approximate
(C) 2003, The University of Michigan 26
Additional references on IR• Gerard Salton, Automatic Text Processing, Addison-
Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory
and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern
Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths
(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries
(C) 2003, The University of Michigan 27
Related courses elsewhere
• Berkeley (Marti Hearst and Ray Larson)http://www.sims.berkeley.edu/courses/is202/f00/
• Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze)http://www.stanford.edu/class/cs276a/
• Cornell (Jon Kleinberg)http://www.cs.cornell.edu/Courses/cs685/2002fa/
• CMU (Yiming Yang and Jamie Callan)http://la.lti.cs.cmu.edu/classes/11-741/
(C) 2003, The University of Michigan 28
Readings for weeks 1 – 3
• MIR (Modern Information Retrieval)– Week 1
• Chapter 1 “Introduction”• Chapter 2 “Modeling”• Chapter 3 “Evaluation”
– Week 2• Chapter 4 “Query languages”• Chapter 5 “Query operations”
– Week 3• Chapter 6 “Text and multimedia languages”• Chapter 7 “Text operations”• Chapter 8 “Indexing and searching”
(C) 2003, The University of Michigan 30
Major IR models
• Boolean
• Vector
• Probabilistic
• Language modeling
• Fuzzy
• Latent semantic indexing
(C) 2003, The University of Michigan 31
Major IR tasks
• Ad-hoc
• Filtering and routing
• Question answering
• Spoken document retrieval
• Multimedia retrieval
(C) 2003, The University of Michigan 34
restaurants AND (Mideastern OR vegetarian) AND inexpensive
Boolean queries
• What types of documents are returned?
• Stemming
• thesaurus expansion
• inclusive vs. exclusive OR
• confusing uses of AND and OR
dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
(C) 2003, The University of Michigan 35
Boolean queries• Weighting (Beethoven AND sonatas)
• precedence
coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
• Use of negation: potential problems
• Conjunctive and Disjunctive normal forms
• Full CNF and DNF
(C) 2003, The University of Michigan 36
Transformations
• De Morgan’s Laws:
NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
• CNF or DNF? – Reference librarians prefer CNF - why?
(C) 2003, The University of Michigan 37
Boolean model
• Partition
• Partial relevance?
• Operators: AND, NOT, OR, parentheses
(C) 2003, The University of Michigan 38
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
(C) 2003, The University of Michigan 39
Exercise0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
(C) 2003, The University of Michigan 40
Stop lists• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
(C) 2003, The University of Michigan 41
Vector-based representationTerm 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
(C) 2003, The University of Michigan 42
Vector queries
• Each document is represented as a vector
• non-efficient representations (bit vectors)
• dimensional compatibility
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
(C) 2003, The University of Michigan 43
The matching process
• Matching is done between a document and a query - topicality
• document space
• characteristic function F(d) = {0,1}
• distance vs. similarity - mapping functions
• Euclidean distance, Manhattan distance, Word overlap
(C) 2003, The University of Michigan 44
Vector-based matching
• The Cosine measure
(D,Q) = (di x qi)
(di)2 x (qi)2
• Intrinsic vs. extrinsic measures
(C) 2003, The University of Michigan 45
Exercise
• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>
• Compute the corresponding Euclidean distances.
(C) 2003, The University of Michigan 46
Matrix representations
• Term-document matrix (m x n)
• term-term matrix (m x m x n)
• document-document matrix (n x n)
• Example: 3,000,000 documents (n) with 50,000 terms (m)
• sparse matrices
• Boolean vs. integer matrices
(C) 2003, The University of Michigan 47
Zipf’s law
Rank x Frequency Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 69,971 0.070 6 in 21,341 0.128
2 of 36,411 0.073 7 that 10,595 0.074
3 and 28,852 0.086 8 is 10,099 0.081
4 to 26.149 0.104 9 was 9,816 0.088
5 a 23,237 0.116 10 he 9,543 0.095
(C) 2003, The University of Michigan 49
Contingency table
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved not retrieved
(C) 2003, The University of Michigan 51
Exercise
Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like.
(C) 2003, The University of Michigan 52
n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00
2 589 x 0.4 1.00
3 576 0.4 0.67
4 590 x 0.6 0.75
5 986 0.6 0.60
6 592 x 0.8 0.67
7 984 0.8 0.57
8 988 0.8 0.50
9 578 0.8 0.44
10 985 0.8 0.40
11 103 0.8 0.36
12 591 0.8 0.33
13 772 x 1.0 0.38
14 990 1.0 0.36
[From Salton’s book]
(C) 2003, The University of Michigan 53
P/R graph
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
Recall
Pre
cisi
on
(C) 2003, The University of Michigan 54
Issues
• Standard levels for P&R (0-100%)
• Interpolation
• Average P&R
• Average P at given “document cutoff values”
• F-measure: F = 2/(1/R+1/P)
(C) 2003, The University of Michigan 55
Relevance collections
• TREC adhoc collections, 2-6 GB
• TREC Web collections, 2-100GB
(C) 2003, The University of Michigan 56
Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles
<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>
LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172
LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208
(C) 2003, The University of Michigan 57
<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>
(C) 2003, The University of Michigan 58
TREC results
• http://trec.nist.gov/presentations/presentations.html