30
(C) 2003, The University of Michigan 1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

Embed Size (px)

Citation preview

Page 1: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #7

March 24, 2003

Page 2: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M&F 11-12

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall

Page 3: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 3

Schedule

• Readings for 03/31:– Chakrabarti, van den Berg, and Dom “Focused

Crawling” WWW 1999– Hawking, Voorhees, Craswell, and Bailey

"Overview of the TREC-8 Web Track" TREC 2000

– Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002

Page 4: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 4

Schedule

• March 24– The link-content hypothesis– XML retrieval

• March 31– Information extraction– Language reuse

• April 7– Language modeling for IR– The Lemur system

Page 5: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 5

Schedule

• HW3 assigned 03/24

• HW3 due 04/07

• Final projects due 04/11

• Final project presentations 04/14

• Final exam 04/212-3 essay questions, 2-3 problems

Page 6: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 6

The link-content hypothesis

Page 7: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 7

Kleinberg and Lawrence, The structure of the Web - Science 294 1849-1850

Web structure

Page 8: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 8

Web structure

• 16-20 links on average

• The fraction of pages with n in-links is approximately n- for ~ 2.1

• Kleinberg/Lawrence: 100,000 coherent communities (e.g., people concerned with oil spills off the coast of Japan)

Page 9: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 9

Topical locality [Davison 00]

• Most web pages are linked to others with related content - this helps users navigate the Web.

• Presence of topical locality - important for building focused crawlers.

• Traditionally search engines only indexed titles and/or the first few lines of each document. Now, they index all links.

• “More evil than Satan himself”

Page 10: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 10

Experimental design

• Local crawl of 100,000 pages

• Starts from HotBot and AltaVista

• Biased towards English-language pages

• From each page, retrieve one outgoing link per page.

Page 11: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 11

TFIDF cosine similarity

all

iii

wIDFPwTF

wIDFPwTFPwTFIDF

2))(*),((

)(*),(),(

)(

1log)(

wDF

nwIDF

)1)(log(),( wDFPwTF

allw allw

ii

PwTFIDFQwTFIDF

PwTFIDFQwTFIDFPQCosTFIDF

22 ),(*),(

),(*),(),(_

Page 12: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 12

Other metrics

)(#

)(#),(

Qterms

QwtimesQwFract

allw

PwqwFractPQ

otherwise, 0

if , ),(),(Prob

Query-document overlap

Query term probability

allw

QwFractPwFractPQOverlap )),(),,(min(),(

Page 13: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 13

Experimental results

• 100,000 URLs but only 89,891 retrievable• An additional 111,107 URLs: two children

per initial page• www.geocities.com (561),

www.webring.com(419), www.amazon.com(303), etc.

• 18% top-level pages• 50% .com, 27% .edu

Page 14: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 14

Textual similarity

• TFIDF similarity– 0.31 same domain

– 0.23 linked pages

– 0.19 sibling

– 0.02 random

Page 15: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 15

Structure and content [Menczer 01]

• Cluster hypothesis (van Rijsbergen 79)

• Link-cluster conjecture (Menczer) - preservation of semantics across link

Page 16: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 16

Experimental design

• Open directory project (dmoz.org)

• 896,233 URLs from 97,614 topics

• 150,000 URLs from 47,174 topics

• 10,000 from each of the 15 top-level branches

Page 17: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 17

Measures of similarity

• Cosine

• Link similarity

• Semantic similarity

21

2121 ),(

pp

pp

lUU

UUpp

]Pr[log]Pr[log

)],(Pr[log2),(

21

2121 cc

cclcaccs

lca

c2

c1

Page 18: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 18

Correlations between similarities

• Over 3.84x109 pairs

• Highest for News and Home ( > 0.2)

• Lowest for Arts and Games ( < 0.05)

Page 19: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 19

Fit

21)1()(

e

03.01=1.8, 2=0.6 ,

Page 20: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 20

Document closures for Q&A

capital

P L P

Madridspain

spain

capital

Page 21: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 21

Document closures for IR

Physics

P L P

PhysicsDepartment

University ofMichigan

Michigan

Page 22: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 22

The perltree experiments

• 23.6% of the Excite log (2.5 M queries)– 60% have both words in WordNet– 27% have one word in WordNet– 13% have no words in WordNet

• 200 queries from the log

• 200 random queries

Page 23: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 23

Two-word queriesjimi SATseats davidcaesar pokercruise yellowscience Tisharatrim yankeewitnesses nakedswaybar cheatsrides Preciousdrugs universityClock enginesmetal choreographyanthony swingingpsychoanalysis webdesignpic lens

toys onlinespeech therapyMalcolm McDowellcellular accessoriesmigrant farmworkerswitch tvdavis instrumentsAdult Gameschichen itzafreighter Cruisesused motorcyclesfeng shuirevolucion mexicanazeebrugee belgiumelectronic greetings

Page 24: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 24

Query analysis

• Words:– Familiarity– Ambiguity– IDF

• Queries;– GoogleSize– SemDist– DistribSim

Page 25: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 25

Query analysis

Fam1 Fam2 Amb1 Amb2 IDF1 IDF2 Gsize SemD DistS

Excite (E) 1.42 1.89 1.70 2.36 4.00 4.74 670,000 0.39 0.06

Random (R) 1.54 1.61 2.06 2.29 4.40 4.55 329,000 0.29 0.02

Page 26: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 26

Link-based language models

• Wt2g corpus

• 247,491 pages

• 3,118,248 links

• 948,036 unique words

Page 27: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 27

A - number of documents that contain the word and are in the collection total documents in the collection = 246379B - number of all the outgoing linksC - number of outgoing links that are in the collectionD - number of outgoing links that are not in the collectionE - number of outgoing links that are in the collection and contain the wordF - number of outgoing links that are in the collection and do not contain the word

#fami l i ari typol ysemy old idf word new idf A B C D E F p=A/total p'=E/C p'/pdf recoved from idf3 n/a n/ a 0. 79 the 0. 78968 213800 871953 160935 711018 146913 14022 0. 867769 0. 912872 1. 05198 2128394 n/a n/ a 0. 8 of 0. 80112 211132 889967 161670 728297 144685 16985 0. 85694 0. 89494 1. 04434 2101835 n/a n/ a 0. 81 to 0. 809069 209307 867918 157538 710380 142058 15480 0. 849533 0. 901738 1. 06145 2083677 n/a n/ a 0. 83 and 0. 83064 204471 855610 157059 698551 138221 18838 0. 829904 0. 880058 1. 06043 2035528 3 4 0. 86 a 0. 862427 197638 832433 149540 682893 126968 22572 0. 802171 0. 849057 1. 05845 1967509 2 2 0. 86 i n 0. 875135 194999 821104 145786 675318 123851 21935 0. 791459 0. 84954 1. 07338 19412311 n/a n/ a 0. 91 for 0. 911742 187675 804872 144882 659990 119299 25583 0. 761733 0. 823422 1. 08098 1868321 n/a n/ a 0. 55 on 0. 952471 179979 809814 141024 668790 112477 28547 0. 730497 0. 797573 1. 09182 17917012 n/a n/ a 1. 01 i s 1. 009982 169847 692395 122096 570299 94133 27963 0. 689373 0. 770975 1. 11837 16908413 n/a n/ a 1. 05 by 1. 049723 163302 733707 122178 611529 93417 28761 0. 662808 0. 764598 1. 15357 16256814 n/a n/ a 1. 12 wi th 1. 122566 152169 687696 113973 573723 79774 34199 0. 617622 0. 699938 1. 13328 15148515 n/a n/ a 1. 15 thi 1. 151407 148043 676925 106199 570726 73402 32797 0. 600875 0. 691174 1. 15028 14737816 1 1 1. 15 ar 1. 16283 146450 636927 109257 527670 74169 35088 0. 594409 0. 678849 1. 14206 14579219 n/a n/ a 1. 18 f rom1. 177686 144412 678667 114326 564341 78700 35626 0. 586138 0. 688382 1. 17444 14376310 1 1 0. 88 be 1. 185611 143340 632562 100816 531746 68522 32294 0. 581787 0. 679674 1. 16825 14269620 2 2 1. 19 at 1. 196502 141884 667875 108091 559784 73134 34957 0. 575877 0. 676597 1. 1749 14124721 n/a n/ a 1. 2 or 1. 208855 140256 659173 103814 555359 68726 35088 0. 569269 0. 662011 1. 16291 13962622 n/a n/ a 1. 21 that 1. 213056 139708 607946 95101 512845 63810 31291 0. 567045 0. 670971 1. 18328 139080

Page 28: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 28

Procedure

• Given a query q1q2

– Get top 50 hits from Altavista (A)

– Extract links that contain q1 or q2

– Get pages that are linked (B)– Extract links from A U B that point to A U B– Index A U B using glimpse– Compute link fertility

Page 29: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 29

Results

• New links pointing to pages that were not in the AltaVista top 50– E = +11.7%, R = +8.9%

• Improvements higher for– rarer words– lower distributional similarity– lower semantic distance

Page 30: (C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan 30

Topic distillation [Chakrabarti et al. 01]

• Topic drift

• Returning snippets rather than full documents

• Clique attacks (www.411fun.com, www.411fashion.com, www.411loans.com)