54
CIS 430 November 6, 2008 Emily Pitler

CIS 430 November 6, 2008 Emily Pitler. 3 Named Entities 1 or 2 words Ambiguous meaning Ambiguous intent 4

Embed Size (px)

Citation preview

CIS 430 November 6, 2008Emily Pitler

3

Named Entities

1 or 2 words

Ambiguous meaning

Ambiguous intent

4

5

6

Mei and Church, WSDM 2008

Beitzel et. al. SIGIR 2004

America Online, week in December 2003

Popular queries: ◦ 1.7 words

Overall: ◦ 2.2 words

7

Lempel and Moran WWW2003 AltaVista, summer 2001 7,175,151 queries 2,657,410 distinct queries

1,792,104 queries occurred only once 63.7%

Most popular query: 31,546 times

8

Saraiva et. al. SIGIR 2001

9

Lempel and Moran WWW2003

10

American Airlines?

or Alcoholics Anonymous?

12

Clarity score ~ low ambiguity Cronen-Townsend et. al. SIGIR 2002 Compare a language model

◦ over the relevant documents for a query◦ over all possible documents

The more difference these are, the more clear the query is

“programming perl” vs. “the”

13

Query Language Model

Collection Language Model (unigram)

RD

QDPDwPQwP )|()|()|(

collectionD

collectionD

allwordsC

wCcollectionwP

)(

)()|(

14

Relative entropy between the two distributions

Cost in bits of coding using Q when true distribution is P

)))(log()((

))(log()()(

iPiP

iQiPQPDi

KL

i

iPiPxPH ))(log()())((

15

i

KL iQ

iPiPQPD )

)(

)(log()()(

16

Vw coll wP

QwPQwP

)(

)|(log)|( scoreClarity 2

17

18

Navigational◦ greyhound bus◦ compaq

Informational◦ San Francisco◦ normocytic anemia

Transactional◦ britney spears lyrics◦ download adobe reader

Broder SIGIR 2002

19

The more webpages that point to you, the more important you are

The more important webpages point to you, the more important you are

These intuitions led to PageRank

PageRank led to…

Page et. al. 1998

22

cnn.com

Nytimes.com

washingtonpost.com Mtv.com

vh1.com

23

Assume our surfer is on a page

In the next time step she can:◦ Choose a link on the current page uniformly at

random◦ Or◦ Go somewhere else in the web uniformly at

random

After a long time, what is the probability she is on a given page?

24

vBu u

uPvP

)deg(

)()(

Pages that point to v

Spread out their probability over outgoing links

25

26

Could also “get bored” with probability d and jump somewhere else completely

27

vBu u

uPd

N

dvP

)deg(

)()1()(

28

Google, obviously Given objects and links between them,

measures importance

Summarization (Erkan and Radev, 2004)◦ Nodes = sentences, edges = thresholded cosine

similarity Research (Mimno and McCallum, 2007)

◦ Nodes = people, edges = citations Facebook?

29

Words on the page

Title

Domain

Anchor text—what other sites say when they link to that page

31

Title: Ani Nenkova - Home

Domain: www.cis.upenn.edu

32

33

What OTHER webpages say about your webpage

Very good descriptions of what’s on a page

Link to: www.cis.upenn.edu/~nenkova

“Ani Nenkova” is anchor text for that page

34

10,000 documents 10 of them are relevant

What happens if you decide to return absolutely nothing?

99.9% accuracy

36

Standard metrics in Information Retrieval Precision: Of what you return, how many are

relevant?

Recall: Of what is relevant, how many do you return?

|Retrieved|

|Retrived andRelevant | Precision

|Relevant|

|Retrived andRelevant | Recall

37

Not always clear-cut binary classification: relevant vs. not relevant

How do you measure recall over the whole web?

How many of the 2.7 billion results will get looked at? Which ones actually need to be good?

38

Very relevant > Somewhat relevant > Not relevant

Want most relevant documents to be ranked first

NDCG = DCG / ideal ordering DCG

Ranges from 0 to 1

p

ii

p i

relrelDCG

22

1 log

39

Proposed ordering:

DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4)◦ = 6.5

IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4)◦ = 6.63

NDCG = 6.5/6.63 = .98

1024

40

Documents—hundreds of words Queries—1 or 2, often ambiguous, words

It would be much easier to compare documents and documents

How can we turn a query into a document?

Just find ONE relevant document, then use that to find more

42

New Query = Original Query +Terms from Relevant Docs - Terms from Irrelevant Docs

Original query = “train” Relevant

◦ www.dog-obedience-training-review.com Irrelevant

◦ http://en.wikipedia.org/wiki/Caboose New query = train + .3*dog -.2*railroad

43

Explicit feedback◦ Ask the user to mark relevant versus irrelevant◦ Or, grade on a scale (like we saw for NDCG)

Implicit feedback◦ Users see list of top 10 results, click on a few◦ Assume clicked on pages were relevant, rest weren’t

Pseudo-relevance feedback◦ Do search, assume top results are relevant, repeat

44

Have query logs for millions of users “hybrid car””toyota prius” is more likely

than “hybrid car”-> “flights to LA” Find statistically significant pairs of queries

(Jones et. al. WWW 2006) using:

45

)(

)(log2

)|()|(:

)|()|(:

2

1

12122

12121

HL

HLLLR

qqPqqPH

qqPqqPH

Make a bipartite graph of queries and URLs Cluster (Beeferman and Berger, KDD 2000)

46

Suggest queries in the same cluster

47

A lot of ambiguity is removed by knowing who the searcher is

Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them

Location matters◦ “Thai restaurants” from me means “Thai

restaurants Philadelphia, PA”

49

Mei and Church, WSDM 2008

H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74 H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17

50

51

Powerset trying to apply NLP to Wikipedia

52

Descriptive searches: “pictures of mountains”◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”}

Link farms: trying to game PageRank

Spelling correction: a huge portion of queries are misspelled

Ambiguity

53

Text normalization, documents as vectors, document similarity, log likelihood ratio, relative entropy, precision and recall, tf-idf, machine learning…

Choosing relevant documents/content Snippets = short summaries

54