37
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting, Oct., 24 Allen, Zhenjiang Lin

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

1

Random Sampling from a Search Engine‘s Index

Ziv Bar-Yossef and Maxim Gurevich

Department of Electrical Engineering Technion

Presentation at group meeting, Oct., 24

Allen, Zhenjiang Lin

2

Outline Introduction

Search Engine Samplers Motivation

The Bharat-Broder Sampler (WWW’98) Infrastructure of Proposed Methods

Search Engines as Hypergraphs Monte Carlo Simulation Methods – Rejection Sampling

The Pool-based Sampler The Random Walk Sampler Experimental Results Conclusions

3

Search Engine Samplers

IndexPublicInterface

PublicInterface

Search Engine

Sampler

Web

D

Queries

Top k results

Random document x D

Indexed Documents

4

Motivation Useful tool for search engine evaluation:

Freshness Fraction of up-to-date pages in the index

Topical bias Identification of overrepresented/underrepresented topics

Spam Fraction of spam pages in the index

Security Fraction of pages in index infected by viruses/worms/trojans

Relative Size Number of documents indexed compared with other search

engines

5

Size Wars

August 2005

: We index 20 billion documents.

So, who’s right?

September 2005

: We index 8 billion documents, but our index is 3 times larger than our competition’s.

6

Why Does Size Matter, Anyway?

ComprehensivenessA good crawler covers the most documents

possible

Narrow-topic queriesE.g., get homepage of John Doe

PrestigeA marketing advantage

7

Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05]

Sample pages uniformly at random from the search engine’s index

Two alternatives Absolute size estimation

Sample until collision Collision expected after k ~ N½ random samples (birthday

paradox) Return k2

Relative size estimation Check how many samples from search engine A are present

in search engine B and vice versa

8

Related Work

Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]

Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]

Queries from user query logs [LawrenceGiles98, DobraFeinberg04]

Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

9

The Bharat-Broder Sampler: Preprocessing Step

C

Large corpusL

t1, freq(t1,C)t2, freq(t2,C)……

Lexicon

10

The Bharat-Broder Sampler

Search Engine

BB Sampler

t1 AND t2Top k results

Random document from top k results

LTwo random terms t1, t2

Only if:• all queries return the same number of results ≤ k • all documents are of the same lengthThen, samples are uniform.

Only if:• all queries return the same number of results ≤ k • all documents are of the same lengthThen, samples are uniform.

11

The Bharat-Broder Sampler:Drawbacks Documents have varying lengths

Bias towards long documents

Some queries have more than k matchesBias towards documents with high static rank

12

Two novel samplers

A pool-based sampler Guaranteed to produce near-uniform samples Needs an lexicon / query pool

A random walk sampler After sufficiently many steps, guaranteed to produce

near-uniform samples Does not need an explicit lexicon / pool at all!

Focus of this talk

13

Search Engines as Hypergraphs

results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:

Vertices: Indexed documents Hyperedges: { result(q) | q P }

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

14

Query Cardinalities and Document Degrees

Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:

card(“news”) = 4, card(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

15

Sampling documents uniformly

Sampling documents from D uniformly Hard Sampling documents from D non-uniformly: Easier

Will show later: can sample documents proportionally to their degrees:

16

Sampling documents by degree

p(news.bbc.co.uk) = 2/13 p(www.cnn.com) = 1/13

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

17

Monte Carlo Simulation

We need: Samples from the uniform distribution We have: Samples from the degree distribution Can we somehow use the samples from the degree

distribution to generate samples from the uniform distribution?

Yes!

Monte Carlo Simulation Methods

Rejection Sampling

Rejection Sampling

Importance Sampling

Importance Sampling

Metropolis-Hastings

Metropolis-Hastings

Maximum-Degree

Maximum-Degree

18

Rejection Sampling Algorithm

Sampling values from an arbitrary probability distribution f(x) by using an instrumental distribution g(x)

The algorithm (due to John von Neumann) is as follows: Sample x from g(x) and u from U(0,1) Check whether or not u < f(x) / Mg(x).

If this holds, accept x as a realization of f(x); if not, reject the value of x and repeat the sampling step.

M > 1 is an appropriate bound on f(x) / g(x). Prove:

pRS(x) = g(x) . f(x) / Mg(x) = f(x) / M.

f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x), x D.∨ ∈

19

Rejection Sampling: An Example

Sampling u.a.r from Square: g(x) Easy Sampling u.a.r from Disc: f(x) Hard Since f(x)=F, g(x)=G, set M = F/G; Generate a candidate point x from

unit square, g(x); If x is in unit disc, f(x) = F≠ 0,

thus f(x)/Mg(x)=1, accept x; If x is in square/disc, f(x) = 0,

thus f(x)/Mg(x)=0, reject x; Therefore, x is sampled u.a.r from the unit disc.

20

Monte Carlo Simulation : Target distribution

In our case: = uniform on D p: Trial distribution

In our case: p = degree distribution

Bias weight of p(x) relative to (x): In our case:

Monte Carlo Simulator

Monte Carlo Simulator

Samples from p

Sample from

x

Sampler

(x1,w(x)), (x2,w(x)),… p-Samplerp-Sampler

21

Bias Weights Unnormalized forms of and p:

: (unknown) normalization constants

Examples: = uniform: p = degree distribution:

Bias weight:

22

C: envelope constant C ≥ w(x) for all x

The algorithm: accept := false while (not accept)

generate a sample x from p toss a coin whose heads probability is if coin comes up heads,

accept := true

return x

In our case: C = 1 and acceptance prob = 1/deg(x)

Rejection Sampling [von Neumann]

23

Pool-Based Sampler

Degree distribution sampler

Degree distribution sampler

Search EngineSearch Engine

Rejection Sampling

Rejection Sampling

q1,q2,…results(q1), results(q2),…

x

Pool-Based Sampler

(x(x11,1/deg(x,1/deg(x11)),)),

(x(x22,1/deg(x,1/deg(x22)),…)),…

Uniform sample

Documents sampled from degree distribution with corresponding weights

Degree distribution: p(x) = deg(x) / x’deg(x’)

24

Sampling documents by degree

Select a random query q Select a random x results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that

belong to narrow queries-the weights of queries are different. We need to sample q proportionally to its cardinality

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

25

Sampling documents by degree (2)

Select a query q proportionally to its cardinality Select a random x results(q) Analysis:

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

26

Degree Distribution Sampler

Search EngineSearch Engine

results(q)

xCardinality Distribution Sampler

Cardinality Distribution Sampler

Sample x uniformly from results(q)

Sample x uniformly from results(q)

q

Degree Distribution Sampler

Query sampled from cardinality

distribution

Document sampled from

degree distribution

27

Sampling queries by cardinality

Sampling queries from pool uniformly:Easy

Sampling queries from pool by cardinality: Hard Requires knowing cardinalities of all queries in the

search engine

Use Monte Carlo methods to simulate biased sampling via uniform sampling: Target distribution: the cardinality distribution Trial distribution: uniform distribution on the query pool

28

Sampling queries by cardinality

Bias weight of cardinality distribution relative to the uniform distribution:

Can be computed using a single search engine query

Use rejection sampling: Envelope constant for rejection sampling:

Queries are sampled uniformly from the pool Each query q is accepted with probability

29

Degree Distribution Sampler

Degree Distribution Sampler

Complete Pool-Based Sampler

Search EngineSearch Engine

Rejection Sampling

Rejection Sampling

x(x,1/deg(x)),…

Uniform document

sample

Documents sampled from degree distribution with corresponding weights

Uniform Query Sampler

Uniform Query Sampler

Rejection Sampling

Rejection Sampling

(q,card(q)),…

Uniform query

sampleQuery

sampled from cardinality distribution

(q,results(q)),…

30

Dealing with Overflowing Queries

Problem: Some queries may overflow (card(q) > k) Bias towards highly ranked documents

Solutions: Select a pool P in which overflowing queries are rare

(e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights

Theorem:

Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

31

Creating the query pool

C

Large corpusPq1

……

Query Pool

Example: P = all 3-word phrases that occur in C If “to be or not to be” occurs in C, P contains:

“to be or”, “be or not”, “or not to”, “not to be”

Choose P that “covers” most documents in D

q2

32

A random walk sampler Define a graph G over the indexed documents

(x,y) E iff queries(x) ∩ queries(y) ≠

Run a random walk on G Limit distribution = degree distribution Use MCMC methods to make limit distribution uniform.

Metropolis-Hastings Maximum-Degree

Does not need a preprocessing step Less efficient than the pool-based sampler

33

Bias towards Long Documents

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5 6 7 8 9 10

Deciles of documents ordered by size

Perc

ent

of

docu

ments

fro

m s

am

ple

.

Pool Based

Random Walk

Bharat-Broder

34

Relative Sizes of Google, MSN and Yahoo!

Google = 1

Yahoo! = 1.28

MSN Search = 0.73

35

Top-Level Domains in Google, MSN and Yahoo!

0%

10%

20%

30%

40%

50%

60%

Top level domain name

Pe

rce

nt

of

do

cu

me

nts

fro

m s

am

ple

.

GoogleMSNYahoo!

36

Conclusions

Two new search engine samplersPool-based samplerRandom walk sampler

Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.

Samplers show no or little bias in experiments.

37

Thank You