99
Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook April 3, 2013 WebDam (INRIA) Web search April 3, 2013 1 / 85

Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Embed Size (px)

Citation preview

Page 1: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web searchWeb data management and distribution

Serge Abiteboul Ioana Manolescu Philippe RigauxMarie-Christine Rousset Pierre Senellart

Web Data Management and Distributionhttp://webdam.inria.fr/textbook

April 3, 2013

WebDam (INRIA) Web search April 3, 2013 1 / 85

Page 2: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

Outline

1 The World Wide Web

2 Web crawling

3 Web Information Retrieval

4 Web Graph Mining

5 Conclusion

WebDam (INRIA) Web search April 3, 2013 2 / 85

Page 3: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

Internet and the Web

Internet: physical network of computers (or hosts)

World Wide Web, Web, WWW: logical collection of hyperlinked documents

static and dynamicpublic Web and private Webseach document (or Web page, or resource) identified by aURL

WebDam (INRIA) Web search April 3, 2013 3 / 85

Page 4: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

Uniform Resource Locators

https︸ ︷︷ ︸scheme

://www.example.com︸ ︷︷ ︸hostname

:443︸ ︷︷ ︸port

/path/to/doc︸ ︷︷ ︸path

?name=foo&town=bar︸ ︷︷ ︸query string

#para︸ ︷︷ ︸fragment

scheme: way the resource can be accessed; generally http or httpshostname: domain name of a host (cf. DNS); hostname of a website may

start with www., but not a rule.port: TCP port; defaults: 80 for http and 443 for httpspath: logical path of the document

query string: additional parameters (dynamic documents).fragment: subpart of the document

Query strings and fragments optionalsEmpty path: root of the Web serverRelative URIs with respect to a context (e.g., the URI above):/titi https://www.example.com/tititata https://www.example.com/path/to/tataWebDam (INRIA) Web search April 3, 2013 4 / 85

Page 5: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

(X)HTML

Choice format for Web pages

Dialect of SGML (the ancestor of XML), but seldom parsed as is

HTML 4.01: most common version, W3C recommendation

XHTML 1.0: XML-ization of HTML 4.01, minor differencesValidation (cf http://validator.w3.org/). Checks the conformityof a Web page with respect to recommendations, for accessibility:

I to all graphical browsers (IE, Firefox, Safari, Opera, etc.)I to text browsers (lynx, links, w3m, etc.)I to aural browsersI to all other user agents including Web crawlers

Actual situation of the Web: tag soup

WebDam (INRIA) Web search April 3, 2013 5 / 85

Page 6: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

XHTML example

<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"lang="en" xml:lang="en">

<head><meta http-equiv="Content-Type"

content="text/html; charset=utf-8" /><title>Example XHTML document</title>

</head><body>

<p>This is a<a href="http://www.w3.org/">link to the<strong>W3C</strong>!</a></p>

</body></html>

WebDam (INRIA) Web search April 3, 2013 6 / 85

Page 7: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

HTTPClient-server protocol for the Web, on top of TCP/IPExample request/response

GET /myResource HTTP/1.1Host: www.example.com

HTTP/1.1 200 OKContent-Type: text/html; charset=ISO-8859-1

<html><head><title>myResource</title></head><body><p>Hello world!</p></body>

</html>

HTTPS: secure version of HTTPI encryptionI authenticationI session tracking

WebDam (INRIA) Web search April 3, 2013 7 / 85

Page 8: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

The World Wide Web

Features of HTTP/1.1

virtual hosting: different Web content for different hostnames on a singlemachine

login/password protection

content negociation: same URL identifying several resources, client indicatespreferences

cookies: chunks of information persistently stored on the client

keep-alive connections: several requests using the same TCP connection

etc.

WebDam (INRIA) Web search April 3, 2013 8 / 85

Page 9: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling

Outline

1 The World Wide Web

2 Web crawlingDiscovering new URLsIdentifying duplicatesCrawling architecture

3 Web Information Retrieval

4 Web Graph Mining

5 Conclusion

WebDam (INRIA) Web search April 3, 2013 9 / 85

Page 10: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Discovering new URLs

Web Crawlers

crawlers, (Web) spiders, (Web) robots: autonomous user agents thatretrieve pages from the WebBasics of crawling:

1 Start from a given URL or set of URLs2 Retrieve and process the corresponding page3 Discover new URLs (cf. next slide)4 Repeat on each found URL

No real termination condition (virtual unlimited number of Web pages!)

Graph-browsing problem

deep-first: not very adapted, possibility of being lost in robot trapsbreadth-first

combination of both: breadth-first with limited-depth deep-first on eachdiscovered website

WebDam (INRIA) Web search April 3, 2013 10 / 85

Page 11: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Discovering new URLs

Sources of new URLs

From HTML pages:I hyperlinks <a href="...">...</a>I media <img src="..."> <embed src="..."> <objectdata="...">

I frames <frame src="..."> <iframe src="...">I JavaScript links window.open("...")I etc.

Other hyperlinked content (e.g., PDF files)

Non-hyperlinked URLs that appear anywhere on the Web (in HTML text,text files, etc.): use regular expressions to extract them

Referrer URLs

Sitemaps [sit08]

WebDam (INRIA) Web search April 3, 2013 11 / 85

Page 12: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Discovering new URLs

Scope of a crawler

Web-scaleI The Web is infinite! Avoid robot traps by putting depth or page number limits

on each Web serverI Focus on important pages [APC03] (cf. lecture on the Web graph)

Web servers under a list of DNS domains: easy filtering of URLs

A given topic: focused crawling techniques [CvdBD99, DCL+00] based onclassifiers of Web page content and predictors of the interest of a link.

The national Web (cf. public deposit, national libraries): what is this?[ACMS02]

A given Web site: what is a Web site? [Sen05]

WebDam (INRIA) Web search April 3, 2013 12 / 85

Page 13: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Identifying duplicates

A word about hashing

DefinitionA hash function is a deterministic mathematical function transforming objects(numbers, character strings, binary. . . ) into fixed-size, seemingly random,numbers. The more random the transformation is, the better.

ExampleJava hash function for the String class:

n−1

∑i=0

si × 31n−i−1 mod 232

where si is the (Unicode) code of character i of a string s.

WebDam (INRIA) Web search April 3, 2013 13 / 85

Page 14: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Identifying duplicates

Identification of duplicate Web pages

ProblemIdentifying duplicates or near-duplicates on the Web to prevent multiple indexing

trivial duplicates: same resource at the same canonized URL:http://example.com:80/totohttp://example.com/titi/../toto

exact duplicates: identification by hashing

near-duplicates: (timestamps, tip of the day, etc.) more complex!

WebDam (INRIA) Web search April 3, 2013 14 / 85

Page 15: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Identifying duplicates

Near-duplicate detection

Edit distance. Count the minimum number of basic modifications (additions ordeletions of characters or words, etc.) to obtain a document fromanother one. Good measure of similarity, and can be computed inO(mn) where m and n are the size of the documents. But: doesnot scale to a large collection of documents (unreasonable tocompute the edit distance for every pair!).

Shingles. Idea: two documents similar if they mostly share the samesuccession of k -grams (succession of tokens of length k ).

ExampleI like to watch the sun set with my friend.My friend and I like to watch the sun set.

S ={i like, like to, my friend, set with, sun set, the sun, to watch, watch the, with my}T ={and i, friend and, i like, like to, my friend, sun set, the sun, to watch, watch the}

WebDam (INRIA) Web search April 3, 2013 15 / 85

Page 16: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Identifying duplicates

Hashing shingles to detect duplicates [BGMZ97]

Similarity: Jaccard coefficient on the set of shingles:

J(S,T ) =|S ∩ T ||S ∪ T |

Still costly to compute! But can be approximated as follows:1 Choose N different hash functions2 For each hash function hi and each set of shingles Sk = {sk1 . . .skn}, store

φik = minj hi(skj)3 Approximate J(Sk ,Sl) as the proportion of φik and φil that are equal

Possibly to repeat in a hierarchical way with super-shingles (we are onlyinterested in very similar documents)

WebDam (INRIA) Web search April 3, 2013 16 / 85

Page 17: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Crawling architecture

Crawling ethics

Standard for robot exclusion: robots.txt at the root of a Web server [Kos94].

User-agent: *Allow: /searchhistory/Disallow: /search

Per-page exclusion (de facto standard).

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Per-link exclusion (de facto standard).

<a href="toto.html" rel="nofollow">Toto</a>

Avoid Denial Of Service (DOS), wait 100ms/1s between two repeatedrequests to the same Web server

WebDam (INRIA) Web search April 3, 2013 17 / 85

Page 18: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Crawling architecture

Parallel processing

Network delays, waits between requests:

Per-server queue of URLs

Parallel processing of requests to different hosts:

I multi-threaded programmingI asynchronous inputs and outputs (select, classes fromjava.util.concurrent): less overhead

Use of keep-alive to reduce connexion overheads

WebDam (INRIA) Web search April 3, 2013 18 / 85

Page 19: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

General Architecture [Cha03]

Page 20: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web crawling Crawling architecture

Refreshing URLs

Content on the Web changes

Different change rates:

online newspaper main page: every hour or sopublished article: virtually no change

Continuous crawling, and identification of change rates for adaptivecrawling:

I If-Last-Modified HTTP feature (not reliable)I Identification of duplicates in successive request

WebDam (INRIA) Web search April 3, 2013 20 / 85

Page 21: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval

Outline

1 The World Wide Web

2 Web crawling

3 Web Information RetrievalText PreprocessingInverted IndexAnswering Keyword QueriesBuilding inverted filesClusteringOther Media

4 Web Graph Mining

5 Conclusion

WebDam (INRIA) Web search April 3, 2013 21 / 85

Page 22: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval

Information Retrieval, Search

ProblemHow to index Web content so as to answer (keyword-based) queries efficiently?

Context: set of text documents

d1 The jaguar is a New World mammal of the Felidae family.d2 Jaguar has designed four new engines.d3 For Jaguar, Atari was keen to use a 68K family device.d4 The Jacksonville Jaguars are a professional US football team.d5 Mac OS X Jaguar is available at a price of US $199 for Apple’s

new “family pack”.d6 One such ruling family to incorporate the jaguar into their name

is Jaguar Paw.d7 It is a big cat.

WebDam (INRIA) Web search April 3, 2013 22 / 85

Page 23: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Text Preprocessing

Initial text preprocessing steps

Number of optional steps

Highly depends on the application

Highly depends on the document language (illustrated with English)

WebDam (INRIA) Web search April 3, 2013 23 / 85

Page 24: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Language Identification

How to find the language used in a document?

Meta-information about the document: often not reliable!

Unambiguous scripts or letters: not very common!한글

カタカナ

Għarbiþorn

Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi,Maltese, Icelandic

Extension of this: frequent characters, or, better, frequent k -grams

Use standard machine learning techniques (classifiers)

WebDam (INRIA) Web search April 3, 2013 24 / 85

Page 25: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Language Identification

How to find the language used in a document?

Meta-information about the document: often not reliable!

Unambiguous scripts or letters: not very common!한글

カタカナ

Għarbiþorn

Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi,Maltese, Icelandic

Extension of this: frequent characters, or, better, frequent k -grams

Use standard machine learning techniques (classifiers)

WebDam (INRIA) Web search April 3, 2013 24 / 85

Page 26: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Tokenization

Principle

Separate text into tokens (words)

Not so easy!

In some languages (Chinese, Japanese), words not separated bywhitespace

Deal consistently with acronyms, elisions, numbers, units, URLs, emails,etc.

Compound words: hostname, host-name and host name. Break into twotokens or regroup them as one token? In any case, lexicon and linguisticanalysis needed! Even more so in other languages as German.

Usually, remove punctuation and normalize case at this point

WebDam (INRIA) Web search April 3, 2013 25 / 85

Page 27: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Tokenization: Example

d1 the1 jaguar2 is3 a4 new5 world6 mammal7 of8 the9 felidae10 family11

d2 jaguar1 has2 designed3 four4 new5 engines6

d3 for1 jaguar2 atari3 was4 keen5 to6 use7 a8 68k9 family10 device11

d4 the1 jacksonville2 jaguars3 are4 a5 professional6 us7 football8 team9

d5 mac1 os2 x3 jaguar4 is5 available6 at7 a8 price9 of10 us11 $19912

for13 apple’s14 new15 family16 pack17

d6 one1 such2 ruling3 family4 to5 incorporate6 the7 jaguar8 into9

their10 name11 is12 jaguar13 paw14

d7 it1 is2 a3 big4 cat5

WebDam (INRIA) Web search April 3, 2013 26 / 85

Page 28: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stemming

Principle

Merge different forms of the same word, or of closely related words, into a singlestem

Not in all applications!

Useful for retrieving documents containing geese when searching for goose

Various degrees of stemming

Possibility of building different indexes, with different stemming

WebDam (INRIA) Web search April 3, 2013 27 / 85

Page 29: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stemming schemes (1/2)

Morphological stemming.

Remove bound morphemes from words:I plural markersI gender markersI tense or mood inflectionsI etc.

Can be linguistically very complex, cf:Les poules du couvent couvent.[The hens of the monastery brood.]In English, somewhat easy:

I Remove final -s, -’s, -ed, -ing, -er, -estI Take care of semiregular forms (e.g., -y/-ies)I Take care of irregular forms (mouse/mice)

But still some ambiguities: cf stocking, rose

WebDam (INRIA) Web search April 3, 2013 28 / 85

Page 30: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stemming schemes (2/2)

Lexical stemming.

Merge lexically related terms of various parts of speech,such as policy, politics, political or politicianFor English, Porter’s stemming [Por80]; stem university anduniversal to univers: not perfect!Possibility of coupling this with lexicons to merge(near-)synonyms

Phonetic stemming.

Merge phonetically related words: search despite spellingerrors!For English, Soundex [US 07] stems Robert and Rupert toR163. Very coarse!

WebDam (INRIA) Web search April 3, 2013 29 / 85

Page 31: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stemming Example

d1 the1 jaguar2 be3 a4 new5 world6 mammal7 of8 the9 felidae10 family11

d2 jaguar1 have2 design3 four4 new5 engine6

d3 for1 jaguar2 atari3 be4 keen5 to6 use7 a8 68k9 family10 device11

d4 the1 jacksonville2 jaguar3 be4 a5 professional6 us7 football8 team9

d5 mac1 os2 x3 jaguar4 be5 available6 at7 a8 price9 of10 us11 $19912

for13 apple14 new15 family16 pack17

d6 one1 such2 rule3 family4 to5 incorporate6 the7 jaguar8 into9

their10 name11 be12 jaguar13 paw14

d7 it1 be2 a3 big4 cat5

WebDam (INRIA) Web search April 3, 2013 30 / 85

Page 32: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stop Word Removal

Principle

Remove uninformative words from documents, in particular to lower the cost ofstoring the index

determiners: a, the, this, etc.

function verbs: be, have, make, etc.

conjunctions: that, and, etc.

etc.

WebDam (INRIA) Web search April 3, 2013 31 / 85

Page 33: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Text Preprocessing

Stop Word Removal Example

d1 jaguar2 new5 world6 mammal7 felidae10 family11

d2 jaguar1 design3 four4 new5 engine6

d3 jaguar2 atari3 keen5 68k9 family10 device11

d4 jacksonville2 jaguar3 professional6 us7 football8 team9

d5 mac1 os2 x3 jaguar4 available6 price9 us11 $19912 apple14

new15 family16 pack17

d6 one1 such2 rule3 family4 incorporate6 jaguar8 their10 name11

jaguar13 paw14

d7 big4 cat5

WebDam (INRIA) Web search April 3, 2013 32 / 85

Page 34: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Structure of an inverted index

Assume D a collection of (text) documents. Create a matrix M with one row foreach document, one column for each token. Initialize the cells at 0.

Create the content of M: scan D, and extract for each document d the tokens tthat can be found in d (preprocessing); put 1 in M[d ][t ]

Invert M: one obtains the inverted index. Term appear as rows, with the list ofdocument ids or posting list.

Problem: storage of the whole matrix is not feasible.

WebDam (INRIA) Web search April 3, 2013 33 / 85

Page 35: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Inverted Index construction

After all preprocessing, construction of an inverted index:

Index of all terms, with the list of documents where this term occurs

Small scale: disk storage, with memory mapping (cf. mmap) techniques;secondary index for offset of each term in main index

Large scale: distributed on a cluster of machines; hashing gives themachine responsible for a given term

Updating the index is costly, so only batch operations (not one-by-oneaddition of term occurrences)

WebDam (INRIA) Web search April 3, 2013 34 / 85

Page 36: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Inverted Index Example

family d1, d3, d5, d6

football d4

jaguar d1, d2, d3, d4, d5, d6

new d1, d2, d5

rule d6

us d4, d5

world d1

. . .Note:

the length of an inverted (posting) list is highly variable – scanning shortlists first is an important optimization.

entries are homogeneous: this gives much room for compression.

WebDam (INRIA) Web search April 3, 2013 35 / 85

Page 37: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Index size matters

We want to index a collection of 1M emails. The average size of an email is1,000 bytes and each email contains an average of 100 words. The number ofdistinct terms is 200,000.

1 size of the collection; number of words?2 how many lists in the index?3 we make the (rough) assumption that 20% of the terms in a document

appear twice; a document appears in how many lists on average ?4 how many entries in a list?5 we represent document ids as 4-bytes unsigned integers, what is the index

size ?

WebDam (INRIA) Web search April 3, 2013 36 / 85

Page 38: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Storing positions in the index

phrase queries, NEAR operator: need to keep position information in theindex

just add it in the document list!

family d1/11, d3/10, d5/16, d6/4football d4/8jaguar d1/2, d2/1, d3/2, d4/3, d5/4, d6/8 + 13new d1/5, d2/5, d5/15rule d6/3us d4/7, d5/11world d1/6. . .⇒ so far, ok for Boolean queries: find the documents that contain a set ofkeywords; reject the other.

WebDam (INRIA) Web search April 3, 2013 37 / 85

Page 39: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Ranked search

Boolean search does not give an accurate result because it does not takeaccount of the relevance of a document to a query.

If the search retrieves dozen or hundreds of documents, the most relevant mustbe shown in top position!

The quality of a result with respect to relevance is measured by two factors:

precision =|relevant | ∩ |retrieved |

|retrieved |

recall =|relevant | ∩ |retrieved |

|relevant |

WebDam (INRIA) Web search April 3, 2013 38 / 85

Page 40: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

Weighting terms occurrencesRelevance can be computed by giving a weight to term occurrences.

Terms occurring frequently in a given document: more relevant.The term frequency is the number of occurrences of a term t in adocument d , divided by the total number of terms in d (normalization)

tf(t,d) =nt,d

∑t ′ nt ′,d

where nt ′,d is the number of occurrences of t ′ in d .

Terms occurring rarely in the document collection as a whole: moreinformativeThe inverse document frequency (idf) is obtained from the division of thetotal number of documents by the number of documents where t occurs, asfollows:

idf(t) = log|D|

|{d ′ ∈ D |nt,d ′ > 0}| .

WebDam (INRIA) Web search April 3, 2013 39 / 85

Page 41: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

TF-IDF Weighting

The inverted is extended by adding Term Frequency—Inverse DocumentFrequency weighting

tfidf(t,d) =nt,d

∑t ′ nt ′,d· log

|D||{d ′ ∈ D |nt,d ′ > 0}|

nt,d number of occurrences of t in dD set of all documents

Documents (along with weight) are stored in decreasing weight order in theindex

WebDam (INRIA) Web search April 3, 2013 40 / 85

Page 42: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Inverted Index

TF-IDF Weighting Example

family d1/11/.13, d3/10/.13, d6/4/.08, d5/16/.07football d4/8/.47jaguar d1/2/.04, d2/1/.04, d3/2/.04, d4/3/.04, d6/8 + 13/.04,

d5/4/.02new d2/5/.24, d1/5/.20, d5/15/.10rule d6/3/.28us d4/7/.30, d5/11/.15world d1/6/.47. . .

Exercise: take an entry, and check that the tf/idf value is indeed correct (takedocuments after stop-word removal).

WebDam (INRIA) Web search April 3, 2013 41 / 85

Page 43: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

Answering Boolean Queries

Single keyword query: just consult the index and return the documents inindex order.

Boolean multi-keyword query

(jaguar AND new AND NOT family) OR cat

Same way! Retrieve document lists from all keywords and apply adequateset operations:

AND intersectionOR union

AND NOT difference

Global score: some function of the individual weight (e.g., addition forconjunctive queries)

Position queries: consult the index, and filter by appropriate condition

WebDam (INRIA) Web search April 3, 2013 42 / 85

Page 44: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

Answering Top-k Queries

t1 AND . . . AND tn

ProblemFind the top-k results (for some given k) to the query, without retrieving alldocuments matching it.

Notations:

s(t,d) weight of t in d (e.g., tfidf)

g(s1, . . . ,sn) monotonous function that computes the global score (e.g.,addition)

WebDam (INRIA) Web search April 3, 2013 43 / 85

Page 45: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

Exercise

Consider the following documents:1 d1 = I like to watch the sun set with my friend.2 d2 = The Best Places To Watch The Sunset.3 d3 = My friend watches the sun come up.

Construct an inverted index with tf/idf weights for terms ’Best’ and ’sun’. Whatwould be the ranked result of the query ’Best and sun’?

WebDam (INRIA) Web search April 3, 2013 44 / 85

Page 46: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

Basic algorithm

First version of the top-k algorithm: the inverted file contains entries sorted onthe document id. The query is

t1 AND . . . AND tn

.

1 Take the first entry of each list; one obtains a tuple T = [e1, . . .en];2 Let d1 be the minimal doc id in the entries of T : compute the global score

of d1;3 For each entry ei featuring d1: advance on the inverted list Li .

When all lists have been scanned: sort the documents on the global scores.

Not very efficient; cannot give the ranked result before a full scan on the lists.

WebDam (INRIA) Web search April 3, 2013 45 / 85

Page 47: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

The Threshold Algorithm

1 Let R be the empty list, and m = +∞.2 For each 1≤ i ≤ n:

1 Retrieve the document d (i) containing term ti that has the next largests(ti ,d (i)).

2 Compute its global score gd (i) = g(s(t1,d (i)), . . . ,s(tn,d (i))) by retrieving

all s(tj ,d (i)) with j 6= i .3 If R contains less than k documents, or if gd (i) is greater than the minimum

of the score of documents in R, add d (i) to R.

3 Let m = g(s(t1,d (1)),s(t2,d (2)), . . . ,s(tn,d (n))).4 If R contains more than k documents, and the minimum of the score of the

documents in R is greater than or equal to m, return R.5 Redo step 2.

WebDam (INRIA) Web search April 3, 2013 46 / 85

Page 48: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Answering Keyword Queries

The TA, by example

q = “new OR family”, and k = 3. We use inverted lists sorted on the weight.family d1/11/.13, d3/10/.13, d6/4/.08, d5/16/.07new d2/5/.24, d1/5/.20, d5/15/.10. . .

Initially, R =∅ and τ = +∞.

1 d (1) is the first entry in Lfamily, one finds s(new,d1) = .20; the globalscore for d1 is .13 + .20 = .33.

2 Next, i = 2, and one finds that the global score for d2 is .24.3 The algorithm quits the loop on i with R = 〈[d1, .33], [d2, .24]〉 and

τ = .13 + .24 = .37.4 We proceed with the loop again, taking d3 with score .13 and d5 with score

.17. [d5, .17] is added to R (at the end) and τ is now .10 + .13 = .23.A last loop concludes that the next candidate is d6, with a global score of.08. Then we are done.

WebDam (INRIA) Web search April 3, 2013 47 / 85

Page 49: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

External Sort/Merge

Building an inverted index from a document collection requires a sort/merge ofthe index entries.

first extracts triplets [d , t, f ] from the collection;

then sort the set of triplets on the term-docid pair [t,d ].

contiguous inverted lists can be created from the sorted entries.

Note: ubiquituous operation; used in RDBMS for ORDER BY, GROUP BY,DISTINCT, and non-indexed joins.

WebDam (INRIA) Web search April 3, 2013 48 / 85

Page 50: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

First phase: sort

Repeat: fill the memory with entries; sort in memory (with quicksort); flush thememory in a “run”.

...

... ... ...

...

One obtains a list of sorted runs.

Cost: documents are read once; entries are written once.

WebDam (INRIA) Web search April 3, 2013 49 / 85

Page 51: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Second phase: merge

Group the runs and merge.

...

o

o

...

...

...

Cost: one read/write of all entries for each level of the merge tree.

WebDam (INRIA) Web search April 3, 2013 50 / 85

Page 52: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Illustration with M = 3

Metropolis, Physchose, Shining, Twin Peaks, Underground, vertigo

Annie Hall, Brazil, Easy Rider, Greystoke, Jurassic Park, Manhattan,

Underground

Jurassic Park,

Twin Peaks, Underground, Vertigo

Metropolis,Shining

Greystoke, Psychose

Vertigo ShiningPsychose

Metropolis, Greystoke,

Manhattan,

Easy Rider

fusion

fusion

Twin Peaks, Vertigo

Annie Hall, Brazil,

fusion

fusion

Annie Hall,

Easy Rider, Greystoke, Manhattan,

Metropolis, Psychose, Shining

fusion

Annie Hall, Brazil, Jurassic Park,

Brazil,

Twin Peaks WebDam (INRIA) Web search April 3, 2013 51 / 85

Page 53: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Main parameter: the buffer size

Consider a file with 75 000 pages, 4Kb each, thus 307 MBs.

1 M > 307Mo : one read 3072 M = 2Mo, (500 pages).

1 the sort phase yields d 3072 e = 154 runs.

2 The merge phase requires 154 pages

Total cost: 614 + 307 = 921 MB.

NB : we must allocate a lot of memory to decrease the number of merge levelsby one.

WebDam (INRIA) Web search April 3, 2013 52 / 85

Page 54: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

With few memory

M = 1Mo, 250 pages.

1 sort phase: 307 runs.2 Merge the 249 first runs; then the 58 remaining. One obtains F1 and F2.3 Second merge of F1 and F2.

Total cost: 1228 + 307 = 1535 MBs.

NB : important performance loss between 2 MBs and 1 MBs ().

WebDam (INRIA) Web search April 3, 2013 53 / 85

Page 55: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Exercise

Assume that a page holds only two records. Explain the sort-merge algorithmon the following dataset with a 4-pages buffers.

3 Allier; 36 Indre 18 Cher 75 Paris 39 Jura9 Arige; 81 Tarn 11 Aude 12 Aveyron 25 Doubs73 Savoie 55 Meuse 15 Cantal 51 Marne 42 Loire40 Landes 14 Calvados 30 Gard 84 Vaucluse 7 Ardche

Same question with a 3-pages buffer.

WebDam (INRIA) Web search April 3, 2013 54 / 85

Page 56: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Compression of inverted lists

Without compression, an inverted index with positions and weights may be largethan the documents collection!

Compression is essential. The gain must be higher than the time spent tocompress.

Key to compression in inverted lists: documents are ordered by id:

[87;273;365;576;810].

First step: use delta-coding:

[87;186;92;211;234].

Exercise: what is the minimal number of bytes for the first list? for the second?

WebDam (INRIA) Web search April 3, 2013 55 / 85

Page 57: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Variable bytes encoding

Idea: encode integers on 7 bits (27 = 128); use the leading bit for termination.

Let v = 9, encoded on one byte as 10000101 (note the first bit set to 1).

Let v = 137.1 the first byte encodes v ′ = v mod 128 = 9, thus b = 10000101 just as

before;2 next we encode v/128 = 1, in a byte b′ = 00000001 (note the first bit set

to 0).

137 is therefore encoded on two bytes:

00000001 10000101.

Compression ratio: typically 1/4 to 1/2 of the fixed-length representation.

WebDam (INRIA) Web search April 3, 2013 56 / 85

Page 58: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Building inverted files

Exercise

The inverted list of a term t consists of the following document ids:

[345;476;698;703].

Apply the VByte compression technique to this sequence. What is the amountof space gained by the method?

WebDam (INRIA) Web search April 3, 2013 57 / 85

Page 59: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Clustering

Clustering Example

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � �� � � � ! " � ! � � � � �# � ! � � � � $ �" � � � � � � �# % & � & � � � ' �# � ( � % � ! � & ( ) � � * + �, � ( - � & . � ! � * / �� � ) 0 � & ( . 1 � � � � � � � ! � � * � �2 � 3 � ( � 1 . � 4 5 � � ) & ( � � 6 �� � � ! 1 4 7 � 8 � � * 9 �" � � � � 1 ) � � � � ! � / �

) � � � � � ! � � � : � ; � � � � < � � = � � > ? > � � � : � < � � @ < � � � < A ? 4 B ? B 4 B B B � � < � � � C � @ � � < D � E : � � F G � � � ! H � @ � � � < � � � I H � < � � � I J � � ; D K � � : � < �LM N O P Q O R S T U V W X X Y Z [ V \ \ ] U ] ^ _ ` [ a b ] c [ V \ c Z [ d e f g b h ^ U i b V j k ] _ _ [ l O P Q O R S= D � � @ @ � ; � � < � � � � < � � � < D � ; � � � � m � � � � � < � � � m � � � ; D � : � � m � � � � < � � L� � � L � : � � L ; � � n o ; ; D � p n q � C � m r � � � s � � � ; < � � F m t � �Lu l O P Q O R = D � G � � � ! H v w x y z { | w } x ~ w I � � � � � � � � � � � � @ < D � ; < @ � � � F � < � C � < � � � �� � � � � � � @ < D � t � � � � ; � L � < � � ; � � � � � F � � � < � < � < D � � � � � m < � � � m � � � � � � � @ < D � r � � � � � m � � � < D � � � � � < � � � ; � � � � @ < D � ; < @ � � � F @ � : � � � < D � t � � � � ; � L� � L � � � � � � � L � � � � � � � � � : � n o ; ; D � p n � � � � � � � m t � � m q � C �L� l O P Q O R � j c Z � b ] ^ b c b g � _ � a� � � � � � � � � � < � � � � ! � s � � � � � � � : � L L L q � � � < � � � � ! � � : � � � < D � � � � � m � � � C � � � C � � u � m � � �� � � � � � � L L L� � � L � � ; L � � L : � n o ; ; D � p n t � � m r � � � s � � � ; < � � FL� � � ^ a ^ j � V j b a ] � \ V � N O P Q O R � [ U V k [ � � � _ ^ j� � � � � � � � � � � = D � � � < � � � � � s � � � < � � � < D � � � � � � < < � � � < � < � ; � @ < � � ; � C � � F � � � @ � � < D �� � � � � � G � � � ! � � ; : � � < � � @ � � � @ < D � � � � ; < � D C � � � � � � � � < < � � � � < D � J � : < D � � � < � � � � �� @ � � � � � � � ; � � t � � � � � < � � � � � < � : ; D ; < � � � L J � � � ; � � < � ; � � @ < D � � ; � � � � � � � = D : � � F < D �G � � � ! � � � � � � � ; � � @ � ; � @ � � < D � � C � � � � � � < � � � � � � � � � � @ � � ; � m � D � ; D � � � � � : � � � � � � F � @ < D �� � � � � � � D � � � < D � L L L D � ; � � � � � � � < � < D � � � � < � J < < � � @ � � � � � � � ; � L � @ < D � � L J L � � � � � � � �� � � � � � � � < � ; � � < � ; � � � ; � C � � F � � � @ � � < D � G � � � ! m < D � � � < � � : � ; � � � < � � � < D � � � � � � � J � ; : � � < Fs � � � < � � � < � � � : � � � � < D � @ � � ; � m � �   � � � � J : ; � � � � m � � � � ; F � � � ; < � � � @ < D � � � � < � � L L L� � � � L F D � � L ; � � � � � � � u � � ¡ � M M ¡ � � ¢ � � ¢ � ¢ ; ¢ � < ¢ � � � � : � ¢ � � ; � C � � F n o ; ; D � p n £ D � � ¤ � � � �L¥ ¦ ^ k ] j § ¨ V V � a � [ c V © [ § [ W V j �ª « ¬ � ­ � ® ¯ � L L L � � � � ; � : � < � � � � ° � � � � � � � � � � < � ; � � � ; � � � � � < � @ � � � ; � ± � C � � @ < � �F � : � : � < � ; < � D � � L � D � � = < � @ � � � � � C F � � < � � : F � � � � ! m F � : � � � � < D �� � � ; � � � @ � � � � � D � ; D � � L J � ; � � n � � � � ; � : � < � � � � � � � � � @ � < � � ; � � � � D : � � @ � � � � � � < � � � � � m L L L� � � L � F < � � � � L ; � � � L L L M � ¥ ² � ² � � � � ³ � � ´ ; µ � @ � ¡ � M � ¶ ² @ µ � ³ � � ´ ¥ � ¡ ¡ ³ � � < � � � ´ � � � � F < ³ � � ; ´ � � � n o ; ; D � pn � £ = � � � �Lµ l O P Q O R� � � � ! � F � � @ � � < � · t G � � � ! H v w x y z { | w } x ~ w I m � � � @ � � � � < � C � < � J � : < D � � � � < � � t � � � � ; 7 ! � 8 8 � ( 5 A B � � � � ! � 8 1 � 1 � � ! ¸ � 1 ! ) ! � 3 � ¹ º # º " � » � � � � ! m � � � � < � F � � ; � @ < � � � � ! " � ! � m ¼ � � < � � D : < � � � � � � � � � � � � � � � ! � � ) 1 ( m @ � � � � � ½ � � � : � r � � < � � m � � � K � ¼ : � � K ; � � � � � ! 1 � � � � ! m ¾ � � � � � ; � � � � � � � � � F t < � � � ; r J ¿ M � L u À � : � À m < D � ; � � � � � @ � � C � � � � � � M � L u � @ < D �� ; r J ¿ r � � � < � � J F � < � � � � � � ! � Á º . � ( � m � � � < � � � � ) 0 � & ( . 1 � � � � � � � ! � m � � ½ q < � � ; D � ; < � � � � < D � � � C � � � 1 8  � � ( - � � � � ! ¼ � � < � � D � � ; � � < m � : � H � � ; � � < I� � L � � � � � � � L � � � � � � � � � : � ¢ H � � � � � : < � � � I n o ; ; D � p n � � � � � � � L¶ l O P Q O R � � X © V W [� � � � ! � J t r @ @ � ; � � � � � � à � L L L� � � L � : � : � L ; � � n o ; ; D � p n t � �L¡ l O P Q O Rà � < D � � � � ; L � £ J = Ä K � r � J � t = r ½ = � Ä t � t Å r � L r @ � � < D � � � ; < � m < D � G � � � ! � � � � � � < D �� � � < � < : � � L � D � � � � � � � � � @ � � � < � � � ; � � � � @ � � � < D � � � � m � � � < � @ � D < � � � � � � � � � : < L L L� � � L � � : � � � � � L � � � � : � L D < � n o ; ; D � p n q � C � m t � �L² e ^ _ U V j b Z ] � [ h ^ § b g � W ] c Z ^ b j [ ` U V ^ U Z� � � � Æ � � � � � � L L L < � � � m < � ; � � � ; � � � @ � : ; ; � � � m � � � � m � � � < � � � � ; D < � < D � � � m � D � D ; D � ; < � � � � � < � � � < F L À J � � < D m < D � � � � � ! � � � @ � � � � C � ; � � � � � < � �� � � ; � u � � � m D D � � � � ; � � � � < � � C � � � � � < D < D � ½ � ; � � � � � ½ � � F L � � D � � � C � �� � � � � � ½ q D � L L L � � ; � � � � u � � µ � � � � < D � � u � � ¥ L q � @ < � � ; D � � J � � < D � � : � � � C � � � � ; � � C � � � � : D ; � � � < � � � ; � � � � C � � � � � � ; : � � � � F � � : � � � � � � ! � ; � ; D� ; � s � � K � � m @ � � � � � � @ � � � � C � ; � � � � � < � � m � � < D � � � � � � < � � � � � � @ < D � � @ � � � � L � � � � ! � � @ � � � � C � � � � � ; � � � : � � � � � F L L L� � � � L F D � � L ; � � � � � � � u � � ¡ � M u � � � ¢ � � ¢ � � ¢ @ � ¢ � � � @ � � ¢ @ � ; � � � ¢ ; � ; D n o ; ; D � p n £ D � � ¤ � � � �LM � ¦ ] _ _ � ^ � b e _ � ^ b e ^ b c ^ b c Z [ Ç [ c ^ � Z V � b Ȫ É ¯ Ê Ë ­ � ® ¯ � L L L � � � � @ < D � � � � < < : � � � : � � � : � � � � L t � � D � � � � � � � : < < D � @ : < : � � � @ ½ � � ° �� � � � � � � � � F � � � � F n � � � � � ¼ � � < � � D � � � � m � � � � ! � q � K � C � � m � D � ; D � � @ � � � � � m � � L � : � � � F� @ @ � � � < D � � � � � � : < � ; � � � � � � � � � < · Ì = D � F ° � � � � F < � < � � � @ @ � L L L� � � L � F < � � � � L ; � � � L L L M � ¥ ² � ² � � � � ³ � � ´ � ¥ u ¡ � � ¡ µ � ² u M � � � ; ³ � � ´ ¥ � ¡ ¡ ³ � � < � � � ´ � � � � F < ³ � � ; ´ � � � n

WebDam (INRIA) Web search April 3, 2013 58 / 85

Page 60: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Clustering

Cosine Similarity of Documents

Document Vector Space model:

terms dimensionsdocuments vectorscoordinates weights

(The projection of document d along coordinate t is the weight of t in d ,say tfidf(t,d))

Similarity between documents d and d ′: cosine of these two vectors

cos(d ,d ′) =d · d ′

‖d‖ × ‖d ′‖

d · d ′ scalar product of d and d ′

‖d‖ norm of vector d

cos(d ,d) = 1

cos(d ,d ′) = 0 if d and d ′ are orthogonal (do not share any common term)

WebDam (INRIA) Web search April 3, 2013 59 / 85

Page 61: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Clustering

Agglomerative Clustering of Documents

1 Initially, each document forms its own cluster.2 The similarity between two clusters is defined as the maximal similarity

between elements of each cluster.3 Find the two clusters whose mutual similarity is highest. If it is lower than a

given threshold, end the clustering. Otherwise, regroup these clusters.Repeat.

RemarkMany other more refined algorithms for clustering exist.

WebDam (INRIA) Web search April 3, 2013 60 / 85

Page 62: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Other Media

Indexing HTML

HTML: text + meta-information + structure

Possibly: separate index for meta-information (title, keywords)

Increase weight of structurally emphasized content in index

Tree structure can also be queried with XPath or XQuery, but not veryuseful on the Web as a whole, because of tag soup and lack of consistency.

WebDam (INRIA) Web search April 3, 2013 61 / 85

Page 63: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Information Retrieval Other Media

Indexing Multimedia Content

Basic approach: index text from context of the mediaI surrounding textI text in or around the links pointing to the contentI filenamesI associated subtitles (hearing-impaired track on TV)

Elaborate approach: index and search the media itself, with the help ofspeech recognition and sound, image, and video analysis

I TrackID, Shazam: identify a song played on the radioI Musipedia: look for music by whistling a tune,http://www.musipedia.org/, http://www.midomi.com/

I Image search from a similar image, http://images.google.com/,Google Goggles

I Voxalead, http://voxaleadnews.labs.exalead.com/

WebDam (INRIA) Web search April 3, 2013 62 / 85

Page 64: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining

Outline

1 The World Wide Web

2 Web crawling

3 Web Information Retrieval

4 Web Graph MiningPageRankHITSSpamdexing

5 Conclusion

WebDam (INRIA) Web search April 3, 2013 63 / 85

Page 65: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining

The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages

Edges: hyperlinks

Same for other interlinked environments:

dictionaries

encyclopedias

scientific publications

social networks

WebDam (INRIA) Web search April 3, 2013 64 / 85

Page 66: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

Let’s start with simple examples

The PageRank (PR) of page i is the Probability that a surfer following therandom walk has arrived on i at some distant given point in the future.

Left part: PR(A) = PR(B) + PR(C) + PR(D)

Right part?

Assume that the initial PR of each page is 0.25: what is the PR after oneiteration? Two iterations?

WebDam (INRIA) Web search April 3, 2013 65 / 85

Page 67: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

The example graph

12

3

6

7

9

4

5

10

8

WebDam (INRIA) Web search April 3, 2013 66 / 85

Page 68: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

The transition matrix

{gij = 0 if there is no link between page i and j ;

gij =1ni

otherwise, with ni the number of outgoing links of page i .

G =

0 1 0 0 0 0 0 0 0 00 0 1

4 0 0 14

14 0 1

4 00 0 0 1

212 0 0 0 0 0

0 1 0 0 0 0 0 0 0 00 0 0 0 0 1

2 0 0 0 12

13

13 0 1

3 0 0 0 0 0 00 0 0 0 0 1

3 0 13 0 1

30 1

3 0 0 0 0 0 0 13

13

0 12

12 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0

WebDam (INRIA) Web search April 3, 2013 67 / 85

Page 69: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank (Google’s Ranking [BP98])

IdeaImportant pages are pages pointed to by important pages.

PageRank simulates a random walk by iterately computing the PR of each page,represented as a vector v .

Initially, v is set using a uniform distribution (v [i ] = 1|v | ).

Definition (Tentative)Probability that the surfer following the random walk in G has arrived on page iat some distant given point in the future.

pr(i) =

(lim

k→+∞(GT )k v

)i

where v is some initial column vector.

WebDam (INRIA) Web search April 3, 2013 68 / 85

Page 70: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

Exercise

Model the simple examples with transition matrix, and apply PageRank,assuming an initial uniform distribution.

WebDam (INRIA) Web search April 3, 2013 69 / 85

Page 71: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.1000.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100

0.100

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 72: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0330.317

0.075

0.108

0.025

0.058

0.083

0.150

0.117

0.033

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 73: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0360.193

0.108

0.163

0.079

0.090

0.074

0.154

0.094

0.008

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 74: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0540.212

0.093

0.152

0.048

0.051

0.108

0.149

0.106

0.026

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 75: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0510.247

0.078

0.143

0.053

0.062

0.097

0.153

0.099

0.016

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 76: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0480.232

0.093

0.156

0.062

0.067

0.087

0.138

0.099

0.018

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 77: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0520.226

0.092

0.148

0.058

0.064

0.098

0.146

0.096

0.021

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 78: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0490.238

0.088

0.149

0.057

0.063

0.095

0.141

0.099

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 79: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0500.232

0.091

0.149

0.060

0.066

0.094

0.143

0.096

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 80: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0500.233

0.091

0.150

0.058

0.064

0.095

0.142

0.098

0.020

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 81: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0500.234

0.090

0.148

0.058

0.065

0.095

0.143

0.097

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 82: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0490.233

0.091

0.149

0.058

0.065

0.095

0.142

0.098

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 83: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0500.233

0.091

0.149

0.058

0.065

0.095

0.143

0.097

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 84: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank Iterative Computation

0.0500.234

0.091

0.149

0.058

0.065

0.095

0.142

0.097

0.019

WebDam (INRIA) Web search April 3, 2013 70 / 85

Page 85: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

PageRank With Damping

May not always converge, or convergence may not be unique.To fix this, the random surfer can at each step randomly jump to any page of theWeb with some probability d (1− d : damping factor).

pr(i) =

(lim

k→+∞((1− d)GT + dU)k v

)i

where U is the matrix with all 1N values with N the number of vertices.

WebDam (INRIA) Web search April 3, 2013 71 / 85

Page 86: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining PageRank

Using PageRank to Score Query Results

PageRank: global score, independent of the query

Can be used to raise the weight of important pages:

weight(t,d) = tfidf(t,d)× pr(d),

This can be directly incorporated in the index.

WebDam (INRIA) Web search April 3, 2013 72 / 85

Page 87: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining HITS

HITS (Kleinberg, [Kle99])

IdeaTwo kinds of important pages: hubs and authorities. Hubs are pages that pointto good authorities, whereas authorities are pages that are pointed to by goodhubs.

G′ transition matrix (with 0 and 1 values) of a subgraph of the Web. We use thefollowing iterative process (starting with a and h vectors of norm 1):{

a := 1‖G′T h‖ G′T h

h := 1‖G′a‖ G′a

Converges under some technical assumptions to authority and hub scores.

WebDam (INRIA) Web search April 3, 2013 73 / 85

Page 88: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining HITS

Using HITS to Order Web Query Results

1 Retrieve the set D of Web pages matching a keyword query.2 Retrieve the set D∗ of Web pages obtained from D by adding all linked

pages, as well as all pages linking to pages of D.3 Build from D∗ the corresponding subgraph G′ of the Web graph.4 Compute iteratively hubs and authority scores.5 Sort documents from D by authority scores.

Less efficient than PageRank, because local scores.

WebDam (INRIA) Web search April 3, 2013 74 / 85

Page 89: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining Spamdexing

Spamdexing

DefinitionFraudulent techniques that are used by unscrupulous webmasters to artificiallyraise the visibility of their website to users of search engines

Purpose: attracting visitors to websites to make profit.

Unceasing war between spamdexers and search engines

WebDam (INRIA) Web search April 3, 2013 75 / 85

Page 90: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining Spamdexing

Spamdexing: Lying about the Content

Technique

Put unrelated terms in:

meta-information (<meta name="description">,<meta name="keywords">)

text content hidden to the user with JavaScript, CSS, or HTMLpresentational elements

Countertechnique

Ignore meta-information

Try and detect invisible text

WebDam (INRIA) Web search April 3, 2013 76 / 85

Page 91: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining Spamdexing

Link Farm Attacks

Technique

Huge number of hosts on the Internet used for the sole purpose of referencingeach other, without any content in themselves, to raise the importance of a givenwebsite or set of websites.

Countertechnique

Detection of websites with empty or duplicate content

Use of heuristics to discover subgraphs that look like link farms

WebDam (INRIA) Web search April 3, 2013 77 / 85

Page 92: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Web Graph Mining Spamdexing

Link Pollution

Technique

Pollute user-editable websites (blogs, wikis) or exploit security bugs to addartificial links to websites, in order to raise its importance.

Countertechnique

rel="nofollow" attribute to <a> links not validated by a page’s owner

WebDam (INRIA) Web search April 3, 2013 78 / 85

Page 93: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Conclusion

Outline

1 The World Wide Web

2 Web crawling

3 Web Information Retrieval

4 Web Graph Mining

5 Conclusion

WebDam (INRIA) Web search April 3, 2013 79 / 85

Page 94: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Conclusion

What you should remember

The inverted index model for efficient answers of keyword-based queries.

The threshold algorithm for retrieving top-k results.

PageRank and its iterative computation.

WebDam (INRIA) Web search April 3, 2013 80 / 85

Page 95: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Conclusion

References

SpecificationsI HTML 4.01, http://www.w3.org/TR/REC-html40/I HTTP/1.1, http://tools.ietf.org/html/rfc2616I Robot Exclusion Protocol,http://www.robotstxt.org/orig.html

A bookMining the Web: Discovering Knowledge from Hypertext Data, SoumenChakrabarti, Morgan Kaufmann

WebDam (INRIA) Web search April 3, 2013 81 / 85

Page 96: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Bibliography I

Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.A first experience in archiving the French Web.In Proc. ECDL, Roma, Italie, September 2002.

Serge Abiteboul, Mihai Preda, and Grégory Cobena.Adaptive on-line page importance computation.In Proc. Intl. World Wide Web Conference (WWW), 2003.

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and GeoffreyZweig.Syntactic clustering of the Web.Computer Networks, 29(8-13):1157–1166, 1997.

Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search engine.Computer Networks, 30(1–7):107–117, April 1998.

WebDam (INRIA) Web search April 3, 2013 82 / 85

Page 97: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Bibliography II

Soumen Chakrabarti.Mining the Web: Discovering Knowledge from Hypertext Data.Morgan Kaufmann, 2003.

Soumen Chakrabarti, Martin van den Berg, and Byron Dom.Focused crawling: A new approach to topic-specific Web resourcediscovery.Computer Networks, 31(11–16):1623–1640, 1999.

Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, andMarco Gori.Focused crawling using context graphs.In Proc. VLDB, Cairo, Egypt, September 2000.

Jon M. Kleinberg.Authoritative Sources in a Hyperlinked Environment.Journal of the ACM, 46(5):604–632, 1999.

WebDam (INRIA) Web search April 3, 2013 83 / 85

Page 98: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Bibliography III

Martijn Koster.A standard for robot exclusion.http://www.robotstxt.org/orig.html, June 1994.

Martin F. Porter.An algorithm for suffix stripping.Program, 14(3):130–137, July 1980.

Pierre Senellart.Identifying Websites with flow simulation.In Proc. ICWE, pages 124–129, Sydney, Australia, July 2005.

sitemaps.org.Sitemaps XML format.http://www.sitemaps.org/protocol.php, February 2008.

WebDam (INRIA) Web search April 3, 2013 84 / 85

Page 99: Web search - Web data management and distributionwebdam.inria.fr/Jorge/files/slwebsearch.pdf · Web search Web data management and distribution Serge Abiteboul Ioana Manolescu Philippe

Bibliography IV

US National Archives and Records Administration.The Soundex indexing system.http://www.archives.gov/genealogy/census/soundex.html,May 2007.

WebDam (INRIA) Web search April 3, 2013 85 / 85