Upload
buikhanh
View
223
Download
0
Embed Size (px)
Citation preview
Web searchWeb data management and distribution
Serge Abiteboul Ioana Manolescu Philippe RigauxMarie-Christine Rousset Pierre Senellart
Web Data Management and Distributionhttp://webdam.inria.fr/textbook
April 3, 2013
WebDam (INRIA) Web search April 3, 2013 1 / 85
The World Wide Web
Outline
1 The World Wide Web
2 Web crawling
3 Web Information Retrieval
4 Web Graph Mining
5 Conclusion
WebDam (INRIA) Web search April 3, 2013 2 / 85
The World Wide Web
Internet and the Web
Internet: physical network of computers (or hosts)
World Wide Web, Web, WWW: logical collection of hyperlinked documents
static and dynamicpublic Web and private Webseach document (or Web page, or resource) identified by aURL
WebDam (INRIA) Web search April 3, 2013 3 / 85
The World Wide Web
Uniform Resource Locators
https︸ ︷︷ ︸scheme
://www.example.com︸ ︷︷ ︸hostname
:443︸ ︷︷ ︸port
/path/to/doc︸ ︷︷ ︸path
?name=foo&town=bar︸ ︷︷ ︸query string
#para︸ ︷︷ ︸fragment
scheme: way the resource can be accessed; generally http or httpshostname: domain name of a host (cf. DNS); hostname of a website may
start with www., but not a rule.port: TCP port; defaults: 80 for http and 443 for httpspath: logical path of the document
query string: additional parameters (dynamic documents).fragment: subpart of the document
Query strings and fragments optionalsEmpty path: root of the Web serverRelative URIs with respect to a context (e.g., the URI above):/titi https://www.example.com/tititata https://www.example.com/path/to/tataWebDam (INRIA) Web search April 3, 2013 4 / 85
The World Wide Web
(X)HTML
Choice format for Web pages
Dialect of SGML (the ancestor of XML), but seldom parsed as is
HTML 4.01: most common version, W3C recommendation
XHTML 1.0: XML-ization of HTML 4.01, minor differencesValidation (cf http://validator.w3.org/). Checks the conformityof a Web page with respect to recommendations, for accessibility:
I to all graphical browsers (IE, Firefox, Safari, Opera, etc.)I to text browsers (lynx, links, w3m, etc.)I to aural browsersI to all other user agents including Web crawlers
Actual situation of the Web: tag soup
WebDam (INRIA) Web search April 3, 2013 5 / 85
The World Wide Web
XHTML example
<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"lang="en" xml:lang="en">
<head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8" /><title>Example XHTML document</title>
</head><body>
<p>This is a<a href="http://www.w3.org/">link to the<strong>W3C</strong>!</a></p>
</body></html>
WebDam (INRIA) Web search April 3, 2013 6 / 85
The World Wide Web
HTTPClient-server protocol for the Web, on top of TCP/IPExample request/response
GET /myResource HTTP/1.1Host: www.example.com
HTTP/1.1 200 OKContent-Type: text/html; charset=ISO-8859-1
<html><head><title>myResource</title></head><body><p>Hello world!</p></body>
</html>
HTTPS: secure version of HTTPI encryptionI authenticationI session tracking
WebDam (INRIA) Web search April 3, 2013 7 / 85
The World Wide Web
Features of HTTP/1.1
virtual hosting: different Web content for different hostnames on a singlemachine
login/password protection
content negociation: same URL identifying several resources, client indicatespreferences
cookies: chunks of information persistently stored on the client
keep-alive connections: several requests using the same TCP connection
etc.
WebDam (INRIA) Web search April 3, 2013 8 / 85
Web crawling
Outline
1 The World Wide Web
2 Web crawlingDiscovering new URLsIdentifying duplicatesCrawling architecture
3 Web Information Retrieval
4 Web Graph Mining
5 Conclusion
WebDam (INRIA) Web search April 3, 2013 9 / 85
Web crawling Discovering new URLs
Web Crawlers
crawlers, (Web) spiders, (Web) robots: autonomous user agents thatretrieve pages from the WebBasics of crawling:
1 Start from a given URL or set of URLs2 Retrieve and process the corresponding page3 Discover new URLs (cf. next slide)4 Repeat on each found URL
No real termination condition (virtual unlimited number of Web pages!)
Graph-browsing problem
deep-first: not very adapted, possibility of being lost in robot trapsbreadth-first
combination of both: breadth-first with limited-depth deep-first on eachdiscovered website
WebDam (INRIA) Web search April 3, 2013 10 / 85
Web crawling Discovering new URLs
Sources of new URLs
From HTML pages:I hyperlinks <a href="...">...</a>I media <img src="..."> <embed src="..."> <objectdata="...">
I frames <frame src="..."> <iframe src="...">I JavaScript links window.open("...")I etc.
Other hyperlinked content (e.g., PDF files)
Non-hyperlinked URLs that appear anywhere on the Web (in HTML text,text files, etc.): use regular expressions to extract them
Referrer URLs
Sitemaps [sit08]
WebDam (INRIA) Web search April 3, 2013 11 / 85
Web crawling Discovering new URLs
Scope of a crawler
Web-scaleI The Web is infinite! Avoid robot traps by putting depth or page number limits
on each Web serverI Focus on important pages [APC03] (cf. lecture on the Web graph)
Web servers under a list of DNS domains: easy filtering of URLs
A given topic: focused crawling techniques [CvdBD99, DCL+00] based onclassifiers of Web page content and predictors of the interest of a link.
The national Web (cf. public deposit, national libraries): what is this?[ACMS02]
A given Web site: what is a Web site? [Sen05]
WebDam (INRIA) Web search April 3, 2013 12 / 85
Web crawling Identifying duplicates
A word about hashing
DefinitionA hash function is a deterministic mathematical function transforming objects(numbers, character strings, binary. . . ) into fixed-size, seemingly random,numbers. The more random the transformation is, the better.
ExampleJava hash function for the String class:
n−1
∑i=0
si × 31n−i−1 mod 232
where si is the (Unicode) code of character i of a string s.
WebDam (INRIA) Web search April 3, 2013 13 / 85
Web crawling Identifying duplicates
Identification of duplicate Web pages
ProblemIdentifying duplicates or near-duplicates on the Web to prevent multiple indexing
trivial duplicates: same resource at the same canonized URL:http://example.com:80/totohttp://example.com/titi/../toto
exact duplicates: identification by hashing
near-duplicates: (timestamps, tip of the day, etc.) more complex!
WebDam (INRIA) Web search April 3, 2013 14 / 85
Web crawling Identifying duplicates
Near-duplicate detection
Edit distance. Count the minimum number of basic modifications (additions ordeletions of characters or words, etc.) to obtain a document fromanother one. Good measure of similarity, and can be computed inO(mn) where m and n are the size of the documents. But: doesnot scale to a large collection of documents (unreasonable tocompute the edit distance for every pair!).
Shingles. Idea: two documents similar if they mostly share the samesuccession of k -grams (succession of tokens of length k ).
ExampleI like to watch the sun set with my friend.My friend and I like to watch the sun set.
S ={i like, like to, my friend, set with, sun set, the sun, to watch, watch the, with my}T ={and i, friend and, i like, like to, my friend, sun set, the sun, to watch, watch the}
WebDam (INRIA) Web search April 3, 2013 15 / 85
Web crawling Identifying duplicates
Hashing shingles to detect duplicates [BGMZ97]
Similarity: Jaccard coefficient on the set of shingles:
J(S,T ) =|S ∩ T ||S ∪ T |
Still costly to compute! But can be approximated as follows:1 Choose N different hash functions2 For each hash function hi and each set of shingles Sk = {sk1 . . .skn}, store
φik = minj hi(skj)3 Approximate J(Sk ,Sl) as the proportion of φik and φil that are equal
Possibly to repeat in a hierarchical way with super-shingles (we are onlyinterested in very similar documents)
WebDam (INRIA) Web search April 3, 2013 16 / 85
Web crawling Crawling architecture
Crawling ethics
Standard for robot exclusion: robots.txt at the root of a Web server [Kos94].
User-agent: *Allow: /searchhistory/Disallow: /search
Per-page exclusion (de facto standard).
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
Per-link exclusion (de facto standard).
<a href="toto.html" rel="nofollow">Toto</a>
Avoid Denial Of Service (DOS), wait 100ms/1s between two repeatedrequests to the same Web server
WebDam (INRIA) Web search April 3, 2013 17 / 85
Web crawling Crawling architecture
Parallel processing
Network delays, waits between requests:
Per-server queue of URLs
Parallel processing of requests to different hosts:
I multi-threaded programmingI asynchronous inputs and outputs (select, classes fromjava.util.concurrent): less overhead
Use of keep-alive to reduce connexion overheads
WebDam (INRIA) Web search April 3, 2013 18 / 85
General Architecture [Cha03]
Web crawling Crawling architecture
Refreshing URLs
Content on the Web changes
Different change rates:
online newspaper main page: every hour or sopublished article: virtually no change
Continuous crawling, and identification of change rates for adaptivecrawling:
I If-Last-Modified HTTP feature (not reliable)I Identification of duplicates in successive request
WebDam (INRIA) Web search April 3, 2013 20 / 85
Web Information Retrieval
Outline
1 The World Wide Web
2 Web crawling
3 Web Information RetrievalText PreprocessingInverted IndexAnswering Keyword QueriesBuilding inverted filesClusteringOther Media
4 Web Graph Mining
5 Conclusion
WebDam (INRIA) Web search April 3, 2013 21 / 85
Web Information Retrieval
Information Retrieval, Search
ProblemHow to index Web content so as to answer (keyword-based) queries efficiently?
Context: set of text documents
d1 The jaguar is a New World mammal of the Felidae family.d2 Jaguar has designed four new engines.d3 For Jaguar, Atari was keen to use a 68K family device.d4 The Jacksonville Jaguars are a professional US football team.d5 Mac OS X Jaguar is available at a price of US $199 for Apple’s
new “family pack”.d6 One such ruling family to incorporate the jaguar into their name
is Jaguar Paw.d7 It is a big cat.
WebDam (INRIA) Web search April 3, 2013 22 / 85
Web Information Retrieval Text Preprocessing
Text Preprocessing
Initial text preprocessing steps
Number of optional steps
Highly depends on the application
Highly depends on the document language (illustrated with English)
WebDam (INRIA) Web search April 3, 2013 23 / 85
Web Information Retrieval Text Preprocessing
Language Identification
How to find the language used in a document?
Meta-information about the document: often not reliable!
Unambiguous scripts or letters: not very common!한글
カタカナ
Għarbiþorn
Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi,Maltese, Icelandic
Extension of this: frequent characters, or, better, frequent k -grams
Use standard machine learning techniques (classifiers)
WebDam (INRIA) Web search April 3, 2013 24 / 85
Web Information Retrieval Text Preprocessing
Language Identification
How to find the language used in a document?
Meta-information about the document: often not reliable!
Unambiguous scripts or letters: not very common!한글
カタカナ
Għarbiþorn
Respectively: Korean Hangul, Japanese Katakana, Maldivian Dhivehi,Maltese, Icelandic
Extension of this: frequent characters, or, better, frequent k -grams
Use standard machine learning techniques (classifiers)
WebDam (INRIA) Web search April 3, 2013 24 / 85
Web Information Retrieval Text Preprocessing
Tokenization
Principle
Separate text into tokens (words)
Not so easy!
In some languages (Chinese, Japanese), words not separated bywhitespace
Deal consistently with acronyms, elisions, numbers, units, URLs, emails,etc.
Compound words: hostname, host-name and host name. Break into twotokens or regroup them as one token? In any case, lexicon and linguisticanalysis needed! Even more so in other languages as German.
Usually, remove punctuation and normalize case at this point
WebDam (INRIA) Web search April 3, 2013 25 / 85
Web Information Retrieval Text Preprocessing
Tokenization: Example
d1 the1 jaguar2 is3 a4 new5 world6 mammal7 of8 the9 felidae10 family11
d2 jaguar1 has2 designed3 four4 new5 engines6
d3 for1 jaguar2 atari3 was4 keen5 to6 use7 a8 68k9 family10 device11
d4 the1 jacksonville2 jaguars3 are4 a5 professional6 us7 football8 team9
d5 mac1 os2 x3 jaguar4 is5 available6 at7 a8 price9 of10 us11 $19912
for13 apple’s14 new15 family16 pack17
d6 one1 such2 ruling3 family4 to5 incorporate6 the7 jaguar8 into9
their10 name11 is12 jaguar13 paw14
d7 it1 is2 a3 big4 cat5
WebDam (INRIA) Web search April 3, 2013 26 / 85
Web Information Retrieval Text Preprocessing
Stemming
Principle
Merge different forms of the same word, or of closely related words, into a singlestem
Not in all applications!
Useful for retrieving documents containing geese when searching for goose
Various degrees of stemming
Possibility of building different indexes, with different stemming
WebDam (INRIA) Web search April 3, 2013 27 / 85
Web Information Retrieval Text Preprocessing
Stemming schemes (1/2)
Morphological stemming.
Remove bound morphemes from words:I plural markersI gender markersI tense or mood inflectionsI etc.
Can be linguistically very complex, cf:Les poules du couvent couvent.[The hens of the monastery brood.]In English, somewhat easy:
I Remove final -s, -’s, -ed, -ing, -er, -estI Take care of semiregular forms (e.g., -y/-ies)I Take care of irregular forms (mouse/mice)
But still some ambiguities: cf stocking, rose
WebDam (INRIA) Web search April 3, 2013 28 / 85
Web Information Retrieval Text Preprocessing
Stemming schemes (2/2)
Lexical stemming.
Merge lexically related terms of various parts of speech,such as policy, politics, political or politicianFor English, Porter’s stemming [Por80]; stem university anduniversal to univers: not perfect!Possibility of coupling this with lexicons to merge(near-)synonyms
Phonetic stemming.
Merge phonetically related words: search despite spellingerrors!For English, Soundex [US 07] stems Robert and Rupert toR163. Very coarse!
WebDam (INRIA) Web search April 3, 2013 29 / 85
Web Information Retrieval Text Preprocessing
Stemming Example
d1 the1 jaguar2 be3 a4 new5 world6 mammal7 of8 the9 felidae10 family11
d2 jaguar1 have2 design3 four4 new5 engine6
d3 for1 jaguar2 atari3 be4 keen5 to6 use7 a8 68k9 family10 device11
d4 the1 jacksonville2 jaguar3 be4 a5 professional6 us7 football8 team9
d5 mac1 os2 x3 jaguar4 be5 available6 at7 a8 price9 of10 us11 $19912
for13 apple14 new15 family16 pack17
d6 one1 such2 rule3 family4 to5 incorporate6 the7 jaguar8 into9
their10 name11 be12 jaguar13 paw14
d7 it1 be2 a3 big4 cat5
WebDam (INRIA) Web search April 3, 2013 30 / 85
Web Information Retrieval Text Preprocessing
Stop Word Removal
Principle
Remove uninformative words from documents, in particular to lower the cost ofstoring the index
determiners: a, the, this, etc.
function verbs: be, have, make, etc.
conjunctions: that, and, etc.
etc.
WebDam (INRIA) Web search April 3, 2013 31 / 85
Web Information Retrieval Text Preprocessing
Stop Word Removal Example
d1 jaguar2 new5 world6 mammal7 felidae10 family11
d2 jaguar1 design3 four4 new5 engine6
d3 jaguar2 atari3 keen5 68k9 family10 device11
d4 jacksonville2 jaguar3 professional6 us7 football8 team9
d5 mac1 os2 x3 jaguar4 available6 price9 us11 $19912 apple14
new15 family16 pack17
d6 one1 such2 rule3 family4 incorporate6 jaguar8 their10 name11
jaguar13 paw14
d7 big4 cat5
WebDam (INRIA) Web search April 3, 2013 32 / 85
Web Information Retrieval Inverted Index
Structure of an inverted index
Assume D a collection of (text) documents. Create a matrix M with one row foreach document, one column for each token. Initialize the cells at 0.
Create the content of M: scan D, and extract for each document d the tokens tthat can be found in d (preprocessing); put 1 in M[d ][t ]
Invert M: one obtains the inverted index. Term appear as rows, with the list ofdocument ids or posting list.
Problem: storage of the whole matrix is not feasible.
WebDam (INRIA) Web search April 3, 2013 33 / 85
Web Information Retrieval Inverted Index
Inverted Index construction
After all preprocessing, construction of an inverted index:
Index of all terms, with the list of documents where this term occurs
Small scale: disk storage, with memory mapping (cf. mmap) techniques;secondary index for offset of each term in main index
Large scale: distributed on a cluster of machines; hashing gives themachine responsible for a given term
Updating the index is costly, so only batch operations (not one-by-oneaddition of term occurrences)
WebDam (INRIA) Web search April 3, 2013 34 / 85
Web Information Retrieval Inverted Index
Inverted Index Example
family d1, d3, d5, d6
football d4
jaguar d1, d2, d3, d4, d5, d6
new d1, d2, d5
rule d6
us d4, d5
world d1
. . .Note:
the length of an inverted (posting) list is highly variable – scanning shortlists first is an important optimization.
entries are homogeneous: this gives much room for compression.
WebDam (INRIA) Web search April 3, 2013 35 / 85
Web Information Retrieval Inverted Index
Index size matters
We want to index a collection of 1M emails. The average size of an email is1,000 bytes and each email contains an average of 100 words. The number ofdistinct terms is 200,000.
1 size of the collection; number of words?2 how many lists in the index?3 we make the (rough) assumption that 20% of the terms in a document
appear twice; a document appears in how many lists on average ?4 how many entries in a list?5 we represent document ids as 4-bytes unsigned integers, what is the index
size ?
WebDam (INRIA) Web search April 3, 2013 36 / 85
Web Information Retrieval Inverted Index
Storing positions in the index
phrase queries, NEAR operator: need to keep position information in theindex
just add it in the document list!
family d1/11, d3/10, d5/16, d6/4football d4/8jaguar d1/2, d2/1, d3/2, d4/3, d5/4, d6/8 + 13new d1/5, d2/5, d5/15rule d6/3us d4/7, d5/11world d1/6. . .⇒ so far, ok for Boolean queries: find the documents that contain a set ofkeywords; reject the other.
WebDam (INRIA) Web search April 3, 2013 37 / 85
Web Information Retrieval Inverted Index
Ranked search
Boolean search does not give an accurate result because it does not takeaccount of the relevance of a document to a query.
If the search retrieves dozen or hundreds of documents, the most relevant mustbe shown in top position!
The quality of a result with respect to relevance is measured by two factors:
precision =|relevant | ∩ |retrieved |
|retrieved |
recall =|relevant | ∩ |retrieved |
|relevant |
WebDam (INRIA) Web search April 3, 2013 38 / 85
Web Information Retrieval Inverted Index
Weighting terms occurrencesRelevance can be computed by giving a weight to term occurrences.
Terms occurring frequently in a given document: more relevant.The term frequency is the number of occurrences of a term t in adocument d , divided by the total number of terms in d (normalization)
tf(t,d) =nt,d
∑t ′ nt ′,d
where nt ′,d is the number of occurrences of t ′ in d .
Terms occurring rarely in the document collection as a whole: moreinformativeThe inverse document frequency (idf) is obtained from the division of thetotal number of documents by the number of documents where t occurs, asfollows:
idf(t) = log|D|
|{d ′ ∈ D |nt,d ′ > 0}| .
WebDam (INRIA) Web search April 3, 2013 39 / 85
Web Information Retrieval Inverted Index
TF-IDF Weighting
The inverted is extended by adding Term Frequency—Inverse DocumentFrequency weighting
tfidf(t,d) =nt,d
∑t ′ nt ′,d· log
|D||{d ′ ∈ D |nt,d ′ > 0}|
nt,d number of occurrences of t in dD set of all documents
Documents (along with weight) are stored in decreasing weight order in theindex
WebDam (INRIA) Web search April 3, 2013 40 / 85
Web Information Retrieval Inverted Index
TF-IDF Weighting Example
family d1/11/.13, d3/10/.13, d6/4/.08, d5/16/.07football d4/8/.47jaguar d1/2/.04, d2/1/.04, d3/2/.04, d4/3/.04, d6/8 + 13/.04,
d5/4/.02new d2/5/.24, d1/5/.20, d5/15/.10rule d6/3/.28us d4/7/.30, d5/11/.15world d1/6/.47. . .
Exercise: take an entry, and check that the tf/idf value is indeed correct (takedocuments after stop-word removal).
WebDam (INRIA) Web search April 3, 2013 41 / 85
Web Information Retrieval Answering Keyword Queries
Answering Boolean Queries
Single keyword query: just consult the index and return the documents inindex order.
Boolean multi-keyword query
(jaguar AND new AND NOT family) OR cat
Same way! Retrieve document lists from all keywords and apply adequateset operations:
AND intersectionOR union
AND NOT difference
Global score: some function of the individual weight (e.g., addition forconjunctive queries)
Position queries: consult the index, and filter by appropriate condition
WebDam (INRIA) Web search April 3, 2013 42 / 85
Web Information Retrieval Answering Keyword Queries
Answering Top-k Queries
t1 AND . . . AND tn
ProblemFind the top-k results (for some given k) to the query, without retrieving alldocuments matching it.
Notations:
s(t,d) weight of t in d (e.g., tfidf)
g(s1, . . . ,sn) monotonous function that computes the global score (e.g.,addition)
WebDam (INRIA) Web search April 3, 2013 43 / 85
Web Information Retrieval Answering Keyword Queries
Exercise
Consider the following documents:1 d1 = I like to watch the sun set with my friend.2 d2 = The Best Places To Watch The Sunset.3 d3 = My friend watches the sun come up.
Construct an inverted index with tf/idf weights for terms ’Best’ and ’sun’. Whatwould be the ranked result of the query ’Best and sun’?
WebDam (INRIA) Web search April 3, 2013 44 / 85
Web Information Retrieval Answering Keyword Queries
Basic algorithm
First version of the top-k algorithm: the inverted file contains entries sorted onthe document id. The query is
t1 AND . . . AND tn
.
1 Take the first entry of each list; one obtains a tuple T = [e1, . . .en];2 Let d1 be the minimal doc id in the entries of T : compute the global score
of d1;3 For each entry ei featuring d1: advance on the inverted list Li .
When all lists have been scanned: sort the documents on the global scores.
Not very efficient; cannot give the ranked result before a full scan on the lists.
WebDam (INRIA) Web search April 3, 2013 45 / 85
Web Information Retrieval Answering Keyword Queries
The Threshold Algorithm
1 Let R be the empty list, and m = +∞.2 For each 1≤ i ≤ n:
1 Retrieve the document d (i) containing term ti that has the next largests(ti ,d (i)).
2 Compute its global score gd (i) = g(s(t1,d (i)), . . . ,s(tn,d (i))) by retrieving
all s(tj ,d (i)) with j 6= i .3 If R contains less than k documents, or if gd (i) is greater than the minimum
of the score of documents in R, add d (i) to R.
3 Let m = g(s(t1,d (1)),s(t2,d (2)), . . . ,s(tn,d (n))).4 If R contains more than k documents, and the minimum of the score of the
documents in R is greater than or equal to m, return R.5 Redo step 2.
WebDam (INRIA) Web search April 3, 2013 46 / 85
Web Information Retrieval Answering Keyword Queries
The TA, by example
q = “new OR family”, and k = 3. We use inverted lists sorted on the weight.family d1/11/.13, d3/10/.13, d6/4/.08, d5/16/.07new d2/5/.24, d1/5/.20, d5/15/.10. . .
Initially, R =∅ and τ = +∞.
1 d (1) is the first entry in Lfamily, one finds s(new,d1) = .20; the globalscore for d1 is .13 + .20 = .33.
2 Next, i = 2, and one finds that the global score for d2 is .24.3 The algorithm quits the loop on i with R = 〈[d1, .33], [d2, .24]〉 and
τ = .13 + .24 = .37.4 We proceed with the loop again, taking d3 with score .13 and d5 with score
.17. [d5, .17] is added to R (at the end) and τ is now .10 + .13 = .23.A last loop concludes that the next candidate is d6, with a global score of.08. Then we are done.
WebDam (INRIA) Web search April 3, 2013 47 / 85
Web Information Retrieval Building inverted files
External Sort/Merge
Building an inverted index from a document collection requires a sort/merge ofthe index entries.
first extracts triplets [d , t, f ] from the collection;
then sort the set of triplets on the term-docid pair [t,d ].
contiguous inverted lists can be created from the sorted entries.
Note: ubiquituous operation; used in RDBMS for ORDER BY, GROUP BY,DISTINCT, and non-indexed joins.
WebDam (INRIA) Web search April 3, 2013 48 / 85
Web Information Retrieval Building inverted files
First phase: sort
Repeat: fill the memory with entries; sort in memory (with quicksort); flush thememory in a “run”.
...
... ... ...
...
One obtains a list of sorted runs.
Cost: documents are read once; entries are written once.
WebDam (INRIA) Web search April 3, 2013 49 / 85
Web Information Retrieval Building inverted files
Second phase: merge
Group the runs and merge.
...
o
o
...
...
...
Cost: one read/write of all entries for each level of the merge tree.
WebDam (INRIA) Web search April 3, 2013 50 / 85
Web Information Retrieval Building inverted files
Illustration with M = 3
Metropolis, Physchose, Shining, Twin Peaks, Underground, vertigo
Annie Hall, Brazil, Easy Rider, Greystoke, Jurassic Park, Manhattan,
Underground
Jurassic Park,
Twin Peaks, Underground, Vertigo
Metropolis,Shining
Greystoke, Psychose
Vertigo ShiningPsychose
Metropolis, Greystoke,
Manhattan,
Easy Rider
fusion
fusion
Twin Peaks, Vertigo
Annie Hall, Brazil,
fusion
fusion
Annie Hall,
Easy Rider, Greystoke, Manhattan,
Metropolis, Psychose, Shining
fusion
Annie Hall, Brazil, Jurassic Park,
Brazil,
Twin Peaks WebDam (INRIA) Web search April 3, 2013 51 / 85
Web Information Retrieval Building inverted files
Main parameter: the buffer size
Consider a file with 75 000 pages, 4Kb each, thus 307 MBs.
1 M > 307Mo : one read 3072 M = 2Mo, (500 pages).
1 the sort phase yields d 3072 e = 154 runs.
2 The merge phase requires 154 pages
Total cost: 614 + 307 = 921 MB.
NB : we must allocate a lot of memory to decrease the number of merge levelsby one.
WebDam (INRIA) Web search April 3, 2013 52 / 85
Web Information Retrieval Building inverted files
With few memory
M = 1Mo, 250 pages.
1 sort phase: 307 runs.2 Merge the 249 first runs; then the 58 remaining. One obtains F1 and F2.3 Second merge of F1 and F2.
Total cost: 1228 + 307 = 1535 MBs.
NB : important performance loss between 2 MBs and 1 MBs ().
WebDam (INRIA) Web search April 3, 2013 53 / 85
Web Information Retrieval Building inverted files
Exercise
Assume that a page holds only two records. Explain the sort-merge algorithmon the following dataset with a 4-pages buffers.
3 Allier; 36 Indre 18 Cher 75 Paris 39 Jura9 Arige; 81 Tarn 11 Aude 12 Aveyron 25 Doubs73 Savoie 55 Meuse 15 Cantal 51 Marne 42 Loire40 Landes 14 Calvados 30 Gard 84 Vaucluse 7 Ardche
Same question with a 3-pages buffer.
WebDam (INRIA) Web search April 3, 2013 54 / 85
Web Information Retrieval Building inverted files
Compression of inverted lists
Without compression, an inverted index with positions and weights may be largethan the documents collection!
Compression is essential. The gain must be higher than the time spent tocompress.
Key to compression in inverted lists: documents are ordered by id:
[87;273;365;576;810].
First step: use delta-coding:
[87;186;92;211;234].
Exercise: what is the minimal number of bytes for the first list? for the second?
WebDam (INRIA) Web search April 3, 2013 55 / 85
Web Information Retrieval Building inverted files
Variable bytes encoding
Idea: encode integers on 7 bits (27 = 128); use the leading bit for termination.
Let v = 9, encoded on one byte as 10000101 (note the first bit set to 1).
Let v = 137.1 the first byte encodes v ′ = v mod 128 = 9, thus b = 10000101 just as
before;2 next we encode v/128 = 1, in a byte b′ = 00000001 (note the first bit set
to 0).
137 is therefore encoded on two bytes:
00000001 10000101.
Compression ratio: typically 1/4 to 1/2 of the fixed-length representation.
WebDam (INRIA) Web search April 3, 2013 56 / 85
Web Information Retrieval Building inverted files
Exercise
The inverted list of a term t consists of the following document ids:
[345;476;698;703].
Apply the VByte compression technique to this sequence. What is the amountof space gained by the method?
WebDam (INRIA) Web search April 3, 2013 57 / 85
Web Information Retrieval Clustering
Clustering Example
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � �� � � � ! " � ! � � � � �# � ! � � � � $ �" � � � � � � �# % & � & � � � ' �# � ( � % � ! � & ( ) � � * + �, � ( - � & . � ! � * / �� � ) 0 � & ( . 1 � � � � � � � ! � � * � �2 � 3 � ( � 1 . � 4 5 � � ) & ( � � 6 �� � � ! 1 4 7 � 8 � � * 9 �" � � � � 1 ) � � � � ! � / �
) � � � � � ! � � � : � ; � � � � < � � = � � > ? > � � � : � < � � @ < � � � < A ? 4 B ? B 4 B B B � � < � � � C � @ � � < D � E : � � F G � � � ! H � @ � � � < � � � I H � < � � � I J � � ; D K � � : � < �LM N O P Q O R S T U V W X X Y Z [ V \ \ ] U ] ^ _ ` [ a b ] c [ V \ c Z [ d e f g b h ^ U i b V j k ] _ _ [ l O P Q O R S= D � � @ @ � ; � � < � � � � < � � � < D � ; � � � � m � � � � � < � � � m � � � ; D � : � � m � � � � < � � L� � � L � : � � L ; � � n o ; ; D � p n q � C � m r � � � s � � � ; < � � F m t � �Lu l O P Q O R = D � G � � � ! H v w x y z { | w } x ~ w I � � � � � � � � � � � � @ < D � ; < @ � � � F � < � C � < � � � �� � � � � � � @ < D � t � � � � ; � L � < � � ; � � � � � F � � � < � < � < D � � � � � m < � � � m � � � � � � � @ < D � r � � � � � m � � � < D � � � � � < � � � ; � � � � @ < D � ; < @ � � � F @ � : � � � < D � t � � � � ; � L� � L � � � � � � � L � � � � � � � � � : � n o ; ; D � p n � � � � � � � m t � � m q � C �L� l O P Q O R � j c Z � b ] ^ b c b g � _ � a� � � � � � � � � � < � � � � ! � s � � � � � � � : � L L L q � � � < � � � � ! � � : � � � < D � � � � � m � � � C � � � C � � u � m � � �� � � � � � � L L L� � � L � � ; L � � L : � n o ; ; D � p n t � � m r � � � s � � � ; < � � FL� � � ^ a ^ j � V j b a ] � \ V � N O P Q O R � [ U V k [ � � � _ ^ j� � � � � � � � � � � = D � � � < � � � � � s � � � < � � � < D � � � � � � < < � � � < � < � ; � @ < � � ; � C � � F � � � @ � � < D �� � � � � � G � � � ! � � ; : � � < � � @ � � � @ < D � � � � ; < � D C � � � � � � � � < < � � � � < D � J � : < D � � � < � � � � �� @ � � � � � � � ; � � t � � � � � < � � � � � < � : ; D ; < � � � L J � � � ; � � < � ; � � @ < D � � ; � � � � � � � = D : � � F < D �G � � � ! � � � � � � � ; � � @ � ; � @ � � < D � � C � � � � � � < � � � � � � � � � � @ � � ; � m � D � ; D � � � � � : � � � � � � F � @ < D �� � � � � � � D � � � < D � L L L D � ; � � � � � � � < � < D � � � � < � J < < � � @ � � � � � � � ; � L � @ < D � � L J L � � � � � � � �� � � � � � � � < � ; � � < � ; � � � ; � C � � F � � � @ � � < D � G � � � ! m < D � � � < � � : � ; � � � < � � � < D � � � � � � � J � ; : � � < Fs � � � < � � � < � � � : � � � � < D � @ � � ; � m � � � � � � J : ; � � � � m � � � � ; F � � � ; < � � � @ < D � � � � < � � L L L� � � � L F D � � L ; � � � � � � � u � � ¡ � M M ¡ � � ¢ � � ¢ � ¢ ; ¢ � < ¢ � � � � : � ¢ � � ; � C � � F n o ; ; D � p n £ D � � ¤ � � � �L¥ ¦ ^ k ] j § ¨ V V � a � [ c V © [ § [ W V j �ª « ¬ � � ® ¯ � L L L � � � � ; � : � < � � � � ° � � � � � � � � � � < � ; � � � ; � � � � � < � @ � � � ; � ± � C � � @ < � �F � : � : � < � ; < � D � � L � D � � = < � @ � � � � � C F � � < � � : F � � � � ! m F � : � � � � < D �� � � ; � � � @ � � � � � D � ; D � � L J � ; � � n � � � � ; � : � < � � � � � � � � � @ � < � � ; � � � � D : � � @ � � � � � � < � � � � � m L L L� � � L � F < � � � � L ; � � � L L L M � ¥ ² � ² � � � � ³ � � ´ ; µ � @ � ¡ � M � ¶ ² @ µ � ³ � � ´ ¥ � ¡ ¡ ³ � � < � � � ´ � � � � F < ³ � � ; ´ � � � n o ; ; D � pn � £ = � � � �Lµ l O P Q O R� � � � ! � F � � @ � � < � · t G � � � ! H v w x y z { | w } x ~ w I m � � � @ � � � � < � C � < � J � : < D � � � � < � � t � � � � ; 7 ! � 8 8 � ( 5 A B � � � � ! � 8 1 � 1 � � ! ¸ � 1 ! ) ! � 3 � ¹ º # º " � » � � � � ! m � � � � < � F � � ; � @ < � � � � ! " � ! � m ¼ � � < � � D : < � � � � � � � � � � � � � � � ! � � ) 1 ( m @ � � � � � ½ � � � : � r � � < � � m � � � K � ¼ : � � K ; � � � � � ! 1 � � � � ! m ¾ � � � � � ; � � � � � � � � � F t < � � � ; r J ¿ M � L u À � : � À m < D � ; � � � � � @ � � C � � � � � � M � L u � @ < D �� ; r J ¿ r � � � < � � J F � < � � � � � � ! � Á º . � ( � m � � � < � � � � ) 0 � & ( . 1 � � � � � � � ! � m � � ½ q < � � ; D � ; < � � � � < D � � � C � � � 1 8  � � ( - � � � � ! ¼ � � < � � D � � ; � � < m � : � H � � ; � � < I� � L � � � � � � � L � � � � � � � � � : � ¢ H � � � � � : < � � � I n o ; ; D � p n � � � � � � � L¶ l O P Q O R � � X © V W [� � � � ! � J t r @ @ � ; � � � � � � à � L L L� � � L � : � : � L ; � � n o ; ; D � p n t � �L¡ l O P Q O Rà � < D � � � � ; L � £ J = Ä K � r � J � t = r ½ = � Ä t � t Å r � L r @ � � < D � � � ; < � m < D � G � � � ! � � � � � � < D �� � � < � < : � � L � D � � � � � � � � � @ � � � < � � � ; � � � � @ � � � < D � � � � m � � � < � @ � D < � � � � � � � � � : < L L L� � � L � � : � � � � � L � � � � : � L D < � n o ; ; D � p n q � C � m t � �L² e ^ _ U V j b Z ] � [ h ^ § b g � W ] c Z ^ b j [ ` U V ^ U Z� � � � Æ � � � � � � L L L < � � � m < � ; � � � ; � � � @ � : ; ; � � � m � � � � m � � � < � � � � ; D < � < D � � � m � D � D ; D � ; < � � � � � < � � � < F L À J � � < D m < D � � � � � ! � � � @ � � � � C � ; � � � � � < � �� � � ; � u � � � m D D � � � � ; � � � � < � � C � � � � � < D < D � ½ � ; � � � � � ½ � � F L � � D � � � C � �� � � � � � ½ q D � L L L � � ; � � � � u � � µ � � � � < D � � u � � ¥ L q � @ < � � ; D � � J � � < D � � : � � � C � � � � ; � � C � � � � : D ; � � � < � � � ; � � � � C � � � � � � ; : � � � � F � � : � � � � � � ! � ; � ; D� ; � s � � K � � m @ � � � � � � @ � � � � C � ; � � � � � < � � m � � < D � � � � � � < � � � � � � @ < D � � @ � � � � L � � � � ! � � @ � � � � C � � � � � ; � � � : � � � � � F L L L� � � � L F D � � L ; � � � � � � � u � � ¡ � M u � � � ¢ � � ¢ � � ¢ @ � ¢ � � � @ � � ¢ @ � ; � � � ¢ ; � ; D n o ; ; D � p n £ D � � ¤ � � � �LM � ¦ ] _ _ � ^ � b e _ � ^ b e ^ b c ^ b c Z [ Ç [ c ^ � Z V � b Ȫ É ¯ Ê Ë � ® ¯ � L L L � � � � @ < D � � � � < < : � � � : � � � : � � � � L t � � D � � � � � � � : < < D � @ : < : � � � @ ½ � � ° �� � � � � � � � � F � � � � F n � � � � � ¼ � � < � � D � � � � m � � � � ! � q � K � C � � m � D � ; D � � @ � � � � � m � � L � : � � � F� @ @ � � � < D � � � � � � : < � ; � � � � � � � � � < · Ì = D � F ° � � � � F < � < � � � @ @ � L L L� � � L � F < � � � � L ; � � � L L L M � ¥ ² � ² � � � � ³ � � ´ � ¥ u ¡ � � ¡ µ � ² u M � � � ; ³ � � ´ ¥ � ¡ ¡ ³ � � < � � � ´ � � � � F < ³ � � ; ´ � � � n
WebDam (INRIA) Web search April 3, 2013 58 / 85
Web Information Retrieval Clustering
Cosine Similarity of Documents
Document Vector Space model:
terms dimensionsdocuments vectorscoordinates weights
(The projection of document d along coordinate t is the weight of t in d ,say tfidf(t,d))
Similarity between documents d and d ′: cosine of these two vectors
cos(d ,d ′) =d · d ′
‖d‖ × ‖d ′‖
d · d ′ scalar product of d and d ′
‖d‖ norm of vector d
cos(d ,d) = 1
cos(d ,d ′) = 0 if d and d ′ are orthogonal (do not share any common term)
WebDam (INRIA) Web search April 3, 2013 59 / 85
Web Information Retrieval Clustering
Agglomerative Clustering of Documents
1 Initially, each document forms its own cluster.2 The similarity between two clusters is defined as the maximal similarity
between elements of each cluster.3 Find the two clusters whose mutual similarity is highest. If it is lower than a
given threshold, end the clustering. Otherwise, regroup these clusters.Repeat.
RemarkMany other more refined algorithms for clustering exist.
WebDam (INRIA) Web search April 3, 2013 60 / 85
Web Information Retrieval Other Media
Indexing HTML
HTML: text + meta-information + structure
Possibly: separate index for meta-information (title, keywords)
Increase weight of structurally emphasized content in index
Tree structure can also be queried with XPath or XQuery, but not veryuseful on the Web as a whole, because of tag soup and lack of consistency.
WebDam (INRIA) Web search April 3, 2013 61 / 85
Web Information Retrieval Other Media
Indexing Multimedia Content
Basic approach: index text from context of the mediaI surrounding textI text in or around the links pointing to the contentI filenamesI associated subtitles (hearing-impaired track on TV)
Elaborate approach: index and search the media itself, with the help ofspeech recognition and sound, image, and video analysis
I TrackID, Shazam: identify a song played on the radioI Musipedia: look for music by whistling a tune,http://www.musipedia.org/, http://www.midomi.com/
I Image search from a similar image, http://images.google.com/,Google Goggles
I Voxalead, http://voxaleadnews.labs.exalead.com/
WebDam (INRIA) Web search April 3, 2013 62 / 85
Web Graph Mining
Outline
1 The World Wide Web
2 Web crawling
3 Web Information Retrieval
4 Web Graph MiningPageRankHITSSpamdexing
5 Conclusion
WebDam (INRIA) Web search April 3, 2013 63 / 85
Web Graph Mining
The Web Graph
The World Wide Web seen as a (directed) graph:
Vertices: Web pages
Edges: hyperlinks
Same for other interlinked environments:
dictionaries
encyclopedias
scientific publications
social networks
WebDam (INRIA) Web search April 3, 2013 64 / 85
Web Graph Mining PageRank
Let’s start with simple examples
The PageRank (PR) of page i is the Probability that a surfer following therandom walk has arrived on i at some distant given point in the future.
Left part: PR(A) = PR(B) + PR(C) + PR(D)
Right part?
Assume that the initial PR of each page is 0.25: what is the PR after oneiteration? Two iterations?
WebDam (INRIA) Web search April 3, 2013 65 / 85
Web Graph Mining PageRank
The example graph
12
3
6
7
9
4
5
10
8
WebDam (INRIA) Web search April 3, 2013 66 / 85
Web Graph Mining PageRank
The transition matrix
{gij = 0 if there is no link between page i and j ;
gij =1ni
otherwise, with ni the number of outgoing links of page i .
G =
0 1 0 0 0 0 0 0 0 00 0 1
4 0 0 14
14 0 1
4 00 0 0 1
212 0 0 0 0 0
0 1 0 0 0 0 0 0 0 00 0 0 0 0 1
2 0 0 0 12
13
13 0 1
3 0 0 0 0 0 00 0 0 0 0 1
3 0 13 0 1
30 1
3 0 0 0 0 0 0 13
13
0 12
12 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
WebDam (INRIA) Web search April 3, 2013 67 / 85
Web Graph Mining PageRank
PageRank (Google’s Ranking [BP98])
IdeaImportant pages are pages pointed to by important pages.
PageRank simulates a random walk by iterately computing the PR of each page,represented as a vector v .
Initially, v is set using a uniform distribution (v [i ] = 1|v | ).
Definition (Tentative)Probability that the surfer following the random walk in G has arrived on page iat some distant given point in the future.
pr(i) =
(lim
k→+∞(GT )k v
)i
where v is some initial column vector.
WebDam (INRIA) Web search April 3, 2013 68 / 85
Web Graph Mining PageRank
Exercise
Model the simple examples with transition matrix, and apply PageRank,assuming an initial uniform distribution.
WebDam (INRIA) Web search April 3, 2013 69 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.1000.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0330.317
0.075
0.108
0.025
0.058
0.083
0.150
0.117
0.033
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0360.193
0.108
0.163
0.079
0.090
0.074
0.154
0.094
0.008
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0540.212
0.093
0.152
0.048
0.051
0.108
0.149
0.106
0.026
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0510.247
0.078
0.143
0.053
0.062
0.097
0.153
0.099
0.016
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0480.232
0.093
0.156
0.062
0.067
0.087
0.138
0.099
0.018
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0520.226
0.092
0.148
0.058
0.064
0.098
0.146
0.096
0.021
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0490.238
0.088
0.149
0.057
0.063
0.095
0.141
0.099
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0500.232
0.091
0.149
0.060
0.066
0.094
0.143
0.096
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0500.233
0.091
0.150
0.058
0.064
0.095
0.142
0.098
0.020
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0500.234
0.090
0.148
0.058
0.065
0.095
0.143
0.097
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0490.233
0.091
0.149
0.058
0.065
0.095
0.142
0.098
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0500.233
0.091
0.149
0.058
0.065
0.095
0.143
0.097
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank Iterative Computation
0.0500.234
0.091
0.149
0.058
0.065
0.095
0.142
0.097
0.019
WebDam (INRIA) Web search April 3, 2013 70 / 85
Web Graph Mining PageRank
PageRank With Damping
May not always converge, or convergence may not be unique.To fix this, the random surfer can at each step randomly jump to any page of theWeb with some probability d (1− d : damping factor).
pr(i) =
(lim
k→+∞((1− d)GT + dU)k v
)i
where U is the matrix with all 1N values with N the number of vertices.
WebDam (INRIA) Web search April 3, 2013 71 / 85
Web Graph Mining PageRank
Using PageRank to Score Query Results
PageRank: global score, independent of the query
Can be used to raise the weight of important pages:
weight(t,d) = tfidf(t,d)× pr(d),
This can be directly incorporated in the index.
WebDam (INRIA) Web search April 3, 2013 72 / 85
Web Graph Mining HITS
HITS (Kleinberg, [Kle99])
IdeaTwo kinds of important pages: hubs and authorities. Hubs are pages that pointto good authorities, whereas authorities are pages that are pointed to by goodhubs.
G′ transition matrix (with 0 and 1 values) of a subgraph of the Web. We use thefollowing iterative process (starting with a and h vectors of norm 1):{
a := 1‖G′T h‖ G′T h
h := 1‖G′a‖ G′a
Converges under some technical assumptions to authority and hub scores.
WebDam (INRIA) Web search April 3, 2013 73 / 85
Web Graph Mining HITS
Using HITS to Order Web Query Results
1 Retrieve the set D of Web pages matching a keyword query.2 Retrieve the set D∗ of Web pages obtained from D by adding all linked
pages, as well as all pages linking to pages of D.3 Build from D∗ the corresponding subgraph G′ of the Web graph.4 Compute iteratively hubs and authority scores.5 Sort documents from D by authority scores.
Less efficient than PageRank, because local scores.
WebDam (INRIA) Web search April 3, 2013 74 / 85
Web Graph Mining Spamdexing
Spamdexing
DefinitionFraudulent techniques that are used by unscrupulous webmasters to artificiallyraise the visibility of their website to users of search engines
Purpose: attracting visitors to websites to make profit.
Unceasing war between spamdexers and search engines
WebDam (INRIA) Web search April 3, 2013 75 / 85
Web Graph Mining Spamdexing
Spamdexing: Lying about the Content
Technique
Put unrelated terms in:
meta-information (<meta name="description">,<meta name="keywords">)
text content hidden to the user with JavaScript, CSS, or HTMLpresentational elements
Countertechnique
Ignore meta-information
Try and detect invisible text
WebDam (INRIA) Web search April 3, 2013 76 / 85
Web Graph Mining Spamdexing
Link Farm Attacks
Technique
Huge number of hosts on the Internet used for the sole purpose of referencingeach other, without any content in themselves, to raise the importance of a givenwebsite or set of websites.
Countertechnique
Detection of websites with empty or duplicate content
Use of heuristics to discover subgraphs that look like link farms
WebDam (INRIA) Web search April 3, 2013 77 / 85
Web Graph Mining Spamdexing
Link Pollution
Technique
Pollute user-editable websites (blogs, wikis) or exploit security bugs to addartificial links to websites, in order to raise its importance.
Countertechnique
rel="nofollow" attribute to <a> links not validated by a page’s owner
WebDam (INRIA) Web search April 3, 2013 78 / 85
Conclusion
Outline
1 The World Wide Web
2 Web crawling
3 Web Information Retrieval
4 Web Graph Mining
5 Conclusion
WebDam (INRIA) Web search April 3, 2013 79 / 85
Conclusion
What you should remember
The inverted index model for efficient answers of keyword-based queries.
The threshold algorithm for retrieving top-k results.
PageRank and its iterative computation.
WebDam (INRIA) Web search April 3, 2013 80 / 85
Conclusion
References
SpecificationsI HTML 4.01, http://www.w3.org/TR/REC-html40/I HTTP/1.1, http://tools.ietf.org/html/rfc2616I Robot Exclusion Protocol,http://www.robotstxt.org/orig.html
A bookMining the Web: Discovering Knowledge from Hypertext Data, SoumenChakrabarti, Morgan Kaufmann
WebDam (INRIA) Web search April 3, 2013 81 / 85
Bibliography I
Serge Abiteboul, Grégory Cobena, Julien Masanès, and Gerald Sedrati.A first experience in archiving the French Web.In Proc. ECDL, Roma, Italie, September 2002.
Serge Abiteboul, Mihai Preda, and Grégory Cobena.Adaptive on-line page importance computation.In Proc. Intl. World Wide Web Conference (WWW), 2003.
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and GeoffreyZweig.Syntactic clustering of the Web.Computer Networks, 29(8-13):1157–1166, 1997.
Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search engine.Computer Networks, 30(1–7):107–117, April 1998.
WebDam (INRIA) Web search April 3, 2013 82 / 85
Bibliography II
Soumen Chakrabarti.Mining the Web: Discovering Knowledge from Hypertext Data.Morgan Kaufmann, 2003.
Soumen Chakrabarti, Martin van den Berg, and Byron Dom.Focused crawling: A new approach to topic-specific Web resourcediscovery.Computer Networks, 31(11–16):1623–1640, 1999.
Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, andMarco Gori.Focused crawling using context graphs.In Proc. VLDB, Cairo, Egypt, September 2000.
Jon M. Kleinberg.Authoritative Sources in a Hyperlinked Environment.Journal of the ACM, 46(5):604–632, 1999.
WebDam (INRIA) Web search April 3, 2013 83 / 85
Bibliography III
Martijn Koster.A standard for robot exclusion.http://www.robotstxt.org/orig.html, June 1994.
Martin F. Porter.An algorithm for suffix stripping.Program, 14(3):130–137, July 1980.
Pierre Senellart.Identifying Websites with flow simulation.In Proc. ICWE, pages 124–129, Sydney, Australia, July 2005.
sitemaps.org.Sitemaps XML format.http://www.sitemaps.org/protocol.php, February 2008.
WebDam (INRIA) Web search April 3, 2013 84 / 85
Bibliography IV
US National Archives and Records Administration.The Soundex indexing system.http://www.archives.gov/genealogy/census/soundex.html,May 2007.
WebDam (INRIA) Web search April 3, 2013 85 / 85