71
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University

A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

A Survey on Web Information Retrieval Technologies

Lan HuangComputer Science Department

State University of New York, Stony Brook

Presented by Kajal Miyan

Michigan State University

Page 2: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Overview

● Web Information Retrieval Challenges● Search Engine – Overview and Architecture(Google as Case Study)● Various Algorithms● Directories – Overview and some Algos.● Estimating Size of Web● Main Challenges● Google File System● Sponsored Search● Discussion

Page 3: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Web Information Retrieval Challenges

● Bulk● Dynamic Internet● Heterogeneity● Variety of Languages● Duplication● High Linkage● Ill-formed queries● Wide Variance in Users● Specific Behavior

Page 4: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

The Goal

Page 5: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Evaluation● Precision - fraction of the documents re-

trieved that are relevant to the user's informa-tion need.Precision = (relevant documents AND retrieved

documents)/retrieved documents● Recall - fraction of the documents that are

relevant to the query that are successfully re-trieved.Recall = (relevant documents AND retrieved docu-

ments)/relevant documents

Page 6: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Current Goal

● Precision at Top 10 results &● Precision at Top 10 result pages

Page 7: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Various SEs

● SEs● Directories● News SEs● Meta Search Engines● Social Search Engines● Opinions, Forums, Usenets SEs

Page 8: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Your Browser

Architecture (Sherman 2003)

The Web

URL1 URL2

URL3 URL4

Crawler

Indexer

Query Server Eggs?

Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am

Page 9: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Crawling the web

● Robust – should avoid overloading the websites and must deal with huge amounts of data.

● Decides in what order to crawl the pages.● Frequency of revisiting pages● Rule of Thumb – Important pages first

Page 10: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Crawler

Page 11: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

● Cho etc.99 (spread the workload)● – Allocation that URL’s in 500 Queues● – Allocation based on the Hash of the server

name● – Read one URL from each queue at a time

Page 12: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Inverted Indexes the IR Way

How Inverted Index Are Created???

Page 13: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Periodically rebuilt, static otherwise.

Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 14: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 15: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Multiple term entries for a single document are merged.

Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 16: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

How Inverted Files are Created

Finally, the file can be split into – Dictionary or Lexicon file– Postings file

Page 17: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Dictionary/Lexicon PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 18: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Inverted indexes

Permit fast search for individual termsFor each term, you get a list consisting of:

document ID frequency of term in doc (optional) position of term in doc (optional)

These lists can be used to solve Boolean queries:country -> d1, d2manor -> d2country AND manor -> d2

Page 19: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Inverted Indexes for Web Search Engines

Inverted indexes are still used, even though the web is so huge.Some systems partition the indexes across different machines. Each machine handles different parts of the data.Other systems duplicate the data across many machines; queries are distributed among the machines.Most do a combination of these.

Page 20: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Google Architecture

Page 21: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Repository– Contains the full HTML text– Compressed using zlib (RFC1950)– Prefixed by docID, length, and URL• Document Index– Each entry contain● • The current doc status (crawled ?)● • A pointer into the repository (if crawled)● • A document checksum (using binary search

to find the docID )● • Various statistics• Lexicon– Keep in memory on a 256M– Current contains 14 million words

Page 22: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Google’s Indexing

● The Indexer converts each doc into a collection of “hit lists” and puts these into “barrels”, sorted by docID. It also creates a database of “links”.– Hit: <wordID, position in doc, font info, hit type>– Hit type: Plain or fancy.– Fancy hit: Occurs in URL, title, anchor text, metatag.– Optimized representation of hits (2 bytes each).

● Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file.– Lexicon: <wordID, offset into inverted index>– Lexicon is mostly cached in-memory

Page 23: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

wordid #docswordid #docswordid #docs

Lexicon (in-memory) Postings (“Inverted barrels”, on disk)

Each “barrel” contains postings for a range of wor-dids.

Google’s Inverted Index

Sorted by wordid

Docid #hits Hit, hit, hit, hit, hitDocid #hits Hit

Docid #hits HitDocid #hits Hit, hit, hit

Docid #hits Hit, hit

Barrel i

Barrel i+1

Sortedby Docid

Page 24: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Google crawler

– Maintain its own DNS cache– Asynchronous I/O to manage events– 4 crawler• Both URLserver & crawler are implement in

Python• Each crawler keeps 300 connections open at

once• >100 pages / s , roughly 600K/s

Page 25: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

● How is Relevance of the page decided???

Page 26: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Content Relevance

• Phrase matching.• Synonyms.• URL analysis.• Date last updated.• Spell checking.• Home page detection.

Page 27: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

HTML Weighting

A 5) Anchor

H1, H2, H3, H4, H5, H64) Header

DL, OL, UL3) List

STRONG, B, EM, I, U2) Strong

None of the above1) Plain Text

TITLE6) Title

HTML tagsClass Name

• Meta tag text is mostly ignored by search engines

Page 28: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Link-Based Metrics

• A link from A to B can be viewed as a recommendation, a vote or a citation.

• Links can be – referential, or – informational

• Links effect the ranking of web pages and thus have commercial value.

Page 29: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Citation and Linking

Page 30: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

PageRank - Motivation

• The number incoming links to a page is a measure of importance and authority of the page.

• Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

Page 31: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

The Random Surfer• Assume the web is a Markov chain.• Surfers randomly click on links, where the probability of an

outlink from page A is 1/m, where m is the number of outlinks from A.

• The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page.

• Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Page 32: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

PageRank (PR) - Definition

• W is a web page• Wi are the web pages that have a link to W• O(Wi) is the number of outlinks from Wi• T is the teleportation probability• N is the size of the web

)()()(...

)()(

)()()1()(

2

2

1

1

n

n

WOWPR

WOWPR

WOWPRT

NTWPR +++−+=

Page 33: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

HITS – Hubs and Authorities - Hyperlink-Induced Topic Search

• A on the left is an authority• A on the right is a hub

Page 34: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

HITS Algorithm – Iterate until Convergence

→∈

→∈

=

=

qpBq

pqBq

qApH

qHpA

|

|

)()(

)()(

• B is the base set• q and p are web pages in B• A(p) is the authority score for p• H(p) is the hub score for p

Page 35: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

HITS

● Algorithm Overview● • Input: Q – a query string● SE – a text-based search engine● t – size of the root set● d – max number of “in” links● Top t pages (highest-ranked pages) from the

text-based search engine form the root set

● • Output: – focused subset

Page 36: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

( Q = “java”, SE = AltaVista, t = 3, d = 3)

Page 37: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

( Q = “java”, SE = AltaVista, t = 3, d = 3)

Page 38: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Algorithm● An Iterative Algorithm● [authority] weights vector x0 = (1, 1, 1,

…, 1)● [hub] weights vector y0 = (1, 1, 1, …, 1)● for i = 1, 2, …, k● xi = update_authorityw(yi-1)● yi = update_hubw(xi)● normalize(xi, yi)● return (xk, yk)

Page 39: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 40: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 41: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 42: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 43: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 44: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 45: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 46: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 47: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 48: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 49: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 50: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Applications of HITS

• Search engine querying (speed is an issue).• Finding web communities.• Finding related pages.• Populating categories in web directories.• Citation analysis.

Page 51: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

HITS Did not Work Well!!!

• Mutually Reinforcing Relationship Between Hosts

• Automatically Generated Links• Non-Relevant Node

Topic drift approach• K edges - 1/k authority weight• L edges - 1/l hub weight

Page 52: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Duplicate Elimination

Challenges– Defining the notation of a replicated

collectionprecisely• Slight differences between copies– Efficient algorithm to identify such collectionand exploiting this knowledge of replication• Hundreds of millions of pages– Subgraph isomorphism: NP

Page 53: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Reasons for inability to Detect Duplicates

● Update Frequency● Different Formats● Partial Crawls

Page 54: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Some Solutions

● IR for Textual similarity● Data Mining for clusteringso on...

Formal definition of similar collections follows:Similar PagesSimilar Links

Page 55: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Growing Similar Clusters

Page 56: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Directories vs. Search Engines

● Directories– Hand-selected sites– Search over the

contents of the descriptions of the pages

– Organized in advance into categories

● Search Engines– All pages in all sites – Search over the

contents of the pages themselves

– Organized in response to a query by relevance rankings or other scores

Page 57: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Directories & Categorization

● Automatic Categorization 1-Taper

– A taxonomy-and-path-enhanced-retrieval system– Given• Hypertext document corpus• A “small” set of classified documents– Goal• Construct a classifier• Apply to new documents

● Manual Categorization 2-OpenGrid and ODP

Page 58: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 59: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 60: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Good discriminating power: large interclass distance, small intraclass distance

Page 61: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Size of the Web

Typical Questions– Which search engine has the largest coverage?– How many pages are out there and how many are indexed?

• Approach- Measure search engine coverage and overlap through

random queries- Allows a third party to measure relative sizes andoverlaps of search engines- Take two search engines, E1 and E2, we can:• Compute their relative sizes• Compute the fraction of E1’s database indexed by E2

Page 62: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Size Wars

August 2005 : We index 20 billion documents.

So, who’s right?

September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

Page 63: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

● As of 10/30/2008 02:33:00 PM

7/25/2008 10:12:00 AM Claimed to have found: 1 trillion (as in

1,000,000,000,000) unique URLs on the web at once!

We knew the web was big...7/25/2008 10:12:00 AM

Page 64: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 65: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New
Page 66: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Challenges

● Spam - Text Spam - Link Spam● Cloaking● Content Quality● Quality Evaluation● Web Conventions (META tag)● Duplicate Hosts● Vaguely Structured Data

Page 67: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Google File System

● Performance, Scalability, Reliability, Avalability

● Component failures are a norm● Files are huge● Most files are mutated by writing in the end

rather than overwriting data● Co-designing APIs with File System to

increase flexibility

Page 68: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

GFS

Page 69: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Sponsored Search

● More than 50% users visit a SE every few days

● Over 13% of traffic to commercial sites was generated by Ses

● Over 40% of product searches on web were initiated via Ses

● Pioneered by Goto, later Overture● Google

Page 70: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Measurement and Pricing

● Cost per mille● Cost per action● Cost per click● Click through rate = (1000*CPM) + CPC● Yahoo! - Uses CPC ● Google – Uses CTR * CPC

Page 71: A Survey on Web Information Retrieval Technologies · 2008-10-31 · A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New

Discussion

● Large field with great challenges● Google really lived upto its name 10^100!!!● Highly profitable● Internet today is a Big Brain...● Can we contribute???