Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant

Searching the Web

The Web

Why is it important:– “Free” ubiquitous information resource– Broad coverage of topics and perspectives– Becoming dominant information collection– Growth and jobs

Web access methodsSearch (e.g. Google)

Directories (e.g. Yahoo!)

Other …

Web CharacteristicsDistributed data

– 80 million web sites (hostnames responding) in April 2006

– 40 million active web sites (don’t redirect, …)

High volatility– Servers come and go …

Large volume– One study found 11.5 billion pages in January 2005

(at that time Google indexed 8 billion pages)– “Dark Web” – content not indexed, not crawlable

estimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)

Web Characteristics

Unstructured data– Lots of duplicated content (30% estimate)– Semantic duplication much higher

Quality of data– No required editorial process– Many typos and misspellings (impacts IR)

Heterogeneous data– Different media– Different languages

These characteristics are not going to change

Web Content Types

Source: How much information 2003

Search Engine Architecture

Interface Query Engine

IndexerCrawler

Users

Index

Web

Lots and lots of computers



IndexerCrawler

Users

Index

Web

Chapter 10 Chapters 2, 4, & 5

Chapters 6 & 7

Chapter 8

Chapter 3Evaluation



IndexerCrawler

Users

Index

Web

Chapter 10 Chapters 2, 4, & 5

Chapters 6 & 7

Chapter 8

Hubs and Authorities

Hubs– Have lots of links to other pages

Authorities– Have lots of links that point to them

Can use feedback to rank hubs and authorities– Better hubs have links to good authorities– Better authorities have links from good hubs

Crawling the Web

Creating a Web Crawler (Web Spider)

Simplest Technique– Start with a set of URLs– Extract URLs pointed to in original set– Continue using either breadth-first or depth-first

Works for one crawler but hard to coordinate for many crawlers– Partition Web by domain name, ip address, or other

technique– Each crawler has its own set but shares a to-do list

Crawling the Web

Need to recrawl– Indexed content is always out of date– Sites come and go and sometimes disappear

for periods of time to reappear

Order of URLs traversed makes a difference– Breadth first matches hierarchic organization of

content– Depth first gets to deeper content faster– Proceeding to “better” pages first can also help

(e.g. good hubs and good authorities)

Server and Author Control of Crawling

Avoid crawling sites that do not want to be crawled– Legal issue– Robot exclusion protocol (Server level control)

• file that indicates which portion of web site should not be visited by crawler

• http://.../robots.txt

– Robot META tag (Author level control)• used to indicate if a file (page) should be indexed or

analyzed for links• few crawlers implement this• <meta name=“robots” content=“noindex, nofollow”)• http://www.robotstxt.org/wc/exclusion.html

Example robots.txt Files

GoogleUser-agent: *

Allow: /searchhistory/

Disallow: /search

Disallow: /groups

Disallow: /images

Disallow: /catalogs

Disallow: /catalogues

…

CSDLUser-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/

TAMU LibraryUser-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ …

New York TimesUser-agent: * Disallow: /pages/college/ …Allow: /pages/Allow: /2003/…

User-agent: Mediapartners-Google*Disallow:

Crawling GoalsCrawling technique may depend on goal

Types of crawling goals:– Create large broad index– Creating a focused topic or domain-specific index

• Target topic-relevant sites• Index preset terms

– Creating a subset of content to model characteristics of (part of) the Web

• Need to survey appropriately• Cannot use simple depth-first or breadth-first

– Create up-to-date index• Use estimated change frequencies

Crawling Challenges

Identifying and keeping track of links– Which to visit– Which have been visited

Issues– Relative vs. absolute link descriptions– Alternate server names– Dynamically generated pages– Server-side scripting– Links buried in scripts

Crawling Architecture

Crawler componentsWorker threads – attempt to retrieve data for URL

DNS resolver – resolves domain names into IP addresses

Protocol modules – downloads content in appropriate protocol

Link extractor – finds and normalizes URLs

URL filter – determines which URLs to add to to-do list

URL to-do agent – keeps list of URLs to visit

Crawling IssuesAvoid overloading servers

– Brute force approach can become a denial of service attack

– Weak politeness guarantee: only one thread allowed to contact a server

– Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors

Broken links, time outs– How many times to try?– How long to wait?– How to recognize crawler traps? (server-side

programs that generate “infinite” links)

Web Tasks

Precision is the key– Goal: first 10-100 results should satisfy user– Requires ranking that matches user’s need– Recall is not important

• Completeness of index is not important• Comprehensive crawling is not important

Browsing

Web directories– Human-organized taxonomies of Web sites– Small portion (< than 1%) of Web pages

• Remember that recall (completeness) is not important

• Directories point to logical web sites rather than pages

– Directory search returns both categories and sites

– People generally browse rather than search once they identify categories of interest

Metasearch

Search a number of search enginesAdvantages

– Do not build their own crawler and index– Cover more of the Web than any of their

component search engines

Difficulties– Need to translate query to each engine query

language– Need to merge results into a meaningful

ranking

Metasearch II

Merging Results– Voting scheme based on component search engines

• No model of component ranking schemes needed

– Model-based merging• Need understanding of relative ranking, potentially by query

type

Why they are not used for the Web– Bias towards coverage (e.g. recall), which is not

important for most Web queries– Merging results is largely ad-hoc, so search engines

tend to do better

Big application: the Dark Web

Using Structure in Search

Languages to search content and structure

– Query languages over labeled graphs

• PHIQL: Used in Microplis and PHIDIAS hypertext systems

• Web-oriented: W3QL, WebSQL, WebLog, WQL

Using Structure in Search

Other use of structure in search– Relevant pages have neighbors that also

tend to be relevant– Search approaches that collect (and filter)

neighbors to returned pages

Web Query CharacteristicsFew terms and operators

– Average 2.35 terms per query• 25% of queries have a single term

– Average 0.41 operators per query

Queries get repeated– Average 3.97 instances of each query– This is very uneven (e.g. “Britney Spears” vs. “Frank

Shipman”)

Query sessions are short– Average 2.02 queries per session– Average of 1.39 pages of results examined

Data from 1998 study– How different today?

Documents

Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant