View
215
Download
0
Tags:
Embed Size (px)
Citation preview
The Web
Why is it important:– “Free” ubiquitous information resource– Broad coverage of topics and perspectives– Becoming dominant information collection– Growth and jobs
Web access methodsSearch (e.g. Google)
Directories (e.g. Yahoo!)
Other …
Web CharacteristicsDistributed data
– 80 million web sites (hostnames responding) in April 2006
– 40 million active web sites (don’t redirect, …)
High volatility– Servers come and go …
Large volume– One study found 11.5 billion pages in January 2005
(at that time Google indexed 8 billion pages)– “Dark Web” – content not indexed, not crawlable
estimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)
Web Characteristics
Unstructured data– Lots of duplicated content (30% estimate)– Semantic duplication much higher
Quality of data– No required editorial process– Many typos and misspellings (impacts IR)
Heterogeneous data– Different media– Different languages
These characteristics are not going to change
Search Engine Architecture
Interface Query Engine
IndexerCrawler
Users
Index
Web
Lots and lots of computers
Search Engine Architecture
Interface Query Engine
IndexerCrawler
Users
Index
Web
Chapter 10 Chapters 2, 4, & 5
Chapters 6 & 7
Chapter 8
Chapter 3Evaluation
Search Engine Architecture
Interface Query Engine
IndexerCrawler
Users
Index
Web
Chapter 10 Chapters 2, 4, & 5
Chapters 6 & 7
Chapter 8
Hubs and Authorities
Hubs– Have lots of links to other pages
Authorities– Have lots of links that point to them
Can use feedback to rank hubs and authorities– Better hubs have links to good authorities– Better authorities have links from good hubs
Crawling the Web
Creating a Web Crawler (Web Spider)
Simplest Technique– Start with a set of URLs– Extract URLs pointed to in original set– Continue using either breadth-first or depth-first
Works for one crawler but hard to coordinate for many crawlers– Partition Web by domain name, ip address, or other
technique– Each crawler has its own set but shares a to-do list
Crawling the Web
Need to recrawl– Indexed content is always out of date– Sites come and go and sometimes disappear
for periods of time to reappear
Order of URLs traversed makes a difference– Breadth first matches hierarchic organization of
content– Depth first gets to deeper content faster– Proceeding to “better” pages first can also help
(e.g. good hubs and good authorities)
Server and Author Control of Crawling
Avoid crawling sites that do not want to be crawled– Legal issue– Robot exclusion protocol (Server level control)
• file that indicates which portion of web site should not be visited by crawler
• http://.../robots.txt
– Robot META tag (Author level control)• used to indicate if a file (page) should be indexed or
analyzed for links• few crawlers implement this• <meta name=“robots” content=“noindex, nofollow”)• http://www.robotstxt.org/wc/exclusion.html
Example robots.txt Files
GoogleUser-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
…
CSDLUser-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/
TAMU LibraryUser-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ …
New York TimesUser-agent: * Disallow: /pages/college/ …Allow: /pages/Allow: /2003/…
User-agent: Mediapartners-Google*Disallow:
Crawling GoalsCrawling technique may depend on goal
Types of crawling goals:– Create large broad index– Creating a focused topic or domain-specific index
• Target topic-relevant sites• Index preset terms
– Creating a subset of content to model characteristics of (part of) the Web
• Need to survey appropriately• Cannot use simple depth-first or breadth-first
– Create up-to-date index• Use estimated change frequencies
Crawling Challenges
Identifying and keeping track of links– Which to visit– Which have been visited
Issues– Relative vs. absolute link descriptions– Alternate server names– Dynamically generated pages– Server-side scripting– Links buried in scripts
Crawling Architecture
Crawler componentsWorker threads – attempt to retrieve data for URL
DNS resolver – resolves domain names into IP addresses
Protocol modules – downloads content in appropriate protocol
Link extractor – finds and normalizes URLs
URL filter – determines which URLs to add to to-do list
URL to-do agent – keeps list of URLs to visit
Crawling IssuesAvoid overloading servers
– Brute force approach can become a denial of service attack
– Weak politeness guarantee: only one thread allowed to contact a server
– Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors
Broken links, time outs– How many times to try?– How long to wait?– How to recognize crawler traps? (server-side
programs that generate “infinite” links)
Web Tasks
Precision is the key– Goal: first 10-100 results should satisfy user– Requires ranking that matches user’s need– Recall is not important
• Completeness of index is not important• Comprehensive crawling is not important
Browsing
Web directories– Human-organized taxonomies of Web sites– Small portion (< than 1%) of Web pages
• Remember that recall (completeness) is not important
• Directories point to logical web sites rather than pages
– Directory search returns both categories and sites
– People generally browse rather than search once they identify categories of interest
Metasearch
Search a number of search enginesAdvantages
– Do not build their own crawler and index– Cover more of the Web than any of their
component search engines
Difficulties– Need to translate query to each engine query
language– Need to merge results into a meaningful
ranking
Metasearch II
Merging Results– Voting scheme based on component search engines
• No model of component ranking schemes needed
– Model-based merging• Need understanding of relative ranking, potentially by query
type
Why they are not used for the Web– Bias towards coverage (e.g. recall), which is not
important for most Web queries– Merging results is largely ad-hoc, so search engines
tend to do better
Big application: the Dark Web
Using Structure in Search
Languages to search content and structure
– Query languages over labeled graphs
• PHIQL: Used in Microplis and PHIDIAS hypertext systems
• Web-oriented: W3QL, WebSQL, WebLog, WQL
Using Structure in Search
Other use of structure in search– Relevant pages have neighbors that also
tend to be relevant– Search approaches that collect (and filter)
neighbors to returned pages
Web Query CharacteristicsFew terms and operators
– Average 2.35 terms per query• 25% of queries have a single term
– Average 0.41 operators per query
Queries get repeated– Average 3.97 instances of each query– This is very uneven (e.g. “Britney Spears” vs. “Frank
Shipman”)
Query sessions are short– Average 2.02 queries per session– Average of 1.39 pages of results examined
Data from 1998 study– How different today?