Upload
brandon-benson
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Searching the WebSearching the Web
CS3352
Searching the WebSearching the Web
Three forms of searching
– Specific queries encyclopaedia, libraries Exploit hyperlink structure
– Broad queries web directories Web directories: classify web documents by
subjects
– Vague queries search engines index portions of web
Problem with the dataProblem with the data Distributed data High percentage of volatile data – much is
generated Large volume
– June 2000 Google full-text index of 560 million URLs Unstructured data
– gifs, pdf etc Redundant data
– mirrors (30% pages are near duplicates) Quality of data
– false, poorly written, invalid, mis-spelt Heterogeneous data – media, formats, languages,
alphabets
Users and the WebUsers and the Web How to specify a query? How to interpret answers?
– Ranking– Relevance selection– Summary presentations– Large document presentation
Main purpose: research, leisure, business, education– 80% do not modify query– 85% look first screen only– 64% queries are unique– 25% users use single keywords
Problem for polysemic words and synonyms
Web searchWeb search
All queries answered without accessing texts – by indices alone– Local copies of web pages
expensive (Google cache)– Remote page access
unrealistic Links
– Link topology, link popularity, link quality, who links
Page structure– Words in heading >
words in text etc
Sites– Sub collections of
documents, mirror site detection
Names Presenting summaries Community identification Indexing
– Refresh rate Similarity engine Ranking scheme Caching and popularity
measures
SpammingSpamming
Most search engines have rules against – invisible text, – meta tag abuse, – heavy repetition– "domain spam”
overtly submission of "mirror“ sites in an attempt to dominate the listings for particular terms
Excite SpammingExcite Spamming Excite screens out spamming before adding a page
to its web page index. – if it finds a string of words such as: – money money money money money money money – it will replace the excess repetition, so that essentially,
the string becomes:– money xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx
The more Excite detects unusual repetition, the more heavily it will penalize a page.
Excite does not penalize for the use of hidden text, but penalties will apply if hidden text is used to disguise spam content.
Excite penalises "domain spam."
Centralised architectureCentralised architecture
Crawler-indexer (most search engine)
Crawler– Robot, spider, wanderer, walker, knowbot– Program that traverses web to send new or update
pages to main server (where they are indexed)– Run on local server and send request to remote
servers
Centralised use of index to answer queries
Example (AltaVista)Example (AltaVista)
Interface
Query engine Index
Indexer
Crawler
1998: 20 multi-processor machines130 GB of RAM, 500 GB disk space
Distributed architectureDistributed architecture
Harvest: harvest.transarc.com
Gatherers:– Collect and extract indexing information from one
or more web servers at periodic time
Brokers– Provide indexing mechanism and query interface
to data gathered– Retrieve information from gatherers or other
brokers, updating incrementally their indices
Harvest architectureHarvest architecture
UserBroker
ReplicationManager
Web siteObject cache
Gatherer
Broker
Ranking algorithmsRanking algorithms
Variations of Boolean and vector space model– TF IDF plus
Hyperlinks between pages– pages pointed to by a retrieved page– pages that point to a retrieved page
– Popularity: number of hyperlinks to a page– Relatedness: number of hyperlinks common in
pages or pages referenced by same pages WebQuery PageRank (Google) HITS (Clever)
Lets Use Links!Lets Use Links!
MetasearchMetasearch
Web server that sends query to – Several search engines– Web directories– Databases
Collect results Unify them (Data fusion) Aim: better coverage Issues
– Translation of query– Uniform result (fusion rankings, e,g, pages retrieved by
several engines)– Wrappers
GoogleGoogle The best web engine:
– comprehensive and relevant results Biggest index
– 580 million pages visited and recorded– Uses link data to get to another 500 million pages
Different kinds of index– smaller indexes containing a higher amount of the web's
most popular pages, as determined by Google's link analysis system.
Index refresh– Updated monthly/weekly– Daily for popular pages
Serves queries from three data centres– two on West Coast of the US, one on East Coast.
Google: let this inspire you…Google: let this inspire you…
Larry Page, Co-founder & Chief Executive Officer
Sergey Brin, Co-founder & President
PhD students at Stanford
Google OverviewGoogle Overview
Crawls the web to create its listings. Combines traditional IR text matching
with extremely heavy use of link popularity to rank the pages it has indexed.
Other services also use link popularity, but none do to the extent that Google does.
Citation Importance RankingCitation Importance Ranking
Google linksGoogle links Submission:
– Add URL page (no need to do a "deep" submit) – Best way to ensure that your site is indexed is to build links.
The more other sites are pointing at you, the more likely you will be crawled and ranked well.
Crawling and Index Depth:– aims to refresh its index on a monthly basis,– if Google doesn't actually index a pages, it may still return it in
a search because it makes extensive use of the text within hyperlinks.
– This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.
Google Google RelevancyRelevancy (1) (1)
Google ranks web pages based on the number, quality and content of links pointing at them (citations).
Number of Links – All things being equal, a page with more links pointing
at it will do better than a page with few or no links to it.
Link Quality – Numbers aren't everything. A single link from an
important site might be worth more than many links from relatively unknown sites.
Google Google RelevancyRelevancy (2) (2)
Link Content – The text in and around links relates to the page
they point at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel
Ranking boosts on text styles– The appearance of terms in bold text, or in header
text, or in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.
PageRankPageRank Usage simulation & Citation importance ranking:
– Based on a model of a Web surfer who follows links and makes occasional haphazard jumps, arriving at certain places more frequently than others.
User randomly navigates– Jumps to random page with probability p– Follows a random hyperlink from the page with
probability 1-p– Never goes back to a previously visited page by
following a previously traversed link backwards Google finds a single type of universally important
page--intuitively, locations that are heavily visited in a random traversal of the Web's link structure.
PageRankPageRank Process modelled by Markov Chain
– probability of being in each page is computed, p set by the system– Wj = PageRank of page j– ni = number of outgoing links on page i– m is the number of nodes in G, that is, the number of Web pages in the
collection
– Ranking of other pages is normalized by the number of links in the page– Computed using an iterative methodWj = p + (1- p)
Wi
ni
(i,j) G | i jm
PageRankPageRank
Wj
Wk 2
Wk 1
Wi 2
Wi 1
Wi 3
(1- p)
W1
n1 m
p
W2
n2
W3
n3
+
+
Google ContentGoogle Content
Performs a full-text index of the pages it visits.
It gathers all visible text. It does not read either the meta keywords or
description tags. Descriptions are formed automatically by
extracting the most relevant portions of pages.
If a page has no description, it is probably because Google has never actually visited it.
Google SpammingGoogle Spamming
Link popularity ranking system leaves it relatively immune to traditional spamming techniques.– Goes beyond the text on pages to decide how good
they are. No links, low rank. Common spam idea
– create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites.
– Unlikely to work, do real link building instead with non-competitive sites that are related to yours.
Site identificationSite identification
AltaVistaAltaVista
HITS: HITS: Hypertext Induced Topic SearchHypertext Induced Topic Search
The ranking scheme depends on the query
Considers the set of pages that point to or are pointed at by pages in the answer S
Implemented in IBM;s Clever Prototype Scientific American Article: http://www.sciam
.com/1999/0699issue/0699raghavan.html
HITS (2)HITS (2) Authorities:
– Pages that have many links point to them in S
Hub: – pages that have many
outgoing links
Positive two-way feedback:
–better authority pages come from incoming edges from good hubs
–better hub pages come from outgoing edges to good authorities
Authorities and HubsAuthorities and Hubs
Authorities ( blue )Hubs (red)
HITS two step iterative processHITS two step iterative process assigns initial scores to candidate hubs
and authorities on a particular topic in set of pages S
1. use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities
2. use the updated hub information to refine the guesses about the authorities--determine where the best hubs point most heavily and call these the good authorities.
Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can then be used to determine the best authorities and hubs.
H(p) =
u S | p u
A(u)
A(p) =
v S | v p
H(u)
HITS issues (3)HITS issues (3)
Restrict set of pages to a maximum number Doesn’t work with non-existent, repeated or
automatically generated links Weighting links on surrounding content Diffusion of the topic
– A more general topic contains the original answer– Analyse the content of each page and score that,
combining link weight with page score. Sub-grouping links
HITS Used for web community identification
CybercommunitiesCybercommunities
Google vs CleverGoogle vs Clever Google 1. assigns initial rankings
and retains them independently of any queries -- enables faster response.
2. looks only in the forward direction, from link to link.
Clever 1. assembles a different root
set for each search term and then prioritizes those pages in the context of that particular query.
2. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.
AutonomyAutonomy
High performance pattern matching based on Bayesian inference networks
Identifies patterns in text based on usage and term frequency that correspond to concepts
X% probability that a document is about a subject– Encode the signature– Categorize it– Link it to related documents with same signature
Not just search engine tool.
Human powered searchingHuman powered searching
Ask Jeeves – an answer service using human editors to
build the knowledgebase. Editors proactively suggest questions and watch what people are actually searching for.
Yahoo – uses humans to organize the web. Human
editors find sites or review submissions, then place those sites in one or more "categories" that are relevant to them.
Research IssuesResearch Issues
Modelling Querying Distributed architecture Ranking Indexing Dynamic pages Browsing User interface Duplicated data Multimedia
Further readingFurther reading
http://searchenginewatch.com/
http://www.clpgh.org/clp/Libraries/search.html