Upload
gaye
View
27
Download
1
Embed Size (px)
DESCRIPTION
Extracting knowledge from the World Wide Web. Monika Henzinger and Steve Lawrence Google Inc. Presented by Murat Şensoy. Objective. The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to: - PowerPoint PPT Presentation
Citation preview
Extracting knowledge from the World Wide Web
Presented
by
Murat Şensoy
Monika Henzinger and Steve Lawrence
Google Inc.
Objective
The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to:
Extract knowledge from the web The Challenge:
Distributed and heterogeneous nature of the web makes large-scale analysis difficult.
Objective
The paper provide an overview of recent methods for :
Sampling the Web
Analyzing and Modeling Web Growth
Sampling the Web
Due to sheer size of the Web, even simple statistics about it are unknown.
The ability to sample web pages or web servers uniformly at random is very useful for determining statistics.
The Question is: How to Sample the Web uniformly ?
Sampling the Web
Two Famous Sampling Methods for the Web are :
Random Walk IP address Sampling
Sampling the Web with Random Walk
Main Idea :
Visit the pages with a probability proportional to its PageRank value
Sample the visited pages with a probability inversely proportional to its PageRank value
Thus, the probability that
a page is sampled is a
constantindependent of
the page.
Google’s creators Brin and Page published definition of PageRank as used in Google.
PageRank has several definitions.
PageRank
Sergey Brin and Lawrence Page,"The Anatomy of a Large-Scale Hypertextual Web Search Engine”,in Proceedings of the 7th International World Wide Web Conference, pp. 107–117,1998.
PageRank
PageRank has another definition depending on Random Walk.
The PageRank of a page p is the fraction of steps that the walk spent at p in the limit.
- Initial page is chosen randomly from all pages.
- Let walk is at page p at a given time step.
- With probability d, follow an out-link of p .
- With probability 1-d, select a page out of all pages.
PageRank
Two problems arise in the implementation:
Random Walk assumes already that it can find a random page on the web; the problem that we actually want to solve.
Many hosts on the web have a large number of links with in the same host and very few leaving them.
PageRank
- Given a set of initial pages
- Choose start page randomly from initial pages.
- Let walk is at page p at a given time step.
- With probability d, follow an out-link of p .
- With probability 1-d, select a random host among visited hosts,
then jump to a randomly selected page out of all pages visited
on this host so far.
- All pages in the initial set are also considered to be visited.
Henzinger proposed and implemented a modified Random Walk
Sampling the Web with Random Walk
The modified random walk visits a page with probability approximately proportional to its PageRank value.
Afterward, the visited pages are sampled with probability inversely proportional to their PageRank value.
Thus, the probability that a page is sampled is a constant independent of the page.
Sampling the Web with Random Walk
An example of statistics generated using this approach:
Sampling the Web with
IP Address Sampling IP V.4 Addresses : 4 bytes IP V.6 Addresses : 16 bytes
There are about 4.3 billion possible IP V.4 addresses.
IP address sampling is an approach depending on randomly sampling IP addresses and testing for a web server at the standard port (http:80 or https:443).
This approach works only for IP V.4
IP V.6 address space, 2128 addresses, is too much to explore.
Sampling the Web with
IP Address SamplingSolution:
Check Multiple Times
Sampling the Web with
IP Address SamplingThis method finds many web servers that would not normally
be considered as a part of the publicly indexable web. - Servers with authorization
requirements
- Servers with no content
- Hardware that provides a Web Interface
Sampling the Web with
IP Address SamplingA number of issues lead to minor biases:
- An IP address may host several web sites
- Multiple IP addresses may serve identical content
- Some web servers may not use the standard port.
There is a higher probability of finding larger sites using multiple IP addresses to serve the same content.
Solution: Use the domain name system.
Sampling the Web with
IP Address SamplingThe distribution of server types found from sampling 3.6 million IP addresses in February 1999 Lawrence, S. & Giles, C. L. (1999) Nature 400, 107–109.
Analyses from the same study
Only 34.2 % of servers contained the common “keyword” or “description” meta-tags on their homepage.Low usage of simple HTML metadata standard suggest that acceptance of more complex standards such as XML will be very slow.
Discussion On Sampling the Web
Current techniques exhibit biases and do not achieve a uniform random sample. For Random Walk, any implementation is
limited to a finite random walk. For IP address sampling, main challenge is how
to sub-sample the pages accessible from a given IP address.
Analyzing and Modeling Web Growth
The Web has a degree distribution following the Power Law.
k~)k(P
For in-link distribution 1.2
For out-link distribution 72.2
We can also extract valuable information by analyzing and modeling the growth of pages and links on the web.
Analyzing and Modeling Web Growth
This observation led to the design of various models for the Web. Preferential Attachment of Barabasi et al. Mixed Model of Pennock et al. Copy Model of Kleinberg et al. The Hostgraph Model
Preferential Attachment
As the network grows, the probability that a given node receives an edge is proportional to that node’s current connectivity.
‘rich get richer’
w node
wu kkPProbability that a new node is connected to node u is
Preferential Attachment
Model suggest that for a node u created at time tu, the expected degree is m(t/tu)0.5.
Thus older pages get rich faster than newer pages.
Model explains Power Law in-link distribution. However, the model exponent is 3 (by mean-field theory), whereas the observed exponent is 2.1.
No Evidence
In reality, different link distributions are observed among web pages of the same category.
Winners don’t take all
The early models fail to account for significant deviations from power law scaling common in almost all studied networks.
For example, among web pages of the same category, link distributions can diverge strongly from power law scaling, exhibiting a roughly log-normal distribution.
Moreover, conclusions about the attack and failure tolerance of the Internet based on the early models may not fully hold within specific communities.
Winners don’t take all
NEC researchers (Pennock et al.) discovered that the degree of "rich get richer" or "winners take all" behavior varies in different categories and may be significantly less than previously thought.
Winners don’t take all
Pennock et al. introduced a new model of network growth, mixing uniform and preferential attachment, that accurately accounts for the true connectivity distributions found in web categories, the web as a whole, and other social and biological networks.
Winners don’t take allThe numbers represent the degree to which link growth is preferential (new links are created to already popular sites).
Copy Model
Kleinberg et al. explained the power-law inlink distributions with a copy model that constructs a directed graph.
v
an existing node v is chosen
uniformly at random.
Dest jth
Copy with Probability:
With Probability: 1-
Choose destination uniformly at
random among existing nodes
This model is also a mixture of uniform and preferential influences on network growth.
u
Dest 0
Dest 1
Dest jth
Dest d-1
a new node u is added with d
outlinks
The Hostgraph Model
Models the Web on the host or domain level.
Each node represents a host.
Each directed edge represents the hyperlinks from pages on the source host to pages on the target host.
The Hostgraph Model
Bharat et al. show that the weighted inlink and the weighted outlink distributions in the host graph have a power law distribution with =1.62 and = 1.67, respectively. However, the number of hosts with small degree is considerably smaller than predicted by the model.
There is "flattening" of the curve for low degree hosts.
The Hostgraph Model
Bharat et al. made a modification to the copy graph model, called the re-link model, to explain this flattening.
With probability 1- no new node is added.So number of low degree nodes is reduced.
u
Dest 0
Dest 1
Dest jth
Dest d-1
With probability add a new node u with d
outlinksand with probability 1- select an existing node
with additional d outlinks.
v
an existing node v is chosen uniformly at
random.Then select d links of v uniformly at random
Dest 0 Dest jth Dest d
Copy with Probability:
With Probability: 1-
Choose destination uniformly at random
among existing nodes
The Hostgraph Model
Communities on the Web
Identification of communities on the web is valuable . Practical applications include :
• Automatic web portals
• Focused search engines
• Content filtering
• Complementing text-based searches
Community identification also allows for analysis of the entire web and the objective study of relationships within and between communities.
Communities on the Web
Flake et al. define a web community as :
A collection of web pages such that each member page has more hyperlinks within the community than outside of the community.
Flake et al. show that the web self-organizes such that these link-based communities identify highly related pages.
Communities on the Web
Communities on the Web
There are alternatives for the indication of Web communities:
Kumar et al. consider dense bipartite subgraphs as indications of communities.
Other approaches :
Bibliometric methods such as cocitation and bibliographic coupling
The PageRank algorithm
The HITS algorithm
Bipartite subgraph identification
Spreading activation energy
Conclusion
The problem of uniformly sampling the web is still open in practice: which pages should be counted, and how can we reduce biases?
Web growth models approximate the true nature of how the web grows: how can the current models be refined to improve accuracy, while keeping the models relatively simple and easy to understand and analyze?
Finally, community identification remains an open area: how can the accuracy of community identification be improved, and how can communities be best structured or presented to account for differences of opinion in what is considered a community?
There are still many open problems:
Thanks For Your Patience
Appendix
Google’s PageRank
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web
Google’s PageRank
Example :
PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45
d = 0.85
PageRank for 26 million web pages can be computed in a few hours on a medium size workstation (Brin&Page 98).
The Hostgraph Model