Extracting knowledge from the World Wide Web

Extracting knowledge from the World Wide Web

Presented

by

Murat Şensoy

Monika Henzinger and Steve Lawrence

Google Inc.

Objective

The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to:

Extract knowledge from the web The Challenge:

Distributed and heterogeneous nature of the web makes large-scale analysis difficult.

Objective

The paper provide an overview of recent methods for :

Sampling the Web

Analyzing and Modeling Web Growth

Sampling the Web

Due to sheer size of the Web, even simple statistics about it are unknown.

The ability to sample web pages or web servers uniformly at random is very useful for determining statistics.

The Question is: How to Sample the Web uniformly ?

Sampling the Web

Two Famous Sampling Methods for the Web are :

Random Walk IP address Sampling

Sampling the Web with Random Walk

Main Idea :

Visit the pages with a probability proportional to its PageRank value

Sample the visited pages with a probability inversely proportional to its PageRank value

Thus, the probability that

a page is sampled is a

constantindependent of

the page.

Google’s creators Brin and Page published definition of PageRank as used in Google.

PageRank has several definitions.

PageRank

Sergey Brin and Lawrence Page,"The Anatomy of a Large-Scale Hypertextual Web Search Engine”,in Proceedings of the 7th International World Wide Web Conference, pp. 107–117,1998.

PageRank

PageRank has another definition depending on Random Walk.

The PageRank of a page p is the fraction of steps that the walk spent at p in the limit.

- Initial page is chosen randomly from all pages.

- Let walk is at page p at a given time step.

- With probability d, follow an out-link of p .

- With probability 1-d, select a page out of all pages.

PageRank

Two problems arise in the implementation:

Random Walk assumes already that it can find a random page on the web; the problem that we actually want to solve.

Many hosts on the web have a large number of links with in the same host and very few leaving them.

PageRank

- Given a set of initial pages

- Choose start page randomly from initial pages.

- Let walk is at page p at a given time step.

- With probability d, follow an out-link of p .

- With probability 1-d, select a random host among visited hosts,

then jump to a randomly selected page out of all pages visited

on this host so far.

- All pages in the initial set are also considered to be visited.

Henzinger proposed and implemented a modified Random Walk


The modified random walk visits a page with probability approximately proportional to its PageRank value.

Afterward, the visited pages are sampled with probability inversely proportional to their PageRank value.

Thus, the probability that a page is sampled is a constant independent of the page.


An example of statistics generated using this approach:

Sampling the Web with

IP Address Sampling IP V.4 Addresses : 4 bytes IP V.6 Addresses : 16 bytes

There are about 4.3 billion possible IP V.4 addresses.

IP address sampling is an approach depending on randomly sampling IP addresses and testing for a web server at the standard port (http:80 or https:443).

This approach works only for IP V.4

IP V.6 address space, 2128 addresses, is too much to explore.


IP Address SamplingSolution:

Check Multiple Times


IP Address SamplingThis method finds many web servers that would not normally

be considered as a part of the publicly indexable web. - Servers with authorization

requirements

- Servers with no content

- Hardware that provides a Web Interface


IP Address SamplingA number of issues lead to minor biases:

- An IP address may host several web sites

- Multiple IP addresses may serve identical content

- Some web servers may not use the standard port.

There is a higher probability of finding larger sites using multiple IP addresses to serve the same content.

Solution: Use the domain name system.


IP Address SamplingThe distribution of server types found from sampling 3.6 million IP addresses in February 1999 Lawrence, S. & Giles, C. L. (1999) Nature 400, 107–109.

Analyses from the same study

Only 34.2 % of servers contained the common “keyword” or “description” meta-tags on their homepage.Low usage of simple HTML metadata standard suggest that acceptance of more complex standards such as XML will be very slow.

Discussion On Sampling the Web

Current techniques exhibit biases and do not achieve a uniform random sample. For Random Walk, any implementation is

limited to a finite random walk. For IP address sampling, main challenge is how

to sub-sample the pages accessible from a given IP address.


The Web has a degree distribution following the Power Law.

k~)k(P

For in-link distribution 1.2

For out-link distribution 72.2

We can also extract valuable information by analyzing and modeling the growth of pages and links on the web.


This observation led to the design of various models for the Web. Preferential Attachment of Barabasi et al. Mixed Model of Pennock et al. Copy Model of Kleinberg et al. The Hostgraph Model

Preferential Attachment

As the network grows, the probability that a given node receives an edge is proportional to that node’s current connectivity.

‘rich get richer’

w node

wu kkPProbability that a new node is connected to node u is

Preferential Attachment

Model suggest that for a node u created at time tu, the expected degree is m(t/tu)0.5.

Thus older pages get rich faster than newer pages.

Model explains Power Law in-link distribution. However, the model exponent is 3 (by mean-field theory), whereas the observed exponent is 2.1.

No Evidence

In reality, different link distributions are observed among web pages of the same category.

Winners don’t take all

The early models fail to account for significant deviations from power law scaling common in almost all studied networks.

For example, among web pages of the same category, link distributions can diverge strongly from power law scaling, exhibiting a roughly log-normal distribution.

Moreover, conclusions about the attack and failure tolerance of the Internet based on the early models may not fully hold within specific communities.


NEC researchers (Pennock et al.) discovered that the degree of "rich get richer" or "winners take all" behavior varies in different categories and may be significantly less than previously thought.


Pennock et al. introduced a new model of network growth, mixing uniform and preferential attachment, that accurately accounts for the true connectivity distributions found in web categories, the web as a whole, and other social and biological networks.

Winners don’t take allThe numbers represent the degree to which link growth is preferential (new links are created to already popular sites).

Copy Model

Kleinberg et al. explained the power-law inlink distributions with a copy model that constructs a directed graph.

v

an existing node v is chosen

uniformly at random.

Dest jth

Copy with Probability:

With Probability: 1-

Choose destination uniformly at

random among existing nodes

This model is also a mixture of uniform and preferential influences on network growth.

u

Dest 0

Dest 1

Dest jth

Dest d-1

a new node u is added with d

outlinks

The Hostgraph Model

Models the Web on the host or domain level.

Each node represents a host.

Each directed edge represents the hyperlinks from pages on the source host to pages on the target host.

The Hostgraph Model

Bharat et al. show that the weighted inlink and the weighted outlink distributions in the host graph have a power law distribution with =1.62 and = 1.67, respectively. However, the number of hosts with small degree is considerably smaller than predicted by the model.

There is "flattening" of the curve for low degree hosts.

The Hostgraph Model

Bharat et al. made a modification to the copy graph model, called the re-link model, to explain this flattening.

With probability 1- no new node is added.So number of low degree nodes is reduced.

u

Dest 0

Dest 1

Dest jth

Dest d-1

With probability add a new node u with d

outlinksand with probability 1- select an existing node

with additional d outlinks.

v

an existing node v is chosen uniformly at

random.Then select d links of v uniformly at random

Dest 0 Dest jth Dest d

Copy with Probability:

With Probability: 1-

Choose destination uniformly at random

among existing nodes

The Hostgraph Model

Communities on the Web

Identification of communities on the web is valuable . Practical applications include :

• Automatic web portals

• Focused search engines

• Content filtering

• Complementing text-based searches

Community identification also allows for analysis of the entire web and the objective study of relationships within and between communities.


Flake et al. define a web community as :

A collection of web pages such that each member page has more hyperlinks within the community than outside of the community.

Flake et al. show that the web self-organizes such that these link-based communities identify highly related pages.



There are alternatives for the indication of Web communities:

Kumar et al. consider dense bipartite subgraphs as indications of communities.

Other approaches :

Bibliometric methods such as cocitation and bibliographic coupling

The PageRank algorithm

The HITS algorithm

Bipartite subgraph identification

Spreading activation energy

Conclusion

The problem of uniformly sampling the web is still open in practice: which pages should be counted, and how can we reduce biases?

Web growth models approximate the true nature of how the web grows: how can the current models be refined to improve accuracy, while keeping the models relatively simple and easy to understand and analyze?

Finally, community identification remains an open area: how can the accuracy of community identification be improved, and how can communities be best structured or presented to account for differences of opinion in what is considered a community?

There are still many open problems:

Thanks For Your Patience

Appendix

Google’s PageRank

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web

Google’s PageRank

Example :

PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45

d = 0.85

PageRank for 26 million web pages can be computed in a few hours on a medium size workstation (Brin&Page 98).

The Hostgraph Model

Documents

Extracting knowledge from the World Wide Web