41
Extracting knowledge from the World Wide Web Presented by Murat Şensoy Monika Henzinger and Steve Lawrence Google Inc.

Extracting knowledge from the World Wide Web

  • Upload
    gaye

  • View
    27

  • Download
    1

Embed Size (px)

DESCRIPTION

Extracting knowledge from the World Wide Web. Monika Henzinger and Steve Lawrence Google Inc. Presented by Murat Şensoy. Objective. The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to: - PowerPoint PPT Presentation

Citation preview

Page 1: Extracting knowledge from the World Wide Web

Extracting knowledge from the World Wide Web

Presented

by

Murat Şensoy

Monika Henzinger and Steve Lawrence

Google Inc.

Page 2: Extracting knowledge from the World Wide Web

Objective

The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to:

Extract knowledge from the web The Challenge:

Distributed and heterogeneous nature of the web makes large-scale analysis difficult.

Page 3: Extracting knowledge from the World Wide Web

Objective

The paper provide an overview of recent methods for :

Sampling the Web

Analyzing and Modeling Web Growth

Page 4: Extracting knowledge from the World Wide Web

Sampling the Web

Due to sheer size of the Web, even simple statistics about it are unknown.

The ability to sample web pages or web servers uniformly at random is very useful for determining statistics.

The Question is: How to Sample the Web uniformly ?

Page 5: Extracting knowledge from the World Wide Web

Sampling the Web

Two Famous Sampling Methods for the Web are :

Random Walk IP address Sampling

Page 6: Extracting knowledge from the World Wide Web

Sampling the Web with Random Walk

Main Idea :

Visit the pages with a probability proportional to its PageRank value

Sample the visited pages with a probability inversely proportional to its PageRank value

Thus, the probability that

a page is sampled is a

constantindependent of

the page.

Page 7: Extracting knowledge from the World Wide Web

Google’s creators Brin and Page published definition of PageRank as used in Google.

PageRank has several definitions.

PageRank

Sergey Brin and Lawrence Page,"The Anatomy of a Large-Scale Hypertextual Web Search Engine”,in Proceedings of the 7th International World Wide Web Conference, pp. 107–117,1998.

Page 8: Extracting knowledge from the World Wide Web

PageRank

PageRank has another definition depending on Random Walk.

The PageRank of a page p is the fraction of steps that the walk spent at p in the limit.

- Initial page is chosen randomly from all pages.

- Let walk is at page p at a given time step.

- With probability d, follow an out-link of p .

- With probability 1-d, select a page out of all pages.

Page 9: Extracting knowledge from the World Wide Web

PageRank

Two problems arise in the implementation:

Random Walk assumes already that it can find a random page on the web; the problem that we actually want to solve.

Many hosts on the web have a large number of links with in the same host and very few leaving them.

Page 10: Extracting knowledge from the World Wide Web

PageRank

- Given a set of initial pages

- Choose start page randomly from initial pages.

- Let walk is at page p at a given time step.

- With probability d, follow an out-link of p .

- With probability 1-d, select a random host among visited hosts,

then jump to a randomly selected page out of all pages visited

on this host so far.

- All pages in the initial set are also considered to be visited.

Henzinger proposed and implemented a modified Random Walk

Page 11: Extracting knowledge from the World Wide Web

Sampling the Web with Random Walk

The modified random walk visits a page with probability approximately proportional to its PageRank value.

Afterward, the visited pages are sampled with probability inversely proportional to their PageRank value.

Thus, the probability that a page is sampled is a constant independent of the page.

Page 12: Extracting knowledge from the World Wide Web

Sampling the Web with Random Walk

An example of statistics generated using this approach:

Page 13: Extracting knowledge from the World Wide Web

Sampling the Web with

IP Address Sampling IP V.4 Addresses : 4 bytes IP V.6 Addresses : 16 bytes

There are about 4.3 billion possible IP V.4 addresses.

IP address sampling is an approach depending on randomly sampling IP addresses and testing for a web server at the standard port (http:80 or https:443).

This approach works only for IP V.4

IP V.6 address space, 2128 addresses, is too much to explore.

Page 14: Extracting knowledge from the World Wide Web

Sampling the Web with

IP Address SamplingSolution:

Check Multiple Times

Page 15: Extracting knowledge from the World Wide Web

Sampling the Web with

IP Address SamplingThis method finds many web servers that would not normally

be considered as a part of the publicly indexable web. - Servers with authorization

requirements

- Servers with no content

- Hardware that provides a Web Interface

Page 16: Extracting knowledge from the World Wide Web

Sampling the Web with

IP Address SamplingA number of issues lead to minor biases:

- An IP address may host several web sites

- Multiple IP addresses may serve identical content

- Some web servers may not use the standard port.

There is a higher probability of finding larger sites using multiple IP addresses to serve the same content.

Solution: Use the domain name system.

Page 17: Extracting knowledge from the World Wide Web

Sampling the Web with

IP Address SamplingThe distribution of server types found from sampling 3.6 million IP addresses in February 1999 Lawrence, S. & Giles, C. L. (1999) Nature 400, 107–109.

Analyses from the same study

Only 34.2 % of servers contained the common “keyword” or “description” meta-tags on their homepage.Low usage of simple HTML metadata standard suggest that acceptance of more complex standards such as XML will be very slow.

Page 18: Extracting knowledge from the World Wide Web

Discussion On Sampling the Web

Current techniques exhibit biases and do not achieve a uniform random sample. For Random Walk, any implementation is

limited to a finite random walk. For IP address sampling, main challenge is how

to sub-sample the pages accessible from a given IP address.

Page 19: Extracting knowledge from the World Wide Web

Analyzing and Modeling Web Growth

The Web has a degree distribution following the Power Law.

k~)k(P

For in-link distribution 1.2

For out-link distribution 72.2

We can also extract valuable information by analyzing and modeling the growth of pages and links on the web.

Page 20: Extracting knowledge from the World Wide Web

Analyzing and Modeling Web Growth

This observation led to the design of various models for the Web. Preferential Attachment of Barabasi et al. Mixed Model of Pennock et al. Copy Model of Kleinberg et al. The Hostgraph Model

Page 21: Extracting knowledge from the World Wide Web

Preferential Attachment

As the network grows, the probability that a given node receives an edge is proportional to that node’s current connectivity.

‘rich get richer’

w node

wu kkPProbability that a new node is connected to node u is

Page 22: Extracting knowledge from the World Wide Web

Preferential Attachment

Model suggest that for a node u created at time tu, the expected degree is m(t/tu)0.5.

Thus older pages get rich faster than newer pages.

Model explains Power Law in-link distribution. However, the model exponent is 3 (by mean-field theory), whereas the observed exponent is 2.1.

No Evidence

In reality, different link distributions are observed among web pages of the same category.

Page 23: Extracting knowledge from the World Wide Web

Winners don’t take all

The early models fail to account for significant deviations from power law scaling common in almost all studied networks.

For example, among web pages of the same category, link distributions can diverge strongly from power law scaling, exhibiting a roughly log-normal distribution.

Moreover, conclusions about the attack and failure tolerance of the Internet based on the early models may not fully hold within specific communities.

Page 24: Extracting knowledge from the World Wide Web

Winners don’t take all

NEC researchers (Pennock et al.) discovered that the degree of "rich get richer" or "winners take all" behavior varies in different categories and may be significantly less than previously thought.

Page 25: Extracting knowledge from the World Wide Web

Winners don’t take all

Pennock et al. introduced a new model of network growth, mixing uniform and preferential attachment, that accurately accounts for the true connectivity distributions found in web categories, the web as a whole, and other social and biological networks.

Page 26: Extracting knowledge from the World Wide Web

Winners don’t take allThe numbers represent the degree to which link growth is preferential (new links are created to already popular sites).

Page 27: Extracting knowledge from the World Wide Web

Copy Model

Kleinberg et al. explained the power-law inlink distributions with a copy model that constructs a directed graph.

v

an existing node v is chosen

uniformly at random.

Dest jth

Copy with Probability:

With Probability: 1-

Choose destination uniformly at

random among existing nodes

This model is also a mixture of uniform and preferential influences on network growth.

u

Dest 0

Dest 1

Dest jth

Dest d-1

a new node u is added with d

outlinks

Page 28: Extracting knowledge from the World Wide Web

The Hostgraph Model

Models the Web on the host or domain level.

Each node represents a host.

Each directed edge represents the hyperlinks from pages on the source host to pages on the target host.

Page 29: Extracting knowledge from the World Wide Web

The Hostgraph Model

Bharat et al. show that the weighted inlink and the weighted outlink distributions in the host graph have a power law distribution with =1.62 and = 1.67, respectively. However, the number of hosts with small degree is considerably smaller than predicted by the model.

There is "flattening" of the curve for low degree hosts.

Page 30: Extracting knowledge from the World Wide Web

The Hostgraph Model

Bharat et al. made a modification to the copy graph model, called the re-link model, to explain this flattening.

With probability 1- no new node is added.So number of low degree nodes is reduced.

u

Dest 0

Dest 1

Dest jth

Dest d-1

With probability add a new node u with d

outlinksand with probability 1- select an existing node

with additional d outlinks.

v

an existing node v is chosen uniformly at

random.Then select d links of v uniformly at random

Dest 0 Dest jth Dest d

Copy with Probability:

With Probability: 1-

Choose destination uniformly at random

among existing nodes

Page 31: Extracting knowledge from the World Wide Web

The Hostgraph Model

Page 32: Extracting knowledge from the World Wide Web

Communities on the Web

Identification of communities on the web is valuable . Practical applications include :

• Automatic web portals

• Focused search engines

• Content filtering

• Complementing text-based searches

Community identification also allows for analysis of the entire web and the objective study of relationships within and between communities.

Page 33: Extracting knowledge from the World Wide Web

Communities on the Web

Flake et al. define a web community as :

A collection of web pages such that each member page has more hyperlinks within the community than outside of the community.

Flake et al. show that the web self-organizes such that these link-based communities identify highly related pages.

Page 34: Extracting knowledge from the World Wide Web

Communities on the Web

Page 35: Extracting knowledge from the World Wide Web

Communities on the Web

There are alternatives for the indication of Web communities:

Kumar et al. consider dense bipartite subgraphs as indications of communities.

Other approaches :

Bibliometric methods such as cocitation and bibliographic coupling

The PageRank algorithm

The HITS algorithm

Bipartite subgraph identification

Spreading activation energy

Page 36: Extracting knowledge from the World Wide Web

Conclusion

The problem of uniformly sampling the web is still open in practice: which pages should be counted, and how can we reduce biases?

Web growth models approximate the true nature of how the web grows: how can the current models be refined to improve accuracy, while keeping the models relatively simple and easy to understand and analyze?

Finally, community identification remains an open area: how can the accuracy of community identification be improved, and how can communities be best structured or presented to account for differences of opinion in what is considered a community?

There are still many open problems:

Page 37: Extracting knowledge from the World Wide Web

Thanks For Your Patience

Page 38: Extracting knowledge from the World Wide Web

Appendix

Page 39: Extracting knowledge from the World Wide Web

Google’s PageRank

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web

Page 40: Extracting knowledge from the World Wide Web

Google’s PageRank

Example :

PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45

d = 0.85

PageRank for 26 million web pages can be computed in a few hours on a medium size workstation (Brin&Page 98).

Page 41: Extracting knowledge from the World Wide Web

The Hostgraph Model