Web Graph & Link Analysis

Web Graph & Link Analysis

http://net.pku.edu.cn/~wbia

黄连恩[email protected]北京大学信息工程学院

09/17/2013



Web Graph

http://www.touchgraph.com/TGGoogleBrowser.html

http://www.touchgraph.com/TGGoogleBrowser.html

Giant Global Graph

本次课大纲

Web 图度量有多大？连通性如何？节点的分布如何？节点距离有多远？

Link Analysis Web 上节点重要度如何度量？

Web 有多大？

Web 的大小—网页总数

图大小不可知，也无法定义估计 Web 图节点数的下界

搜索引擎索引的网页数 (crawled pages) 例如 CNNIC 中国互联网网页调查报告

能更逼近真实值吗？

Capture-Recapture Model

Unknown number of fish in a lake Catch a sample and mark them Let them loose Recapture a sample and look for

marks Estimate population size n1 = number in first sample 15

n2 = number in second sample 10 n12 = number in both samples 5 N = total population size assume that n1/N = n12/n2 therefore 15/N = 5/10

N = (10 x 15) / 5 = 30

Web 的大小

Estimate indexed web low-bound by analysis overlap of Search Engine (Steve Lawrence and C. Lee Giles,1998*)

•P(a) =na/N = n0/nb•N = na*nb/n0

•During the test time, HotBot report indexed 110 miliion page•lower bound on the size of the indexable Web of 320 million pages. (1998)

Overlap Analysis

选择 6 个流行的 search engine, 假设它们索引页面之间的 independency

Sampling ：通过 575 个查询对这些 SE 采样，分析它们之间的 overlap

用 overlap 来估计各个 SE 所覆盖的 indexable Web 的大小

利用已知某个 SE 的页面数，来估计整个 Web 的大小

Web 的连通性如何？

Web 的形状

A large scale study (Altavista crawls) reveals interesting properties of web (Andrei Broder ,1999) Study of 200 million nodes & 1.5

billion links Some parts unreachable, Others have

long paths found Bow-tie Structure

Bow-tie Components

Strongly Connected Component (SCC)

Core Upstream (IN)

Core can’t reach IN

Downstream (OUT)

OUT can’t reach core

Disconnected Tendrils & Tubes

Component Properties

Each component is roughly same size ~50 million nodes Probability of a path between any 2 nodes ~1

quarter (0.24) Diameter, maximal minimal path length(?)

Maximal and average diameter is infinite 28 for SCC, 500 for entire graph

Average length 16 (directed path exists), 7 (undirected) Shortest directed path between 2 nodes in SCC:

16-20 links on average

问题：在这样一个巨大的图 (200M nodes, 1.5G edges)上， Diameter 怎么计算出来的？

问题：在这样一个巨大的图 (200M nodes, 1.5G edges)上， Diameter 怎么计算出来的？

Web 上节点的分布如何？

站点入度分布

会是下面哪一种情况？

Power law

http://www.biomedcentral.com/1471-2148/7/S1/S16/figure/F1?highres=y

Power law

P(x=k)=CK-λ A line appears on a log-log plot rare events are not so rare! Long tail

Power Law Size and Connectivity

站点大小 Site Sizes （以页面数量计算）服从 power law 分布跨跨跨跨跨跨跨 λ 在 1.6-1.9 之间

节点的度 connections per node 服从 power law 分布 Study at Notre Dame University reported λ= 2.45 for outdegree distribution λ= 2.1 for indegree distribution

Power Law Distribution -Examples

From Graph structure in the web, (by altavista crawl,1999)

Random Graph .vs. Power Law Graph

Random graphs have Poisson distribution if p is small. Random uniform graph with random

independent edges of fixed probability p P(x=k)= e-λ * λk/k! Decays exponentially fast to 0 as k increases

towards its maximum value n-1 Power law graphs

Decays polynomially for large values Power law graph emerging order in a

large graph created by many agents

Examples with Power Law Networks

Examples of networks with Power Law Distribution Internet at the router and interdomain level Citation network Collaboration network of actors Networks associated with metabolic pathways Networks formed by interacting genes and

proteins Network of nervous system connection in C.

elegans

Web 上节点距离有多远？

What does this mean?

Size: 200M nodes, 1.5G edges Average length: 16 (directed path exists), 7

(undirected)

Huge graph with small distanceHuge graph with small distance

It’s a small world

小世界网络

It is a ‘small world’ Millions of people. Yet,

separated by “six degrees” of acquaintance relationships

Popularized by Milgram’s famous experiment

Mathematically Diameter of graph is small

(log N) as compared to overall size

Property seems interesting given ‘sparse’ nature of graph but … This property is ‘natural’ in ‘pure’ random graphs

The small world of WWW

Empirical study of Web-graph reveals small-world property Graph generated using power-law model Diameter properties inferred from sampling

Calculation of max. diameter computationally demanding for large values of n

Average distance (d) in simulated web: d = 0.35 + 2.06 log (n) e.g. n = 109, d ~= 19

Implications for Web

Logarithmic scaling of diameter makes future growth of web manageable 10-fold increase of web pages results in only 2

more additional ‘clicks’, but …

Robustness and vulnerability

How diameter or connectivity affected by deleting nodes randomly? Scale-free graph are more robust than random

uniform graph Specific nodes are targeted?

Diameter doubling when 5% important nodes removed

Topology changes under attack? Fragment and break down Phrase change for deletion ratio : 0.28 in

exponential graph & 0.18 in a scale-free network

Web 上节点重要度如何度量？

对网页重要性的评价

PageRank 算法， HITS （ Hyperlink Induced Topic Search ）算法

都是为了利用 HTML 网页的链接特点，改善查询的效果当 Spam 页面淹没了 search engine的搜索结果页面时，除了页面内容与查询的相关性以外，页面本身的质量 / 重要性的作用就显现出来

Larry Page & Sergey Brin

Jon Kleinberg

PageRank

Why and how it works?Why and how it works?

重要度的度量

一阶指标（“入度”）知晓关系：社会知名度引用关系：认可程度

“高阶指标” 和一个著名人物“共同发表”论文的“距离”：越短似乎显得越“有荣誉”（例如， Erdos number ，）

Paul Erdös 刘翔

认识甲的人可能和认识乙的人一样多，但认识乙的人都是些“重要人物”，于是通常会认为乙比甲重要

不仅是人，论文也是一样，被重要的文章引用的文章可能就比较重要些

谁重要一些？

如何用一个模型来刻画这种感觉，使算出来的“重要性”反映这种感觉？

如何用一个模型来刻画这种感觉，使算出来的“重要性”反映这种感觉？

声望模型 Reputation Model 给定一个群体 S ，及其在上面的一个“知晓”关系

R ，于是定义了一个有向“关系图” G 。用邻接矩阵 E 表示， E(i,j)=1 ，当且仅当 i “ 听说过” j（注意这里没有程度之分）。我们希望确定 p(i) ：所有个体 i∈S 的“声望”

模型一： p(i) = ∑E[k,i] ， k=1,…,n ，即 i 在 G 上的“入度”，亦即 E 的第 i 列的 1 的个数清楚、好计算；但是“不够好”

模型二： p(i) = ∑E[k,i]p(k) ， k=1,…,n ，即 i 的声望等于知晓他的人的声望之和清楚、显得要更“精确些”；但是，好计算吗？

声望模型二

对于所有 i ， p(i) = ∑E[k,i]p(k) ， k=1,…,n 也就是，记 p = (p(1), p(2), …, p(n))T,

p = ETp 问题是：

这个方程存在解吗？如果存在，如何得到？如果不存在，该怎么办？

一般来讲：这个方程的非 0解是不存在的！

p = ETp 的不存在例

011

001

000TE

S = {1,2,3}, R = {<1,2>,<1,3>,<2,3>}

E = ((0,1,1),(0,0,1),(0,0,0)) ET = ((0,0,0),(1,0,0),(1,1,0)) 不难看到，方程的成立 p

(1)=0p(2)=0p(3)=0

一般来讲， p = ETp ，意味着要求 ET

有特征值 1 ，这是很难得的。一般来讲， p = ETp ，意味着要求 ET

有特征值 1 ，这是很难得的。

1

2 3

先前那 4 个点的例子也无解

p = ETp (I － ET)p = 0 线性代数讲，此方程组有非 0 解，仅当行列式 |I － ET| = 0

但我们算得 |I － ET| = 2

即使有解，还有可能不唯一！

S = {1,2,3}, R = {<1,2>,<2,3>,<3,1>} 不难看出任何 p(1) = p(2) = p(3) 都是解

怎么办？怎么办？

“Random Walker” 模型

设想有一个永不休止、在网上浏览网页的人 ,随机选择一个链出的链接继续访问。我们问，在稳态情况下（足够长时间后），他会正在看哪一篇网页呢？

等价于：稳态情况下，每个网页 v 会有一个被访问的概率， p(v) ，它可以作为网页的重要程度的度量。

我们可以合理地设想：此时到达 v 的概率，依赖于上一个时刻到达“链向” v 的网页的概率，以及那些网页中超链的个数。

Random walker model

p(v) = ∑E[u,v]*p(u)/du, over u 这里， du是网页 u 的“出度”，∑ E[u,v] over u 。

∑p(u) = 1

V

u1

u2

u3

u4

u5

稳定时：

Random Walker Model (continue)

改写一下，成

u u

upd

vuEvp )(

],[)(

1|p|

][],[][

],[],[

且有或者写成矩阵形式，

于是，

令，

pLp

upvuLvp

d

vuEvuL

T

u

u

形式上和“声望”模型一样，只是矩阵 L 有行向量元素和为 1 的性质。

有用吗？

形式上和“声望”模型一样，只是矩阵 L 有行向量元素和为 1 的性质。

有用吗？

Dangling Node （出度为 0 的节点）对于这些节点，矩阵 L 对应着元素全 0 的行，元素和不为1修正： L[u,v] = 1/N if du=0

Dangling Node （出度为 0 的节点）对于这些节点，矩阵 L 对应着元素全 0 的行，元素和不为1修正： L[u,v] = 1/N if du=0

Stochastic matrix

矩阵 M ，元素非负，每个行向量元素之和分别都等于 1 （亦称马尔科夫转移矩阵）

L 就是这种矩阵 () 显然，随机矩阵的最大特征值为 1 ，对应有一个全 1元素的特征向量

转置矩阵的行列式和原矩阵的行列式相等

0||

0||

TMI

MI

xxM

λλ

λ 于是 1 也就是 LT

最大的特征值！于是 1 也就是 LT

最大的特征值！

还有一点问题

上述“随机浏览”模型有稳态解的条件是：由网页形成的有向图允许通过链接关系访问到每一个网页

但有两个情况是破坏这条件的图中形成“圈” (rank bounce) 有入度或者出度为 0 的点 (rank s

ink) 因此该模型的表述通常要求所形成的图是 irreducible （强连通）和 aperiodic （不能有进去后出不来的圈）。

继续修改模型

让这浏览者每次以一定的概率（ 1-β ）沿着超链走，以概率（ β ）重新随机选择一个新的起始节点这在物理意义上即总是有可能跳进入度为 0 的点，跳出那些“圈”。在模型表达上即为

pN

LpN

pLp NT

NT

)1()1(1)1(

β 选在 0.1 和 0.2 之间，被称作 damping factor(Page & Brin 1997 ）

G=(1-β)LT+ β/N(1N) 被称为 Google Matrix

Google Matrix 特征向量求解

Power Iteration 方法：给定 Google Matrix G ，记 |λ1| ≥|λ2| ≥…,q1是属于 λ1 的特征向量

初始化向量 p0，使得 ||p0||1=1 对于 k = 1, 2, …,执行如下步骤

x = Gpk-1，基本迭代 pk = x/||x||1，规格化步骤

可以证明（收敛速度） |pk – q1| = O(|λ2/λ1|k)

iNT

i pN

Lp

)1()1(1

iNT

i pN

Lp

)1()1(1

问题

GoogleMatrix 的 Power Iteration 求解特征向量算法一定会经过有限步迭代终止吗？一定会得到有意义的解吗？（正解并且 ||p||1=1 ）一定会得到唯一的解吗？不管初始值 P0如何，都会收敛到相同解吗？是否需要很多次迭代才能收敛呢？

Amy Langville, Carl Meyer, Google's PageRank and Beyond:

The Science of Search Engine Rankings. Princeton University Press, 2006.

Amy Langville, Carl Meyer, Google's PageRank and Beyond:

The Science of Search Engine Rankings. Princeton University Press, 2006.

例子（ power iteration ）

1

1

2/12/1

2/12/1

2/12/1

2/12/1

3/13/13/1

2/12/1

1

1

11/111/111/111/111/111/111/111/111/111/111/1

L

小规模数据求解

β 取 0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T

P1=GP0 ... 。。。。。。。

Power Iteration 求解得 (迭代 50次 ) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T

You can try this in MatLab

You can try this in MatLab

本次课小结

Web Graph 的性质 Capture-recapture Bow-tie structure Power law Small world

Graph Link Analysis 声望模型 Random Walker 模型 PageRank 算法iN

Ti p

NLp

)1()1(1

Thank You!

Q&A

HOME WORK

Larry Page & Sergey Brin The PageRank citation

ranking: bringing order to the web （ 1999 ）

提交一个 1-2 页的报告： 1. 根据 PageRank 的理论，设计一些方法，来提高自己网站的排名

2. 请分析 PageRank 的缺陷和不足，并提出自己的解决方法

3. 现在的 Web 相比 1998 年的时候已经发生了很大变化，请思考 PageRank 是否适用当今的 Web ，为什么？

Documents

Web Graph & Link Analysis