34
The Linear Algebra Behind Google by SINDHUJARANI.V Roll No. [10212338] () May 1, 2012 1 / 34

The Linear Algebra Behind Google Ppt

Embed Size (px)

Citation preview

Page 1: The Linear Algebra Behind Google Ppt

The Linear Algebra Behind Google

by

SINDHUJARANI.V

Roll No. [10212338]

() May 1, 2012 1 / 34

Page 2: The Linear Algebra Behind Google Ppt

Outline of the Talk

⇒ Basic working of the Google.

⇒ Page Rank Algorithm.

⇒ Solving the Page Rank algorithm by eigen system.

⇒ Solving the Page Rank algorithm by linear system.

() May 1, 2012 2 / 34

Page 3: The Linear Algebra Behind Google Ppt

Introdution

⇒ The Internet can be seen as a large directed graph, where the Web pagesthemselves represent vertices’s, and their links as edges.

⇒ The page rank algorithm ranking the back links of the vertices’s.

⇒ Which vertices’s having more back links getting more important.

() May 1, 2012 3 / 34

Page 4: The Linear Algebra Behind Google Ppt

Example:Figure 1

&%'$

1

&%'$

1

&%'$

1-

� &%'$

3

&%'$

3

?

6

&%'$

2&%'$

2 &%'$

4&%'$

4-������������

������������

@@

@@

@@

@@

@@I

@@

@@

@@

@@

@@I@

@@@@@@@

@@R

@@@@@@@@

@@R

() May 1, 2012 4 / 34

Page 5: The Linear Algebra Behind Google Ppt

⇒ we have the vote for the page x1 = 2, x2 = 1, x3 = 3, and x4 = 2. So thatpage 3 is the most important, pages 1 and 4 are second important, and page 2 isleast important.

Drawback:

⇒ Not all votes are equally important. A vote from a page with low importanceshould be less important than the vote from the more importance page.

⇒ To avoid this, each vote′s importance is divided by the number of differentvotes a page casts.

() May 1, 2012 5 / 34

Page 6: The Linear Algebra Behind Google Ppt

Matrix Model

⇒ The new formate consider as a matrix which is in the form of

Aij =

{

1/Nj if Pj links to Pi

0 otherwise, (1)

where Nj is the number of out links from page Pj .

Recursive form:

⇒The recursive form per page is defined as

ri =∑

j∈Li

rj/Nj , (2)

where ri is the page rank of page Pi , Nj is the number of out links from page Pj

and Li are the pages that link to page Pi .

() May 1, 2012 6 / 34

Page 7: The Linear Algebra Behind Google Ppt

⇒ Let′s apply this approach to figure 1. For page 1, the recursive form asx1=

x31 + x4

2 , for the page 2 x2=x13 , x3=

x13 + x2

2 + x42 and x4=

x13 + x2

2 .

⇒ Now, these linear equations can be written Ax = x , where x=[x1, x2, x3, x4]T

and in the matrix form as

A =

0 0 1 1/21/3 0 0 01/3 1/2 0 1/21/3 1/2 0 0

,

which transforms the web ranking problem into the “standard”problem of findingan eigenvector for a square matrix.

⇒ In this case, we obtain x1 ≃ 0.387, x2 ≃ 0.129, x3 ≃ 0.290, and x4 ≃ 0.194,where page 1 getting rank 1, page 2 getting rank 3, page 3 getting rank 2, andpage 4 getting rank 4.

() May 1, 2012 7 / 34

Page 8: The Linear Algebra Behind Google Ppt

Speciality of the matrix A

Definition:

A square matrix is called a column stochastic matrix, if all of its entries arenonnegative and the entries in each column sum to 1.

⇒ A is called a column stochastic matrix.

⇒ A has 1 as an eigenvalue.

⇒ A has left eigenvector, which sum is equal to 1.

() May 1, 2012 8 / 34

Page 9: The Linear Algebra Behind Google Ppt

Difficulty arise when using the formula 2

⇒ Stuck at a page.

⇒ Nonuniquness rankings.

⇒ Stuck in a subgraph.

() May 1, 2012 9 / 34

Page 10: The Linear Algebra Behind Google Ppt

Stuck at a page

Definition:

A node that has no out links is called dangling node.

⇒ If graph has dangling node then the link matrix has a column of zeros to thatnode. so we cannot get the column stochastic matrix.

⇒To modify the link matrix as a column stochastic matrix, replace all zeros with1nin all the zero column, where n is the dimension of the matrix.

⇒ Now the matrix as

A = A+1

neTd (3)

where e is a row vector of ones, and d is a row vector, defined as

dj =

{

1 if Nj = 0

0 otherwise(4)

() May 1, 2012 10 / 34

Page 11: The Linear Algebra Behind Google Ppt

Example:Figure 2

&%'$

1

&%'$

1

&%'$

2

&%'$

2

&%'$

3

&%'$

3

&%'$

4

&%'$

4

&%'$

6&%'$

6

&%'$

5

&%'$

5

�����

@@

@I

��

��

-

���

-�

@@@R@

@@I

() May 1, 2012 11 / 34

Page 12: The Linear Algebra Behind Google Ppt

For our figure 2, we have d = [1, 0, 0, · · ·0]. Thus

A = A+1

neTd

=

0 12

13 0 0 0

0 0 13 0 0 0

0 12 0 0 0 0

0 0 13 0 1/2 1

0 0 0 12

12 0

0 0 0 12 0 0

+

16 0 0 0 0 016 0 0 0 0 016 0 0 0 0 016 0 0 0 0 016 0 0 0 0 016 0 0 0 0 0

=

16

12

13 0 0 0

16 0 1

3 0 0 016

12 0 0 0 0

16 0 1

3 0 12 1

16 0 0 1

212 0

16 0 0 1

2 0 0

With the creation of matrix A, we have a column stochastic matrix.

() May 1, 2012 12 / 34

Page 13: The Linear Algebra Behind Google Ppt

Nonuniquness ranking

⇒ For our rankings, it is desirable that the dimension of V1(A) (corresponding tothe eigenvalue 1) equal′s 1, so that there is a nonzero eigenvector x with

i xi=1which can be for page ranking.

⇒ It is not always true in general.

() May 1, 2012 13 / 34

Page 14: The Linear Algebra Behind Google Ppt

Example:Figure 3

&%'$

1&%'$

1 &%'$

3&%'$

3

?

6

?

6

&%'$

2&%'$

2 &%'$

4&%'$

4

&%'$

5

&%'$

5

@@

@@

@I

��

���

() May 1, 2012 14 / 34

Page 15: The Linear Algebra Behind Google Ppt

⇒ The link matrix of, figure 3 as

A =

0 1 0 0 01 0 0 0 00 0 0 1 1/20 0 1 0 1/20 0 0 0 0

.

We find here that V1(A) is two-dimensional. One possible pair of basis vectors isx = [1/2, 1/2, 0, 0, 0]T and y = [0, 0, 1/2, 1/2, 0]T.

⇒ We know that any linear combination of these two vectors yields anothervector in V1(A). so we will face the problem in the ranking.

() May 1, 2012 15 / 34

Page 16: The Linear Algebra Behind Google Ppt

Overcome the dim(V1(A))

⇒To solving this problem we are modifying the equation (2).This analysis thatfollows, basically a special case of the Perron Frobenius theorem.

Perron Frobenius theorem:

Let B be an n × n matrix with nonnegative real entries. Then we have thefollowing:

1. B has a nonnegative real eigenvalue. The largest such eigenvalue λ(B),dominates the absolute values of all other eigenvalues of B. The dominationis strict if the entries of B are strictly positive.

2. If B has strictly positive entries, then λ(B) is a simple positive eigenvalue,and the corresponding eigenvector can be normalized to have strictly positiveentries.

3. If B has an eigenvector v with strictly positive entries, then thecorresponding eigenvalue λ is λ(B).

() May 1, 2012 16 / 34

Page 17: The Linear Algebra Behind Google Ppt

A Modification of the Link Matrix A

⇒ For an n page web with no dangling nodes, We will replace the matrix A withthe matrix

A = αA+ (1− α)S . (5)

⇒ For an n page web with dangling nodes, We will replace the matrix A with thematrix

A = αA+ (1− α)S . (6)

where 0 ≤ α ≤ 1 is called a damping factor and S denote an n× n matrix with allentries 1

n. The matrix S is column stochastic, and also V1(S) is one dimensional.

() May 1, 2012 17 / 34

Page 18: The Linear Algebra Behind Google Ppt

Speciality of the matrix A

1. All the entries Aij satisfy 0 ≤ Aij ≤ 1.

2. Each of the columns sum to one,∑

i Aij = 1 for all j.

3. If the value of α = 0, then we have the original problem as A = A.

4. If the value of α = 1, then we have the problem as A = S .

() May 1, 2012 18 / 34

Page 19: The Linear Algebra Behind Google Ppt

Random walker

⇒The random walker starts from a random page, and then selects one of the outlinks from the page in a random fashion.

⇒The page rank of a specific page can now be viewed as the asymptoticprobability that the walker is present at the page.

⇒This is possible, as the walker is more likely to randomly wander to pages withmany votes (lots of in links), giving him a large probability of ending up in suchpages.

() May 1, 2012 19 / 34

Page 20: The Linear Algebra Behind Google Ppt

Stuck in a subgraph

⇒There is still one possible pitfall in the ranking. The walker wander into asubsection of the complete graph that does not link to any outside pages.

⇒The link matrix for this model reducible matrix.

⇒We therefore want the matrix to be irreducible, which making sure he cannotget stuck in a subgraph. This irreducibility is called “teleportation”which meansability to jump with a small probability from any page in the link structure to anyother page. This can mathematically be described for page with no danglingnodes as:

A = αA+ (1− α)1

neTe. (7)

For page with dangling nodes as:

A = αA+ (1− α)1

neT e (8)

where e is a row vector of ones, and α is a damping factor.

() May 1, 2012 20 / 34

Page 21: The Linear Algebra Behind Google Ppt

Example:Figure 4

&%'$

1

&%'$

1

&%'$

2

&%'$

2

&%'$

3

&%'$

3

&%'$

4

&%'$

4

&%'$

6&%'$

6

&%'$

5

&%'$

5

�����

@@

@I

��

��

-

���

-�

@@@R@

@@I

() May 1, 2012 21 / 34

Page 22: The Linear Algebra Behind Google Ppt

⇒ The link matrix for figure 4 using equation (8) as,

A = αA+ (1− α)1

neT e =

16

1112

1960

160

160

160

16

160

1960

160

160

160

16

1112

160

160

160

160

16

160

1960

160

715

1112

16

160

160

715

715

160

16

160

160

715

160

160

.

Here α set to 0.85. Here the matrix A is a column stochastic matrix.

⇒ When adding (1− α) 1neTe gives an equal chance of jumping to all pages.

() May 1, 2012 22 / 34

Page 23: The Linear Algebra Behind Google Ppt

Analysis of the matrix A

Definition:

A matrix A is positive if Aij > 0 for all i and j .

⇒ If A is positive and column stochastic, then any eigenvector in V1(A) has allpositive or all negative components.

⇒ If A is positive and column stochastic, then V1(A) has dimension 1.

() May 1, 2012 23 / 34

Page 24: The Linear Algebra Behind Google Ppt

Solution Methods for Solving the Page rank Problem

⇒The page rank is the same as finding the eigenvector corresponding to thelargest eigenvalue of the matrix A.

⇒To solve this we need an iterative method that works well for large sparsematrices.

⇒ There are two methods for solving Page rank Problem:

1. Eigen system problem. (The power method)

2. Linear system problem. (Jacobi method,Gauss-Seidel method,SORmethod,etc..)

() May 1, 2012 24 / 34

Page 25: The Linear Algebra Behind Google Ppt

The power method

⇒The Power method is a simple method for finding the largest eigenvalue andcorresponding eigenvector of a matrix.

⇒It can be used when there is a dominant eigenvalue of A.

⇒ Consider iterates of the power method applied to A as

Ax (k−1) = αAx (k−1) + α1

neTdx (k−1) + (1 − α)

1

neTex (k−1) = x (k),

where x (k−1) is a probability vector, and thus ex (k−1) = 1.

() May 1, 2012 25 / 34

Page 26: The Linear Algebra Behind Google Ppt

Convergence of the power method

⇒Rescale the power method at each iteration by xk =Axk−1

‖Axk−1‖, where ‖ · ‖ can be

any vector norm.

⇒Every positive column stochastic matrix A has a unique vector x with positivecomponents such that Ax = x with ‖x‖1 = 1. The vector x can be computed asx=limk→∞ Akx0 for any initial guess x0 with positive components such that‖x0‖1 = 1.

⇒The rate of convergence may be shown to be linear for the Power method is|λ2/λ1|.

() May 1, 2012 26 / 34

Page 27: The Linear Algebra Behind Google Ppt

Linear system problem

⇒ We begin by formulating the page rank problem as a linear system.

⇒The eigen system problem Ax = αAx + (1− α) 1neTex = x can be rewritten as,

(I − αA)x = (1− α)1

neT =: b. (9)

⇒Let we split the matrix (I − αA) as,

(I − αA) = A = (L+ D + U), (10)

where D is the diagonal matrix and, L and U are strict lower triangular and strictupper triangular respectively.

() May 1, 2012 27 / 34

Page 28: The Linear Algebra Behind Google Ppt

Properties of (I − αA)

1. (I − αA) is an M-matrix.

2. (I − αA) is nonsingular.

3. The row sums of (I − αA) are 1− α.

4. ‖I − αA‖∞ = 1+ α, provided at least one nondangling node exists.

5. Since (I − αA) is an M-matrix, (I − αA)−1 ≥ 0.

6. The row sums of (I − αA)−1 is 11−α

. Therefore ‖(I − αA)−1‖∞ = 11−α

.

7. Thus, the condition number κ∞(I − αA) = 1+α

1−α.

Definition:

A real matrix A that has Aij ≤ 0 when i 6= j and aii ≥ 0 for all i .A can beexpressed as A = sI − B, where s > 0 and B ≥ 0. when s > ρ(B), A is called anM matrix. M matrix can be either singular or nonsingular.

() May 1, 2012 28 / 34

Page 29: The Linear Algebra Behind Google Ppt

Jacobi method

⇒The Jacobi method can be applied to Google matrix (10)

(L+ D + U)x = b

Dxk = b − (L+ U)xk−1

xk = D−1[b − (L+ U)xk−1],

where D is invertible matrix.

() May 1, 2012 29 / 34

Page 30: The Linear Algebra Behind Google Ppt

Gauss Seidel method

⇒The Gauss Seidel method can be applied to Google matrix (10)

(L+ D + U)x = b

(L+ D)xk = b − Uxk−1

xk = (L+ D)−1[b − Uxk−1],

where (L+ D) is invertible matrix.

⇒The Gauss seidel method converges much faster than the Power and Jacobimethods.

⇒The disadvantage is very hard to parallelize.

() May 1, 2012 30 / 34

Page 31: The Linear Algebra Behind Google Ppt

SOR method

⇒The SOR method can be applied to Google matrix (10)

ω(L+ D + U)x = ωb

(ωL+ D)xk = ω(b − Uxk−1) + (1− ω)Dxk−1

xk = (ωL+ D)−1[ω(b − Uxk−1) + (1− ω)Dxk−1],

where 1 ≤ ω ≤ 2. when ω=1, this method return to the Gauss seidel. Here(ωL+ D) is invertible matrix.

⇒ The cost of SOR method per iteration is more expansive and less efficient inparallel computing for huge matrix system.

() May 1, 2012 31 / 34

Page 32: The Linear Algebra Behind Google Ppt

Plot for number of the iteration required for

convergence by different method

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure: Jacobi method

1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

1.2

1.4

Figure: Gauss-seidal

1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: SOR method

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: Power method

() May 1, 2012 32 / 34

Page 33: The Linear Algebra Behind Google Ppt

Conclusion

⇒We are discussed mathematical idea used in Google search engine.

⇒Investigated various problem arises when computing the Google matrix (Linkmatrix).

⇒Taken, an example of large matrix representation of the Internet, and developedcomputing the Page rank using different methods.

() May 1, 2012 33 / 34

Page 34: The Linear Algebra Behind Google Ppt

Bibliography

Erik Andersson and Per-Anders Ekstrom.Investigating google’s pagerank algorithm.(2):1–9, 2004.

Pavel Berkhin.A survey on pagerank computing.(1):88–89, 13-07-2005.

Kurt Bryan and Tanya Leise.The 25,000,000,000 eigenvector: The linear algebra behind google.SIAM Review, (3), 2006.

Amy N. Langville and Carl D. Meyer.Deeper inside pagerank.2004.

() May 1, 2012 34 / 34