58
Algorithms (wait, Algorithms (wait, Math?) Everywhere… Math?) Everywhere… Gerald Kruse, PhD Gerald Kruse, PhD . . John ‘54 and Irene ‘58 Dale Professor of MA, CS John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T and I T Interim Assistant Provost 2013-14 Interim Assistant Provost 2013-14 Juniata College Juniata College Huntingdon, PA Huntingdon, PA [email protected] http://faculty.juniata.edu/kruse http://faculty.juniata.edu/kruse

Algorithms (wait, Math?) Everywhere…

Embed Size (px)

DESCRIPTION

Algorithms (wait, Math?) Everywhere…. Gerald Kruse, PhD . John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost 2013-14 Juniata College Huntingdon, PA [email protected] http://faculty.juniata.edu/kruse. Some Context / Confessions…. - PowerPoint PPT Presentation

Citation preview

Algorithms (wait, Math?) Algorithms (wait, Math?) Everywhere…Everywhere…

Gerald Kruse, PhDGerald Kruse, PhD..John ‘54 and Irene ‘58 Dale Professor of MA, CS and I TJohn ‘54 and Irene ‘58 Dale Professor of MA, CS and I T

Interim Assistant Provost 2013-14Interim Assistant Provost 2013-14Juniata CollegeJuniata CollegeHuntingdon, PAHuntingdon, PA

[email protected]://faculty.juniata.edu/krusehttp://faculty.juniata.edu/kruse

Some Context / Confessions…Some Context / Confessions…

Prepare to be underwhelmed. I can’t return the hour or so you spend Prepare to be underwhelmed. I can’t return the hour or so you spend here.here.

I am impressed by the elegance of the algorithms I will present today, I am impressed by the elegance of the algorithms I will present today, and I will probably try too hard to explain the underlying math (“but it’s so and I will probably try too hard to explain the underlying math (“but it’s so cool…”).cool…”).

We like and depend on many automated processes, we just have issues We like and depend on many automated processes, we just have issues implementing or interacting with them.implementing or interacting with them.

But, when we understand an algorithm, we can manipulate it. But, when we understand an algorithm, we can manipulate it. (my CS 315 students “Google Bombed” Juniata… in a good way…).(my CS 315 students “Google Bombed” Juniata… in a good way…).

Are we really surprised to learn that a Google search isn’t “free?”

What movie should we pick?What movie should we pick?$1,000,000 to the first algorithm that was 10% $1,000,000 to the first algorithm that was 10%

better than Netflix’s original algorithmbetter than Netflix’s original algorithm

The first 8% improvement was easy…The first 8% improvement was easy…

The first 8% improvement was easy…The first 8% improvement was easy…

“Just A Guy In A Garage”

Psychiatrist father and “hacker” daughter team

The first 8% improvement was easy…The first 8% improvement was easy…

Team from Bell Labs ended up winning

Here’s an interesting billboard, from Here’s an interesting billboard, from a few years ago in Silicon Valleya few years ago in Silicon Valley

First 70 digits ofFirst 70 digits ofee

2.7182818284590452353602874713526624977572470936999595749669676277240772.718281828459045235360287471352662497757247093699959574966967627724077

What happened for those who What happened for those who found the answer?found the answer?

The answer is 7427466391The answer is 7427466391

What happened for those who What happened for those who found the answer?found the answer?

The answer is 7427466391The answer is 7427466391

Those who typed in the URL, Those who typed in the URL, http://7427466391.com , ended up getting , ended up getting another puzzle. Solving that lead them to a another puzzle. Solving that lead them to a page with a job application for…page with a job application for…

What happened for those who What happened for those who found the answer?found the answer?

The answer is 7427466391The answer is 7427466391

Those who typed in the URL, Those who typed in the URL, http://7427466391.com , ended up getting another puzzle. Solving that lead , ended up getting another puzzle. Solving that lead them to a page with a job application for…them to a page with a job application for…

Google!Google!

(1) Just what does it take to solve that (1) Just what does it take to solve that problem?problem?

First QuestionFirst Question

(1) Just what does it take to solve that (1) Just what does it take to solve that problem?problem?

Calculations (most probably on a computer), Calculations (most probably on a computer), knowledge of number theory, a general knowledge of number theory, a general aptitude and interest in problem solving.aptitude and interest in problem solving.

First QuestionFirst Question

(2) Why does Google want to hire people (2) Why does Google want to hire people who know how to find that number, and what who know how to find that number, and what does it have to do with a search engine? does it have to do with a search engine?

Second QuestionSecond Question

(2) Why does Google want to hire people (2) Why does Google want to hire people who know how to find that number, and what who know how to find that number, and what does it have to do with a search engine? does it have to do with a search engine?

Hmmm… Google wants you to choose it for Hmmm… Google wants you to choose it for your web searches.your web searches.

Second QuestionSecond Question

(2) Why does Google want to hire people (2) Why does Google want to hire people who know how to find that number, and what who know how to find that number, and what does it have to do with a search engine? does it have to do with a search engine?

Hmmm… Google wants you to choose it for Hmmm… Google wants you to choose it for your web searches.your web searches.

Maybe their algorithms are mathematically Maybe their algorithms are mathematically based?based?

Second QuestionSecond Question

““Google-ing” GoogleGoogle-ing” Google

Results in an early paper from Page, Brin et. al. Results in an early paper from Page, Brin et. al. while in graduate schoolwhile in graduate school

Search EnginesSearch EnginesWe’ve all used them, but what is “under We’ve all used them, but what is “under

the hood?”the hood?”Crawl the web and locate Crawl the web and locate all*all* public pages public pages

Index the “crawled” data so it can be searchedIndex the “crawled” data so it can be searched

Rank the pages for more effective searchingRank the pages for more effective searching ( the “math” ( the “math” part of this talk )part of this talk )

Each word which is searched on is linked with a list of pages Each word which is searched on is linked with a list of pages (just URL’s) which contain it. The pages with the highest rank (just URL’s) which contain it. The pages with the highest rank are returned first.are returned first.

* - can’t get a “snapshot” of the web at a particular instance* - can’t get a “snapshot” of the web at a particular instance

Note:Note:

Google’s PageRank uses the Google’s PageRank uses the link structure (“crowd sourcing”) link structure (“crowd sourcing”) of the World Wide Web to of the World Wide Web to determine a page’s rank, it determine a page’s rank, it doesn’t grade content of a doesn’t grade content of a page.page.

PageRankPageRank is NOT a simple citation is NOT a simple citation indexindex

A B

Which is the more popular page below, A or B?Which is the more popular page below, A or B?

PageRankPageRank is NOT a simple citation is NOT a simple citation indexindex

NOTE:NOTE: (1)(1)Rankings based on citation index would be very easy to manipulateRankings based on citation index would be very easy to manipulate

A B

Which is the more popular page below, A or B?Which is the more popular page below, A or B?

What if the links to A were from unpopular pages, and the What if the links to A were from unpopular pages, and the one link to B was from one link to B was from www.yahoo.com ? (High School…) ? (High School…)

PageRankPageRank is NOT a simple citation is NOT a simple citation indexindex

NOTE:NOTE: (1)(1)Rankings based on citation index would be very easy to manipulateRankings based on citation index would be very easy to manipulate(2)(2)PageRank PageRank has evolved to be a minor part of Google’s search has evolved to be a minor part of Google’s search results.results.

A B

Which is the more popular page below, A or B?Which is the more popular page below, A or B?

What if the links to A were from unpopular pages, and the What if the links to A were from unpopular pages, and the one link to B was from one link to B was from www.yahoo.com ? (High School…) ? (High School…)

Intuitively Intuitively PageRankPageRank is analogous is analogous to popularityto popularity

The web as a graph: each page is a The web as a graph: each page is a vertexvertex, each , each hyperlink a directed hyperlink a directed edgeedge..

1BNPage A Page B

Page C

2AN

1CN

Which of these three would have the highest page rank?

Intuitively Intuitively PageRankPageRank is analogous is analogous to popularityto popularity

The web as a graph: each page is a The web as a graph: each page is a vertexvertex, each , each hyperlink a directed hyperlink a directed edgeedge..A page is popular if a few very popular pages point (via A page is popular if a few very popular pages point (via hyperlinks) to it.hyperlinks) to it.

1BNPage A Page B

Page C

2AN

1CN

Which of these three would have the highest page rank?

Intuitively Intuitively PageRankPageRank is analogous is analogous to popularityto popularity

The web as a graph: each page is a The web as a graph: each page is a vertexvertex, each , each hyperlink a directed hyperlink a directed edgeedge..A page is popular if a few very popular pages point (via A page is popular if a few very popular pages point (via hyperlinks) to it.hyperlinks) to it.A page could be popular if many not-necessarily A page could be popular if many not-necessarily popular pages point (via hyperlinks) to it.popular pages point (via hyperlinks) to it.

1BNPage A Page B

Page C

2AN

1CN

Which of these three would have the highest page rank?

So what is the mathematical So what is the mathematical definition of definition of PageRankPageRank??

In particular, a page’s rank is equal to the sum In particular, a page’s rank is equal to the sum of the ranks of all the pages pointing to it. of the ranks of all the pages pointing to it.

note the scaling of each page ranknote the scaling of each page rank

vfromlinksofnumberN

utolinkswithpagesofsetB

N

vRankuRank

v

u

Bv vu

)()(

Writing out the equation for each Writing out the equation for each web-page in our example gives:web-page in our example gives:

1BNPage A Page B

Page C

2AN

1CN

01

)(

2

)()(

002

)()(

1

)(00)(

BRankARankCRank

ARankBRank

CRankARank

Even though this is a circular definition we can Even though this is a circular definition we can calculate the ranks.calculate the ranks.

Even though this is a circular definition we can Even though this is a circular definition we can calculate the ranks.calculate the ranks.Re-write the system of equations as a Matrix-Re-write the system of equations as a Matrix-Vector product.Vector product.

)(

)(

)(

012

1

002

1

100

)(

)(

)(

CRank

BRank

ARank

CRank

BRank

ARank

Even though this is a circular definition we can Even though this is a circular definition we can calculate the ranks.calculate the ranks.Re-write the system of equations as a Matrix-Re-write the system of equations as a Matrix-Vector product.Vector product.

)(

)(

)(

012

1

002

1

100

)(

)(

)(

CRank

BRank

ARank

CRank

BRank

ARank

The The PageRankPageRank vector is simply an eigenvector of the vector is simply an eigenvector of the coefficient matrix, with coefficient matrix, with xAx

1

Wait… what’s an eigenvector?Wait… what’s an eigenvector?

1BNPage A Page B

Page C

2AN

1CN

PageRank = 0.4

PageRank = 0.4

PageRank = 0.2

Note: we choose the eigenvector with Note: we choose the eigenvector with 11x

Implementation DetailsImplementation Details

Billions of web-pages would make a huge matrixBillions of web-pages would make a huge matrix

The matrix (in theory) is column-stochastic, which allows The matrix (in theory) is column-stochastic, which allows for iterative calculationfor iterative calculation

Previous PageRank is used as an initial guessPrevious PageRank is used as an initial guess

Random-Surfer term handles computational difficulties Random-Surfer term handles computational difficulties associated with a “disconnected graph”associated with a “disconnected graph”

Wait… what else gets Wait… what else gets searched?searched?

Attempts to Manipulate Search ResultsAttempts to Manipulate Search Results Via a “Google Bomb”Via a “Google Bomb”

Liberals vs. Conservatives!Liberals vs. Conservatives! In 2007, Google addressed Google Bombs, too many people thought In 2007, Google addressed Google Bombs, too many people thought the results were intentional and not merely a function of the structure the results were intentional and not merely a function of the structure

of the webof the web

Juniata’s own “Google Bomb”Juniata’s own “Google Bomb”

At Juniata, CS 315 is my “Analysis and At Juniata, CS 315 is my “Analysis and Algorithms” courseAlgorithms” course

Miscellaneous pointsMiscellaneous points• Try a search in Google on “PigeonRank.”Try a search in Google on “PigeonRank.”

• What types of sites would Google NOT give What types of sites would Google NOT give good results on?good results on?

• PageRankPageRank has been deprecated. Google is has been deprecated. Google is continuosly trying new ranking algorithms.continuosly trying new ranking algorithms.

SPAM filtersSPAM filters• A “rules” approach… filter out all messages with things A “rules” approach… filter out all messages with things

like, “Dear Friend” or “Click.”like, “Dear Friend” or “Click.”• The first 80% is captured easily, with few false-positives.The first 80% is captured easily, with few false-positives.• But the last few % (remember Netflix) will be difficult to But the last few % (remember Netflix) will be difficult to

catch, the rules will offer many more false-positives, and catch, the rules will offer many more false-positives, and the SPAMM’ers can adapt.the SPAMM’ers can adapt.

• A statistical approach, called a Bayesian filter, is much A statistical approach, called a Bayesian filter, is much more effective.more effective.

• It “learns” from a given set of SPAM and non-SPAM It “learns” from a given set of SPAM and non-SPAM emails, automatically counting the frequency of words.emails, automatically counting the frequency of words.

• Some words are incriminating, like “Madam,” others Some words are incriminating, like “Madam,” others almost guarantee the email is non-SPAM, like “describe,” almost guarantee the email is non-SPAM, like “describe,” or “example.”or “example.”

BibliographyBibliography

[1] S. Brin, L. Page, et. al., [1] S. Brin, L. Page, et. al., The PageRank Citation Ranking: Bringing The PageRank Citation Ranking: Bringing Order to the WebOrder to the Web, , http://dbpubs.stanford.edu/pub/1999-66 , Stanford , Stanford Digital Libraries Project (January 29, 1998).Digital Libraries Project (January 29, 1998).

[2] K. Bryan and T. Leise, [2] K. Bryan and T. Leise, The $25,000,000,000 Eigenvector: The The $25,000,000,000 Eigenvector: The Linear Algebra behind GoogleLinear Algebra behind Google, SIAM Review, 48 (2006), pp. 569-581., SIAM Review, 48 (2006), pp. 569-581.

[3] G. Strang, [3] G. Strang, Linear Algebra and Its ApplicationsLinear Algebra and Its Applications, Brooks-Cole, , Brooks-Cole, Boston, MA, 2005.Boston, MA, 2005.

[4] D. Poole, [4] D. Poole, Linear Algebra: A Modern IntroductionLinear Algebra: A Modern Introduction, Brooks-Cole, , Brooks-Cole, Boston, MA, 2005.Boston, MA, 2005.

Any Questions?Any Questions?

Slides available at Slides available at http://faculty.juniata.edu/krusehttp://faculty.juniata.edu/kruse

The following slides give The following slides give some of the more in-depth some of the more in-depth

mathematics behind Googlemathematics behind Google

A Graphical Interpretation of a A Graphical Interpretation of a 2-Dimensional Eigenvector2-Dimensional Eigenvector

http://cnx.org/content/m10736/latest/

If we have some 2-D vector If we have some 2-D vector xx, and some , and some

2 x 2 matrix 2 x 2 matrix AA, generally their product, , generally their product,

A*x = bA*x = b, will result in a new vector, , will result in a new vector, bb, ,

which is pointing in a different direction andwhich is pointing in a different direction and

having a different length than having a different length than xx..

A Graphical Interpretation of a A Graphical Interpretation of a 2-Dimensional Eigenvector2-Dimensional Eigenvector

http://cnx.org/content/m10736/latest/

If we have some 2-D vector If we have some 2-D vector xx, and some , and some

2 x 2 matrix 2 x 2 matrix AA, generally their product, , generally their product,

A*x = bA*x = b, will result in a new vector, , will result in a new vector, bb, ,

which is pointing in a different direction andwhich is pointing in a different direction and

having a different length than having a different length than xx..

But, if the vector (But, if the vector (vv in the image at the left) is in the image at the left) is

an an eigenvectoreigenvector of of AA, then , then A*v A*v will give a will give a

vector which is same direction as vector which is same direction as vv,, but justbut just

scaled a different length, by scaled a different length, by λλ..

Note that Note that λ λ is called an eigenvalue of is called an eigenvalue of AA..

Note that the coefficient matrix is Note that the coefficient matrix is column-stochasticcolumn-stochastic**

1

10,

211

321

2232221

1131211

njjj

n

iij

ij

nnnnn

n

n

aaaa

a

aaaa

aaaa

aaaa

Every column-stochastic matrix has 1 as an eigenvalue.Every column-stochastic matrix has 1 as an eigenvalue.

** As long as there are no “dangling nodes” and the graph is As long as there are no “dangling nodes” and the graph is connected.connected.

In Page, Brin, et. al. [1], they suggest dangling nodes most likely In Page, Brin, et. al. [1], they suggest dangling nodes most likely would occur from pages which haven’t been crawled yet, and so would occur from pages which haven’t been crawled yet, and so they “simply remove them from the system until all the they “simply remove them from the system until all the PageRankPageRanks are calculated.”s are calculated.”

It is interesting to note that a column-substochastic does have a It is interesting to note that a column-substochastic does have a positive eigenvalue and corresponding eigenvector with positive eigenvalue and corresponding eigenvector with non-negative entries, which is called the Perron eigenvector, as non-negative entries, which is called the Perron eigenvector, as detailed in Bryan and Leise [2].detailed in Bryan and Leise [2].

Dangling Nodes have no outgoing linksDangling Nodes have no outgoing links

Page B

Page A

Page C

02/12/1

002/1

02/10In this example, Page C is a In this example, Page C is a dangling node. Note that its dangling node. Note that its associated column in the associated column in the coefficient matrix is all 0. coefficient matrix is all 0. Matrices like these are called Matrices like these are called column-substochasticcolumn-substochastic..

1

In this example, the eigenspace assiciated with In this example, the eigenspace assiciated with eigenvalue is two-dimensional. Which eigenvalue is two-dimensional. Which eigenvector should be used for ranking? eigenvector should be used for ranking?

A disconnected graph could lead to A disconnected graph could lead to non-unique rankingsnon-unique rankings

Page D

Page C

Page E

Page B

Page A

00000

2/10100

2/11000

00001

00010Notice the block Notice the block diagonal structure diagonal structure of the coefficient of the coefficient matrix. matrix. Note:Note: Re-ordering Re-ordering via permutation via permutation doesn’t change the doesn’t change the ranking, as in [2].ranking, as in [2].

1

Add a “random-surfer” term to the Add a “random-surfer” term to the simple simple PageRankPageRank formula. formula.

This models the behavior of a real web-surfer, who might jump to another This models the behavior of a real web-surfer, who might jump to another page by directly typing in a URL or by choosing a bookmark, rather than page by directly typing in a URL or by choosing a bookmark, rather than clicking on a hyperlink. Originally, clicking on a hyperlink. Originally, m=0.15m=0.15 in Google, according to [2]. in Google, according to [2].

can also be written as:can also be written as:

mSAmM )1(

Let Let SS be an be an nn x x nn matrix with all entries matrix with all entries 1/n1/n. . SS is column- is column-stochastic, and we consider the matrix stochastic, and we consider the matrix M M , which is a , which is a weighted average of weighted average of AA and and SS. .

smxAmx )1(xMx

Important NoteImportant Note: : We will use this formulation with We will use this formulation with AA when computing when computing x x ,,

and and ss is a column vector with all entries is a column vector with all entries 1/n1/n, where if , where if sxS 1ix

The eigenspace associated with is one-The eigenspace associated with is one-dimensional, and the normalized eigenvector isdimensional, and the normalized eigenvector is

MM for our previous disconnected for our previous disconnected graph, with graph, with m=0.15m=0.15

Page D

Page C

Page E

Page B

Page A

03.003.003.003.003.0

455.003.088.003.003.0

455.088.003.003.003.0

03.003.003.003.088.0

03.003.003.088.003.0

1

So the addition of the random surfer term permits So the addition of the random surfer term permits comparison between pages in different subwebs.comparison between pages in different subwebs.

)03.0,285.0,285.0,2.0,2.0(

By many estimates, the web currently contains at least 8 billion By many estimates, the web currently contains at least 8 billion pages. How does Google compute an eigenvector for pages. How does Google compute an eigenvector for something this large?something this large?

One possibility is the power method.One possibility is the power method.

In [2], it is shown that every positive (all entries are > 0) In [2], it is shown that every positive (all entries are > 0) column-stochastic matrix column-stochastic matrix MM has a unique vector has a unique vector qq with positive with positive

components such that components such that MqMq = = qq, with , and it can be, with , and it can be

computed as , for any initial guess withcomputed as , for any initial guess with

positive components and .positive components and .

Iterative CalculationIterative Calculation

11q

0lim xMq k

k

0x

110x

Rather than calculating the powers of M directly, we could use Rather than calculating the powers of M directly, we could use the iteration, .the iteration, .

Since M is positive, would be an calculation. Since M is positive, would be an calculation.

As we mentioned previously, Google uses the equivalent As we mentioned previously, Google uses the equivalent expression in the computation:expression in the computation:

These products can be calculated without explicitly creating These products can be calculated without explicitly creating the huge coefficient matrix, since the huge coefficient matrix, since AA contains mostly 0’s. contains mostly 0’s.

The iteration is guaranteed to converge, and it will converge The iteration is guaranteed to converge, and it will converge quicker with a better first guess, so the previous quicker with a better first guess, so the previous PageRankPageRank vector is used as the initial vector. vector is used as the initial vector.

Iterative Calculation continuedIterative Calculation continued

1 kk xMx

1kxM

)( 2nO

smxAmx kk

1)1(

This gives a regular matrixThis gives a regular matrix

In matrix notation we haveIn matrix notation we have

Since Since we can rewrite as we can rewrite as

The new coefficient matrix is regular, so we can calculate The new coefficient matrix is regular, so we can calculate the eigenvector iteratively.the eigenvector iteratively.

This iterative process is a series of matrix-vector products, This iterative process is a series of matrix-vector products, beginning with an initial vector (typically the previous beginning with an initial vector (typically the previous PageRankPageRank vector). These products can be calculated vector). These products can be calculated without explicitly creating the huge coefficient matrix. without explicitly creating the huge coefficient matrix.

EARR

11R

REEnoteREAR )1(:,))1((