Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...

Preview:

Citation preview

Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

Lecture 9: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion

• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Links as Votes

• Pages with more inbound links are more important • Inbound links from important pages carry more weight

Not all pages are equally important

Manyinboundlinks

Few/noinboundlinks

Links fromunimportantpages

Links fromimportantpages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Example: PageRank Scores

B 38.4 C

34.3

E 8.1

F 3.9

D 3.9

A 3.3

1.61.6 1.6 1.6 1.6

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Recursive Formulation

• A link’s vote is proportional to the importance of its source page

• If page j with importance rj has n out-links, each link gets rj / n votes

• Page j’s own importance is the sum of the votes on its in-links

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Equivalent Formulation: Random Surfer

• At time t a surfer is on some page i • At time t+1 the surfer follows a

link to a new page at random • Define rank ri as fraction of time

spent on page i

j

ki

rj/3

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: The “Flow” Model

• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5

∑→

=ji

ij

rrid

y

mara/2

ra/2

rm

ry/2

ry/2

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: The “Flow” Model

∑→

=ji

ij

rrid

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

y

mara/2

ra/2

rm

ry/2

ry/2

PageRank: Eigenvector Problem

• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1

• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)

• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?

PageRank: Power IterationModel for random Surfer:

• At time t = 0 pick a page at random • At each subsequent time t follow an

outgoing link at random

Probabilistic interpretation:

PageRank: Power Iteration

y

maa/2

y/2a/2

m

y/2

pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Intermezzo: Markov Chains

Ergodicity

Irreducibility

Stationary distribution

(for ergodic chains)

Markov Property

• PageRank is assumes a random walkmodel for individual surfers

• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

Aside: Ergodicity• PageRank is assumes a random walk

model for individual surfers • Equivalent assumption: flow model

in which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

PageRank: Problems

Dead end

1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Spider trap

2. Spider Traps • Subgraph with no outgoing

links to wider graph • Surfers are “trapped” with

no way out.Not irreducible

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Dead Ends

y

maa/2

y/2a/2

y/2

Fixes “probability sink” issue

(teleport at dead ends)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Spider Traps

y

maa/2

y/2a/2

y/2

Probability accumulates in traps (surfers get stuck)

m

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Solution: Random TeleportsModel for teleporting random surfer:

• At time t = 0 pick a page at random • At each subsequent time t

• With probability β follow an outgoing link at random

• With probability 1-β teleport to a new initial location at random

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

rj =X

i! j

�ri

di+ (1� �) 1

N

PageRank Equation [Page & Brin 1998]

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Teleports

y

ma (can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Computing PageRank

• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk

0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23

source node degree destination nodes

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Block-based Update Algorithm• Break r new into k blocks that fit in memory • Scan M and r old once for each block

0 4 0, 1, 3, 51 2 0, 52 2 3, 4

src degree destination

01

23

45

012345

rnew rold

M

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Block-Stripe Update Algorithm

0 4 0, 11 3 02 2 1

src degree destination

01

23

45

012345

rnew

rold

0 4 51 3 52 2 4

0 4 32 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Problems: Term Spam• How do you make your page appear to be

about movies? • (1) Add the word movie 1,000 times to your page

• Set text color to the background color, so only search engines would see it

• (2) Or, run the query “movie” on your target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”

• These and similar techniques are term spam

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Google’s Solution to Term Spam• Believe what people say about you, rather

than what you say about yourself • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

• PageRank as a tool to measure the “importance” of Web pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Problems 2: Link Spam• Once Google became the dominant search

engine, spammers began to work out ways to fool Google

• Spam farms were developed to concentrate PageRank on a single page

• Link spam: • Creating link structures that

boost PageRank of a page

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Spamming• Three kinds of web pages from a

spammer’s point of view • Inaccessible pages • Accessible pages

• e.g., blog comments pages (spammer can post links to his pages)

• Owned pages • Completely controlled by spammer • May span multiple domain names

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Farms• Spammer’s goal:

• Maximize the PageRank of target page t

• Technique: • Get as many links from accessible pages

as possible to target page t • Construct “link farm” to get PageRank

multiplier effect

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Farms

Inaccessible

t

Accessible Owned

1

2

M

Oneofthemostcommonandeffective organizationsforalinkfarm

Millions of farm pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Extensions

• Topic-specific PageRank: • Restrict teleportation to some set S

of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise

• Trust Propagation • Use set S of trusted pages

as teleport set

Hidden Markov Models

Time Series with Distinct States

Can we use a Gaussian Mixture Model?Time Series Histogram

MixturePosterior on states

Time Series Histogram

MixturePosterior on states

Can we use a Gaussian Mixture Model?

Estimate from GMM

Hidden Markov ModelsEstimate from HMM

• Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states

(more likely to be in same state than different state)

Reminder: Random Surfers in PageRank

y

maa/2

y/2a/2

m

y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Reminder: Random Surfers in PageRank

y

maa/2

y/2a/2

m

y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

A = M>

Hidden Markov ModelsGaussian Mixture Gaussian HMM

Review: Gaussian MixturesExpectation Maximization1. Update cluster probabilities

2. Update parameters

Forward-backward AlgorithmExpectation step for HMM

z1 ⇠ Discrete(⇡)

zt+1|zt = k ⇠ Discrete(Ak)

xt|zt = k ⇠ Normal(µk,�k)

�t,k = p(zt = k |x1:T ,✓)

=p(x1:t, zt)p(xt+1:T |zt)

p(x1:T )

/ ↵t,k�t,k

↵t,l := p(x1:t, zt)

=X

k

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

=X

l

�t+1,l p(xt+1|µl,�l)Akl

↵t,l := p(x1:t, zt)

=X

k

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

=X

l

�t+1,l p(xt+1|µl,�l)Akl

Parameter UpdatesE-step: Posterior probabilities on Transitions

M-step updates

Other Examples for HMMsHandwritten Digits

• State 1: Sweeping arc • State 2: Horizontal line

13.2. Hidden Markov Models 615

Figure 13.11 Top row: examples of on-line handwrittendigits. Bottom row: synthetic digits sam-pled generatively from a left-to-right hid-den Markov model that has been trainedon a data set of 45 handwritten digits.

One of the most powerful properties of hidden Markov models is their ability toexhibit some degree of invariance to local warping (compression and stretching) ofthe time axis. To understand this, consider the way in which the digit ‘2’ is writtenin the on-line handwritten digits example. A typical digit comprises two distinctsections joined at a cusp. The first part of the digit, which starts at the top left, has asweeping arc down to the cusp or loop at the bottom left, followed by a second more-or-less straight sweep ending at the bottom right. Natural variations in writing stylewill cause the relative sizes of the two sections to vary, and hence the location of thecusp or loop within the temporal sequence will vary. From a generative perspectivesuch variations can be accommodated by the hidden Markov model through changesin the number of transitions to the same state versus the number of transitions to thesuccessive state. Note, however, that if a digit ‘2’ is written in the reverse order, thatis, starting at the bottom right and ending at the top left, then even though the pen tipcoordinates may be identical to an example from the training set, the probability ofthe observations under the model will be extremely small. In the speech recognitioncontext, warping of the time axis is associated with natural variations in the speed ofspeech, and again the hidden Markov model can accommodate such a distortion andnot penalize it too heavily.

13.2.1 Maximum likelihood for the HMMIf we have observed a data set X = {x1, . . . ,xN}, we can determine the param-

eters of an HMM using maximum likelihood. The likelihood function is obtainedfrom the joint distribution (13.10) by marginalizing over the latent variables

p(X|θ) =!

Z

p(X,Z|θ). (13.11)

Because the joint distribution p(X,Z|θ) does not factorize over n (in contrast to themixture distribution considered in Chapter 9), we cannot simply treat each of thesummations over zn independently. Nor can we perform the summations explicitlybecause there are N variables to be summed over, each of which has K states, re-sulting in a total of KN terms. Thus the number of terms in the summation grows

RNA splicing

• State 1: Exon (relevant) • State 2: Splice site • State 3: Intron (ignored)

Recommended