Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...

Data Mining TechniquesCS 6220 - Section 2 - Spring 2017

Lecture 9: Link AnalysisJan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

• Human-curated (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion

• Text-search(e.g. WebCrawler, Lycos) • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Links as Votes

• Pages with more inbound links are more important • Inbound links from important pages carry more weight

Not all pages are equally important

Manyinboundlinks

Few/noinboundlinks

Links fromunimportantpages

Links fromimportantpages

Example: PageRank Scores

B 38.4 C

1.61.6 1.6 1.6 1.6

PageRank: Recursive Formulation

• A link’s vote is proportional to the importance of its source page

• If page j with importance rj has n out-links, each link gets rj / n votes

• Page j’s own importance is the sum of the votes on its in-links

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

Equivalent Formulation: Random Surfer

• At time t a surfer is on some page i • At time t+1 the surfer follows a

link to a new page at random • Define rank ri as fraction of time

spent on page i

rj/3rj/3rj = ri/3+rk/4

ri/3 rk/4

PageRank: The “Flow” Model

• 3 equations, 3 unknowns • Impose constraint: ry + ra + rm = 1• Solution: ry = 2/5, ra = 2/5, rm = 1/5

∑→

mara/2

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

PageRank: The “Flow” Model

∑→

“Flow” equations:

ry = ry /2 + ra /2ra = ry /2 + rm

rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

mara/2

PageRank: Eigenvector Problem

• PageRank: Solve for eigenvector r = M r with eigenvalue λ = 1

• Eigenvector with λ = 1 is guaranteed to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)

• Problem: There are billions of pages on the internet. How do we solve for eigenvector with order 1010 elements?

PageRank: Power IterationModel for random Surfer:

• At time t = 0 pick a page at random • At each subsequent time t follow an

outgoing link at random

Probabilistic interpretation:

PageRank: Power Iteration

y/2a/2

pt converges to r. Iterate until |pt - pt -1| < ε(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Intermezzo: Markov Chains

Ergodicity

Irreducibility

Stationary distribution

(for ergodic chains)

Markov Property

• PageRank is assumes a random walkmodel for individual surfers

• Equivalent assumption: flow modelin which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

Aside: Ergodicity• PageRank is assumes a random walk

model for individual surfers • Equivalent assumption: flow model

in which equal fractions of surfers follow each link at every time

• Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

PageRank: Problems

Dead end

1. Dead Ends • Nodes with no outgoing links. • Where do surfers go next?

Spider trap

2. Spider Traps • Subgraph with no outgoing

links to wider graph • Surfers are “trapped” with

no way out.Not irreducible

Power Iteration: Dead Ends

y/2a/2

Probability not conserved(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Dead Ends

y/2a/2

Fixes “probability sink” issue

(teleport at dead ends)

Power Iteration: Spider Traps

y/2a/2

Probability accumulates in traps (surfers get stuck)

Solution: Random TeleportsModel for teleporting random surfer:

• At time t = 0 pick a page at random • At each subsequent time t

• With probability β follow an outgoing link at random

• With probability 1-β teleport to a new initial location at random

di+ (1� �) 1

PageRank Equation [Page & Brin 1998]

Power Iteration: Teleports

ma (can use power iteration as normal)

Computing PageRank

• M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk

0 3 1, 5, 71 5 17, 64, 113, 117, 2452 2 13, 23

source node degree destination nodes

Block-based Update Algorithm• Break r new into k blocks that fit in memory • Scan M and r old once for each block

0 4 0, 1, 3, 51 2 0, 52 2 3, 4

src degree destination

012345

rnew rold

Block-Stripe Update Algorithm

0 4 0, 11 3 02 2 1

src degree destination

012345

0 4 51 3 52 2 4

0 4 32 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new

Problems: Term Spam• How do you make your page appear to be

about movies? • (1) Add the word movie 1,000 times to your page

• Set text color to the background color, so only search engines would see it

• (2) Or, run the query “movie” on your target search engine • See what page came first in the listings • Copy it into your page, make it “invisible”

• These and similar techniques are term spam

Google’s Solution to Term Spam• Believe what people say about you, rather

than what you say about yourself • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

• PageRank as a tool to measure the “importance” of Web pages

Problems 2: Link Spam• Once Google became the dominant search

engine, spammers began to work out ways to fool Google

• Spam farms were developed to concentrate PageRank on a single page

• Link spam: • Creating link structures that

boost PageRank of a page

Link Spamming• Three kinds of web pages from a

spammer’s point of view • Inaccessible pages • Accessible pages

• e.g., blog comments pages (spammer can post links to his pages)

• Owned pages • Completely controlled by spammer • May span multiple domain names

Link Farms• Spammer’s goal:

• Maximize the PageRank of target page t

• Technique: • Get as many links from accessible pages

as possible to target page t • Construct “link farm” to get PageRank

multiplier effect

Link Farms

Inaccessible

Accessible Owned

Oneofthemostcommonandeffective organizationsforalinkfarm

Millions of farm pages

PageRank: Extensions

• Topic-specific PageRank: • Restrict teleportation to some set S

of pages related to a specific topic • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise

• Trust Propagation • Use set S of trusted pages

as teleport set

Hidden Markov Models

Time Series with Distinct States

Can we use a Gaussian Mixture Model?Time Series Histogram

MixturePosterior on states

Time Series Histogram

MixturePosterior on states

Can we use a Gaussian Mixture Model?

Estimate from GMM

Hidden Markov ModelsEstimate from HMM

• Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states

(more likely to be in same state than different state)

Reminder: Random Surfers in PageRank

y/2a/2

Reminder: Random Surfers in PageRank

y/2a/2

A = M>

Hidden Markov ModelsGaussian Mixture Gaussian HMM

Review: Gaussian MixturesExpectation Maximization1. Update cluster probabilities

2. Update parameters

Forward-backward AlgorithmExpectation step for HMM

z1 ⇠ Discrete(⇡)

zt+1|zt = k ⇠ Discrete(Ak)

xt|zt = k ⇠ Normal(µk,�k)

�t,k = p(zt = k |x1:T ,✓)

=p(x1:t, zt)p(xt+1:T |zt)

p(x1:T )

/ ↵t,k�t,k

↵t,l := p(x1:t, zt)

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

�t+1,l p(xt+1|µl,�l)Akl

↵t,l := p(x1:t, zt)

p(xt|µl,�l)Akl↵t�1,k

�t,k := p(xt+1:T | zt)

�t+1,l p(xt+1|µl,�l)Akl

Parameter UpdatesE-step: Posterior probabilities on Transitions

M-step updates

Other Examples for HMMsHandwritten Digits

• State 1: Sweeping arc • State 2: Horizontal line

13.2. Hidden Markov Models 615

Figure 13.11 Top row: examples of on-line handwrittendigits. Bottom row: synthetic digits sam-pled generatively from a left-to-right hid-den Markov model that has been trainedon a data set of 45 handwritten digits.

One of the most powerful properties of hidden Markov models is their ability toexhibit some degree of invariance to local warping (compression and stretching) ofthe time axis. To understand this, consider the way in which the digit ‘2’ is writtenin the on-line handwritten digits example. A typical digit comprises two distinctsections joined at a cusp. The first part of the digit, which starts at the top left, has asweeping arc down to the cusp or loop at the bottom left, followed by a second more-or-less straight sweep ending at the bottom right. Natural variations in writing stylewill cause the relative sizes of the two sections to vary, and hence the location of thecusp or loop within the temporal sequence will vary. From a generative perspectivesuch variations can be accommodated by the hidden Markov model through changesin the number of transitions to the same state versus the number of transitions to thesuccessive state. Note, however, that if a digit ‘2’ is written in the reverse order, thatis, starting at the bottom right and ending at the top left, then even though the pen tipcoordinates may be identical to an example from the training set, the probability ofthe observations under the model will be extremely small. In the speech recognitioncontext, warping of the time axis is associated with natural variations in the speed ofspeech, and again the hidden Markov model can accommodate such a distortion andnot penalize it too heavily.

13.2.1 Maximum likelihood for the HMMIf we have observed a data set X = {x1, . . . ,xN}, we can determine the param-

eters of an HMM using maximum likelihood. The likelihood function is obtainedfrom the joint distribution (13.10) by marginalizing over the latent variables

p(X|θ) =!

p(X,Z|θ). (13.11)

Because the joint distribution p(X,Z|θ) does not factorize over n (in contrast to themixture distribution considered in Chapter 9), we cannot simply treat each of thesummations over zn independently. Nor can we perform the summations explicitlybecause there are N variables to be summed over, each of which has K states, re-sulting in a total of KN terms. Thus the number of terms in the summation grows

RNA splicing

• State 1: Exon (relevant) • State 2: Splice site • State 3: Intron (ignored)

Data Mining Techniques...Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link...

Documents

New Data Mining Techniques · 2016. 11. 11. · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 17: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,

Advanced Techniques for Mining Structured Data: Process Mining

Data Mining: Concepts and Techniques Mining time-series data

Data Mining Concepts and Techniques

Chapter 08 Data Mining Techniques

August 25, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 —

Data Warehousing & Mining Techniques

Text Mining: Natural Language techniques and Text Mining applications · · 2017-08-27Text Mining: Natural Language techniques and Text Mining applications M. Rajman, R. Besan

Data Mining: Concepts et Techniques

- Data Mining Techniques

10BM60027: WEKA Data Mining Techniques

Other Data Mining Techniques

September 15, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.1. Mining data streams

Data mining techniques using weka

Data Mining Techniques for CRM

GRAPH MINING a general overview of some mining techniques

Data Mining Techniques · Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 10 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.)

Applying Data Mining Techniques

Data Mining: Concepts and Techniques — Chapter 8 — 8.2 Mining

Techniques du data mining