167
Link Analysis for Web Information Retrieval C. Castillo Hypothesis Levels of link analysis Ranking Web spam ... detection ... links ... contents ... both Summary Link Analysis for Web Information Retrieval With Applications to Adversarial IR Carlos Castillo 1 [email protected] With: R. Baeza-Yates 1,3 , L. Becchetti 2 , P. Boldi 5 , D. Donato 1 , A. Gionis 1 , S. Leonardi 2 , V.Murdock 1 , M. Santini 5 , F. Silvestri 4 , S. Vigna 5 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Universit` a di Roma “La Sapienza” – Rome, Italy 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` a degli Studi di Milano – Milan, Italy

Link Analysis for Web Information Retrieval

Embed Size (px)

DESCRIPTION

Talk from February 2008 @ FADOC, Complutense University, Madrid

Citation preview

Page 1: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link Analysis for Web Information RetrievalWith Applications to Adversarial IR

Carlos Castillo1

[email protected]

With: R. Baeza-Yates1,3, L. Becchetti2, P. Boldi5,D. Donato1, A. Gionis1, S. Leonardi2, V.Murdock1,

M. Santini5, F. Silvestri4, S. Vigna5

1. Yahoo! Research Barcelona – Catalunya, Spain2. Universita di Roma “La Sapienza” – Rome, Italy

3. Yahoo! Research Santiago – Chile4. ISTI-CNR –Pisa,Italy

5. Universita degli Studi di Milano – Milan, Italy

Page 2: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

When you have a hammer

Page 3: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Everything looks like a graph!

Page 4: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 5: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Links are not placed at random

Topical locality hypothesis

Link endorsement hypothesis

Page 6: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Links are not placed at random

Topical locality hypothesis

Link endorsement hypothesis

Page 7: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topical locality hypothesis

“We found that pages are significantly more likely tobe related topically to pages to which they arelinked, as opposed to other pages selected atrandom or other nearby pages.” [Davison, 2000]

Page 8: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5

Aver

age

text

sim

ilar

ity

Link distance

[Baeza-Yates et al., 2006], data from UK 2006

Page 9: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link similarity cases

Link (geodesic) distance

Co-citation

Bibliographic coupling

Page 10: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Co-citation

Page 11: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Bibliographic coupling

Page 12: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

(Both can be generalized)

(Both co-citation and bibliographic coupling can begeneralized. E.g.: SimRank [Jeh and Widom, 2002]:generalizes the idea of co-citation to several levels)

Page 13: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link endorsement hypothesis

Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]

But they can represent:

Disagreement

Self citations

Nepotism

Citations to methodological documents

etc.

Page 14: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link endorsement hypothesis

Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]

But they can represent:

Disagreement

Self citations

Nepotism

Citations to methodological documents

etc.

Page 15: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 16: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 17: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 18: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 19: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 20: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Page 21: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Nevertheless

Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web

Analogy with Economy

Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences

Page 22: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Nevertheless

Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web

Analogy with Economy

Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences

Page 23: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 24: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Page 25: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Page 26: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Page 27: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Page 28: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Bow-tie

[Broder et al., 2000]

Page 29: Link Analysis for Web Information Retrieval
Page 30: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Jellyfish

[Tauro et al., 2001] - Internet Autonomous Systems (AS)Topology

Page 31: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Jellyfish

Page 32: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Microscopic view, e.g. Degree

[Barabasi, 2002] and others

Page 33: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“While entirely of human design, the emergingnetwork appears to have more in common with a cellor an ecological system than with a Swisswatch.” [Barabasi, 2002]

Page 34: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Other scale-free networks

Power grid designs

Sexual partners in humans

Collaboration of movie actors in films

Citations in scientific publications

Protein interactions

Page 35: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Microscopic view, e.g. Degree

Greece Chile

Spain Korea

[Baeza-Yates et al., 2007] - compares this distribution in 8countries . . . guess what is the result?

Page 36: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

Page 37: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

Page 38: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

.it (40M pages) .uk (18M pages)

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

.eu.int (800K pages) Synthetic graph (100K pages)

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

[Baeza-Yates et al., 2006]

Page 39: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Page 40: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Page 41: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Page 42: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Page 43: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Preferential attachment

“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]

“rich get richer”

Page 44: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Preferential attachment

“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]

“rich get richer”

Page 45: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 46: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Counting in-links does not work

“With a simple program, huge numbers of pages canbe created easily, artificially inflating citation counts.Because the Web environment contains profitseeking ventures, attention getting strategies evolvein response to search engine algorithms. For thisreason, any evaluation strategy which countsreplicable features of web pages is prone tomanipulation” [Page et al., 1998]

Page 47: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

PageRank: simplified version

PageRank ′(u) =∑

v∈Γ−(u)

PageRank ′(v)

|Γ+(v)|

Γ−(·): in-linksΓ+(·): out-links

Page 48: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations with pseudo-PageRank

Page 49: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations with pseudo-PageRank

Page 50: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

So far, so good, but ...

The Web includes many pages with no out-links, thesewill accumulate all of the score

We would like Web pages to accumulate ranking

We add random jumps (teleportation)

Page 51: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

PageRank

PageRank(u) =ǫ

N+ (1 − ǫ)

v∈Γ−(u)

PageRank(v)

|Γ+(v)|

Γ−(·): in-linksΓ+(·): out-linksǫ/N: jump to a random page with probability ǫ ≈ 0.15

Page 52: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

HITS

Two scores per page: “hub score” and “authority score”.

Page 53: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

HITS

Two scores per page: “hub score” and “authority score”.

Page 54: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Page 55: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations

Initialize:hub(u, 0) = auth(u, 0) = 0

Iterate:hub(u, t) =

∑v∈Γ+(u)

auth(v ,t−1)|Γ−(v)|

auth(u, t) =∑

v∈Γ−(u)hub(v ,t−1)|Γ+(v)|

Page 56: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 57: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What is on the Web?

Page 58: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What is on the Web [2.0]?

Page 59: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What else is on the Web?

“The sum of all human knowledge plus porn” – Robert Gilbert

Source: www.milliondollarhomepage.com

Page 60: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What’s happening on the Web?

There is a fierce competition

for your attention

Page 61: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What’s happening on the Web?

Search engines are to some extent

arbiters of this competition

and they must watch it closely, otherwise ...

Page 62: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Some cheating occurs

1986 FIFA World Cup, Argentina vs England

Page 63: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Simple web spam

Page 64: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Hidden text

Page 65: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Made for advertising

Page 66: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Search engine?

Page 67: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Fake search engine

Page 68: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“Normal” content in link farms

Page 69: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“Normal” content in link farms

Page 70: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Cloaking

Page 71: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Redirection

Page 72: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Redirects using Javascript

Simple redirect

<script>

document.location="http://www.topsearch10.com/";

</script>

“Hidden” redirect<script>

var1=24; var2=var1;

if(var1==var2) {document.location="http://www.topsearch10.com/";

}</script>

Page 73: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Problem: obfuscated code

Obfuscated redirect<script>

var a1="win",a2="dow",a3="loca",a4="tion.",

a5="replace",a6="(’http://www.top10search.com/’)";

var i,str="";

for(i=1;i<=6;i++)

{str += eval("a"+i);

}eval(str);

</script>

Page 74: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Problem: really obfuscated code

Encoded javascript

<script>

var s = "%5CBE0D%5C%05GDHJ BDE%16...%04%0E";

var e = ’’, i;

eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));

</script>

More examples: [Chellapilla and Maykov, 2007]

Page 75: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

There are many attempts of cheating on the Web

Most of these are spam:

1,630,000 results for “free mp3 hilton viagra” in SE1

1,760,000 results for “credit vicodin loan” in SE2

1,320,000 results for “porn mortgage” in SE3

Page 76: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Costs

Costs:

X Costs for users: lower precision for some queries

X Costs for search engines: wasted storage space,network resources, and processing cycles

X Costs for the publishers: resources invested in cheatingand not in improving their contents

Page 77: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 78: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 79: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 80: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 81: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 82: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 83: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 84: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 85: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 86: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 87: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 88: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Page 89: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Opportunities for Web spam

X Spamdexing

Keyword stuffingLink farmsSpam blogs (splogs)Cloaking

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 90: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Opportunities for Web spam

X Spamdexing

Keyword stuffingLink farmsSpam blogs (splogs)Cloaking

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 91: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 92: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Motivation

[Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:

“in a number of these distributions, outlier values are

associated with web spam”

Page 93: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Machine Learning

Page 94: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Training of a Decision Tree

Page 95: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Decision Tree (error = 15%)

Page 96: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Decision Tree (error = 15% → 12%)

Page 97: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Machine Learning (cont.)

Page 98: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Feature Extraction

Page 99: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Page 100: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Page 101: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Page 102: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Page 103: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Page 104: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Page 105: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Page 106: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Page 107: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Data

Data is difficult to collect

Data is expensive to label

Labels are sparse

Humans do not always agree

Page 108: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Agreement

Page 109: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Results

LabelsLabel Frequency Percentage

Normal 4,046 61.75%Borderline 709 10.82%

Spam 1,447 22.08%Can not classify 350 5.34%

Agreement

Category Kappa Interpretation

normal 0.62 Substantial agreementspam 0.63 Substantial agreementborderline 0.11 Slight agreement

global 0.56 Moderate agreement

Reference collection [Castillo et al., 2006]

Page 110: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 111: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological spam: link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Page 112: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological spam: link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Page 113: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Page 114: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Page 115: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Page 116: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Semi-streaming model

Memory size enough to hold some data per-node

Disk size enough to hold some data per-edge

A small number of passes over the data

Page 117: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Page 118: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Page 119: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Page 120: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link-Based Features

Degree-related measures

PageRank

TrustRank [Gyongyi et al., 2004]

Truncated PageRank [Becchetti et al., 2006]

Estimation of supporters [Becchetti et al., 2006]

140 features per host (2 pages per host)

Page 121: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Degree-Based

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1968753460609107764252125899138032376184

NormalSpam

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

22009.92686.5327.940.04.90.60.10.00.00.0

NormalSpam

Page 122: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank

TrustRank [Gyongyi et al., 2004]

A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious

Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step

i Trusted nodes: data from http://www.dmoz.org/

Page 123: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank

TrustRank [Gyongyi et al., 2004]

A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious

Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step

i Trusted nodes: data from http://www.dmoz.org/

Page 124: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank Idea

Page 125: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank / PageRank

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

9e+033e+031e+033e+021e+024e+011e+01410.4

NormalSpam

Page 126: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Nu

mb

er

of

No

de

s

Top 0%−10%

Top 40%−50%

Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Page 127: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Nu

mb

er

of

No

de

s

Top 0%−10%

Top 40%−50%

Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Page 128: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Page 129: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Page 130: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Bottleneck number

bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}. Minimum rate of growthof the neighbors of x up to a certain distance. We expect thatspam pages form clusters that are somehow isolated from therest of the Web graph and they have smaller bottlenecknumbers than non-spam pages.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

4.523.873.312.832.422.071.781.521.301.11

NormalSpam

Page 131: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 132: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Content-Based Features

Most of these reported in [Ntoulas et al., 2006]:

Number of word in the page and title

Average word length

Fraction of anchor text

Fraction of visible text

Compression rate

From [Castillo et al., 2007]:

Corpus precision and corpus recall

Query precision and query recall

Independent trigram likelihood

Entropy of trigrams

Page 133: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Average word length

0.00

0.02

0.04

0.06

0.08

0.10

0.12

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

NormalSpam

Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.

Page 134: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Corpus precision

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NormalSpam

Figure: Histogram of the corpus precision in non-spam vs. spampages.

Page 135: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Query precision

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

NormalSpam

Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.

Page 136: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 137: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

General hypothesis

Pages topologically close to each other are more likely to havethe same label (spam/nonspam) than random pairs of pages

Ideas for exploiting this: clustering, propagation, stackedlearning

Page 138: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

[Castillo et al., 2007]

Page 139: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological dependencies: in-links

Histogram of fraction of spam hosts in the in-links

0 = no in-link comes from spam hosts

1 = all of the in-links come from spam hosts

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.0 0.2 0.4 0.6 0.8 1.0

In-links of non spamIn-links of spam

Page 140: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological dependencies: out-links

Histogram of fraction of spam hosts in the out-links

0 = none of the out-links points to spam hosts

1 = all of the out-links point to spam hosts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.2 0.4 0.6 0.8 1.0

Out-links of non spamOutlinks of spam

Page 141: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering

Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting

Page 142: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Initial prediction:

Page 143: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Clustering:

Page 144: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Final prediction:

Page 145: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering – Results

Baseline Clustering

Without bagging

True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%

F-Measure 0.646 0.673

With bagging

True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%

F-Measure 0.723 0.728

V Reduces error rate

Page 146: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label

Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes

Page 147: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Initial prediction:

Page 148: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Propagation:

Page 149: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Final prediction, applying a threshold:

Page 150: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label – Results

Baseline Fwds. Backwds. Both

Classifier without bagging

True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%

F-Measure 0.646 0.665 0.664 0.676

Classifier with bagging

True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%

F-Measure 0.723 0.716 0.733 0.724

Page 151: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning

Meta-learning scheme [Cohen and Kou, 2006]

Derive initial predictions

Generate an additional attribute for each object bycombining predictions on neighbors in the graph

Append additional attribute in the data and retrain

Page 152: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features

Let N(x) be the set of pages related to x (in some way)

Compute

f (x) =

∑g∈N(x) p(g)

|N(x)|

Add f (x) as an extra feature for instance x and learn anew model with k + 1 features

Page 153: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Initial prediction:

Page 154: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Computation of new feature:

Page 155: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

New prediction with k + 1 features:

Page 156: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning - Results

Avg. Avg. Avg.Baseline of in of out of both

True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%

F-Measure 0.723 0.733 0.742 0.750

V Increases detection rate

Page 157: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning x2

And repeat ...

Baseline First pass Second pass

True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%

F-Measure 0.723 0.750 0.763

V Significant improvement over the baseline

Page 158: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Page 159: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Page 160: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Page 161: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Page 162: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Thank you!

Page 163: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).Generalizing pagerank: Damping functions for link-based rankingalgorithms.In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.ACM Press.

Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).Characterization of national web domains.ACM Transactions on Internet Technology, 7(2).

Baeza-Yates, R. and Poblete, B. (2006).Dynamics of the chilean web structure.Comput. Networks, 50(10):1464–1473.

Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).Web structure, dynamics and page quality.In Proceedings of String Processing and Information Retrieval (SPIRE),volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.Springer.

Barabasi, A.-L. (2002).Linked: The New Science of Networks.Perseus Books Group.

Barabasi, A. L. and Albert, R. (1999).Emergence of scaling in random networks.Science, 286(5439):509–512.

Page 164: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,Stata, R., Tomkins, A., and Wiener, J. (2000).Graph structure in the web: Experiments and models.In Proceedings of the Ninth Conference on World Wide Web, pages309–320, Amsterdam, Netherlands. ACM Press.

Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,and Vigna, S. (2006).A reference collection for web spam.SIGIR Forum, 40(2):11–24.

Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).Know your neighbors: Web spam detection using the web topology.In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.

Chellapilla, K. and Maykov, A. (2007).A taxonomy of javascript redirection spam.In AIRWeb ’07: Proceedings of the 3rd international workshop onAdversarial information retrieval on the web, pages 81–88, New York, NY,USA. ACM Press.

Page 165: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.

Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.

Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.

Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.

Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.

Page 166: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating web spam with trustrank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.

Jeh, G. and Widom, J. (2002).Simrank: a measure of structural-context similarity.In KDD ’02: Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 538–543, NewYork, NY, USA. ACM Press.

Li, Y. (1998).Toward a qualitative search engine.IEEE Internet Computing.

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).The PageRank citation ranking: bringing order to the Web.Technical report, Stanford Digital Library Technologies Project.

Page 167: Link Analysis for Web Information Retrieval

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.

Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).A simple conceptual model for the internet topology.In Global Internet, San Antonio, Texas, USA. IEEE CS Press.