Link Analysis for Web Information Retrieval

Preview:

DESCRIPTION

Talk from February 2008 @ FADOC, Complutense University, Madrid

Citation preview

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link Analysis for Web Information RetrievalWith Applications to Adversarial IR

Carlos Castillo1

chato@yahoo-inc.com

With: R. Baeza-Yates1,3, L. Becchetti2, P. Boldi5,D. Donato1, A. Gionis1, S. Leonardi2, V.Murdock1,

M. Santini5, F. Silvestri4, S. Vigna5

1. Yahoo! Research Barcelona – Catalunya, Spain2. Universita di Roma “La Sapienza” – Rome, Italy

3. Yahoo! Research Santiago – Chile4. ISTI-CNR –Pisa,Italy

5. Universita degli Studi di Milano – Milan, Italy

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

When you have a hammer

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Everything looks like a graph!

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Links are not placed at random

Topical locality hypothesis

Link endorsement hypothesis

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Links are not placed at random

Topical locality hypothesis

Link endorsement hypothesis

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topical locality hypothesis

“We found that pages are significantly more likely tobe related topically to pages to which they arelinked, as opposed to other pages selected atrandom or other nearby pages.” [Davison, 2000]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5

Aver

age

text

sim

ilar

ity

Link distance

[Baeza-Yates et al., 2006], data from UK 2006

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link similarity cases

Link (geodesic) distance

Co-citation

Bibliographic coupling

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Co-citation

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Bibliographic coupling

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

(Both can be generalized)

(Both co-citation and bibliographic coupling can begeneralized. E.g.: SimRank [Jeh and Widom, 2002]:generalizes the idea of co-citation to several levels)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link endorsement hypothesis

Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]

But they can represent:

Disagreement

Self citations

Nepotism

Citations to methodological documents

etc.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link endorsement hypothesis

Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]

But they can represent:

Disagreement

Self citations

Nepotism

Citations to methodological documents

etc.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Furthermore

They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)

Self-citations are frequent

In some topics there is more linking

Citations go from newer to older

New documents get fewcitations [Baeza-Yates et al., 2002]

Many of the citations are irrelevant

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Nevertheless

Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web

Analogy with Economy

Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Nevertheless

Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web

Analogy with Economy

Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

How to find meaningful patterns?

Several levels of analysis:

Macroscopic view: overall structure

Microscopic view: nodes

Mesoscopic view: regions

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Bow-tie

[Broder et al., 2000]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Jellyfish

[Tauro et al., 2001] - Internet Autonomous Systems (AS)Topology

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Macroscopic view, e.g. Jellyfish

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Microscopic view, e.g. Degree

[Barabasi, 2002] and others

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“While entirely of human design, the emergingnetwork appears to have more in common with a cellor an ecological system than with a Swisswatch.” [Barabasi, 2002]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Other scale-free networks

Power grid designs

Sexual partners in humans

Collaboration of movie actors in films

Citations in scientific publications

Protein interactions

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Microscopic view, e.g. Degree

Greece Chile

Spain Korea

[Baeza-Yates et al., 2007] - compares this distribution in 8countries . . . guess what is the result?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Mesoscopic view, e.g. Hop-plot

.it (40M pages) .uk (18M pages)

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

.eu.int (800K pages) Synthetic graph (100K pages)

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

0.0

0.1

0.2

0.3

5 10 15 20 25 30

Fre

qu

ency

Distance

[Baeza-Yates et al., 2006]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Models

Preferential attachment

Copy model

Hybrid models

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Preferential attachment

“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]

“rich get richer”

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Preferential attachment

“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]

“rich get richer”

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Counting in-links does not work

“With a simple program, huge numbers of pages canbe created easily, artificially inflating citation counts.Because the Web environment contains profitseeking ventures, attention getting strategies evolvein response to search engine algorithms. For thisreason, any evaluation strategy which countsreplicable features of web pages is prone tomanipulation” [Page et al., 1998]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

PageRank: simplified version

PageRank ′(u) =∑

v∈Γ−(u)

PageRank ′(v)

|Γ+(v)|

Γ−(·): in-linksΓ+(·): out-links

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations with pseudo-PageRank

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations with pseudo-PageRank

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

So far, so good, but ...

The Web includes many pages with no out-links, thesewill accumulate all of the score

We would like Web pages to accumulate ranking

We add random jumps (teleportation)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

PageRank

PageRank(u) =ǫ

N+ (1 − ǫ)

v∈Γ−(u)

PageRank(v)

|Γ+(v)|

Γ−(·): in-linksΓ+(·): out-linksǫ/N: jump to a random page with probability ǫ ≈ 0.15

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

HITS

Two scores per page: “hub score” and “authority score”.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

HITS

Two scores per page: “hub score” and “authority score”.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Iterations

Initialize:hub(u, 0) = auth(u, 0) = 0

Iterate:hub(u, t) =

∑v∈Γ+(u)

auth(v ,t−1)|Γ−(v)|

auth(u, t) =∑

v∈Γ−(u)hub(v ,t−1)|Γ+(v)|

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What is on the Web?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What is on the Web [2.0]?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What else is on the Web?

“The sum of all human knowledge plus porn” – Robert Gilbert

Source: www.milliondollarhomepage.com

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What’s happening on the Web?

There is a fierce competition

for your attention

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

What’s happening on the Web?

Search engines are to some extent

arbiters of this competition

and they must watch it closely, otherwise ...

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Some cheating occurs

1986 FIFA World Cup, Argentina vs England

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Simple web spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Hidden text

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Made for advertising

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Search engine?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Fake search engine

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“Normal” content in link farms

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

“Normal” content in link farms

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Cloaking

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Redirection

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Redirects using Javascript

Simple redirect

<script>

document.location="http://www.topsearch10.com/";

</script>

“Hidden” redirect<script>

var1=24; var2=var1;

if(var1==var2) {document.location="http://www.topsearch10.com/";

}</script>

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Problem: obfuscated code

Obfuscated redirect<script>

var a1="win",a2="dow",a3="loca",a4="tion.",

a5="replace",a6="(’http://www.top10search.com/’)";

var i,str="";

for(i=1;i<=6;i++)

{str += eval("a"+i);

}eval(str);

</script>

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Problem: really obfuscated code

Encoded javascript

<script>

var s = "%5CBE0D%5C%05GDHJ BDE%16...%04%0E";

var e = ’’, i;

eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));

</script>

More examples: [Chellapilla and Maykov, 2007]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

There are many attempts of cheating on the Web

Most of these are spam:

1,630,000 results for “free mp3 hilton viagra” in SE1

1,760,000 results for “credit vicodin loan” in SE2

1,320,000 results for “porn mortgage” in SE3

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Costs

Costs:

X Costs for users: lower precision for some queries

X Costs for search engines: wasted storage space,network resources, and processing cycles

X Costs for the publishers: resources invested in cheatingand not in improving their contents

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Adversarial IR Issues on the Web

Link spam

Content spam

Cloaking

Comment/forum/wiki spam

Spam-oriented blogging

Click fraud ×2

Reverse engineering of ranking algorithms

Web content filtering

Advertisement blocking

Stealth crawling

Malicious tagging

. . . more?

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Opportunities for Web spam

X Spamdexing

Keyword stuffingLink farmsSpam blogs (splogs)Cloaking

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Opportunities for Web spam

X Spamdexing

Keyword stuffingLink farmsSpam blogs (splogs)Cloaking

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Motivation

[Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:

“in a number of these distributions, outlier values are

associated with web spam”

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Machine Learning

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Training of a Decision Tree

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Decision Tree (error = 15%)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Decision Tree (error = 15% → 12%)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Machine Learning (cont.)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Feature Extraction

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Machine Learning

Machine Learning Challenges:

Instances are not really independent (graph)

Learning with few examples

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Information Retrieval

Information Retrieval Challenges:

Feature extraction: which features?

Feature aggregation: page/host/domain

Feature propagation (graph)

Recall/precision tradeoffs

Scalability

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Challenges: Data

Data is difficult to collect

Data is expensive to label

Labels are sparse

Humans do not always agree

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Agreement

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Results

LabelsLabel Frequency Percentage

Normal 4,046 61.75%Borderline 709 10.82%

Spam 1,447 22.08%Can not classify 350 5.34%

Agreement

Category Kappa Interpretation

normal 0.62 Substantial agreementspam 0.63 Substantial agreementborderline 0.11 Slight agreement

global 0.56 Moderate agreement

Reference collection [Castillo et al., 2006]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological spam: link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological spam: link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Handling large graphs

For large graphs, random access is not possible.

Large graphs do not fit in main memory

Streaming model of computation

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Semi-streaming model

Memory size enough to hold some data per-node

Disk size enough to hold some data per-edge

A small number of passes over the data

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Restriction

Semi-streaming model: graph on disk

1: for node : 1 . . . N do

2: INITIALIZE-MEM(node)3: end for

4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do

7: COMPUTE(src,dest)8: end for

9: end for

10: NORMALIZE

11: end for

12: POST-PROCESS

13: return Something

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Link-Based Features

Degree-related measures

PageRank

TrustRank [Gyongyi et al., 2004]

Truncated PageRank [Becchetti et al., 2006]

Estimation of supporters [Becchetti et al., 2006]

140 features per host (2 pages per host)

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Degree-Based

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1968753460609107764252125899138032376184

NormalSpam

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

22009.92686.5327.940.04.90.60.10.00.00.0

NormalSpam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank

TrustRank [Gyongyi et al., 2004]

A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious

Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step

i Trusted nodes: data from http://www.dmoz.org/

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank

TrustRank [Gyongyi et al., 2004]

A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious

Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step

i Trusted nodes: data from http://www.dmoz.org/

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank Idea

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

TrustRank / PageRank

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

9e+033e+031e+033e+021e+024e+011e+01410.4

NormalSpam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Nu

mb

er

of

No

de

s

Top 0%−10%

Top 40%−50%

Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Nu

mb

er

of

No

de

s

Top 0%−10%

Top 40%−50%

Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Bottleneck number

bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}. Minimum rate of growthof the neighbors of x up to a certain distance. We expect thatspam pages form clusters that are somehow isolated from therest of the Web graph and they have smaller bottlenecknumbers than non-spam pages.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

4.523.873.312.832.422.071.781.521.301.11

NormalSpam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Content-Based Features

Most of these reported in [Ntoulas et al., 2006]:

Number of word in the page and title

Average word length

Fraction of anchor text

Fraction of visible text

Compression rate

From [Castillo et al., 2007]:

Corpus precision and corpus recall

Query precision and query recall

Independent trigram likelihood

Entropy of trigrams

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Average word length

0.00

0.02

0.04

0.06

0.08

0.10

0.12

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

NormalSpam

Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Corpus precision

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NormalSpam

Figure: Histogram of the corpus precision in non-spam vs. spampages.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Query precision

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

NormalSpam

Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

General hypothesis

Pages topologically close to each other are more likely to havethe same label (spam/nonspam) than random pairs of pages

Ideas for exploiting this: clustering, propagation, stackedlearning

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

[Castillo et al., 2007]

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological dependencies: in-links

Histogram of fraction of spam hosts in the in-links

0 = no in-link comes from spam hosts

1 = all of the in-links come from spam hosts

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.0 0.2 0.4 0.6 0.8 1.0

In-links of non spamIn-links of spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Topological dependencies: out-links

Histogram of fraction of spam hosts in the out-links

0 = none of the out-links points to spam hosts

1 = all of the out-links point to spam hosts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.2 0.4 0.6 0.8 1.0

Out-links of non spamOutlinks of spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering

Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Initial prediction:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Clustering:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering (cont.)

Final prediction:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 1: Clustering – Results

Baseline Clustering

Without bagging

True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%

F-Measure 0.646 0.673

With bagging

True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%

F-Measure 0.723 0.728

V Reduces error rate

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label

Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Initial prediction:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Propagation:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label (cont.)

Final prediction, applying a threshold:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 2: Propagate the label – Results

Baseline Fwds. Backwds. Both

Classifier without bagging

True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%

F-Measure 0.646 0.665 0.664 0.676

Classifier with bagging

True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%

F-Measure 0.723 0.716 0.733 0.724

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning

Meta-learning scheme [Cohen and Kou, 2006]

Derive initial predictions

Generate an additional attribute for each object bycombining predictions on neighbors in the graph

Append additional attribute in the data and retrain

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features

Let N(x) be the set of pages related to x (in some way)

Compute

f (x) =

∑g∈N(x) p(g)

|N(x)|

Add f (x) as an extra feature for instance x and learn anew model with k + 1 features

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Initial prediction:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

Computation of new feature:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning (cont.)

New prediction with k + 1 features:

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning - Results

Avg. Avg. Avg.Baseline of in of out of both

True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%

F-Measure 0.723 0.733 0.742 0.750

V Increases detection rate

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Idea 3: Stacked graphical learning x2

And repeat ...

Baseline First pass Second pass

True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%

F-Measure 0.723 0.750 0.763

V Significant improvement over the baseline

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Concluding remarks

Hypothesis: topical locality + link endorsement

Primitives: similarity, ranking, propagation, etc.

Application to Web spam

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Thank you!

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).Generalizing pagerank: Damping functions for link-based rankingalgorithms.In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.ACM Press.

Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).Characterization of national web domains.ACM Transactions on Internet Technology, 7(2).

Baeza-Yates, R. and Poblete, B. (2006).Dynamics of the chilean web structure.Comput. Networks, 50(10):1464–1473.

Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).Web structure, dynamics and page quality.In Proceedings of String Processing and Information Retrieval (SPIRE),volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.Springer.

Barabasi, A.-L. (2002).Linked: The New Science of Networks.Perseus Books Group.

Barabasi, A. L. and Albert, R. (1999).Emergence of scaling in random networks.Science, 286(5439):509–512.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,Stata, R., Tomkins, A., and Wiener, J. (2000).Graph structure in the web: Experiments and models.In Proceedings of the Ninth Conference on World Wide Web, pages309–320, Amsterdam, Netherlands. ACM Press.

Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,and Vigna, S. (2006).A reference collection for web spam.SIGIR Forum, 40(2):11–24.

Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).Know your neighbors: Web spam detection using the web topology.In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.

Chellapilla, K. and Maykov, A. (2007).A taxonomy of javascript redirection spam.In AIRWeb ’07: Proceedings of the 3rd international workshop onAdversarial information retrieval on the web, pages 81–88, New York, NY,USA. ACM Press.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.

Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.

Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.

Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.

Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating web spam with trustrank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.

Jeh, G. and Widom, J. (2002).Simrank: a measure of structural-context similarity.In KDD ’02: Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 538–543, NewYork, NY, USA. ACM Press.

Li, Y. (1998).Toward a qualitative search engine.IEEE Internet Computing.

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).The PageRank citation ranking: bringing order to the Web.Technical report, Stanford Digital Library Technologies Project.

Link Analysis forWeb Information

Retrieval

C. Castillo

Hypothesis

Levels of linkanalysis

Ranking

Web spam

... detection

... links

... contents

... both

Summary

Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.

Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).A simple conceptual model for the internet topology.In Global Internet, San Antonio, Texas, USA. IEEE CS Press.

Recommended