Upload
carlos-castillo
View
2.617
Download
1
Embed Size (px)
DESCRIPTION
Talk from February 2008 @ FADOC, Complutense University, Madrid
Citation preview
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link Analysis for Web Information RetrievalWith Applications to Adversarial IR
Carlos Castillo1
With: R. Baeza-Yates1,3, L. Becchetti2, P. Boldi5,D. Donato1, A. Gionis1, S. Leonardi2, V.Murdock1,
M. Santini5, F. Silvestri4, S. Vigna5
1. Yahoo! Research Barcelona – Catalunya, Spain2. Universita di Roma “La Sapienza” – Rome, Italy
3. Yahoo! Research Santiago – Chile4. ISTI-CNR –Pisa,Italy
5. Universita degli Studi di Milano – Milan, Italy
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
When you have a hammer
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Everything looks like a graph!
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Links are not placed at random
Topical locality hypothesis
Link endorsement hypothesis
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Links are not placed at random
Topical locality hypothesis
Link endorsement hypothesis
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topical locality hypothesis
“We found that pages are significantly more likely tobe related topically to pages to which they arelinked, as opposed to other pages selected atrandom or other nearby pages.” [Davison, 2000]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5
Aver
age
text
sim
ilar
ity
Link distance
[Baeza-Yates et al., 2006], data from UK 2006
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link similarity cases
Link (geodesic) distance
Co-citation
Bibliographic coupling
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Co-citation
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Bibliographic coupling
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
(Both can be generalized)
(Both co-citation and bibliographic coupling can begeneralized. E.g.: SimRank [Jeh and Widom, 2002]:generalizes the idea of co-citation to several levels)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link endorsement hypothesis
Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]
But they can represent:
Disagreement
Self citations
Nepotism
Citations to methodological documents
etc.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link endorsement hypothesis
Links are assumed to be endorsements (votes, positiveopinions) [Li, 1998]
But they can represent:
Disagreement
Self citations
Nepotism
Citations to methodological documents
etc.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Furthermore
They measure quantity not quality (e.g.: “Stop thenumbers game!” in ACM communications a few monthsago)
Self-citations are frequent
In some topics there is more linking
Citations go from newer to older
New documents get fewcitations [Baeza-Yates et al., 2002]
Many of the citations are irrelevant
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Nevertheless
Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web
Analogy with Economy
Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Nevertheless
Both the topical locality hypothesis and the link endorsementhypothesis are meaningful on the Web
Analogy with Economy
Think on the hypothesis requiring many buyers/sellers, zerotransaction costs, perfect information, etc. in economicsciences
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
How to find meaningful patterns?
Several levels of analysis:
Macroscopic view: overall structure
Microscopic view: nodes
Mesoscopic view: regions
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Bow-tie
[Broder et al., 2000]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Jellyfish
[Tauro et al., 2001] - Internet Autonomous Systems (AS)Topology
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Macroscopic view, e.g. Jellyfish
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Microscopic view, e.g. Degree
[Barabasi, 2002] and others
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“While entirely of human design, the emergingnetwork appears to have more in common with a cellor an ecological system than with a Swisswatch.” [Barabasi, 2002]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Other scale-free networks
Power grid designs
Sexual partners in humans
Collaboration of movie actors in films
Citations in scientific publications
Protein interactions
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Microscopic view, e.g. Degree
Greece Chile
Spain Korea
[Baeza-Yates et al., 2007] - compares this distribution in 8countries . . . guess what is the result?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Mesoscopic view, e.g. Hop-plot
.it (40M pages) .uk (18M pages)
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
.eu.int (800K pages) Synthetic graph (100K pages)
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Fre
qu
ency
Distance
[Baeza-Yates et al., 2006]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Models
Preferential attachment
Copy model
Hybrid models
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Preferential attachment
“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]
“rich get richer”
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Preferential attachment
“A common property of many large networks is thatthe vertex connectivities follow a scale-freepower-law distribution. This feature was found to bea consequence of two generic mechanisms: (i)networks expand continuously by the addition ofnew vertices, and (ii) new vertices attachpreferentially to sites that are already wellconnected.” [Barabasi and Albert, 1999]
“rich get richer”
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Counting in-links does not work
“With a simple program, huge numbers of pages canbe created easily, artificially inflating citation counts.Because the Web environment contains profitseeking ventures, attention getting strategies evolvein response to search engine algorithms. For thisreason, any evaluation strategy which countsreplicable features of web pages is prone tomanipulation” [Page et al., 1998]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
PageRank: simplified version
PageRank ′(u) =∑
v∈Γ−(u)
PageRank ′(v)
|Γ+(v)|
Γ−(·): in-linksΓ+(·): out-links
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations with pseudo-PageRank
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations with pseudo-PageRank
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
So far, so good, but ...
The Web includes many pages with no out-links, thesewill accumulate all of the score
We would like Web pages to accumulate ranking
We add random jumps (teleportation)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
PageRank
PageRank(u) =ǫ
N+ (1 − ǫ)
∑
v∈Γ−(u)
PageRank(v)
|Γ+(v)|
Γ−(·): in-linksΓ+(·): out-linksǫ/N: jump to a random page with probability ǫ ≈ 0.15
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
HITS
Two scores per page: “hub score” and “authority score”.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
HITS
Two scores per page: “hub score” and “authority score”.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Iterations
Initialize:hub(u, 0) = auth(u, 0) = 0
Iterate:hub(u, t) =
∑v∈Γ+(u)
auth(v ,t−1)|Γ−(v)|
auth(u, t) =∑
v∈Γ−(u)hub(v ,t−1)|Γ+(v)|
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What is on the Web?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What is on the Web [2.0]?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What else is on the Web?
“The sum of all human knowledge plus porn” – Robert Gilbert
Source: www.milliondollarhomepage.com
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What’s happening on the Web?
There is a fierce competition
for your attention
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
What’s happening on the Web?
Search engines are to some extent
arbiters of this competition
and they must watch it closely, otherwise ...
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Some cheating occurs
1986 FIFA World Cup, Argentina vs England
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Simple web spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Hidden text
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Made for advertising
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Search engine?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Fake search engine
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“Normal” content in link farms
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
“Normal” content in link farms
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Cloaking
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Redirection
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Redirects using Javascript
Simple redirect
<script>
document.location="http://www.topsearch10.com/";
</script>
“Hidden” redirect<script>
var1=24; var2=var1;
if(var1==var2) {document.location="http://www.topsearch10.com/";
}</script>
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Problem: obfuscated code
Obfuscated redirect<script>
var a1="win",a2="dow",a3="loca",a4="tion.",
a5="replace",a6="(’http://www.top10search.com/’)";
var i,str="";
for(i=1;i<=6;i++)
{str += eval("a"+i);
}eval(str);
</script>
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Problem: really obfuscated code
Encoded javascript
<script>
var s = "%5CBE0D%5C%05GDHJ BDE%16...%04%0E";
var e = ’’, i;
eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));
</script>
More examples: [Chellapilla and Maykov, 2007]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
There are many attempts of cheating on the Web
Most of these are spam:
1,630,000 results for “free mp3 hilton viagra” in SE1
1,760,000 results for “credit vicodin loan” in SE2
1,320,000 results for “porn mortgage” in SE3
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Costs
Costs:
X Costs for users: lower precision for some queries
X Costs for search engines: wasted storage space,network resources, and processing cycles
X Costs for the publishers: resources invested in cheatingand not in improving their contents
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Adversarial IR Issues on the Web
Link spam
Content spam
Cloaking
Comment/forum/wiki spam
Spam-oriented blogging
Click fraud ×2
Reverse engineering of ranking algorithms
Web content filtering
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Opportunities for Web spam
X Spamdexing
Keyword stuffingLink farmsSpam blogs (splogs)Cloaking
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Opportunities for Web spam
X Spamdexing
Keyword stuffingLink farmsSpam blogs (splogs)Cloaking
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Motivation
[Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:
“in a number of these distributions, outlier values are
associated with web spam”
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Machine Learning
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Training of a Decision Tree
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Decision Tree (error = 15%)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Decision Tree (error = 15% → 12%)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Machine Learning (cont.)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Feature Extraction
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Machine Learning
Machine Learning Challenges:
Instances are not really independent (graph)
Learning with few examples
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Information Retrieval
Information Retrieval Challenges:
Feature extraction: which features?
Feature aggregation: page/host/domain
Feature propagation (graph)
Recall/precision tradeoffs
Scalability
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Challenges: Data
Data is difficult to collect
Data is expensive to label
Labels are sparse
Humans do not always agree
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Agreement
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Results
LabelsLabel Frequency Percentage
Normal 4,046 61.75%Borderline 709 10.82%
Spam 1,447 22.08%Can not classify 350 5.34%
Agreement
Category Kappa Interpretation
normal 0.62 Substantial agreementspam 0.63 Substantial agreementborderline 0.11 Slight agreement
global 0.56 Moderate agreement
Reference collection [Castillo et al., 2006]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological spam: link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological spam: link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Handling large graphs
For large graphs, random access is not possible.
Large graphs do not fit in main memory
Streaming model of computation
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Semi-streaming model
Memory size enough to hold some data per-node
Disk size enough to hold some data per-edge
A small number of passes over the data
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Restriction
Semi-streaming model: graph on disk
1: for node : 1 . . . N do
2: INITIALIZE-MEM(node)3: end for
4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do
7: COMPUTE(src,dest)8: end for
9: end for
10: NORMALIZE
11: end for
12: POST-PROCESS
13: return Something
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Link-Based Features
Degree-related measures
PageRank
TrustRank [Gyongyi et al., 2004]
Truncated PageRank [Becchetti et al., 2006]
Estimation of supporters [Becchetti et al., 2006]
140 features per host (2 pages per host)
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Degree-Based
0.00
0.02
0.04
0.06
0.08
0.10
0.12
1968753460609107764252125899138032376184
NormalSpam
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
22009.92686.5327.940.04.90.60.10.00.00.0
NormalSpam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank
TrustRank [Gyongyi et al., 2004]
A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious
Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step
i Trusted nodes: data from http://www.dmoz.org/
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank
TrustRank [Gyongyi et al., 2004]
A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious
Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1 − α ateach step
i Trusted nodes: data from http://www.dmoz.org/
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank Idea
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
TrustRank / PageRank
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
9e+033e+031e+033e+021e+024e+011e+01410.4
NormalSpam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Nu
mb
er
of
No
de
s
Top 0%−10%
Top 40%−50%
Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Nu
mb
er
of
No
de
s
Top 0%−10%
Top 40%−50%
Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Bottleneck number
bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}. Minimum rate of growthof the neighbors of x up to a certain distance. We expect thatspam pages form clusters that are somehow isolated from therest of the Web graph and they have smaller bottlenecknumbers than non-spam pages.
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
4.523.873.312.832.422.071.781.521.301.11
NormalSpam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Content-Based Features
Most of these reported in [Ntoulas et al., 2006]:
Number of word in the page and title
Average word length
Fraction of anchor text
Fraction of visible text
Compression rate
From [Castillo et al., 2007]:
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Average word length
0.00
0.02
0.04
0.06
0.08
0.10
0.12
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
NormalSpam
Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Corpus precision
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
NormalSpam
Figure: Histogram of the corpus precision in non-spam vs. spampages.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Query precision
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
NormalSpam
Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
General hypothesis
Pages topologically close to each other are more likely to havethe same label (spam/nonspam) than random pairs of pages
Ideas for exploiting this: clustering, propagation, stackedlearning
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
[Castillo et al., 2007]
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological dependencies: in-links
Histogram of fraction of spam hosts in the in-links
0 = no in-link comes from spam hosts
1 = all of the in-links come from spam hosts
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.0 0.2 0.4 0.6 0.8 1.0
In-links of non spamIn-links of spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Topological dependencies: out-links
Histogram of fraction of spam hosts in the out-links
0 = none of the out-links points to spam hosts
1 = all of the out-links point to spam hosts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
Out-links of non spamOutlinks of spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering
Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Initial prediction:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Clustering:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering (cont.)
Final prediction:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 1: Clustering – Results
Baseline Clustering
Without bagging
True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%
F-Measure 0.646 0.673
With bagging
True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label
Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Initial prediction:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Propagation:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label (cont.)
Final prediction, applying a threshold:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 2: Propagate the label – Results
Baseline Fwds. Backwds. Both
Classifier without bagging
True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%
F-Measure 0.646 0.665 0.664 0.676
Classifier with bagging
True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning
Meta-learning scheme [Cohen and Kou, 2006]
Derive initial predictions
Generate an additional attribute for each object bycombining predictions on neighbors in the graph
Append additional attribute in the data and retrain
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features
Let N(x) be the set of pages related to x (in some way)
Compute
f (x) =
∑g∈N(x) p(g)
|N(x)|
Add f (x) as an extra feature for instance x and learn anew model with k + 1 features
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Initial prediction:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
Computation of new feature:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning (cont.)
New prediction with k + 1 features:
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning - Results
Avg. Avg. Avg.Baseline of in of out of both
True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%
F-Measure 0.723 0.733 0.742 0.750
V Increases detection rate
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Idea 3: Stacked graphical learning x2
And repeat ...
Baseline First pass Second pass
True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%
F-Measure 0.723 0.750 0.763
V Significant improvement over the baseline
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
1 Hypothesis2 Levels of link analysis3 Ranking4 Web spam5 ... detection6 ... links7 ... contents8 ... both9 Summary
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Concluding remarks
Hypothesis: topical locality + link endorsement
Primitives: similarity, ranking, propagation, etc.
Application to Web spam
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Thank you!
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).Generalizing pagerank: Damping functions for link-based rankingalgorithms.In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.ACM Press.
Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2007).Characterization of national web domains.ACM Transactions on Internet Technology, 7(2).
Baeza-Yates, R. and Poblete, B. (2006).Dynamics of the chilean web structure.Comput. Networks, 50(10):1464–1473.
Baeza-Yates, R., Saint-Jean, F., and Castillo, C. (2002).Web structure, dynamics and page quality.In Proceedings of String Processing and Information Retrieval (SPIRE),volume 2476 of Lecture Notes in Computer Science, Lisbon, Portugal.Springer.
Barabasi, A.-L. (2002).Linked: The New Science of Networks.Perseus Books Group.
Barabasi, A. L. and Albert, R. (1999).Emergence of scaling in random networks.Science, 286(5439):509–512.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,Stata, R., Tomkins, A., and Wiener, J. (2000).Graph structure in the web: Experiments and models.In Proceedings of the Ninth Conference on World Wide Web, pages309–320, Amsterdam, Netherlands. ACM Press.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M.,and Vigna, S. (2006).A reference collection for web spam.SIGIR Forum, 40(2):11–24.
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. (2007).Know your neighbors: Web spam detection using the web topology.In Proceedings of SIGIR, Amsterdam, Netherlands. ACM.
Chellapilla, K. and Maykov, A. (2007).A taxonomy of javascript redirection spam.In AIRWeb ’07: Proceedings of the 3rd international workshop onAdversarial information retrieval on the web, pages 81–88, New York, NY,USA. ACM Press.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.
Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.
Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.
Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating web spam with trustrank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Jeh, G. and Widom, J. (2002).Simrank: a measure of structural-context similarity.In KDD ’02: Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 538–543, NewYork, NY, USA. ACM Press.
Li, Y. (1998).Toward a qualitative search engine.IEEE Internet Computing.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).The PageRank citation ranking: bringing order to the Web.Technical report, Stanford Digital Library Technologies Project.
Link Analysis forWeb Information
Retrieval
C. Castillo
Hypothesis
Levels of linkanalysis
Ranking
Web spam
... detection
... links
... contents
... both
Summary
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.
Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).A simple conceptual model for the internet topology.In Global Internet, San Antonio, Texas, USA. IEEE CS Press.