15
Know your Neighbors: Web Spam Detection using the Web Topology Carlos Castillo, [email protected] Debora Donato, [email protected] Aristides Gionis, [email protected] Vanessa Murdock, [email protected] Fabrizio Silvestri, [email protected] Presented by Anton Rodriguez-Dmitriev

Carlos Castillo, [email protected] Debora Donato, [email protected] Aristides Gionis, [email protected] Vanessa Murdock, [email protected]

Embed Size (px)

Citation preview

Page 1: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Know your Neighbors: Web Spam Detection

using the Web TopologyCarlos Castillo, [email protected]

Debora Donato, [email protected] Gionis, [email protected]

Vanessa Murdock, [email protected] Silvestri, [email protected]

Presented by Anton Rodriguez-Dmitriev

Page 2: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Personal BackgroundGraduated from FSUWorking on a MSECESpecializing in ControlsCS minorWork part-time at STW

Technic, LP

Page 3: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Web Spam ConsequencesDamages reputation of search engineWeakens the trust of the usersEiron et al. ranked 100 million pages using

PageRank: 11 out of the top 20 were pornographic pages

PageRank alone cannot filter spamCost incurred in crawling, indexing and storing spam pages

Page 4: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com
Page 5: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Some popular spamming techniquesLink Spam: create link

structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm.

Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries

Cloaking: send different content to a search engine than to the regular visitor of a website

Page 6: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Topology of the DatasetUsed WEBSPAM-UK2006

dataset: publically available spam collection

Undirected graphPruned to contain only hosts

that share more than 100 links

Black nodes are spam and white nodes are non-spam

Most spammers in the larger connected component are clustered together

Other connected components are single-class

Page 7: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Evaluation of the processConfusion Matrix:

a represents the number of non-spam examples that were correctly classified

b represents the number of examples of non-spam that were falsely classified as spam

c represents the spam examples that were falsely classified as non-spam

d represents the number of spam examples that were correctly classified

Page 8: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Success MeasuresTrue positive-rate (or Recall):

False positive-rate :

Precision:

F-measure :

Page 9: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Link-based FeaturesDegree-related measures:

In-degree and out-degree of the hosts and neighborsEdge-reciprocity: the number of links that are reciprocalAssortativity: the ratio between the degree of a

particular page and the average degree of its neighborsPageRankTrustRank: uses a subset of hand-picked trusted nodes

and propagates their labels through the Web graphTruncated PageRank: a variant of PageRank that

diminishes the influence of a page to the PageRank of its neighbors

Page 10: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Link-based FeaturesEstimation of

supporters: Given two nodes x and

y, x is a d-supporter of y, if the shortest path from x to y has length d

Nd(x) is the set of d-supporters of page x

Spam pages have a smaller bottleneck than non-spam

Bottleneck number :

Histogram of b4(x) for spam and non-spam

Page 11: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Content-based FeaturesMost interesting features presented:Finding the k most frequent words in the dataset,

excluding stopwords:Corpus precision: is the fraction of words in a page

that appear in a set of popular termsCorpus recall: to be the fraction of popular terms

that appear in the pageConsidering the set of q most popular terms in a

query log:Query precision and query recall: are analogous to

corpus precision and recall.Used k & q = 100, 200, 500 and 1000

Page 12: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

Content-based FeaturesThe best features

are the corpus precision and query precision

All features where judged based only on histograms

Histogram of the query precision in non-spam vs. spam pages for q = 500.

Page 13: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

ClassifiersCost-sensitive decision treeCost of zero for correctly

classifying the instanceCost of misclassifying spam

as normal is R times more costly as classifying a normal host as spam

R can be used to tune the balance between the true-positive rate and the false-positive rate

Used “bagging” to help reduce the false-positive rate

Page 14: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

ConclusionExperimental evidence led to the hypotheses:

Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes

Spam nodes are mainly linked by spam nodesThese tendencies can be exploited to yield

better spam detectionUsing multiple features, link-based and

content-based, provided better detectionError rate can be tuned by adjusting the cost

matrix

Page 15: Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com

CritiqueArticle presented many features, both link-based

and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing)

Results obtained showed which features and optimizations were effective

Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques

There was no direct comparison between prior research results and the results obtained