30
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006 http://ebiquity.umbc.edu

SVMs for the Blogosphere: Blog Identification and Splog Detection

  • Upload
    graham

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

SVMs for the Blogosphere: Blog Identification and Splog Detection. Pranam Kolari, Tim Finin, Anupam Joshi. http://ebiquity.umbc.edu. Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006. Blogosphere - the brighter side. Panel View Market Research PR Monitoring - PowerPoint PPT Presentation

Citation preview

Page 1: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

SVMs for the Blogosphere: Blog Identification and

Splog Detection

Pranam Kolari, Tim Finin, Anupam Joshi

Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006

http://ebiquity.umbc.edu

Page 2: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 2

Blogosphere - the brighter side

• Panel View– Market Research– PR Monitoring

• From Presentations– Opinion Extraction– Demography based analysis

Page 3: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 3

Blogosphere - the darker side (1)

• From the Panel– Blogger is cracking down splogs– SixApart and TypePad– Content Hijacking

• From Presentations– Removing SPAM an essential part of

blog search engine– Cost of cleaning up splogs and its effect

on results

Page 4: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 4

Blogosphere - the darker side (2)

Page 5: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 5

The Blogosphere

Blogger

msn-spaces

livejournal

InformationAudienceBLOG HOSTS

PING SERVERS

SPINGS SPLOGS

Page 6: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 6

Spings – weblogs.com

Page 7: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 7

Spings – weblogs.com (2)

Page 8: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 8

Spings – weblogs.com (3)

Page 9: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 9

Splogs – icerocket.com

Page 10: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 10

Splogs – icerocket.com (2)

Page 11: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 11

A Featured Splog?

Page 12: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 12

Splogs – technorati.com (2)

Page 13: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 13

“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”

“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”

“Holy Grail Of Advertising... “

“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”

Splogs – The Source!

Page 14: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 14

Spam we target -- summarized

• Non-blogs– For increased search engine exposure– Through BLOG IDENTIFICATION

• Splogs– Adsense clicks for high-paying contexts (i)– Unjustifiably increase page-rank

(importance) of affiliates – link farms (ii)– Combination of (i) and (ii)– Through SPLOG DETECTION

Page 15: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 15

This work

Can machine learning models be effective to counter splogs on the blogosphere?

How do they perform when using features local to a blog?

Page 16: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 16

Dataset for Training

• Technorati random sampling• 500K blogs – May/June 2005• Dropped those from top blogging

hosts– Blog Identification is an easy tasking

using just URL patterns/domains• Sampled the rest in different ways to

create training datasets

Page 17: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 17

Blog-HomePage/Non-Blog

• Sampled for blog home-pages • Sampled for external links from these

blogs to capture contextually similar pages – but from non-blogs

• All samples were manually verified• Training set consists of 2100 positive

and 2100 negative samples – multiple languages

• Lets call this (BH, NB)

Page 18: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 18

Blog-SubPage/Non-Blog

• Sampled for local-links from BH• Sampled for out-links similar to NB• No manual verification• 2600 positive and 2600 negative

samples• Lets call this (BNH, NB)

Page 19: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 19

Authentic Blog/Splog

• Manually identified 700 splogs (English) in the BH sample

• Sampled for 700 blogs from the rest

• 700 positive and 700 negative samples

• Lets call this (AB, S)

Page 20: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 20

Comparison Baselines

Feature Precision Recall F1

meta 1 .75 .85RSS/Atom .96 .90 .93Text - blog .88 .79 .83Text – comment .83 .87 .85Text – trackback .99 .18 .30Text – 2005 .56 .97 .71

• Blog Identification

• Splog Detection is a known problem!

Page 21: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 21

Evaluation - Background

• SVMs as implemented by libsvm• Leave-One-Out cross-validation• No stop word elimination• No stemming• Mutual Information for feature selection

– Frequency count provided similar results• Binary feature encoding

– Others encodings give similar results

Page 22: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 22

New features for blogs

• Hyper-links on a page– Tokenized by “/” and “-”

• Anchor-text on a page• Meta tags

– From HTML HEAD element• 4-grams

– Contiguous blocks of 4 characters• Combinations

– words and urls– meta and link– urls, anchors, meta

Page 23: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 23

Blog Identification – (BH, NB)

Feature Precision Recall F1 Feature SizeWords (w) .976 .941 .958 19000Urls (u) .982 .962 .972 7000Anchors (a) .975 .926 .950 8000Meta (m) .981 .774 .865 3500w+u .985 .966 .975 26000m+LINK .973 .939 .956 4000u+a .985 .961 .973 15000u+a+m .986 .964 .975 185004grams .982 .964 .973 25000

Page 24: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 24

Blog Identification – (BNH, NB)

Feature Precision Recall F1 Feature SizeWords (w) .976 .930 .952 19000Urls (u) .966 .904 .934 7000Anchors (a) .962 .897 .923 8000Meta (m) .981 .919 .945 3500w+u .979 .932 .955 26000m+LINK .919 .942 .930 4000u+a .977 .919 .947 15000u+a+m .989 .940 .964 185004grams .976 .930 .952 25000

Page 25: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 25

Splog Detection - (AB, S)

Feature Precision Recall F1 Feature SizeWords (w) .887 .864 .875 19000Urls (u) .804 .827 .815 7000Anchors (a) .854 .807 .830 8000Meta (m) .741 .747 .744 3500w+u .893 .869 .881 26000m+LINK .736 .755 .745 4000u+a .858 .833 .845 15000u+a+m .866 .841 .853 185004grams .867 .844 .855 25000

Page 26: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 26

An quick Analysis

• Ping Servers– Our analysis in December 2005– At least 75% of pings are spings

• Technorati Index– Data from week of March 20, 2006– Random queries to sample for 10K blogs– 3K blogspot, 2.5K livejournal, 1.8K msn– We predict that 1.5K blogspot, 250 from LJ are

splogs– Overall 2.5K/10K are splogs ~ 25% of the

fresh index!

Page 27: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 27

Blogosphere Spam - Summary

Blogger

msn-spaces

livejournal

InformationAudienceBLOG HOSTS

PING SERVERS

75%

25%50%

10%

Page 28: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 28

And its not getting easier …

But spammers still leave trails that can be exploited

Page 29: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 29

Conclusion

• Blogosphere is prone to spam at various infrastructure points

• Local content based models can be quite effective by itself

• 75% of pings and further downstream, 25% of fresh content is spam

• Blogger’s problem is now livejournal’s problem, and now everyone’s problem

• Combining local and global splog models is our current direction

Page 30: SVMs for the Blogosphere:  Blog Identification and  Splog Detection

March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 30

Questions?

• Google “Splog Detection”• memeta

– http://memeta.umbc.edu• eBiquity

– http://ebiquity.umbc.edu– http://ebiquity.umbc.edu/blogger

• Check out Umbria’s report on splogs– http://www.umbrialistens.com/files/

uploads/umbria_splog.pdf