Upload
graham
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SVMs for the Blogosphere: Blog Identification and Splog Detection. Pranam Kolari, Tim Finin, Anupam Joshi. http://ebiquity.umbc.edu. Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006. Blogosphere - the brighter side. Panel View Market Research PR Monitoring - PowerPoint PPT Presentation
Citation preview
SVMs for the Blogosphere: Blog Identification and
Splog Detection
Pranam Kolari, Tim Finin, Anupam Joshi
Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006
http://ebiquity.umbc.edu
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 2
Blogosphere - the brighter side
• Panel View– Market Research– PR Monitoring
• From Presentations– Opinion Extraction– Demography based analysis
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 3
Blogosphere - the darker side (1)
• From the Panel– Blogger is cracking down splogs– SixApart and TypePad– Content Hijacking
• From Presentations– Removing SPAM an essential part of
blog search engine– Cost of cleaning up splogs and its effect
on results
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 4
Blogosphere - the darker side (2)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 5
The Blogosphere
Blogger
msn-spaces
livejournal
InformationAudienceBLOG HOSTS
PING SERVERS
SPINGS SPLOGS
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 6
Spings – weblogs.com
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 7
Spings – weblogs.com (2)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 8
Spings – weblogs.com (3)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 9
Splogs – icerocket.com
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 10
Splogs – icerocket.com (2)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 11
A Featured Splog?
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 12
Splogs – technorati.com (2)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 13
“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”
“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”
“Holy Grail Of Advertising... “
“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”
Splogs – The Source!
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 14
Spam we target -- summarized
• Non-blogs– For increased search engine exposure– Through BLOG IDENTIFICATION
• Splogs– Adsense clicks for high-paying contexts (i)– Unjustifiably increase page-rank
(importance) of affiliates – link farms (ii)– Combination of (i) and (ii)– Through SPLOG DETECTION
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 15
This work
Can machine learning models be effective to counter splogs on the blogosphere?
How do they perform when using features local to a blog?
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 16
Dataset for Training
• Technorati random sampling• 500K blogs – May/June 2005• Dropped those from top blogging
hosts– Blog Identification is an easy tasking
using just URL patterns/domains• Sampled the rest in different ways to
create training datasets
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 17
Blog-HomePage/Non-Blog
• Sampled for blog home-pages • Sampled for external links from these
blogs to capture contextually similar pages – but from non-blogs
• All samples were manually verified• Training set consists of 2100 positive
and 2100 negative samples – multiple languages
• Lets call this (BH, NB)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 18
Blog-SubPage/Non-Blog
• Sampled for local-links from BH• Sampled for out-links similar to NB• No manual verification• 2600 positive and 2600 negative
samples• Lets call this (BNH, NB)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 19
Authentic Blog/Splog
• Manually identified 700 splogs (English) in the BH sample
• Sampled for 700 blogs from the rest
• 700 positive and 700 negative samples
• Lets call this (AB, S)
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 20
Comparison Baselines
Feature Precision Recall F1
meta 1 .75 .85RSS/Atom .96 .90 .93Text - blog .88 .79 .83Text – comment .83 .87 .85Text – trackback .99 .18 .30Text – 2005 .56 .97 .71
• Blog Identification
• Splog Detection is a known problem!
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 21
Evaluation - Background
• SVMs as implemented by libsvm• Leave-One-Out cross-validation• No stop word elimination• No stemming• Mutual Information for feature selection
– Frequency count provided similar results• Binary feature encoding
– Others encodings give similar results
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 22
New features for blogs
• Hyper-links on a page– Tokenized by “/” and “-”
• Anchor-text on a page• Meta tags
– From HTML HEAD element• 4-grams
– Contiguous blocks of 4 characters• Combinations
– words and urls– meta and link– urls, anchors, meta
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 23
Blog Identification – (BH, NB)
Feature Precision Recall F1 Feature SizeWords (w) .976 .941 .958 19000Urls (u) .982 .962 .972 7000Anchors (a) .975 .926 .950 8000Meta (m) .981 .774 .865 3500w+u .985 .966 .975 26000m+LINK .973 .939 .956 4000u+a .985 .961 .973 15000u+a+m .986 .964 .975 185004grams .982 .964 .973 25000
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 24
Blog Identification – (BNH, NB)
Feature Precision Recall F1 Feature SizeWords (w) .976 .930 .952 19000Urls (u) .966 .904 .934 7000Anchors (a) .962 .897 .923 8000Meta (m) .981 .919 .945 3500w+u .979 .932 .955 26000m+LINK .919 .942 .930 4000u+a .977 .919 .947 15000u+a+m .989 .940 .964 185004grams .976 .930 .952 25000
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 25
Splog Detection - (AB, S)
Feature Precision Recall F1 Feature SizeWords (w) .887 .864 .875 19000Urls (u) .804 .827 .815 7000Anchors (a) .854 .807 .830 8000Meta (m) .741 .747 .744 3500w+u .893 .869 .881 26000m+LINK .736 .755 .745 4000u+a .858 .833 .845 15000u+a+m .866 .841 .853 185004grams .867 .844 .855 25000
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 26
An quick Analysis
• Ping Servers– Our analysis in December 2005– At least 75% of pings are spings
• Technorati Index– Data from week of March 20, 2006– Random queries to sample for 10K blogs– 3K blogspot, 2.5K livejournal, 1.8K msn– We predict that 1.5K blogspot, 250 from LJ are
splogs– Overall 2.5K/10K are splogs ~ 25% of the
fresh index!
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 27
Blogosphere Spam - Summary
Blogger
msn-spaces
livejournal
InformationAudienceBLOG HOSTS
PING SERVERS
75%
25%50%
10%
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 28
And its not getting easier …
But spammers still leave trails that can be exploited
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 29
Conclusion
• Blogosphere is prone to spam at various infrastructure points
• Local content based models can be quite effective by itself
• 75% of pings and further downstream, 25% of fresh content is spam
• Blogger’s problem is now livejournal’s problem, and now everyone’s problem
• Combining local and global splog models is our current direction
March 29, 2006 P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection 30
Questions?
• Google “Splog Detection”• memeta
– http://memeta.umbc.edu• eBiquity
– http://ebiquity.umbc.edu– http://ebiquity.umbc.edu/blogger
• Check out Umbria’s report on splogs– http://www.umbrialistens.com/files/
uploads/umbria_splog.pdf