Upload
alicia-banks
View
219
Download
0
Embed Size (px)
Citation preview
Exploiting Network Structure for Proactive Spam Mitigation
Shobha Venkataraman*
Joint work withSubhabrata Sen§, Oliver Spatscheck§, Patrick Haffner§ &
Dawn Song*
*Carnegie Mellon University§ AT&T Labs - Research
2
Daily Mail at Real Server
All incomingmail
Legitimatemail
Over 90% of the mail received any day is spam!
3
Spam Mitigation
Mail Servers
Content-basedspam-filtering
Scalability bottleneck!
4
Mail Servers
Content-basedspam-filtering
Spam Mitigation at Network-Level
IP address info- Computationally-efficient- Difficult to spoof: handshake required
Coarse-grained but effective technique first?
5
Spam Mitigation under Overload
Mail Servers
?
Goal: bias mail processed towards legitimate mail
Overload: Server gets much more mail than it can process
6
Contributions
Use history & structure of IP addresses as effective coarse-grained mechanism to differentiate spam from legit mail Extensive analysis of IP-based properties
Individual IP Analysis: infer significant legitimate senders
Analysis of IP Aggregates with network-aware clustering: infer significant (often transient) spammers
Application to server overload Solution techniques derived from analysis results Trace-driven simulations show upto factor of 3 improvement
in legit mail accepted
7
Outline
Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
8
Data
Logs from Postfix mail server at enterprise location of large corporation 700+ user mailboxes Includes all mail sent to mail server
Legitimate mail: mail deemed legitimate by SpamAssassin
Spam: all the rest
Total 28+ million messages over 6 months 27 million spam, 1.4 million legitimate
9
IP Analysis
Spam characteristics at granularity of sending mail server’s IP address
Find historical communication patterns of IPs to distinguish bulk of legitimate mail & spam
Use IP spam-ratio to characterize IP behaviour Def: Fraction of mail sent by IP address that is
spame.g., only legit mail: spam-ratio = 0% (good) only spam: spam-ratio = 100% (bad)
10
IP AnalysisQuestions: Is IP spam-ratio a good discriminating feature?
How are IPs/spam/legit mail distributed by spam-ratio?
Effect on spam mitigation if spam-ratio is perfectly predicted
Can long-term IP history differentiate legit mail from spam?
Questions: Is IP spam-ratio a good discriminating feature? Can long-term IP history differentiate legit mail
from spam?
11
IP Addresses
CDF across IP addresses
Bad IPs: ~ 90%Spam-Ratio: > 99%
Nearly all IPs have spam-ratios of 0% or 100%: i.e, they send only legit mail, or only spam
Good IPs: ~ 10%Spam-Ratio: 0%
12
Distribution of Spam Volume
IPs with spam-ratio 90%-100% contribute over 99% of spam!
Almost all spam comes from IPs with very high spam-ratio
Define x: IP spam-ratioFraction of Spam Sent by IPs with spam-ratio < x
13
Distribution of Legitimate Mail
IPs with spam-ratio over 95% contribute tiny fraction (5%)
Very little legit mail comes from IPs with very high spam-ratios
Define x: IP spam-ratioFraction of Legit Mail sent by IPs with spam-ratio < x
14
Effect on Spam MitigationIP spam-ratio, if perfectly predicted every day, could
identify most legitimate mail! e.g. accept mail from IPs with spam-ratio < 95%: accept very little spam, and most legit mail
Spam
Legit mail
15
IP AnalysisQuestions: Is IP spam-ratio a good discriminating feature? YES, if perfectly predicted every day Can long-term IP history differentiate legit mail
from spam? Temporal Stability: Do most IP addresses fluctuate
significantly in their spamming behaviour every day?
Persistence: How much legit mail/spam is contributed by long-lived IPs?
Next…
No
16
IP Persistence: Legit mail & spam
Also: Less than 5% of total IPs are present for 20+ days
Legit mail sent by good IPs(low spam-ratio)
Spam sent by bad IPs(high spam-ratio)
20% comes from IPs present
20+ days
52% comes from IPs present 20+
days
IPs present on many days contribute bulk of legit mail & little spam
17
IP Analysis Summary Bulk of legit mail comes from small no. of IPs
that appear frequently, and are consistently good History of legit senders to distinguish legit mail
Most spam comes from transient IPs Purely blacklisting approach has limitations
(also consistent with findings in [RF06])
[RF06] Understanding the Network-level Behavior of Spammers, Ramachandran & Feamster, SIGCOMM ‘06
18
Outline
Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
19
Why IP Clusters?
Since spamming IPs are transient, can coarser IP aggregations help? Incorporate collective history of individual transient
spammers Exploit network structure for guilt-by-association
Network-aware clustering [KW00] Set of IP prefixes collected from BGP routing tables
Each IP prefix represents a cluster of IP addresses: IP belongs to cluster with longest matching prefix in
set Topologically-close, often under common admin
control [KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00
20
Cluster Analysis
Use cluster spam-ratio to capture cluster behaviour Fraction of mail sent by cluster that is spam
(sent by cluster = sent by IPs belonging to cluster)
Questions: Granularity: Does cluster spam-ratio approximate
IP spam-ratio well, for distinguishing spam? Persistence: how much of spam & legit mail do
long-lived clusters contribute?
Yes
Next…
21
Cluster Persistence: Spam
Over 95% of totalspam comes from IPs in bad clusters (with high spam-ratios)
90% of total spam comes from bad
clusters present for 60+ days
Most of spam comes from bad clusters present for many days
Bad IPs
Bad Clusters
22
Cluster Analysis: Results
Most spam originates from long-lived clusters with high spam-ratios.
Much less legit mail originates from clusters with low spam-ratio, but clusters still long-lived.
Network-aware clusters may provide a history for spamming IPs, even if individual IPs are transient.
23
Outline
Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion
24
Server Overload Problem
Mail Servers
?
Problem: Server receives much more mail than it can processGoal: Maximize legitimate mail accepted for processing
25
Motivation to Overload
Legit: 20/min
Spam:80/min
Server
Spam: 80
Legit: 20
No Overload
Spammer has incentive to overload server with spam
Legit: 20/min
Spam:180/min Spam: 90
Legit: 10
Overloaded by factor of 2
Server
Spammer has capacity to overload server greatly with spam With large botnets, spammers can increase spam sent Not sufficient to increase server capacity
Server Capacity: 100/min
26
Approach Use history and structure of IP addresses
IP/cluster history assigns reputation to incoming mail
Based on IP reputation & server load, decide which mail is accepted/refused for processing
Details left to paper
Validate by simulation of server & policies on traces Details of simulation in paper
27
Simulation ResultsPerformance measure, computed for each hour:
Goodput of policy: % of available legit mail accepted by policy
(i.e., accepted for processing by mail server)
Overload-factor
Default policy
IP-history policy
No overload
93.7 96.7
2 61.7 79.6
3 39.5 68.6
4 26.8 64.5
5 20.3 63.0
Summary: Server Goodput, averaged over all hours
Factor of 3 improveme
nt!
Detailed analysis of performance in paper
28
Conclusion
Use history & structure of IP addresses to prioritize legitimate mail over spam efficiently.
Measurement-based analysis of IP properties Individual IP-address analysis helps identify legitimate
senders Analysis with network-aware clusters helps identify
transient spammers Application to server overload problem
Trace-driven simulation demonstrates that analysis can help prioritize most of legitimate mail over spam