Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner

Exploiting Network Structure for Proactive Spam Mitigation

Shobha Venkataraman*

Joint work withSubhabrata Sen§, Oliver Spatscheck§, Patrick Haffner§ &

Dawn Song*

*Carnegie Mellon University§ AT&T Labs - Research

2

Daily Mail at Real Server

All incomingmail

Legitimatemail

Over 90% of the mail received any day is spam!

3

Spam Mitigation

Mail Servers

Content-basedspam-filtering

Scalability bottleneck!

4

Mail Servers

Content-basedspam-filtering

Spam Mitigation at Network-Level

IP address info- Computationally-efficient- Difficult to spoof: handshake required

Coarse-grained but effective technique first?

5

Spam Mitigation under Overload

Mail Servers

?

Goal: bias mail processed towards legitimate mail

Overload: Server gets much more mail than it can process

6

Contributions

Use history & structure of IP addresses as effective coarse-grained mechanism to differentiate spam from legit mail Extensive analysis of IP-based properties

Individual IP Analysis: infer significant legitimate senders

Analysis of IP Aggregates with network-aware clustering: infer significant (often transient) spammers

Application to server overload Solution techniques derived from analysis results Trace-driven simulations show upto factor of 3 improvement

in legit mail accepted

7

Outline

Introduction IP Analysis Cluster Analysis Application under Server Overload Conclusion

8

Data

Logs from Postfix mail server at enterprise location of large corporation 700+ user mailboxes Includes all mail sent to mail server

Legitimate mail: mail deemed legitimate by SpamAssassin

Spam: all the rest

Total 28+ million messages over 6 months 27 million spam, 1.4 million legitimate

9

IP Analysis

Spam characteristics at granularity of sending mail server’s IP address

Find historical communication patterns of IPs to distinguish bulk of legitimate mail & spam

Use IP spam-ratio to characterize IP behaviour Def: Fraction of mail sent by IP address that is

spame.g., only legit mail: spam-ratio = 0% (good) only spam: spam-ratio = 100% (bad)

10

IP AnalysisQuestions: Is IP spam-ratio a good discriminating feature?

How are IPs/spam/legit mail distributed by spam-ratio?

Effect on spam mitigation if spam-ratio is perfectly predicted

Can long-term IP history differentiate legit mail from spam?

Questions: Is IP spam-ratio a good discriminating feature? Can long-term IP history differentiate legit mail

from spam?

11

IP Addresses

CDF across IP addresses

Bad IPs: ~ 90%Spam-Ratio: > 99%

Nearly all IPs have spam-ratios of 0% or 100%: i.e, they send only legit mail, or only spam

Good IPs: ~ 10%Spam-Ratio: 0%

12

Distribution of Spam Volume

IPs with spam-ratio 90%-100% contribute over 99% of spam!

Almost all spam comes from IPs with very high spam-ratio

Define x: IP spam-ratioFraction of Spam Sent by IPs with spam-ratio < x

13

Distribution of Legitimate Mail

IPs with spam-ratio over 95% contribute tiny fraction (5%)

Very little legit mail comes from IPs with very high spam-ratios

Define x: IP spam-ratioFraction of Legit Mail sent by IPs with spam-ratio < x

14

Effect on Spam MitigationIP spam-ratio, if perfectly predicted every day, could

identify most legitimate mail! e.g. accept mail from IPs with spam-ratio < 95%: accept very little spam, and most legit mail

Spam

Legit mail

15

IP AnalysisQuestions: Is IP spam-ratio a good discriminating feature? YES, if perfectly predicted every day Can long-term IP history differentiate legit mail

from spam? Temporal Stability: Do most IP addresses fluctuate

significantly in their spamming behaviour every day?

Persistence: How much legit mail/spam is contributed by long-lived IPs?

Next…

No

16

IP Persistence: Legit mail & spam

Also: Less than 5% of total IPs are present for 20+ days

Legit mail sent by good IPs(low spam-ratio)

Spam sent by bad IPs(high spam-ratio)

20% comes from IPs present

20+ days

52% comes from IPs present 20+

days

IPs present on many days contribute bulk of legit mail & little spam

17

IP Analysis Summary Bulk of legit mail comes from small no. of IPs

that appear frequently, and are consistently good History of legit senders to distinguish legit mail

Most spam comes from transient IPs Purely blacklisting approach has limitations

(also consistent with findings in [RF06])

[RF06] Understanding the Network-level Behavior of Spammers, Ramachandran & Feamster, SIGCOMM ‘06

18

Outline


19

Why IP Clusters?

Since spamming IPs are transient, can coarser IP aggregations help? Incorporate collective history of individual transient

spammers Exploit network structure for guilt-by-association

Network-aware clustering [KW00] Set of IP prefixes collected from BGP routing tables

Each IP prefix represents a cluster of IP addresses: IP belongs to cluster with longest matching prefix in

set Topologically-close, often under common admin

control [KW00] On Network-Aware Clustering of Web Clients, Krishnamurty & Wang, SIGCOMM ’00

20

Cluster Analysis

Use cluster spam-ratio to capture cluster behaviour Fraction of mail sent by cluster that is spam

(sent by cluster = sent by IPs belonging to cluster)

Questions: Granularity: Does cluster spam-ratio approximate

IP spam-ratio well, for distinguishing spam? Persistence: how much of spam & legit mail do

long-lived clusters contribute?

Yes

Next…

21

Cluster Persistence: Spam

Over 95% of totalspam comes from IPs in bad clusters (with high spam-ratios)

90% of total spam comes from bad

clusters present for 60+ days

Most of spam comes from bad clusters present for many days

Bad IPs

Bad Clusters

22

Cluster Analysis: Results

Most spam originates from long-lived clusters with high spam-ratios.

Much less legit mail originates from clusters with low spam-ratio, but clusters still long-lived.

Network-aware clusters may provide a history for spamming IPs, even if individual IPs are transient.

23

Outline


24

Server Overload Problem

Mail Servers

?

Problem: Server receives much more mail than it can processGoal: Maximize legitimate mail accepted for processing

25

Motivation to Overload

Legit: 20/min

Spam:80/min

Server

Spam: 80

Legit: 20

No Overload

Spammer has incentive to overload server with spam

Legit: 20/min

Spam:180/min Spam: 90

Legit: 10

Overloaded by factor of 2

Server

Spammer has capacity to overload server greatly with spam With large botnets, spammers can increase spam sent Not sufficient to increase server capacity

Server Capacity: 100/min

26

Approach Use history and structure of IP addresses

IP/cluster history assigns reputation to incoming mail

Based on IP reputation & server load, decide which mail is accepted/refused for processing

Details left to paper

Validate by simulation of server & policies on traces Details of simulation in paper

27

Simulation ResultsPerformance measure, computed for each hour:

Goodput of policy: % of available legit mail accepted by policy

(i.e., accepted for processing by mail server)

Overload-factor

Default policy

IP-history policy

No overload

93.7 96.7

2 61.7 79.6

3 39.5 68.6

4 26.8 64.5

5 20.3 63.0

Summary: Server Goodput, averaged over all hours

Factor of 3 improveme

nt!

Detailed analysis of performance in paper

28

Conclusion

Use history & structure of IP addresses to prioritize legitimate mail over spam efficiently.

Measurement-based analysis of IP properties Individual IP-address analysis helps identify legitimate

senders Analysis with network-aware clusters helps identify

transient spammers Application to server overload problem

Trace-driven simulation demonstrates that analysis can help prioritize most of legitimate mail over spam

29

Thank you!Questions?

(Contact: [email protected])

Documents

Exploiting Network Structure for Proactive Spam Mitigation Shobha Venkataraman * Joint work with Subhabrata Sen §, Oliver Spatscheck §, Patrick Haffner