43
Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards Jenq-Haur Wang Academia Sinica Nov. 16-17, 2006

Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

  • Upload
    mattox

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards. Jenq-Haur Wang Academia Sinica Nov. 16-17, 2006. Outline. Introduction Existing Solutions Regulatory Solutions Technical Solutions Potential Industrial Standards. Introduction. What is spam? - PowerPoint PPT Presentation

Citation preview

Page 1: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial

Standards

Jenq-Haur WangAcademia Sinica

Nov. 16-17, 2006

Page 2: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 2

Outline

• Introduction• Existing Solutions

– Regulatory Solutions– Technical Solutions

• Potential Industrial Standards

Page 3: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 3

Introduction

• What is spam?– E-mail, netnews, instant messaging (“spim”),

“Google-spam”, guestbook spam, Weblog comments spam, VoIP (“spit”), …

– Unsolicited messages flooded to uninterested receivers, usually sent in bulk

• What is e-mail spam?– Junk e-mail– Unsolicited bulk e-mail (UBE)– Unsolicited commercial e-mail (UCE)

Page 4: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 4

Spam Statistics

• Jan. 2001, – 8% of all e-mail traffic in the US is spam [Brightmail

Inc.]• Jan. 2003,

– 42% [Brightmail Inc.]• Jul. 2004,

– 65% [Symantec (Brightmail) Inc.]• In 2002,

– 3 pieces/day/user (average) [Ferris Research]• By 2005,

– 10 pieces/day/user (average) [Ferris Research]

Page 5: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 5

Spam Statistics (cont.)

Page 6: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 6

Spam Statistics (cont.)

Page 7: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 7

Costs of Spam

• Enterprises– > US$10 billion for US organizations in 2003 [Ferris

Research]– US$245,000/year for a company with 14,000 employees

[IDC]

• End users– 5 spam/day, 30 seconds each -> 15 hours/year [Ferris

Research]– Loss of productivity

• Burden on ISPs– System resource consumption on servers– Waste on network bandwidth– User complaints

Page 8: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 8

Latest Spam Statistics

• Email considered spam: 40%• Daily Spam emails sent: 12.4 biliion• Daily spam received per person: 6• Annual spam received per person: 2,200 • Spam cost to all non-corp. Internet users: $255

million• Spam cost to all US corporations in 2002: $8.9

billion• States with anti-spam laws: 26

[source: Spam Statistics 2006, by Don Evett,TopTenReviews, Inc.]

Page 9: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 9

Latest Spam Statistics (cont.)

• Email address changes due to spam: 16%• Estimated spam increase by 2007: 63%• Annual spam in 1,000 employee company: 2.1

million• Users who reply to spam email: 28%• Users who purchase from spam email: 8%• Corporate email that is considered spam: 15-20

%• Wasted corporate time per spam email: 4-5 sec

Page 10: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 10

Email Statistics

• Daily emails sent: 31 billion• Daily emails sent per email address: 56• Daily emails sent per person: 174• Daily emails sent per corporate user: 34• Daily emails received per person: 10• Email addresses per person: 3.1 average• Cost to all Internet users: $255 million

Page 11: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 11

Spam Categories• Products: 25% • Financial: 20% ↑• Adult: 19% ↑• Scams: 9% • Health: 7% • Internet: 7%• Leisure: 6%• Spiritual: 4%• Other: 3%(Source: http://www.brightmail.com/spamstats.html, Jun. 2004 & http://spam-filter-review.toptenreviews.com/spam-

statistics.html, 2006 )

Page 12: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 12

Origins of Spam

• Where does the spam come from? [Sophos, “Dirty Dozen” spam producing countries, Apr. 2005]– 35.7% (43%): from the US – 25.0% ↑(16%): from South Korea– 9.7% (11%): from China

• …

Page 13: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 13

Major Factors

• Simple SMTP mail relaying mechanism– Cannot verify the identity of the sender

• Forged IP address /sender e-mail address

– Open mail relay/proxy• Low cost for sending bulk e-mails

– Low cost for e-mail address harvesting• Web, mailing list, …

– Bulk mailer programs– Low cost for obtaining “free” e-mail

address

Page 14: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 14

Lifecycle of E-mails

sender

recipient

MUAs MTAsMTAs

SMTP

MTArMTAr

SMTP

MUAr

POP3/IMAP4

mailbox

DNS

MX records

queues

sender domain

receiver domain

Page 15: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 16

Existing Solutions

• Regulatory solutions– Anti-spam laws– Limitations

• Technical solutions– Filtering– Postage– Disposable e-mail address

Page 16: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 17

Regulatory Solutions

• Anti-spam laws– http://www.spamlaws.com/– Ex: US federal law CAN-SPAM Act

(S.877) enacted on Jan. 1, 2004

• Limitations– Dependence on evidences in technical

information– Slow and costly process

Page 17: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 18

Current Status ofAnti-Spam Laws

• In the US:– Enacted federal laws: CAN-SPAM Act of 2003 (Pub. L. 108-187, S. 877)– Enacted state laws: Arkansas, California, Colorado, Connecticut, Delaware,

Idaho, Illinois, Indiana, Iowa, Kansas, Louisiana, Maryland, Minnesota, Missouri, Nevada, New Mexico, North Carolina, Ohio, Oklahoma, Pennsylvania, Rhode Island, South Dakota, Tennessee, Utah, Virginia, Washington, West Virginia, Wisconsin, Wyoming, …

• In Europe:• European Union, Austria, Belgium, Czech Republic, Denmark, Finland, France,

Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, United Kingdom, …

• In other countries:• Argentina, Australia, Brazil, Canada, India, Japan, Panama, Peru, Russia, Sout

h Korea, Yugoslavia, …• TaiwanTaiwan: “Anti-Hacker” laws in the Martial Law (Jun. 3, 2003)

Page 18: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 19

Technical Solutions

• Filtering: to separate bad from good– Heuristic-based– Classification-based: machine learning – Others: peer-to-peer, honeypot

• Postage: to increase the cost of sending e-mails• Hiding email address

– Encoding (text to image, Java script, …)– Disposable email address: separate e-mail address for differe

nt correspondence• Enhancing SMTP mechanism

– Email path verification– Authenticated SMTP

Page 19: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 20

Filtering TechniqueHeuristic-based

• Black/White/Grey lists– Blacklist: lists of IP addresses that send spam

• RBLs (Real-time Blackhole Lists), open mail relays, open proxies, …

– Whitelist: lists of trusted sender• Challenge-response mechanism

– Greylisting: temporary delay of e-mail from unknown sender

• Problems– Easy to make mistake

• Forged IP address/sender e-mail address– Lists need to be updated frequently

• Changing spammer e-mail addresses

Page 20: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 21

Filtering TechniqueHeuristic-based (cont.)

• Keyword-matching rules (ex. MS Outlook)– Look for similar messages based on their subject or

content• Problems

– Exact rules are difficult to formulate and maintain• Spam is always changing

– Chinese menu (madlibs) attackMake thousands of dollars working at home !!!

Earn lots of money in the comfort of your own house.

Page 21: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 22

Filtering TechniqueClassification-based

• Machine learning– Text classification methods: TF-IDF, Naïve Bayes, S

VM (Support Vector Machine), …– Learn spam vs. good– Adapt to changing spam

• Problems– Need lots of training data

• Diverse contents in e-mail spam– Spammers are learning too

• Images, synonyms, misspellings, …– “One man’s spam is another man’s ham”

Page 22: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 24

Filtering Techniques -- Others

• Distributed (peer-to-peer, collaborative) spam filtering– To share the knowledge of spam features– SpamNet: Cloudmark– SpamWatch: UC Berkeley

• Problems– Efficacy– Efficiency

Page 23: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 25

Distributed Spam Filtering

• Cloudmark’s SpamNet

SpamNet

MUAr

recipient MTArMTAr

POP3/IMAP4

Add-in

check

Client-side

MUAr

recipient

Client-side

Add-in

report

Page 24: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 27

Discussions on Filtering-based Approach

• False-positive vs. false-negative– Cost-sensitive e-mail classification

• Incoming vs. outgoing e-mail filtering– Ex. corporate mail filtering might focus

on preventing confidential data

Page 25: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 29

Postage

• Postage: to increase the cost of sending e-mails– Money: payment– Computation: time– Turing tests: challenge-response

• Problems– Requires multiple monetary transactions for

each e-mail delivery– Who pays for infrastructure?

Page 26: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 30

Disposable E-mail Address

• Disposable e-mail address– Separate e-mail address for each correspondence

• Channelized e-mail system [R. Hall]– Sort incoming mails according to sender address– Terminate the address with spam

• Problems– How do new senders get your address?– What’s the sender address for multiple receivers?– Difficult to remember

Page 27: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 31

Enhancing SMTP Mechanism

• Email path verification– To trace the real origin of e-mail (sender) – Problem: accounting is needed for packet

network

• Authenticated SMTP– Trusted environment

• SMTP authentication (RFC 2554), SMTP over SSL/TLS (RFC 3207), digital signatures (PGP, …)

– Problem: need client-server cooperation

Page 28: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 32

Other Techniques (cont.)

• Reputation-based approach– Based on HITS (Hyperlink Induced Topic

Search) algorithm– Ranking on email sending/receiving

reputation

• Problem– Bad reputation for volume senders

(mailing lists, newsletters, …)

Page 29: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 35

Existing Anti-Spam Tools

• Open Source Filters– SpamAssassin– ifile– bogofilter– POPfile– SpamBayes– CRM114

• Commercial Products– BrightMail– SurfControl– Anti-virus

Page 30: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 36

Spammers’ Tricks

• Images: MIME• Invisible ink (hidden text): color• Misspelling

– o -> 0– i -> l -> 1 -> !– S -> 5

• F R E E, g-i=r-l, …• Ref: John Graham-Cumming: The

Spammers’ Compendium, http://www.jgc.org/tsc/index.htm

Page 31: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 37

Potential Industrial Standards

• Sender/Domain authentication for e-mails– Sender ID Framework (Microsoft)– DKIM (Yahoo, Cisco)

• DomainKeys (Yahoo)• Identified Internet Mail (Cisco)

– SPF• Sender Permitted From (AOL)

Page 32: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 38

Structures of E-mails

• Envelope: SMTP (RFC 2821)

• Header & body: RFC 2822

Page 33: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 41

Sender ID Framework (MS)

Page 34: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 43

DomainKeys

Page 35: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 45

IIM –Authentication /Authorization Model

Messages must pass two tests before they are authenticated

10401_10_2004

Receiving domain authenticates the message—i.e. Verifies that the message was not altered in any consequential manner prior to reaching the receiving domain

Receiving domain asks sending domain to confirm that whoever signed the message was authorized to do so (without having to identify the sender)

++AUTHENTICATE THE MESSAGE

AUTHORIZE THE SENDER

Page 36: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 46

Identified Internet Mail

Page 37: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 47

DomainKeys Identified Mail(DKIM)

• Derived from Yahoo DomainKeys and Cisco Identified Mail– IETF Working Group formed– IETF Internet draft

• Message header authentication– DNS identifiers– Public keys in DNS

• End-to-end– Between origin/receiver administrative domains– Not path-based

Page 38: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 48

SPF

• Sender Policy Framework– Derived from Sender Permitted From (SPF,

AOL)– By Meng Wong, CTO of Pobox– Current specification: SPFv1 (RFC 4408)– Reverse MX records– Adopted by many mail server implementati

ons

Page 39: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 50

Tips for End Users (1/2)• Never give out your personal e-mail address to

strangers• Use separate e-mail addresses for business an

d public use (“disposable”)• Never respond to unsolicited e-mail• Do not click on links within unsolicited e-mail,

including deceptive unsubscribe links

Page 40: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 51

Tips for End Users (2/2)• Read carefully the subject line on all e-mail, an

d use the preview feature on mail programs• If your e-mail address appears on a Web site, a

sk the site's manager to do some encoding• Use e-mail service providers that filter spam• Install an anti-spam program on your comput

er

Page 41: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 52

Conclusion

• Anti-spam is a battle– “Every time we discover a feature to catch

spam, spammers will find a work-around”

• Some advices– Filtering is just one part of the solutions– Try to make the costs of spammers higher– Be nice to your e-mail address– Mail delivery has to be improved

Page 42: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 53

References• IRTF ASRG: http://asrg.sp.am/ • Sender ID: http://www.microsoft.com/mscorp/safety/

technologies/senderid/technology.mspx • DKIM: http://dkim.org/ • DomainKeys: http://antispam.yahoo.com/domainkey

s • Identified Internet Mail: http://www.identifiedmail.co

m/ • SPF Project: http://www.openspf.org/ • RFCs and Internet Drafts

Page 43: Talk (II): E-mail Spam: The Problem, Solutions and Potential Industrial Standards

Nov. 16-17, 2006 E-mail Spam 54

References for Research• MIT Spam Conference (2003-2006)

– http://www.spamconference.org/ • Conference on Email and Anti-Spam (CEAS) (2004-200

6)– http://www.ceas.cc/

• TREC (Text REtrieval Conference) Spam Track (2005-2006)– http://trec.nist.gov/data/spam.html