34
Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Text CategorizationMoshe Koppel

Lecture 10: Spam Detection

Some slides from Joshua Goodman

Page 2: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Obligatory Scare Slide

• There’s lots of spam

• The proportion of spam is growing – it will soon exceed 100% of all email sent

• It costs the world gazillions of dollars

• Spam is BAD

• (Actually, lately it looks like spam email has been mostly defeated.)

Page 3: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Kinds of spam

• Active spam – ads and scams– email

– chatbots

– commentbots

• Passive spam – websites– link farms for SEO

– adsense parking lots

Differences between these increasingly artificial

Page 4: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Special Issues

Spam detection is basically a text cat problem, but there are some special issues:

• Collecting data – non-spam email is private

• Asymmetry – must never class good mail as spam

• Adversarial – spammers try to defeat filters

Page 5: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Collecting Data

• gmail has user feedback– LOTS of examples– Haphazardly labeled– How much info do they keep about each email?

Page 6: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Problem of False Positives

• False positives more costly than false negatives

• Research must report recall-precision curves;

key point is precision ~ 1

Page 7: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Adversarial Problem

• Spammers reverse engineer global filters; use nasty tricks to circumvent them

• This is what makes spam detection an interesting problem

Page 8: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Basic Spam

• Let’s start with some garden variety spam

• This is easily detected by standard text cat tricks

Page 9: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

It cost you nothing (Yes! $0) to give Us a call, We will contact You back

Absolutely No exams/Tests/classes/books/InterviewsNo Pre-School qualification Needed!

-----------------------------Inside USA: 1-718-989-5XXX0utside USA: +1-718-989-5XXX-----------------------------

Degree, Bacheelor, masteerMBA, PhDD available in the field of your choice that's Right, You can even become a doctor & receive all the benefits That omes With it!

Please Leave Below 3 INFO in voicemail:

1) your Name2) your Country3) your Phone No. (with Countrycode)

Call Now! 24 hours a day, 7 Days a week to recieve Your call

Page 10: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Most Honorable Sir,

  

I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIRE YOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certain events in my country, it has become necessary for me to transfer a considerable sum of cash to a foreign bank account. I turn to you as a MOST HONORABLE AND TRUSTED PERSON for your discrete assistance.

The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only [30,000.000.00 NIS] and we wish to transfer this money into safe foreigners account abroad. I am only contacting you as a foreigner because this money cannot be approved to a local person here, but to a foreigner who has information about the account, which I shall give to you upon your positive response. I am revealing this to you with believe in God that you will never let me down in this business, you are the FIRST AND THE ONLY PERSON that I am contacting for this business, so please reply urgently so that I will inform you the next step to take urgently.

At the conclusion of this business, you will be given 40% of the total amount, 50% will be for us while 10% will be for the expenses both parties may incurred during this transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.

Page 11: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Early Work Sahami et al ‘98

• Learner: Naïve Bayes

• Feature Set: Words, Phrases, Structural Features

• Feature Selection: top 500 infogain

• Evaluation Data: ~1700 Messages, ~88% Spam

• Results: Spam precision 100%, Spam recall 98.3%

Page 12: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Early Work Sahami et al ‘98

Hand Crafted Features– 35 Phrases

• ‘Free Money’• ‘Only $’• ‘be over 21’

– 20 Domain Specific Features• Domain type of sender (.edu, .com, etc)• Sender name resolutions (internal mail)• Has attachments• Time received• Percent of non-alphanumeric characters in subject

Page 13: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Later Studies

• The early work was followed by the usual stream of extended feature sets and fancier learning methods (e.g. SVM)

• It is now common to use over 100,000 features

• Learning methods for huge data sets must be very efficient (online algorithms)

• Methods must be adaptive

Page 14: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

How to Beat an Adaptive Spam Filter Graham-Cumming ‘04

• Use machine learning to discover words that beat an adaptive filter– Take a message that is near spam threshold– Send it to the target filter 10,000 times each time

adding 5 random words – Train an ‘evil’ filter to learn which messages beat the

target filter– Use ‘evil’ filter to modify new spam messages

• Found single word additions to get new spam by the filter

Page 15: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Other Tricks

• Fill messages with real text taken from books, sites, etc.

• Can even generate real-looking texts using Markovian language models

Page 16: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

The Hitchhiker Chaffer

• Content Chaff– Random passages from the

Hitchhiker’s Guide– Footers from valid mail

“This must be Thursday,” said Arthur to himself, sinking low over his beer, “I never could get the hang of Thursdays.”

Express yourself with MSN Messenger 6.0…

Page 17: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Hitchhiker Chaffer’s Later Work

• There is nothing fancy about this spam

– “A spam filter will catch that in its sleep” – anonymous

• Or maybe not…

Page 18: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Hitchhiker Chaffer’s Later Work

• Hidden Text

• Content Chaff

• URL Spamming

Also included a number of unusual statements made by candidates during, ‘On display? I eventually had to go down to the cellar to find them.’

http://join.msn.com/?Page=features/es

Page 19: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

More Tricks

• Encoded Text

• Distorted Text

Page 20: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Secret Decoder Ring Dude

• Another spam that looks easy

• Is it?

Page 21: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Secret Decoder Ring Dude

• Character Encoding

• HTML word breakingPharmacy

Prod&#117;c<!LZJ>t<!LG>s

Page 22: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Diploma Guy

• Word Obscuring

Dplmoia Pragorm

Caerte a mroe prosoeprus

Page 23: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

More of Diploma Guy

Page 24: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

One Pretty Good Text Cat Method

• Optimally compress spam training examples

• Optimally compress non-spam training examples

• Check which compression method better compresses suspicious message

Page 25: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Why This Works

• Works at level of character n-grams

• Should be applied to html source

• Captures weird encodings, word distortions

• Probably using character n-grams with SVM would also work well

Page 26: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

But Spammers Aren’t Sitting Around…

• Embed text in images (can vary non-text parts of image)

• Also, just send link to spam site

Page 27: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman
Page 28: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Text Cat isn’t the only Trick

• Don’t display images w/o user okay

• Blacklist IPs that spam comes from– Can harm legitimate senders (zombies, etc.)

• Charge “postage” for email– Cash– Puzzles that waste CPU– Task easy for humans, hard for computers

Page 29: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Sender RecipientResponse

Message

Page 30: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

CAPTCHAS

• Identify distorted characters

• Supposed to be easy for humans, hard for computers

• Actually, nowadays computers better at it than humans

Page 31: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Computers vs. Humans

Page 32: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Slight Variation

• Fortunately, for now, humans are still better than computers at identifying character boundaries

Page 33: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

New CAPTCHAS

Page 34: Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman

Economics of CAPTCHAs

• CAPTCHAs taken from books Google is trying to OCR. We all work for them for free.

• Spammers use Mechanical Turk to solve CAPTCHAs. It’s worth paying for.