16
IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM Submitted To: Mrs. Arti Gupta BY: ATUL SAURABH 07503872 ALOK NANDAN JHA 07503895 NITISH KUMAR SINHA 07503897 (B-11)

IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Embed Size (px)

Citation preview

Page 1: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Submitted To:

Mrs. Arti Gupta

BY:

ATUL SAURABH 07503872

ALOK NANDAN JHA 07503895

NITISH KUMAR SINHA 07503897

(B-11)

Page 2: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e1

INDEX

CONTENT PAGE NUMBER

PROBLEM STATEMENT

INTRODUCTION

3

4

LITERATURE SURVEY

5

METHODOLOGY PROPOSED

EXPERIMENT AND RESULT

CONCLUSION

FUTURE BLOCK

REFERENCE

9

11

13

14

15

Page 3: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e2

ACKNOWLEDGEMENT

A journey is easier when you travel together. Interdependence is certainly more valuable than

independence. This report is a result of intensive work and observation whereby we have been

accompanied and supported by many people. It’s a pleasant aspect that we have now the

opportunity to express our gratitude for all of them.

We thank Arti Gupta Ma’am, our IRDM Project mentor for providing us with her initial

stimulating ideas to start the project to gain an insight in the project. Her support and constant

guidance was a constant source of inspiration for us towards the completion of the report.

In a nutshell, we can say that this project would have been stuck in wilderness without her

assistance that provided the stimulating discussion to work on the project.

Page 4: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e3

PROBLEM STATEMENT

We would do the implementation of Paul Graham's Naive Bayesian Spam Filter algorithm in C#.

It is suitable for incorporation into an ASP.NET Blogging, Forum or Email.

The Achilles heel of the spammers is their message. They can circumvent any other barrier we

set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can

write software that recognizes their messages, there is no way they can get around that. In fact,

we have found that we can filter present-day spam acceptably well using nothing more than a

Bayesian combination of the spam probabilities of individual words.

Page 5: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e4

INTRODUCTION

Earlier it used to be a simple silent human-detection script that was run behind the scenes to

ensure that a real person was sitting at a real keyboard and typing blog entries in by hand. Now

we see a new breed of spam showing up on Blogabond, and it's getting worse every day. Modern

email clients all use Bayesian spam filtering, so that's what we are going to implement. The

content-based filters are the way to stop spam.

Using a slightly tweaked Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false

positives. A Bayes classifier is a simple probabilistic classifier based on applying Bayes'

theorem (from Bayesian statistics) with strong (naive) independence assumptions. Depending on

the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently

in a supervised learning setting.

Page 6: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e5

LITERATURE SURVEY

SOURCE-

http://www.paulgraham.com/spam.html

Authors: Paul Graham

This source was the originator of naïve Bayesian based spam filter. According to the paper the

statistical approach is not usually the first one people try when they write spam filters. Most

hackers' first instinct is to try to write software that recognizes individual properties of spam.

One look at spams and he/she thinks, the gall of these guys to try sending me mail that begins

"Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points.

Filtering out that stuff will take a big bite out of incoming spam. But the paper further discusses

sort of AI techniques, to automate this process i.e. to train the system to automatically filter the

spam based on training set.

SOURCE-

Adaptive Naïve Bayesian Anti-Spam Engine

Authors: Wojciech P. Gajewski

World Academy of science, Engineering and Technology 7 2005

The paper first discussed about the problem of spam that has been seriously troubling the

Internet community during the last few years and currently reached an alarming scale.

Observations made at CERN (European Organization for Nuclear Research located in Geneva,

Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. Then the

paper presented naïve Bayesian classifier based on a Bag of Words representation of an email as

a widely used process to stop this unwanted flood as it combines good performance with

simplicity of the training and classification processes. However, facing the constantly changing

patterns of spam, it is necessary to assure online adaptability of the classifier.

Page 7: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e6

SOURCE-

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

The article discussed about the Bayes theorem and how it could be used for spam filtering.

Bayes' theorem is used several times in the context of spam:

A first time, to compute the probability that the message is spam, knowing that a

given word appears in this message;

A second time, to compute the probability that the message is spam, taking into

consideration all of its words (or a relevant subset of them);

Computing the probability that a message containing a given word is spam

The formula used by the software to determine that is derived from Bayes' theorem

Where:

is the probability that a message is a spam, knowing that the word is in it;

is the overall probability that any given message is spam;

is the probability that the word appears in spam messages;

is the overall probability that any given message is not spam (is "ham");

is the probability that the word appears in ham messages.

Most Bayesian spam detection software make the assumption that there is no a priori reason for

any incoming message to be spam rather than ham, and consider both cases to have equal

probabilities of 50%:

The filters that use this hypothesis are said to be "not biased", meaning that they have no

prejudice regarding the incoming email. This assumption allows us to simplify the general

formula to:

Page 8: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e7

Combining individual probabilities

The Bayesian spam filtering software makes the "naive" assumption that the words present in the

message are independent events. That is wrong in natural languages like English, where the

probability of finding an adjective, for example, is affected by the probability of having a noun.

With that assumption, one can derive another formula from Bayes' theorem:

where:

p is the probability that the suspect message is spam;

p1 is the probability p(S | W1) that it is a spam knowing it contains a first word .

p2 is the probability p(S | W2) that it is a spam knowing it contains a second word.

pN is the probability p(S | WN) that it is a spam knowing it contains an Nth word

SOURCE-

http://www.ibm.com/developerworks/linux/library/l-spamf.html

Authors: David Mertz

In this article, the author describe ways that computer code can help eliminate unsolicited

commercial e-mail, viruses, Trojans, and worms, as well as frauds perpetrated electronically and

other undesired and troublesome e-mail. The problem with spam is that it tends to swamp

desirable e-mail. He discussed various spam filtering methods.

1. Basic structured text filters The e-mail client we use has the capability to sort incoming e-mail based on simple strings found

in specific header fields, the header in general, and/or in the body. Its capability is very simple

and does not even include regular expression matching. Almost all e-mail clients have this much

filtering capability.

These few simple filters correctly catch about 80% of the spam. Unfortunately, they also have a

relatively high false positive rate -- enough that one needs to manually examine some of the

spam folders from time to time.

2. Whitelist/verification filters

A fairly aggressive technique for spam filtering is what we would call the "whitelist plus

automated verification" approach.

Page 9: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e8

A whitelist filter passes mail only from explicitly approved recipients on to the inbox. Other

messages generate a special challenge response to the sender. The whitelist filter's response

contains some kind of unique code that identifies the original message, such as a hash or

sequential ID. This challenge message contains instructions for the sender to reply in order to be

added to the whitelist (the response message must contain the code generated by the whitelist

filter).

3. Rule-based rankings

In this we evaluate a large number of patterns -- mostly regular expressions -- against a candidate

message. Some matched patterns add to a message score, while others subtract from it. If a

message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered

legitimate. But here rules need to be updated as the products and scams advanced by spammers

evolve.

4. Bayesian word distribution filters

Paul Graham wrote a provocative essay in August 2002. The general idea is that some words

occur more frequently in known spam, and other words occur more frequently in legitimate

messages. Using well-known mathematics, it is possible to generate a "spam-indicative

probability" for each word.

Graham's idea has several noteworthy benefits:

1. It can generate a filter automatically from corpora of categorized messages rather than

requiring human effort in rule development.

2. It can be customized to individual users' characteristic spam and legitimate messages.

3. It can be implemented in a very small number of lines of code.

4. It works surprisingly well.

Page 10: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e9

METHODOLOGY PROPOSED

We started with one corpus of spam and one of nonspam mail. At the moment each one

has about 4000 messages in it. I scan the entire text, including headers and embedded

html and javascript, of each message in each corpus. We currently consider alphanumeric

characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else

to be a token separator. (There is probably room for improvement here). We ignore

tokens that are all digits, and we also ignore html comments, not even considering them

as token separators. We count the number of times each token (ignoring case, currently)

occurs in each corpus.

At this stage we end up with two large hash tables, one for each corpus, mapping tokens

to number of occurrences.

Next we create a third hash table, this time mapping each token to the probability that an

email containing it is a spam, which we calculate as follows :

(let ((g (* 2 (or (gethash word good) 0)))

(b (or (gethash word bad) 0)))

(unless (< (+ g b) 5)

(max .01

(min .99 (float (/ (min 1 (/ b nbad))

(+ (min 1 (/ g ngood))

(min 1 (/ b nbad)))))))))

where word is the token whose probability we're calculating, good and bad are the hash

tables we created in the first step, and ngood and nbad are the number of nonspam and

spam messages respectively. We want to bias the probabilities slightly to avoid false

positives, and we've found that a good way to do it is to double all the numbers in good.

This helps to distinguish between words that occasionally do occur in legitimate email

and words that almost never do. We only consider words that occur more than five times

in total

And then there is the question of what probability to assign to words that occur in one

corpus but not the other. Again we chose .01 and .99 as found. There may be room for

tuning here, but as the corpus grows such tuning will happen automatically anyway.

We considered each corpus to be a single long stream of text for purposes of counting

occurrences. We use the number of emails in each, rather than their combined length, as

the divisor in calculating spam probabilities. This adds another slight bias to protect

against false positives.

Page 11: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e10

When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens,

where interesting is measured by how far their spam probability is from a neutral .5, are

used to calculate the probability that the mail is spam. From probs is a list of the fifteen

individual probabilities, we calculate the combined probability.

One question that arises in practice is what probability to assign to a word you've never

seen, i.e. one that doesn't occur in the hash table of word probabilities. We've found,

again that .4 was the number proposed. If you've never seen a word before, it is probably

fairly innocent; spam words tend to be all too familiar.

Page 12: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e11

EXPERIMENT AND RESULT

INPUT PARAMETERS:

ALGO A:

public int GoodTokenWeight = 4;

public int MinTokenCount = 0;

public int MinCountForInclusion = 3;

public double MinScore = 0.015;

public double MaxScore = 0.99;

public double LikelySpamScore = 0.9998;

public double CertainSpamScore = 0.9999;

public int CertainSpamCount = 15;

public int InterestingWordCount = 17;

ALGO B:

Public int GoodTokenWeight = 2;

public int MinTokenCount = 0;

public int MinCountForInclusion = 5;

public double MinScore = 0.011;

public double MaxScore = 0.99;

public double LikelySpamScore = 0.9998;

public double CertainSpamScore = 0.9999;

public int CertainSpamCount = 10;

public int InterestingWordCount = 15;

Page 13: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e12

SCREEN SHOT:

SPAM FILTERS-- SPAM ASSASSIN NAÏVE BAYESIAN GMAIL SPAM FILTER

PARAMETERS

FALSE POSITIVE

12 OUT OF 100 5 OUT OF 100 1/2 OUT OF 100

CORRECTLY CLASSIFIED

88 OUT OF 100 95 OUT OF 100 98-99 OUT OF 100

IMAGE PARAMETER

NOT CONSIDERED NOT CONSIDERED CONSIDERED; SOLVED USING OCR

USED

BASIC BLOCK OF EVERY SPAM FILTER

USED AT USER END OF SPAM FILTER

WIDELY USED AT DEVELOPER END OF SPAM FILTER

Page 14: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e13

CONCLUSION

The advantage of Bayesian spam filtering is that it can be trained on a per-user basis. The spam

that a user receives is often related to the online user's activities. The word probabilities are

unique to each user and can evolve over time with corrective training whenever the filter

incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is

often superior to pre-defined rules. It can also perform particularly well in avoiding false

positives, where legitimate email is incorrectly classified as spam.

However there are few disadvantages too i.e. Bayesian spam filtering is susceptible to Bayesian

poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam

filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out

emails with large amounts of legitimate text (gathered from legitimate news or literary

sources). Spammer tactics include insertion of random innocuous words that are not normally

associated with spam, thereby decreasing the email's spam score, making it more likely to slip

past a Bayesian spam filter. Also another technique used to try to defeat Bayesian spam filters is

to replace text with pictures, either directly included or linked.

Currently a probably more efficient solution has been proposed by Google and is used by

its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to

large size image, analyzing the text inside.

Even though Bayesian filtering is used widely to identify spam email, the technique can classify

(or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. There is

recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and

decide on behavioral responses

Page 15: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e14

FUTURE BLOCK

We could develop filter based on word pairs, or even triples, rather than individual

words. This should yield a much sharper estimate of the probability. For example, in my

current database, the word "offers" has a probability of .96. If you based the probabilities

on word pairs, you'd end up with "special offers" and "valuable offers" having

probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a

probability of .1 or less.

Recognizing nonspam features may be more important than recognizing spam features.

False positives are such a worry that they demand extraordinary measures. We will

probably in future versions add a second level of testing designed specifically to avoid

false positives. If a mail triggers this second level of filters it will be accepted even if its

spam probability is above the threshold.

We might also focus extra attention on specific parts of the email. For example, about

95% of current spam includes the url of a site they want you to visit. (The remaining 5%

want you to call a phone number, reply by email or to a US mail address, or in a few

cases to buy a certain stock.) The url is in such cases practically enough by itself to

determine whether the email is spam.

It might be a good idea to have a cooperatively maintained list of urls promoted by

spammers. A way to create such a list is to test dubious urls by sending out a crawler to

look at the site before the user looked at the email mentioning it. We could use a

Bayesian filter to rate the site just as one would an email, and whatever was found on the

site could be included in calculating the probability of the email being a spam. A url that

led to a redirect would of course be especially suspicious.

Page 16: IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM

Pag

e15

REFERENCE

[1] P.Graham.(2002, August). A Plan for Spam[Online].Available:www.paulgraham.com/spam.html

[2] Wojciech P. Gajewski ,”Adaptive Naïve Bayesian Anti-Spam Engine “ in World

Academy of science, Engineering and Technology 7 2005

[3] David Mertz. Spam filtering techniques.

Available: www.ibm.com/developerworks/linux/library/l-spamf.html

[4] http://en.wikipedia.org/wiki/Bayesian_spam_filtering