Upload
alok-nandan-jha
View
460
Download
2
Embed Size (px)
Citation preview
IMPLEMENTATION OF NAÏVE BAYESIAN SPAM FILTER ALGORITHM
Submitted To:
Mrs. Arti Gupta
BY:
ATUL SAURABH 07503872
ALOK NANDAN JHA 07503895
NITISH KUMAR SINHA 07503897
(B-11)
Pag
e1
INDEX
CONTENT PAGE NUMBER
PROBLEM STATEMENT
INTRODUCTION
3
4
LITERATURE SURVEY
5
METHODOLOGY PROPOSED
EXPERIMENT AND RESULT
CONCLUSION
FUTURE BLOCK
REFERENCE
9
11
13
14
15
Pag
e2
ACKNOWLEDGEMENT
A journey is easier when you travel together. Interdependence is certainly more valuable than
independence. This report is a result of intensive work and observation whereby we have been
accompanied and supported by many people. It’s a pleasant aspect that we have now the
opportunity to express our gratitude for all of them.
We thank Arti Gupta Ma’am, our IRDM Project mentor for providing us with her initial
stimulating ideas to start the project to gain an insight in the project. Her support and constant
guidance was a constant source of inspiration for us towards the completion of the report.
In a nutshell, we can say that this project would have been stuck in wilderness without her
assistance that provided the stimulating discussion to work on the project.
Pag
e3
PROBLEM STATEMENT
We would do the implementation of Paul Graham's Naive Bayesian Spam Filter algorithm in C#.
It is suitable for incorporation into an ASP.NET Blogging, Forum or Email.
The Achilles heel of the spammers is their message. They can circumvent any other barrier we
set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can
write software that recognizes their messages, there is no way they can get around that. In fact,
we have found that we can filter present-day spam acceptably well using nothing more than a
Bayesian combination of the spam probabilities of individual words.
Pag
e4
INTRODUCTION
Earlier it used to be a simple silent human-detection script that was run behind the scenes to
ensure that a real person was sitting at a real keyboard and typing blog entries in by hand. Now
we see a new breed of spam showing up on Blogabond, and it's getting worse every day. Modern
email clients all use Bayesian spam filtering, so that's what we are going to implement. The
content-based filters are the way to stop spam.
Using a slightly tweaked Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false
positives. A Bayes classifier is a simple probabilistic classifier based on applying Bayes'
theorem (from Bayesian statistics) with strong (naive) independence assumptions. Depending on
the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently
in a supervised learning setting.
Pag
e5
LITERATURE SURVEY
SOURCE-
http://www.paulgraham.com/spam.html
Authors: Paul Graham
This source was the originator of naïve Bayesian based spam filter. According to the paper the
statistical approach is not usually the first one people try when they write spam filters. Most
hackers' first instinct is to try to write software that recognizes individual properties of spam.
One look at spams and he/she thinks, the gall of these guys to try sending me mail that begins
"Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points.
Filtering out that stuff will take a big bite out of incoming spam. But the paper further discusses
sort of AI techniques, to automate this process i.e. to train the system to automatically filter the
spam based on training set.
SOURCE-
Adaptive Naïve Bayesian Anti-Spam Engine
Authors: Wojciech P. Gajewski
World Academy of science, Engineering and Technology 7 2005
The paper first discussed about the problem of spam that has been seriously troubling the
Internet community during the last few years and currently reached an alarming scale.
Observations made at CERN (European Organization for Nuclear Research located in Geneva,
Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. Then the
paper presented naïve Bayesian classifier based on a Bag of Words representation of an email as
a widely used process to stop this unwanted flood as it combines good performance with
simplicity of the training and classification processes. However, facing the constantly changing
patterns of spam, it is necessary to assure online adaptability of the classifier.
Pag
e6
SOURCE-
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
The article discussed about the Bayes theorem and how it could be used for spam filtering.
Bayes' theorem is used several times in the context of spam:
A first time, to compute the probability that the message is spam, knowing that a
given word appears in this message;
A second time, to compute the probability that the message is spam, taking into
consideration all of its words (or a relevant subset of them);
Computing the probability that a message containing a given word is spam
The formula used by the software to determine that is derived from Bayes' theorem
Where:
is the probability that a message is a spam, knowing that the word is in it;
is the overall probability that any given message is spam;
is the probability that the word appears in spam messages;
is the overall probability that any given message is not spam (is "ham");
is the probability that the word appears in ham messages.
Most Bayesian spam detection software make the assumption that there is no a priori reason for
any incoming message to be spam rather than ham, and consider both cases to have equal
probabilities of 50%:
The filters that use this hypothesis are said to be "not biased", meaning that they have no
prejudice regarding the incoming email. This assumption allows us to simplify the general
formula to:
Pag
e7
Combining individual probabilities
The Bayesian spam filtering software makes the "naive" assumption that the words present in the
message are independent events. That is wrong in natural languages like English, where the
probability of finding an adjective, for example, is affected by the probability of having a noun.
With that assumption, one can derive another formula from Bayes' theorem:
where:
p is the probability that the suspect message is spam;
p1 is the probability p(S | W1) that it is a spam knowing it contains a first word .
p2 is the probability p(S | W2) that it is a spam knowing it contains a second word.
pN is the probability p(S | WN) that it is a spam knowing it contains an Nth word
SOURCE-
http://www.ibm.com/developerworks/linux/library/l-spamf.html
Authors: David Mertz
In this article, the author describe ways that computer code can help eliminate unsolicited
commercial e-mail, viruses, Trojans, and worms, as well as frauds perpetrated electronically and
other undesired and troublesome e-mail. The problem with spam is that it tends to swamp
desirable e-mail. He discussed various spam filtering methods.
1. Basic structured text filters The e-mail client we use has the capability to sort incoming e-mail based on simple strings found
in specific header fields, the header in general, and/or in the body. Its capability is very simple
and does not even include regular expression matching. Almost all e-mail clients have this much
filtering capability.
These few simple filters correctly catch about 80% of the spam. Unfortunately, they also have a
relatively high false positive rate -- enough that one needs to manually examine some of the
spam folders from time to time.
2. Whitelist/verification filters
A fairly aggressive technique for spam filtering is what we would call the "whitelist plus
automated verification" approach.
Pag
e8
A whitelist filter passes mail only from explicitly approved recipients on to the inbox. Other
messages generate a special challenge response to the sender. The whitelist filter's response
contains some kind of unique code that identifies the original message, such as a hash or
sequential ID. This challenge message contains instructions for the sender to reply in order to be
added to the whitelist (the response message must contain the code generated by the whitelist
filter).
3. Rule-based rankings
In this we evaluate a large number of patterns -- mostly regular expressions -- against a candidate
message. Some matched patterns add to a message score, while others subtract from it. If a
message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered
legitimate. But here rules need to be updated as the products and scams advanced by spammers
evolve.
4. Bayesian word distribution filters
Paul Graham wrote a provocative essay in August 2002. The general idea is that some words
occur more frequently in known spam, and other words occur more frequently in legitimate
messages. Using well-known mathematics, it is possible to generate a "spam-indicative
probability" for each word.
Graham's idea has several noteworthy benefits:
1. It can generate a filter automatically from corpora of categorized messages rather than
requiring human effort in rule development.
2. It can be customized to individual users' characteristic spam and legitimate messages.
3. It can be implemented in a very small number of lines of code.
4. It works surprisingly well.
Pag
e9
METHODOLOGY PROPOSED
We started with one corpus of spam and one of nonspam mail. At the moment each one
has about 4000 messages in it. I scan the entire text, including headers and embedded
html and javascript, of each message in each corpus. We currently consider alphanumeric
characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else
to be a token separator. (There is probably room for improvement here). We ignore
tokens that are all digits, and we also ignore html comments, not even considering them
as token separators. We count the number of times each token (ignoring case, currently)
occurs in each corpus.
At this stage we end up with two large hash tables, one for each corpus, mapping tokens
to number of occurrences.
Next we create a third hash table, this time mapping each token to the probability that an
email containing it is a spam, which we calculate as follows :
(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (< (+ g b) 5)
(max .01
(min .99 (float (/ (min 1 (/ b nbad))
(+ (min 1 (/ g ngood))
(min 1 (/ b nbad)))))))))
where word is the token whose probability we're calculating, good and bad are the hash
tables we created in the first step, and ngood and nbad are the number of nonspam and
spam messages respectively. We want to bias the probabilities slightly to avoid false
positives, and we've found that a good way to do it is to double all the numbers in good.
This helps to distinguish between words that occasionally do occur in legitimate email
and words that almost never do. We only consider words that occur more than five times
in total
And then there is the question of what probability to assign to words that occur in one
corpus but not the other. Again we chose .01 and .99 as found. There may be room for
tuning here, but as the corpus grows such tuning will happen automatically anyway.
We considered each corpus to be a single long stream of text for purposes of counting
occurrences. We use the number of emails in each, rather than their combined length, as
the divisor in calculating spam probabilities. This adds another slight bias to protect
against false positives.
Pag
e10
When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens,
where interesting is measured by how far their spam probability is from a neutral .5, are
used to calculate the probability that the mail is spam. From probs is a list of the fifteen
individual probabilities, we calculate the combined probability.
One question that arises in practice is what probability to assign to a word you've never
seen, i.e. one that doesn't occur in the hash table of word probabilities. We've found,
again that .4 was the number proposed. If you've never seen a word before, it is probably
fairly innocent; spam words tend to be all too familiar.
Pag
e11
EXPERIMENT AND RESULT
INPUT PARAMETERS:
ALGO A:
public int GoodTokenWeight = 4;
public int MinTokenCount = 0;
public int MinCountForInclusion = 3;
public double MinScore = 0.015;
public double MaxScore = 0.99;
public double LikelySpamScore = 0.9998;
public double CertainSpamScore = 0.9999;
public int CertainSpamCount = 15;
public int InterestingWordCount = 17;
ALGO B:
Public int GoodTokenWeight = 2;
public int MinTokenCount = 0;
public int MinCountForInclusion = 5;
public double MinScore = 0.011;
public double MaxScore = 0.99;
public double LikelySpamScore = 0.9998;
public double CertainSpamScore = 0.9999;
public int CertainSpamCount = 10;
public int InterestingWordCount = 15;
Pag
e12
SCREEN SHOT:
SPAM FILTERS-- SPAM ASSASSIN NAÏVE BAYESIAN GMAIL SPAM FILTER
PARAMETERS
FALSE POSITIVE
12 OUT OF 100 5 OUT OF 100 1/2 OUT OF 100
CORRECTLY CLASSIFIED
88 OUT OF 100 95 OUT OF 100 98-99 OUT OF 100
IMAGE PARAMETER
NOT CONSIDERED NOT CONSIDERED CONSIDERED; SOLVED USING OCR
USED
BASIC BLOCK OF EVERY SPAM FILTER
USED AT USER END OF SPAM FILTER
WIDELY USED AT DEVELOPER END OF SPAM FILTER
Pag
e13
CONCLUSION
The advantage of Bayesian spam filtering is that it can be trained on a per-user basis. The spam
that a user receives is often related to the online user's activities. The word probabilities are
unique to each user and can evolve over time with corrective training whenever the filter
incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is
often superior to pre-defined rules. It can also perform particularly well in avoiding false
positives, where legitimate email is incorrectly classified as spam.
However there are few disadvantages too i.e. Bayesian spam filtering is susceptible to Bayesian
poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam
filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out
emails with large amounts of legitimate text (gathered from legitimate news or literary
sources). Spammer tactics include insertion of random innocuous words that are not normally
associated with spam, thereby decreasing the email's spam score, making it more likely to slip
past a Bayesian spam filter. Also another technique used to try to defeat Bayesian spam filters is
to replace text with pictures, either directly included or linked.
Currently a probably more efficient solution has been proposed by Google and is used by
its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to
large size image, analyzing the text inside.
Even though Bayesian filtering is used widely to identify spam email, the technique can classify
(or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. There is
recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and
decide on behavioral responses
Pag
e14
FUTURE BLOCK
We could develop filter based on word pairs, or even triples, rather than individual
words. This should yield a much sharper estimate of the probability. For example, in my
current database, the word "offers" has a probability of .96. If you based the probabilities
on word pairs, you'd end up with "special offers" and "valuable offers" having
probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a
probability of .1 or less.
Recognizing nonspam features may be more important than recognizing spam features.
False positives are such a worry that they demand extraordinary measures. We will
probably in future versions add a second level of testing designed specifically to avoid
false positives. If a mail triggers this second level of filters it will be accepted even if its
spam probability is above the threshold.
We might also focus extra attention on specific parts of the email. For example, about
95% of current spam includes the url of a site they want you to visit. (The remaining 5%
want you to call a phone number, reply by email or to a US mail address, or in a few
cases to buy a certain stock.) The url is in such cases practically enough by itself to
determine whether the email is spam.
It might be a good idea to have a cooperatively maintained list of urls promoted by
spammers. A way to create such a list is to test dubious urls by sending out a crawler to
look at the site before the user looked at the email mentioning it. We could use a
Bayesian filter to rate the site just as one would an email, and whatever was found on the
site could be included in calculating the probability of the email being a spam. A url that
led to a redirect would of course be especially suspicious.
Pag
e15
REFERENCE
[1] P.Graham.(2002, August). A Plan for Spam[Online].Available:www.paulgraham.com/spam.html
[2] Wojciech P. Gajewski ,”Adaptive Naïve Bayesian Anti-Spam Engine “ in World
Academy of science, Engineering and Technology 7 2005
[3] David Mertz. Spam filtering techniques.
Available: www.ibm.com/developerworks/linux/library/l-spamf.html
[4] http://en.wikipedia.org/wiki/Bayesian_spam_filtering