21
EMAIL SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu

Email Spam Detection using machine Learning

  • Upload
    bazyli

  • View
    411

  • Download
    22

Embed Size (px)

DESCRIPTION

Email Spam Detection using machine Learning. Lydia Song, Lauren Steimle, Xiaoxiao Xu. Outline. Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest D ecision tree Logistic regression Naïve-Bayes Preliminary results - PowerPoint PPT Presentation

Citation preview

Page 1: Email Spam Detection using machine Learning

EMAIL SPAM DETECTION USING MACHINE LEARNINGLydia Song, Lauren Steimle, Xiaoxiao Xu

Page 2: Email Spam Detection using machine Learning

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Page 3: Email Spam Detection using machine Learning

Spam Statistics Percentage of Spam Emails in email traffic

averaged 69.9% in February 2014

Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014

Perc

enta

ge o

f spa

m

in e

mai

l tra

ffic

Page 4: Email Spam Detection using machine Learning

Spam vs. HamSpam=Unwanted communication

Ham=Normal communication

Page 5: Email Spam Detection using machine Learning

Pre-processing

Example of Spam Email Corresponding File in Data Set

Page 6: Email Spam Detection using machine Learning

Pre-processing1. Remove meaningless words2. Create a “bag of words” used in data

set3. Combine similar words4. Create a feature matrix

Email 1Email 2

Email m

“histor

y”“se

rvic

e”Bag of Wordshistory

last

service

“last”

Page 7: Email Spam Detection using machine Learning

Pre-processing ExampleYour history shows that your last order is ready for refilling.

Thank you,

Sam McfarlandCustomer Services

tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’]

filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services']

bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service']

Email 1Email 2

Email m

“histo

r

i”“se

rvi”

“last”Bag of

Wordshistori

last

servi

Page 8: Email Spam Detection using machine Learning

Dimensionality Growth Add ~100-150 features for each

additional email

50 100 150 200 250 3000

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Growth of Number of Features

Number of Emails Considered

Num

ber

of F

eatu

res

Page 9: Email Spam Detection using machine Learning

Dimensionality Reduction Add a requirement that words must

appear in x% of all emails to be considered a feature

50 100 150 200 250 3000

100

200

300

400

500

600

Growth of Features with Cutoff Requirement

5%10%15%20%

Number of Emails Considered

Num

ber

of F

eatu

res

Page 10: Email Spam Detection using machine Learning

Dimensionality Reduction-Hashing Trick

Before Hashing: 70x9403 Dimensions After Hashing: 70x1024 Dimensions

String

Integer

Hash Table Index

Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

Page 11: Email Spam Detection using machine Learning

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Page 12: Email Spam Detection using machine Learning

K-Nearest Neighbors Goal: Classify an unknown training

sample into one of C classes Idea: To determine the label of an

unknown sample (x), look at x’s k-nearest neighbors

Image from MIT Opencourseware

Page 13: Email Spam Detection using machine Learning

Decision Tree Convert training data

into a tree structure Root node: the first

decision node Decision node: if–then

decision based on features of training sample

Leaf Node: contains a class label

Image from MIT Opencourseware

Page 14: Email Spam Detection using machine Learning

Logistic Regression “Regression” over training examples

Transform continuous y to prediction of 1 or 0 using the standard logistic function

Predict spam if

Page 15: Email Spam Detection using machine Learning

Naïve Bayes Use Bayes Theorem: Hypothesis (H): spam or not spam Event (e): word occurs For example, the probability an email is

spam when the word “free” is in the email

“Naïve”: assume the feature values are independent of each other

Page 16: Email Spam Detection using machine Learning

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Page 17: Email Spam Detection using machine Learning

Preliminary Results 250 emails in training set, 50 in testing set Use 15% as the “percentage of emails” cutoff Performance measures:

Accuracy: % of predictions that were correct Recall: % of spam emails that were predicted

correctly Precision: % of emails classified as spam that

were actually spam F-Score: weighted average of precision and recall

Page 18: Email Spam Detection using machine Learning

“Percentage of Emails” Performance

Linear Regression Logistic Regression

Page 19: Email Spam Detection using machine Learning

Preliminary Results

Page 20: Email Spam Detection using machine Learning

Next Steps

Implement SVM: Matlab vs. Weka

Hashing trick- try different number of

buckets

Regularizations

Page 21: Email Spam Detection using machine Learning

Thank you! Any questions?