Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY...

Preview:

Citation preview

Machine Learning Basics with Applications to Email Spam

Detection

UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI 

General background information about the process of machine

learning

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Motivation of this project

⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing

⦿Spam vs. Ham email

Our Goal

Spam Email example

Ham Email example

The process of email detection

⦿ Motivation of this project⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Pre-processing of data

⦿ Convert capital letters to lowercase 

⦿ Remove numbers, and extra white space

⦿ Remove punctuations 

⦿ Remove stop-words

⦿ Delete terms with length greater than 20. 

Pre-processing of data

⦿Original Email

Pre-processing of data

⦿After pre-processing

Pre-processing of data

⦿Extract Terms

Pre-processing of data

⦿Reduce Terms●Keep word length < 20

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers

Different classification methods

⦿ K Nearest Neighbor (KNN)

⦿ Naive Bayes Classifier

⦿ Logistic Regression

⦿ Decision Tree Analysis

What is K Nearest Neighbor

⦿ Use k "closet" samples (nearest neighbors) to perform classification

What is K Nearest Neighbor

Initial outcome and strategies for improvement

⦿ KNN accuracy was ~64% - very low

⦿ KNN classifier does not fit our project 

⦿ Term-list is still too large 

⦿ Try different method to classify and see if evaluation results are better than KNN results

⦿ Continue to reduce size of term list by removing terms that are not meaningful

Steps for improvement

⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier

●Naive- Bayes Classifier

⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain

Hashtable

Naive- Bayes classifier

Secondary Results

⦿Correctness increases from 62% to 82.36%

Suggestions for further improvement

⦿Revise pre-processing⦿Apply additional classifiers

Thank you

⦿Questions?

Recommended