24
Machine Learning Basics with Applications to Email Spam Detection UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Embed Size (px)

Citation preview

Page 1: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Machine Learning Basics with Applications to Email Spam

Detection

UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI 

Page 2: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

General background information about the process of machine

learning

Page 3: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Page 4: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Motivation of this project

⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing

⦿Spam vs. Ham email

Page 5: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Our Goal

Page 6: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Spam Email example

Page 7: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Ham Email example

Page 8: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

The process of email detection

⦿ Motivation of this project⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Page 9: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Pre-processing of data

⦿ Convert capital letters to lowercase 

⦿ Remove numbers, and extra white space

⦿ Remove punctuations 

⦿ Remove stop-words

⦿ Delete terms with length greater than 20. 

Page 10: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Pre-processing of data

⦿Original Email

Page 11: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Pre-processing of data

⦿After pre-processing

Page 12: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Pre-processing of data

⦿Extract Terms

Page 13: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Pre-processing of data

⦿Reduce Terms●Keep word length < 20

Page 14: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers

Page 15: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Different classification methods

⦿ K Nearest Neighbor (KNN)

⦿ Naive Bayes Classifier

⦿ Logistic Regression

⦿ Decision Tree Analysis

Page 16: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

What is K Nearest Neighbor

⦿ Use k "closet" samples (nearest neighbors) to perform classification

Page 17: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

What is K Nearest Neighbor

Page 18: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Initial outcome and strategies for improvement

⦿ KNN accuracy was ~64% - very low

⦿ KNN classifier does not fit our project 

⦿ Term-list is still too large 

⦿ Try different method to classify and see if evaluation results are better than KNN results

⦿ Continue to reduce size of term list by removing terms that are not meaningful

Page 19: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Steps for improvement

⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier

●Naive- Bayes Classifier

Page 20: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain

Hashtable

Page 21: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Naive- Bayes classifier

Page 22: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Secondary Results

⦿Correctness increases from 62% to 82.36%

Page 23: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Suggestions for further improvement

⦿Revise pre-processing⦿Apply additional classifiers

Page 24: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

Thank you

⦿Questions?