Machine Learning Basics with Applications to Email Spam Detection

Machine Learning Basics with Applications to Email Spam Detection

Brittany Edwards, Haoyu Li, and Wei Zhang under Xiaoxiao Xu and Dr. Nehorai

Department of Electrical and Systems Engineering

IntroductionUse machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham emails. Email detection requires many steps in the pre-processing of data so that the data can then be classified using various methods to accurately classify the emails. Methods explored were: k nearest neighbor, naive-bayes classification, logistic regression, and decision tree classification. After classifications were done, the best method was chosen by evaluating each classifier and determining the precision, recall, and F1 score of each, as well as examining the ROC curve. MethodsInitial pre-processing was done by removing stop-words, by using a package in the programming language R, which was used for this project. Then, the words were converted to lowercase, punctuation was removed, and words with lengths longer than 20 letters were removed. Then a hashtable was created to make words that were similar, and mapped them all to a single general form of the word.After pre-processing, the different classification methods were applied and results were analyzed.

AbstractUse machine learning to create a spam detection algorithm to decipher between spam and ham emails. Various classification methods were used and results were analyzed for the best outcome.

SummaryInitial results were not substantial to continue and produce a successful outcome, therefore further steps were made to improve progress and increase possibility of creating a correct algorithm.ConclusionsThe final results, after preliminary results and revisions were made, yielded a 32.38% increase in accuracy. More pre-processing needs to occur for the most accurate outcomes and therefore the best possible resulting algorithm. The initial classifier used, k nearest neighbor, did not yield the best results, and there are many possible contributing factors as to why this occurred. The naive-bayes classifier was much more accurate and lead to better results than the k nearest neighbor classifier did. The other two methods, logistic regression and decision tree classification, did not work correctly with our model and after future pre-processing revisions there is hope that this methods can be used. There was progress made, but more progress is hoped for with future efforts.

ResultsWith the use of k nearest neighbor with k = 3, the accuracy was ~64%, and therefore the initial conclusion was that this k nearest neighbor classifier does not fit this model properly. Also, the term list was still to large and therefore more pre-processing is necessary to remove meaningless terms. With the use of the naive-bayes classifier, the results yielded a higher accuracy, ~82%. These secondary results were much better than the first, but still not enough for an accurate algorithm. From these secondary results, it is seen that progress was made through the revisions made in pre-processing and the new classification method.

Spam Email exampleHam Email example

Naive-Bayes Classifier

Naive-Bayes Classifier

Documents

Machine Learning Basics with Applications to Email Spam Detection