SMS Spam Filter Design Using R: A Machine Learning Approach

SMS Spam Filter Design Using R:

A Machine Learning Approach

Reza Rahimi,Ph.D Candidate,

School of Information and Computer Science, University of California, Irvine.

Introduction• In basic terms Machine Learning (ML) is

about the construction of systems that can learn from data.

• It is used as a tool for knowledge discovery.• Several Important classes of problems could

be solved using machine learning techniques like:– Classification (Prediction):

• Given a collection of records as a training set.• Each record contains a set of attributes and one of the

attributes called class. • The problem is to find a model for class attribute as a

function of other attributes.– Example: Spam or Ham, Handwriting Recognition,…

– Clustering (Description): • Given a set of data points, with some attributes, and a

similarity measure (metric) among them.• The goal is to find clusters such that data points in one

cluster are more similar to one another. – Example: Document Clustering, people categorization,…

– Association (Description): • Given a set of records each contains some items from a

given collection.• The goal is to produce dependency rules which show the

occurrence of an item based on occurrences of other items.

– Example: user habit pattern recognition,…

– Regression (Prediction): • Predict a value of a given continuous variables based on

the values of other variables.• Could be linear or nonlinear model of dependency.

– Example: Stock prediction

• ML is a very mature and developed area.• In all of the different mentioned problem

classes, it contains rich resources of tools, techniques and Algorithms.

• These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,…

• The following procedure could be considered as the general methodology for problem solving in this framework:

Problem Solving Using Machine

Learning Framework

Get a sense of data: Feature extraction, dimension

reduction, noise cancellation,…

Problem modeling: Classification, Clustering, Association,

Regression,…

Run standard ML Algorithms: check the errors according to the

standard ML Metrics.

Select the methods that satisfy your performance criteria and metrics.

• In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.

SMS Spam Filter using R• #this file is SMS Spam filter codes with different classifiers in R language• #Written by: Reza Rahimi • #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), • #loading required packages, libraries and function declaration

• #required package for text mining• if(!require("tm"))• install.packages("tm")• • #required package for SVM• if(!require("e1071"))• install.packages("e1071")• • #required package for KNN• if(!require("RWeka"))• install.packages("RWeka", dependencies = TRUE)• • #required package for Adaboost• if(!require("ada"))• install.packages("ada")• library("tm")• library("e1071")• library(RWeka)• library("ada")

R Codes (Cont.)• #Initialize random generator• set.seed(1245)• • #This function makes vector (Vector Space Model) from text message using highly repeated words• vsm<-function(message,highlyrepeatedwords){• • tokenizedmessage<-strsplit(message, "\\s+")[[1]]• • #making vector• v<-rep(0, length(highlyrepeatedwords))• for(i in 1:length(highlyrepeatedwords)){• for(j in 1:length(tokenizedmessage)){• if(highlyrepeatedwords[i]==tokenizedmessage[j]){• v[i]<-v[i]+1• }• }• }• return (v)• }• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection• print("Uploading SMS Spams and Hams!\n")• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "\t",

colClasses=c("type"="character","sms"="character"))

R Codes (Cont.)• smstabletmp<-smstable• • print("Extracting Ham and Spam Basic Statistics!")• • smstabletmp$type[smstabletmp$type=="ham"] <- 1• smstabletmp$type[smstabletmp$type=="spam"] <- 0• • #Convert character data into numeric• tmp<-as.numeric(smstabletmp$type)• • #Basic Statisctics like mean and variance of spam and hams• hamavg<-mean(tmp)• print("Average Ham is :");hamavg• • hamvariance<-var(tmp)• print("Var of Ham is :");hamvariance• • print("Extract average token of Hams and Spams!")• • nohamtokens<-0• noham<-0• nospamtokens<-0• nospam<-0

R Codes (Cont.)• for(i in 1:length(smstable$type)){• if(smstable[i,1]=="ham"){• nohamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nohamtokens• noham<-noham+1• }else{ • nospamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nospamtokens• nospam<-nospam+1• }• }• • totaltokens<-nospamtokens+nohamtokens;• print("total number of tokens is:")• print(totaltokens)• • avgtokenperham<-nohamtokens/noham• print("Avarage number of tokens per ham message")• print(avgtokenperham)• • avgtokenperspam<-nospamtokens/nospam• print("Avarage number of tokens per spam message")• print(avgtokenperspam)• • print(" Make two different sets, training data and test data!")

R Codes (Cont.)• #select the percent of data that you want to use as training set• trdatapercent<-0.3• • #training data set• trdata=NULL• • #test data set• tedata=NULL• • for(i in 1:length(smstable$type)){• if(runif(1)<trdatapercent){• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))• }• else{• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))• }• }• • print("Training data size is!")• dim(trdata)• • print("Test data size is!")• dim(tedata)

R Codes (Cont.)• # Text feature extraction using tm package• • trsmses<-Corpus(VectorSource(trdata[,2]))• trsmses<-tm_map(trsmses, stripWhitespace)• trsmses<-tm_map(trsmses, tolower)• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))• • dtm <- DocumentTermMatrix(trsmses)• • highlyrepeatedwords<-findFreqTerms(dtm, 80)• • #These highly used words are used as an index to make VSM • #(vector space model) for trained data and test data• • #vectorized training data set• vtrdata=NULL• • #vectorized test data set • vtedata=NULL

R Codes (Cont.)• for(i in 1:length(trdata[,2])){• if(trdata[i,1]=="ham"){• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))• }• else{• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))• }• • }• • for(i in 1:length(tedata[,2])){• if(tedata[i,1]=="ham"){• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))• }• else{• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))• }• • }

R Codes (Cont.)• # Run different classification algorithms• # differnet SVMs with different Kernels • print("----------------------------------SVM-----------------------------------------") • print("Linear Kernel")• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');• summary(svmlinmodel)• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear• precisionlin<-sum(diag(tablinear))/sum(tablinear);• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100• • print("Polynomial Kernel")• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial',

probability=FALSE)• summary(svmpolymodel)• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly• • print("Radial Kernel")• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma =

0.09, cost = 1, probability=FALSE)• summary(svmradmodel)• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad

R Codes (Cont.)• print("----------------------------------KNN-----------------------------------------") • data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))• summary(classifier)• evaluate_Weka_classifier(classifier, newdata =

data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))

• print("---------------------------------Adaboost-------------------------------------")• adaptiveboost<-

ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)

• summary(adaptiveboost)• varplot(adaptiveboost)

Conclusions

• In these slides I gave a broad overview of ML and different problems that could be solved in this framework.

• I reviewed in details one way of SMS spam filter implementation using ML techniques with R language.

• ML provides strong framework to solve problem in Big Data domain.

Technology

SMS Spam Filter Design Using R: A Machine Learning Approach