15
SMS Spam Filter Design Using R: A Machine Learning Approach Reza Rahimi, Ph.D Candidate, School of Information and Computer Science, University of California, Irvine.

SMS Spam Filter Design Using R: A Machine Learning Approach

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: SMS Spam Filter Design Using R: A Machine Learning Approach

SMS Spam Filter Design Using R:

A Machine Learning Approach

Reza Rahimi,Ph.D Candidate,

School of Information and Computer Science, University of California, Irvine.

Page 2: SMS Spam Filter Design Using R: A Machine Learning Approach

Introduction• In basic terms Machine Learning (ML) is

about the construction of systems that can learn from data.

• It is used as a tool for knowledge discovery.• Several Important classes of problems could

be solved using machine learning techniques like:– Classification (Prediction):

• Given a collection of records as a training set.• Each record contains a set of attributes and one of the

attributes called class. • The problem is to find a model for class attribute as a

function of other attributes.– Example: Spam or Ham, Handwriting Recognition,…

Page 3: SMS Spam Filter Design Using R: A Machine Learning Approach

– Clustering (Description): • Given a set of data points, with some attributes, and a

similarity measure (metric) among them.• The goal is to find clusters such that data points in one

cluster are more similar to one another. – Example: Document Clustering, people categorization,…

– Association (Description): • Given a set of records each contains some items from a

given collection.• The goal is to produce dependency rules which show the

occurrence of an item based on occurrences of other items.

– Example: user habit pattern recognition,…

– Regression (Prediction): • Predict a value of a given continuous variables based on

the values of other variables.• Could be linear or nonlinear model of dependency.

– Example: Stock prediction

Page 4: SMS Spam Filter Design Using R: A Machine Learning Approach

• ML is a very mature and developed area.• In all of the different mentioned problem

classes, it contains rich resources of tools, techniques and Algorithms.

• These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,…

• The following procedure could be considered as the general methodology for problem solving in this framework:

Problem Solving Using Machine

Learning Framework

Page 5: SMS Spam Filter Design Using R: A Machine Learning Approach

Get a sense of data: Feature extraction, dimension

reduction, noise cancellation,…

Problem modeling: Classification, Clustering, Association,

Regression,…

Run standard ML Algorithms: check the errors according to the

standard ML Metrics.

Select the methods that satisfy your performance criteria and metrics.

• In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.

Page 6: SMS Spam Filter Design Using R: A Machine Learning Approach

SMS Spam Filter using R• #this file is SMS Spam filter codes with different classifiers in R language• #Written by: Reza Rahimi • #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), • #loading required packages, libraries and function declaration 

• #required package for text mining• if(!require("tm"))• install.packages("tm")•  • #required package for SVM• if(!require("e1071"))• install.packages("e1071")•  • #required package for KNN• if(!require("RWeka"))• install.packages("RWeka", dependencies = TRUE)•  • #required package for Adaboost• if(!require("ada"))• install.packages("ada")• library("tm")• library("e1071")• library(RWeka)• library("ada")

Page 7: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• #Initialize random generator• set.seed(1245)•  • #This function makes vector (Vector Space Model) from text message using highly repeated words• vsm<-function(message,highlyrepeatedwords){•  • tokenizedmessage<-strsplit(message, "\\s+")[[1]]•  • #making vector• v<-rep(0, length(highlyrepeatedwords))• for(i in 1:length(highlyrepeatedwords)){• for(j in 1:length(tokenizedmessage)){• if(highlyrepeatedwords[i]==tokenizedmessage[j]){• v[i]<-v[i]+1• }• }• }• return (v)• }• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection• print("Uploading SMS Spams and Hams!\n")• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "\t",

colClasses=c("type"="character","sms"="character"))

Page 8: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• smstabletmp<-smstable•  • print("Extracting Ham and Spam Basic Statistics!")•  • smstabletmp$type[smstabletmp$type=="ham"] <- 1• smstabletmp$type[smstabletmp$type=="spam"] <- 0•  • #Convert character data into numeric• tmp<-as.numeric(smstabletmp$type)•  • #Basic Statisctics like mean and variance of spam and hams• hamavg<-mean(tmp)• print("Average Ham is :");hamavg•  • hamvariance<-var(tmp)• print("Var of Ham is :");hamvariance•  • print("Extract average token of Hams and Spams!")•  • nohamtokens<-0• noham<-0• nospamtokens<-0• nospam<-0

Page 9: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• for(i in 1:length(smstable$type)){• if(smstable[i,1]=="ham"){• nohamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nohamtokens• noham<-noham+1• }else{ • nospamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nospamtokens• nospam<-nospam+1• }• }•  • totaltokens<-nospamtokens+nohamtokens;• print("total number of tokens is:")• print(totaltokens)•  • avgtokenperham<-nohamtokens/noham• print("Avarage number of tokens per ham message")• print(avgtokenperham)•  • avgtokenperspam<-nospamtokens/nospam• print("Avarage number of tokens per spam message")• print(avgtokenperspam)•  • print(" Make two different sets, training data and test data!")

Page 10: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• #select the percent of data that you want to use as training set• trdatapercent<-0.3•  • #training data set• trdata=NULL•  • #test data set• tedata=NULL•  • for(i in 1:length(smstable$type)){• if(runif(1)<trdatapercent){• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))• }• else{• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))• }• }•  • print("Training data size is!")• dim(trdata)•  • print("Test data size is!")• dim(tedata)

Page 11: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• # Text feature extraction using tm package•  • trsmses<-Corpus(VectorSource(trdata[,2]))• trsmses<-tm_map(trsmses, stripWhitespace)• trsmses<-tm_map(trsmses, tolower)• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))•  • dtm <- DocumentTermMatrix(trsmses)•  • highlyrepeatedwords<-findFreqTerms(dtm, 80)•  • #These highly used words are used as an index to make VSM • #(vector space model) for trained data and test data•  • #vectorized training data set• vtrdata=NULL•  • #vectorized test data set • vtedata=NULL

Page 12: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• for(i in 1:length(trdata[,2])){• if(trdata[i,1]=="ham"){• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))• }• else{• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))• }•  • }•  • for(i in 1:length(tedata[,2])){• if(tedata[i,1]=="ham"){• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))• }• else{• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))• }•  • }

Page 13: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• # Run different classification algorithms• # differnet SVMs with different Kernels • print("----------------------------------SVM-----------------------------------------") • print("Linear Kernel")• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');• summary(svmlinmodel)• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear• precisionlin<-sum(diag(tablinear))/sum(tablinear);• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100•  • print("Polynomial Kernel")• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial',

probability=FALSE)• summary(svmpolymodel)• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly•  • print("Radial Kernel")• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma =

0.09, cost = 1, probability=FALSE)• summary(svmradmodel)• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad

Page 14: SMS Spam Filter Design Using R: A Machine Learning Approach

R Codes (Cont.)• print("----------------------------------KNN-----------------------------------------") • data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))• summary(classifier)• evaluate_Weka_classifier(classifier, newdata =

data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))

• print("---------------------------------Adaboost-------------------------------------")• adaptiveboost<-

ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)

• summary(adaptiveboost)• varplot(adaptiveboost)

Page 15: SMS Spam Filter Design Using R: A Machine Learning Approach

Conclusions

• In these slides I gave a broad overview of ML and different problems that could be solved in this framework.

• I reviewed in details one way of SMS spam filter implementation using ML techniques with R language.

• ML provides strong framework to solve problem in Big Data domain.