Upload
journal-of-computing
View
214
Download
0
Embed Size (px)
Citation preview
7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques
1/5
E-mail Classification in An Instance-BasedSystem Using Header Information and Text
Mining TechniquesE. ParsaeiMehr, M. Ganj, and E. BehroozianNejad
Abstract The increasing volumes of unsolicited bulk e-mail, known as spam, are bringing more annoyance for most internetusers. However, using several machine learning techniques have been proposed, an instance-based system has less false
positive error. In this paper we presented an instance-based system in which training spam data set is clustered. Our evaluation
shows that this new system not only has as much false positive error as simple instance-based system, but also it has better re-
sponse time. Furthermore, we analyzed time field in header information of e-mails in order to survey if there is any especial pat-
tern in the time in which spammers send a spam.
Index TermsText mining, classification, clustering.
u
1 INTRODUCTION
In recent years, e-mails have become a common andimportant medium of communication for most internetusers. However, spam, also known as unsolicitedcommercial / bulk e-mail, is a bane of e-mailcommunication. A study estimated that over 70% oftodays business e-mails is spam[1]; therefore, there aremany serious problems associated with growing volumesof spam such as filling users e-mail boxes, engulfingimportant personal e-mails, wasting storage space andcommunication band width and consuming users time todelete all spam e-mails. Spam e-mails vary significantly incontent and they roughly belong to the followingcategories: money making scams, fat loss, improvebusiness, sexually explicit, make friends, service provideradvertisement, etc.
Several solutions have been proposed to overcome thespam problem. Among the proposed methods, muchinterest has focused on the instance-based technique inspam filtering; because this technique has less falsepositive error comparing with other machine learningtechniques. As a user cannot tolerate that a legitimate e-
mail, which may be very important, is not delivered tohim, thus using a spam filtering technique that has verylow false positive error is very vital for users.
In this paper we extended an instance-based systemwith clustering approach. The proposed system offers anumber of advantages in the spam filtering domain. Asfilters are adapted to contend with todays types of spame-mails, the spammers alter, obfuscate and confuse filter
by disguising their e-mails to look more like legitimate e-mail. This dynamic nature of spam e-mail raises arequirement for update in any filter that is to besuccessful over time in identifying spam. In this regard,the proposed system can learn from misclassified e-mails.For this purpose the system has two bases called HardHam in which legitimate e-mails misclassified as spam byour system are stored and Hard Spam in which spam e-mails misclassified as legitimate are stored. These twobases are used for classifying new arrived e-mail to avoidmisclassification of e-mails which has same headerinformation with those found in one of these two bases.
Moreover, thanks to clustering the spam e-mails intraining data set, the representative vector of clusterscontribute in calculations instead of each of spam e-mailsin the clusters; hence, the number of the calculationsdecreases significantly and therefore the response time ofour system declines.
Furthermore, the information in the header of e-mailssuch as from, return path, content type, language andattachment are considered and investigated in order to
avoid false positive error. The evaluation shows that oursystem has less false positive error, which can be veryharmful for user, than instance-based system withoutclustering.
Finally, we analyzed the time field of all legitimate andspam e-mails to detect a pattern of sending e-mails byusers and spammers. The evaluations and results aredescribed in section 5. The organization of the rest of thepaper is as follow: section2 outlines the related works one-mail classification techniques. Our proposed system isdescribed in Section 3, while Section 4 presentsperformance evaluation and results of our system. Section
5 describes the results of analyzing the time field of e-mails. Finally, the paper ends with conclusion in Section6.
E. ParsaeiMehr is with the department of Comuter, Azad University,Shoushtar Branch, Shoushtar, Khouzestan, Iran.
M. Ganj is with the department of Comuter, Azad University, Shoushtar
Branch, Shoushtar, Khouzestan, Iran E. BegroozianNejad is with the department of Comuter, Azad University,Shoushtar Branch, Shoushtar, Khouzestan, Iran
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 33
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques
2/5
2 RELATED WORKS
The techniques that have been developed for spam filter-ing are divided into three categories of rule-based, con-tent-based and memory-based. In rule-based techniquesthe keywords or clues in the body and header of e-mailsare investigated. Black list, White list as well as Ripper
belong to this kind of technique. These methods try todetect pattern (such as special words in the body or sub-
ject). In the term of content-based techniques, it can beseen that spam e-mails usually have specific contentwhich make them different from legitimate e-mails.Hence classification based on content is more logical thanrule based and false positive error reduces, too. More me-thods in this category use machine learning technique.First example of this kind of technique is Naive Bayes inwhich classifier computes the likelihood that whether ane-mail is spam or not given the features that are containedin the e-mail [1]. The model, output by the Naive Bayesalgorithm, labels examples based on the features that theycontain. Another content-based method is SVM [2],[3]and key concepts are that there are two classes, yi{-1,1},and there are N labeled training examples: {x1, y1),,(xn,yn), xRd where d is the dimensionality of feature vector.If the two classes are linearly separable, then one can findan optimal weight vector w* such that ||w*||2 is mini-mum; and yi(W*. xi b) 1. The distance between thetwo hyperplanes defines a margin and this margin ismaximized when the norm of the weight vector ||w*||is minimum. The next method is C5.0 (Decision Tree) inwhich each branch node represents a choice between anumber of alternatives [1], and each leaf node represents
a decision. Neural Network is another method of content-based technique. The method is a non-linear feed-forwardnetwork with the sigmoid activation function f(x)=1/(1+e-x) [4]. This activation function will produce an output inthe range [0; 1]. The e-mail is classified as junk if the out-put value is above 0.5. Another technique for spam filter-ing is memory-based. Memory-based approach has lessfalse positive error than ML (Machine Learning) tech-niques. This technique stores all training instances in amemory structure and uses them for classification. Themost basic instance-based method is the k-nearest neigh-bor (k-NN) algorithm [5]. In this algorithm the distance
between new e-mail and all samples in the training set iscalculated. The samples are ranked according to the dis-tances. Then the k samples which are nearest to the newe-mail are used in assigning a class to the case. Anothermethod which uses this technique is an Instance-BasedSystem. When a new e-mail arrives, the system executesfour stages: retrieve, reuse, revision and retain [6],[7]. Formore details, they are illustrated as follow: Every time IB(Instance-Based) system executes the aforementioned in-stance retrieval stage by selecting those e-mails withhigher values for the similarity with the arrived e-mail,the system assigns a class label to the incoming e-mailbased on a unanimous voting algorithm and M' instance-
messages which are most similar to the arrived e-mail.Each message in M returns one vote and by means ofrecounting the existing votes, an initial classification isprovided by the system. Previous to the final response of
the system, a revision stage is carried out when the as-signed class is spam. This re-evaluation is carried outwith the goal of guaranteeing the accuracy of the solutionproposed [6]. In these situations, IB system uses theknowledge extracted from the message header (from,return path, content type, language, attachment) in orderto generate a final answer. Concretely, the systemsearches the instance base looking for spam e-mails writ-ten in the same language as the new one. If any onefound, IB system classifies the incoming message asspam, otherwise it labels the e-mail as legitimate. If theincoming e-mail was assigned as legitimate at the adapta-tion stage, the revision phase does nothing. Every timethe user checks his e-mailbox and provides feedback tothe system about a previous e-mail classification, the sys-tem stores a new instance in the e-mail base for futureuse.
3 PROPOSED SYSTEM
This section outlines CIB system, an IB system in whichClustering approach is used. Hence, a simple IB systemwill be changed in this regard. In CIB system each e-mailis an instance represented as a vector of attributes or fea-tures. Actually the features are words existing in body orheader of e-mails. Totally, CIB system has four phases asfollow: preprocessing, feature selection, clustering as wellas IB classification. Each phase has been described as be-low.3.1 Preprocessing
In this phase, initial data which are emails in the format
of text files are transmitted to an understandable form forthe system. Exactly, HTML tags in the text files are elimi-nated and then the words in the header and the body ofeach email are extracted and stored in a word-base.3.2 Feature Selection
In this phase, the most important words assumed asfeatures which are useful for better classification areselected. First, the words in the stop list are eliminatedfrom feature vector of e-mails; thus, words like a, an,the found in both legitimate and spam e-mails areremoved. Second, a feature selection algorithm to reducethe dimensionality of feature space is applied; in this
paper Information Gain (IG) [8] was applied to opt themost predictive features as it has been shown to be aneffective technique in aggressive feature removal in textclassification [9]. Finally, TF-IDF technique [2] wasapplied to form weight vector of instances.
3.3 Clustering
In this phase, the idea is added to simple IB system. Weused k-means algorithm to cluster the spam instances intraining dataset. We experienced various K in order to getthe efficient number of clusters; results are shown insection 4. It can be seen that the number of 10 clusters hasan efficient performance. More details about it have been
described in section 4.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 34
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques
3/5
3.4 Classification
In this phase we have changed the four stages of simpleIB system to adopt the clustering approach. Fig. 1 hasshown the architecture of this phase. Sequence of thestages is labeled with the numbers. In fig. 1 there are twobases named HS1 and HH2 which HS is a base of
misclassified spam e-mails; in contrast, HH is a base ofmisclassified legitimate e-mails. Now, more details aboutthe four stages of this phase are illustrated as follow.
3.4.1 Retrieve Stage
In this stage, the similarity of incoming e-mail with thelegitimate e-mails and representative vectors of clustersare calculated. Hence, instead of calculating the similaritybetween arrived e-mail and all spam instances, which isdone in a simple IB system, merely representative vectorsof clusters are investigated.
3.4.2 Reuse Stage
If selected instance in the retrieve stage is a cluster, theincoming e-mail will be labeled as spam temporally; oth-erwise, it will be labeled as legitimate.
3.4.3 Revision Stage
This stage, which is done for more investigation of theclass of incoming e-mail, encompasses three parts. Thefirst part will be done when temporary class in the reusestage is declared as spam.
The system searches the most similar cluster tolook for spam e-mails written in the same lan-guage and same header information. If anyone isfound, CIB system will classify the incomingmessage as spam; otherwise, it will label the e-mail as legitimate.
If the declared class in the first part is spam, the secondpart will execute:
1 Hard Spam2 Hard Ham
The system will search the HH base to look forlegitimate e-mails written in the same languageand same header information. If anyone is found,the CIB system will classify the incomingmessage as legitimate; otherwise, it will label thee-mail as spam.
If the declared class in the first part or reuse stage is legi-timate, the third part will execute:
The system will search the HS base to look forspam e-mails written in the same language andsame header information. If anyone found, IBsystem will classify the incoming message asspam; otherwise, it will label the e-mail aslegitimate.
3.4.4 Retain Stage
Every time the user checks his e-mailbox and providesfeedback to the system about a previous e-mail classifica-
tion, the system stores the misclassified e-mails into thetwo bases of the hard instances for future use. Thus, if theuser declares a legitimate e-mail is misclassified, the sys-tem will store it in the HS base; on the other hand, if hedeclares that a spam is misclassified; the system will storeit in the HH base. Figure 1 introduces more details.
4 Performance evaluationIn this section proposed system has been compared witha simple IB system and the results have been analyzed.The measures by which we evaluated the performance ofour system are: FP rates, FN rate, precision rate, recallrate and response time. These measures have been de-
fined below:
TN: the number of spam e-mails that are classified cor-rectlyTP: the number of legitimate e-mails that are classifiedcorrectlyFN: the number of spam e-mails that are classified as legi-timateFP: the number of legitimate e-mails that are classified asspam
Recall rate 100+= FNTP
TP
(1)
Precision rate 100+
=FPTP
TP(2)
Error rate = 100+++
+
TNFNTPFP
FNFP(3)
4.1 Dataset
Table (1) describes the spamAssassin corpus (public
available for download athttp://spamassassin.apache.org/publiccorpus/) employedin our experiments.
Header
information ofhard
Missclassifiedlegitimate
Featurevectors
Representativevector of clusters
Representativevector of clusters
Retrive stage
(k-nn)
Reuse stage
(Investigating the mostsimilar instance)
Revision stage
Looking for the
same header
Looking for the
same header
Retain stage
Feedback from
Representative
vectors of clusters
Legitimate e-mail
instances
HS
HHMost
similar
instance
Abtained class
Clustering phase
Declaring
final class
1
2
3
4Missclassified
spam
Fig. 1. Architecture of classification phase
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 35
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
http://spamassassin.apache.org/publiccorpus/http://spamassassin.apache.org/publiccorpus/7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques
4/5
4.2 Results of Performance Evaluation
In table 2, the error rate values for various number of
clusters are shown.
Table 2 shows that when the number of clusters de-
creases from ten to less, the error rate of our system in-
creases. The case initiates from increasing the density of
the clusters. In this regard, an instance which is similar to
incoming e-mail has poor effect on the representative vec-
tor because of high density in clusters; hence, it is likely
that our system does not distinguish the cluster as most
similar; thus, error occurs.Table 2 illustrates that when the number of the clusters
is more than ten, the error rate does not improve; the caseinitiates from the fact that when the number of clustersincreases, the number of spam instances in each clusterwill decrease; hence, it is likely that enough spam withsame header information, which is necessary for unanim-ous voting to classify, will not be found.
In the experience, we set k=1(the number of voters) inthe unanimous voting algorithm in the retrieve stage of IBsystem and our system.
Fig. 2, fig. 3, fig. 4, fig. 5 and fig. 6 show the results ofthe performance evaluation of the both systems.
Table 3 shows the average response time of the bothsystems. The difference between the response time of twosystems initiates from the different number of compari-siov between the two systems for classification.
The average response time in our system is less thanwhat occurs in the simple IB system. When our systemdeclares legitimate class, it merely compares arrived e-mail with all legitimate e-mails and ten representativevectors of clusters; but, in the case of declaring the spamclass, not only 10 representative vectors of clusters arecompared; but also all spam instances in the cluster which
is the most similar should be compared too. On the otherhand, in the simple IB system for declaring the class of anin-coming e-mail, all stored instances (spam and legiti-mate e-mails) should be compared. Hence, the responsetime increases.
5 Results of Analysis of the Time Field
For this analysis, we took the time field of all e-mails inthe dataset and converted it to the nearest hour (e.g.6:45:33 to 7 or 6:12:34 to 6).
Fig. 7 shows the percentage of the distribution of spamand legitimate e-mails along the 24 hours of a day.
TABLE 1
DataSet Description
Legitimate e-mailSpam
225460Training stage
75140Test stage
300700Total e-mails
Fig. 5. Recall rate for the analyzed systems
TABLE 2
Error Rate Values for Various Number of Clusters
Number of
Clusters6 8 10 12 14 16 18 20
Error Rate 8 4 2.5 2.5 2.5 2.5 2.5 2.5
Fig. 6. Error rate for the analyzed systems
Fig. 2. False Positive error rate for the analyzed systems
Fig. 4. Precision rate for the analyzed systems
Fig. 3. False negative error rate for the analyzed systems
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 36
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques
5/5
From Fig. 7, we can conclude that:
1. There is not special time for sendingspam e-mails; they can be sent in any hourduring a day.
2. The thought, that spammers work atnight when the number of users of internet is lessbecause of more bandwidth, is false. Because inthe hours between 2 and 7 the percentage ofposted spam e-mails is less than legitimate e-mails. Actualy it may be initiated from thesecauses:
The spammers use their victims who are
other users of internet for sending theire-mails and these victims are usuallyactive in any time of a day.
A company may has set one of itsemployee in the charge of sending manyadvertising e-mails to the inbox of manyusers throughout the world. Hence, hedoes it in his working time, not at night.
3. The hours in which most users sendtheir e-mails (e.g. 4, 7 and 9), the activity of thespammers is least.
6 CONCLUSION
In this article we presented an improved Instance-Based
system for spam filtering. The idea was clustering the
spam instances; therefore, in the retrieve stage, the repre-
sentative vectors of the clusters are calculated instead of
the all spam massages. Thus, the number of comparisons
and the average of response time decrease. Furthermore,
the time field of both spam and legitimate e-mails ana-
lyzed and results were described. The results of compari-
sion between our system and the simple IB system show
that performance measures have improved.
ACKNOWLEDGMENTS
The research work in this paper is supported by Azad University,
Shoushtar Branch in Iran.
REFERENCES
[1] D. Shih, H. Chiang, and B. Lin, Collaborative Spam Filtering
with Heterogeneous Agents, Expert systems with applications,2008.
[2] H. Drucker, D. Wu, and V. Vapnik, Support Vector Machinesfor Spam Categorization, IEEE Transactions on NeuralNetworks, 1999.
[3] M. Islam, W. Zhou, and M. Choudhury, Dynamic FeatureSelection For Spam Filtering Using Support Vector Machine,International Conference and Information Science, 2007.
[4] B. Yu, and Z. Xu, A Comparative Study for Content-basedDynamic Spam Classification Using Four Machine LearningAlgorithms, Knowledge Based System, 2008.
[5] C. Lai, An Empirical Study of Three Machine Learning
Methods for Spam Filtering, Knowledge Based System, 2007.[6] F. Riverola, E.L. Iglesias, F. Daz, and J.R. Mndez, and J.M.
Corchado, SpamHunting: An Instance-based ReasoningSystem for Spam Labelling and Filtering, Decision Supportsystems, 2007.
[7] S. Delany, P. Cunninghamb, A. Tsymbalb, and L. cpyle, ACase-based Technique for Tracking Concept Drift in Spam Fil-tering,. Knowledge Based System, 2005.
[8] D. McSherry, Explaining the Pros and Cons of Conclusions inCBR, Proc. of the 7th European Conference on Case-BasedReasoning, Madrid, Spain, 2004.
[9] L. Galavotti, S. Fabrizio, and M. Simi,Feature Selection and
Negative Evidence in Automated Text Categorization, ICML,1999.
Elham ParsaeiMehr received the B.S. degree from Shahid Cha-mran University, Ahvaz, Iran in 2004, and the M.SC degree from
Azad Universirt, Ahaz, Iran in 2010. She is currently with departmentof computer science, Azad university of Shoushtar, Khouzestan,Iran, as a tutor.Mohsen Ganj received the B.S. degree from Azad University, Dez-ful, Iran in 2004, and the M.SC degree from Azad Universirt, Dezful,Iran in 2011. He is currently with department of computer science,
Azad university of Shoushtar, Khouzestan, Iran, as a tutor.Ebrahim BehroozianNejad received the B.S. degree from AzadUniversity, Dezful, Iran in 2004, and the M.SC degree from AzadUniversirt, Tehran, Iran in 2006, and the PH.D. degree from AzadUniversirt, Tehran, Iran in 2011. He is currently with department ofcomputer science, Azad university of Shoushtar, Khouzestan, Iran,as a professor assistant.
Fig. 7. The distribution of e-mails along the 24 hours of the day
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 37
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617