E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

Embed Size (px)

Citation preview

  • 7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

    1/5

    E-mail Classification in An Instance-BasedSystem Using Header Information and Text

    Mining TechniquesE. ParsaeiMehr, M. Ganj, and E. BehroozianNejad

    Abstract The increasing volumes of unsolicited bulk e-mail, known as spam, are bringing more annoyance for most internetusers. However, using several machine learning techniques have been proposed, an instance-based system has less false

    positive error. In this paper we presented an instance-based system in which training spam data set is clustered. Our evaluation

    shows that this new system not only has as much false positive error as simple instance-based system, but also it has better re-

    sponse time. Furthermore, we analyzed time field in header information of e-mails in order to survey if there is any especial pat-

    tern in the time in which spammers send a spam.

    Index TermsText mining, classification, clustering.

    u

    1 INTRODUCTION

    In recent years, e-mails have become a common andimportant medium of communication for most internetusers. However, spam, also known as unsolicitedcommercial / bulk e-mail, is a bane of e-mailcommunication. A study estimated that over 70% oftodays business e-mails is spam[1]; therefore, there aremany serious problems associated with growing volumesof spam such as filling users e-mail boxes, engulfingimportant personal e-mails, wasting storage space andcommunication band width and consuming users time todelete all spam e-mails. Spam e-mails vary significantly incontent and they roughly belong to the followingcategories: money making scams, fat loss, improvebusiness, sexually explicit, make friends, service provideradvertisement, etc.

    Several solutions have been proposed to overcome thespam problem. Among the proposed methods, muchinterest has focused on the instance-based technique inspam filtering; because this technique has less falsepositive error comparing with other machine learningtechniques. As a user cannot tolerate that a legitimate e-

    mail, which may be very important, is not delivered tohim, thus using a spam filtering technique that has verylow false positive error is very vital for users.

    In this paper we extended an instance-based systemwith clustering approach. The proposed system offers anumber of advantages in the spam filtering domain. Asfilters are adapted to contend with todays types of spame-mails, the spammers alter, obfuscate and confuse filter

    by disguising their e-mails to look more like legitimate e-mail. This dynamic nature of spam e-mail raises arequirement for update in any filter that is to besuccessful over time in identifying spam. In this regard,the proposed system can learn from misclassified e-mails.For this purpose the system has two bases called HardHam in which legitimate e-mails misclassified as spam byour system are stored and Hard Spam in which spam e-mails misclassified as legitimate are stored. These twobases are used for classifying new arrived e-mail to avoidmisclassification of e-mails which has same headerinformation with those found in one of these two bases.

    Moreover, thanks to clustering the spam e-mails intraining data set, the representative vector of clusterscontribute in calculations instead of each of spam e-mailsin the clusters; hence, the number of the calculationsdecreases significantly and therefore the response time ofour system declines.

    Furthermore, the information in the header of e-mailssuch as from, return path, content type, language andattachment are considered and investigated in order to

    avoid false positive error. The evaluation shows that oursystem has less false positive error, which can be veryharmful for user, than instance-based system withoutclustering.

    Finally, we analyzed the time field of all legitimate andspam e-mails to detect a pattern of sending e-mails byusers and spammers. The evaluations and results aredescribed in section 5. The organization of the rest of thepaper is as follow: section2 outlines the related works one-mail classification techniques. Our proposed system isdescribed in Section 3, while Section 4 presentsperformance evaluation and results of our system. Section

    5 describes the results of analyzing the time field of e-mails. Finally, the paper ends with conclusion in Section6.

    E. ParsaeiMehr is with the department of Comuter, Azad University,Shoushtar Branch, Shoushtar, Khouzestan, Iran.

    M. Ganj is with the department of Comuter, Azad University, Shoushtar

    Branch, Shoushtar, Khouzestan, Iran E. BegroozianNejad is with the department of Comuter, Azad University,Shoushtar Branch, Shoushtar, Khouzestan, Iran

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 33

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

    2/5

    2 RELATED WORKS

    The techniques that have been developed for spam filter-ing are divided into three categories of rule-based, con-tent-based and memory-based. In rule-based techniquesthe keywords or clues in the body and header of e-mailsare investigated. Black list, White list as well as Ripper

    belong to this kind of technique. These methods try todetect pattern (such as special words in the body or sub-

    ject). In the term of content-based techniques, it can beseen that spam e-mails usually have specific contentwhich make them different from legitimate e-mails.Hence classification based on content is more logical thanrule based and false positive error reduces, too. More me-thods in this category use machine learning technique.First example of this kind of technique is Naive Bayes inwhich classifier computes the likelihood that whether ane-mail is spam or not given the features that are containedin the e-mail [1]. The model, output by the Naive Bayesalgorithm, labels examples based on the features that theycontain. Another content-based method is SVM [2],[3]and key concepts are that there are two classes, yi{-1,1},and there are N labeled training examples: {x1, y1),,(xn,yn), xRd where d is the dimensionality of feature vector.If the two classes are linearly separable, then one can findan optimal weight vector w* such that ||w*||2 is mini-mum; and yi(W*. xi b) 1. The distance between thetwo hyperplanes defines a margin and this margin ismaximized when the norm of the weight vector ||w*||is minimum. The next method is C5.0 (Decision Tree) inwhich each branch node represents a choice between anumber of alternatives [1], and each leaf node represents

    a decision. Neural Network is another method of content-based technique. The method is a non-linear feed-forwardnetwork with the sigmoid activation function f(x)=1/(1+e-x) [4]. This activation function will produce an output inthe range [0; 1]. The e-mail is classified as junk if the out-put value is above 0.5. Another technique for spam filter-ing is memory-based. Memory-based approach has lessfalse positive error than ML (Machine Learning) tech-niques. This technique stores all training instances in amemory structure and uses them for classification. Themost basic instance-based method is the k-nearest neigh-bor (k-NN) algorithm [5]. In this algorithm the distance

    between new e-mail and all samples in the training set iscalculated. The samples are ranked according to the dis-tances. Then the k samples which are nearest to the newe-mail are used in assigning a class to the case. Anothermethod which uses this technique is an Instance-BasedSystem. When a new e-mail arrives, the system executesfour stages: retrieve, reuse, revision and retain [6],[7]. Formore details, they are illustrated as follow: Every time IB(Instance-Based) system executes the aforementioned in-stance retrieval stage by selecting those e-mails withhigher values for the similarity with the arrived e-mail,the system assigns a class label to the incoming e-mailbased on a unanimous voting algorithm and M' instance-

    messages which are most similar to the arrived e-mail.Each message in M returns one vote and by means ofrecounting the existing votes, an initial classification isprovided by the system. Previous to the final response of

    the system, a revision stage is carried out when the as-signed class is spam. This re-evaluation is carried outwith the goal of guaranteeing the accuracy of the solutionproposed [6]. In these situations, IB system uses theknowledge extracted from the message header (from,return path, content type, language, attachment) in orderto generate a final answer. Concretely, the systemsearches the instance base looking for spam e-mails writ-ten in the same language as the new one. If any onefound, IB system classifies the incoming message asspam, otherwise it labels the e-mail as legitimate. If theincoming e-mail was assigned as legitimate at the adapta-tion stage, the revision phase does nothing. Every timethe user checks his e-mailbox and provides feedback tothe system about a previous e-mail classification, the sys-tem stores a new instance in the e-mail base for futureuse.

    3 PROPOSED SYSTEM

    This section outlines CIB system, an IB system in whichClustering approach is used. Hence, a simple IB systemwill be changed in this regard. In CIB system each e-mailis an instance represented as a vector of attributes or fea-tures. Actually the features are words existing in body orheader of e-mails. Totally, CIB system has four phases asfollow: preprocessing, feature selection, clustering as wellas IB classification. Each phase has been described as be-low.3.1 Preprocessing

    In this phase, initial data which are emails in the format

    of text files are transmitted to an understandable form forthe system. Exactly, HTML tags in the text files are elimi-nated and then the words in the header and the body ofeach email are extracted and stored in a word-base.3.2 Feature Selection

    In this phase, the most important words assumed asfeatures which are useful for better classification areselected. First, the words in the stop list are eliminatedfrom feature vector of e-mails; thus, words like a, an,the found in both legitimate and spam e-mails areremoved. Second, a feature selection algorithm to reducethe dimensionality of feature space is applied; in this

    paper Information Gain (IG) [8] was applied to opt themost predictive features as it has been shown to be aneffective technique in aggressive feature removal in textclassification [9]. Finally, TF-IDF technique [2] wasapplied to form weight vector of instances.

    3.3 Clustering

    In this phase, the idea is added to simple IB system. Weused k-means algorithm to cluster the spam instances intraining dataset. We experienced various K in order to getthe efficient number of clusters; results are shown insection 4. It can be seen that the number of 10 clusters hasan efficient performance. More details about it have been

    described in section 4.

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 34

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

    3/5

    3.4 Classification

    In this phase we have changed the four stages of simpleIB system to adopt the clustering approach. Fig. 1 hasshown the architecture of this phase. Sequence of thestages is labeled with the numbers. In fig. 1 there are twobases named HS1 and HH2 which HS is a base of

    misclassified spam e-mails; in contrast, HH is a base ofmisclassified legitimate e-mails. Now, more details aboutthe four stages of this phase are illustrated as follow.

    3.4.1 Retrieve Stage

    In this stage, the similarity of incoming e-mail with thelegitimate e-mails and representative vectors of clustersare calculated. Hence, instead of calculating the similaritybetween arrived e-mail and all spam instances, which isdone in a simple IB system, merely representative vectorsof clusters are investigated.

    3.4.2 Reuse Stage

    If selected instance in the retrieve stage is a cluster, theincoming e-mail will be labeled as spam temporally; oth-erwise, it will be labeled as legitimate.

    3.4.3 Revision Stage

    This stage, which is done for more investigation of theclass of incoming e-mail, encompasses three parts. Thefirst part will be done when temporary class in the reusestage is declared as spam.

    The system searches the most similar cluster tolook for spam e-mails written in the same lan-guage and same header information. If anyone isfound, CIB system will classify the incomingmessage as spam; otherwise, it will label the e-mail as legitimate.

    If the declared class in the first part is spam, the secondpart will execute:

    1 Hard Spam2 Hard Ham

    The system will search the HH base to look forlegitimate e-mails written in the same languageand same header information. If anyone is found,the CIB system will classify the incomingmessage as legitimate; otherwise, it will label thee-mail as spam.

    If the declared class in the first part or reuse stage is legi-timate, the third part will execute:

    The system will search the HS base to look forspam e-mails written in the same language andsame header information. If anyone found, IBsystem will classify the incoming message asspam; otherwise, it will label the e-mail aslegitimate.

    3.4.4 Retain Stage

    Every time the user checks his e-mailbox and providesfeedback to the system about a previous e-mail classifica-

    tion, the system stores the misclassified e-mails into thetwo bases of the hard instances for future use. Thus, if theuser declares a legitimate e-mail is misclassified, the sys-tem will store it in the HS base; on the other hand, if hedeclares that a spam is misclassified; the system will storeit in the HH base. Figure 1 introduces more details.

    4 Performance evaluationIn this section proposed system has been compared witha simple IB system and the results have been analyzed.The measures by which we evaluated the performance ofour system are: FP rates, FN rate, precision rate, recallrate and response time. These measures have been de-

    fined below:

    TN: the number of spam e-mails that are classified cor-rectlyTP: the number of legitimate e-mails that are classifiedcorrectlyFN: the number of spam e-mails that are classified as legi-timateFP: the number of legitimate e-mails that are classified asspam

    Recall rate 100+= FNTP

    TP

    (1)

    Precision rate 100+

    =FPTP

    TP(2)

    Error rate = 100+++

    +

    TNFNTPFP

    FNFP(3)

    4.1 Dataset

    Table (1) describes the spamAssassin corpus (public

    available for download athttp://spamassassin.apache.org/publiccorpus/) employedin our experiments.

    Header

    information ofhard

    Missclassifiedlegitimate

    Featurevectors

    Representativevector of clusters

    Representativevector of clusters

    Retrive stage

    (k-nn)

    Reuse stage

    (Investigating the mostsimilar instance)

    Revision stage

    Looking for the

    same header

    Looking for the

    same header

    Retain stage

    Feedback from

    Representative

    vectors of clusters

    Legitimate e-mail

    instances

    HS

    HHMost

    similar

    instance

    Abtained class

    Clustering phase

    Declaring

    final class

    1

    2

    3

    4Missclassified

    spam

    Fig. 1. Architecture of classification phase

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 35

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

    http://spamassassin.apache.org/publiccorpus/http://spamassassin.apache.org/publiccorpus/
  • 7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

    4/5

    4.2 Results of Performance Evaluation

    In table 2, the error rate values for various number of

    clusters are shown.

    Table 2 shows that when the number of clusters de-

    creases from ten to less, the error rate of our system in-

    creases. The case initiates from increasing the density of

    the clusters. In this regard, an instance which is similar to

    incoming e-mail has poor effect on the representative vec-

    tor because of high density in clusters; hence, it is likely

    that our system does not distinguish the cluster as most

    similar; thus, error occurs.Table 2 illustrates that when the number of the clusters

    is more than ten, the error rate does not improve; the caseinitiates from the fact that when the number of clustersincreases, the number of spam instances in each clusterwill decrease; hence, it is likely that enough spam withsame header information, which is necessary for unanim-ous voting to classify, will not be found.

    In the experience, we set k=1(the number of voters) inthe unanimous voting algorithm in the retrieve stage of IBsystem and our system.

    Fig. 2, fig. 3, fig. 4, fig. 5 and fig. 6 show the results ofthe performance evaluation of the both systems.

    Table 3 shows the average response time of the bothsystems. The difference between the response time of twosystems initiates from the different number of compari-siov between the two systems for classification.

    The average response time in our system is less thanwhat occurs in the simple IB system. When our systemdeclares legitimate class, it merely compares arrived e-mail with all legitimate e-mails and ten representativevectors of clusters; but, in the case of declaring the spamclass, not only 10 representative vectors of clusters arecompared; but also all spam instances in the cluster which

    is the most similar should be compared too. On the otherhand, in the simple IB system for declaring the class of anin-coming e-mail, all stored instances (spam and legiti-mate e-mails) should be compared. Hence, the responsetime increases.

    5 Results of Analysis of the Time Field

    For this analysis, we took the time field of all e-mails inthe dataset and converted it to the nearest hour (e.g.6:45:33 to 7 or 6:12:34 to 6).

    Fig. 7 shows the percentage of the distribution of spamand legitimate e-mails along the 24 hours of a day.

    TABLE 1

    DataSet Description

    Legitimate e-mailSpam

    225460Training stage

    75140Test stage

    300700Total e-mails

    Fig. 5. Recall rate for the analyzed systems

    TABLE 2

    Error Rate Values for Various Number of Clusters

    Number of

    Clusters6 8 10 12 14 16 18 20

    Error Rate 8 4 2.5 2.5 2.5 2.5 2.5 2.5

    Fig. 6. Error rate for the analyzed systems

    Fig. 2. False Positive error rate for the analyzed systems

    Fig. 4. Precision rate for the analyzed systems

    Fig. 3. False negative error rate for the analyzed systems

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 36

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

  • 7/29/2019 E-mail Classification in An Instance-Based System Using Header Information and Text Mining Techniques

    5/5

    From Fig. 7, we can conclude that:

    1. There is not special time for sendingspam e-mails; they can be sent in any hourduring a day.

    2. The thought, that spammers work atnight when the number of users of internet is lessbecause of more bandwidth, is false. Because inthe hours between 2 and 7 the percentage ofposted spam e-mails is less than legitimate e-mails. Actualy it may be initiated from thesecauses:

    The spammers use their victims who are

    other users of internet for sending theire-mails and these victims are usuallyactive in any time of a day.

    A company may has set one of itsemployee in the charge of sending manyadvertising e-mails to the inbox of manyusers throughout the world. Hence, hedoes it in his working time, not at night.

    3. The hours in which most users sendtheir e-mails (e.g. 4, 7 and 9), the activity of thespammers is least.

    6 CONCLUSION

    In this article we presented an improved Instance-Based

    system for spam filtering. The idea was clustering the

    spam instances; therefore, in the retrieve stage, the repre-

    sentative vectors of the clusters are calculated instead of

    the all spam massages. Thus, the number of comparisons

    and the average of response time decrease. Furthermore,

    the time field of both spam and legitimate e-mails ana-

    lyzed and results were described. The results of compari-

    sion between our system and the simple IB system show

    that performance measures have improved.

    ACKNOWLEDGMENTS

    The research work in this paper is supported by Azad University,

    Shoushtar Branch in Iran.

    REFERENCES

    [1] D. Shih, H. Chiang, and B. Lin, Collaborative Spam Filtering

    with Heterogeneous Agents, Expert systems with applications,2008.

    [2] H. Drucker, D. Wu, and V. Vapnik, Support Vector Machinesfor Spam Categorization, IEEE Transactions on NeuralNetworks, 1999.

    [3] M. Islam, W. Zhou, and M. Choudhury, Dynamic FeatureSelection For Spam Filtering Using Support Vector Machine,International Conference and Information Science, 2007.

    [4] B. Yu, and Z. Xu, A Comparative Study for Content-basedDynamic Spam Classification Using Four Machine LearningAlgorithms, Knowledge Based System, 2008.

    [5] C. Lai, An Empirical Study of Three Machine Learning

    Methods for Spam Filtering, Knowledge Based System, 2007.[6] F. Riverola, E.L. Iglesias, F. Daz, and J.R. Mndez, and J.M.

    Corchado, SpamHunting: An Instance-based ReasoningSystem for Spam Labelling and Filtering, Decision Supportsystems, 2007.

    [7] S. Delany, P. Cunninghamb, A. Tsymbalb, and L. cpyle, ACase-based Technique for Tracking Concept Drift in Spam Fil-tering,. Knowledge Based System, 2005.

    [8] D. McSherry, Explaining the Pros and Cons of Conclusions inCBR, Proc. of the 7th European Conference on Case-BasedReasoning, Madrid, Spain, 2004.

    [9] L. Galavotti, S. Fabrizio, and M. Simi,Feature Selection and

    Negative Evidence in Automated Text Categorization, ICML,1999.

    Elham ParsaeiMehr received the B.S. degree from Shahid Cha-mran University, Ahvaz, Iran in 2004, and the M.SC degree from

    Azad Universirt, Ahaz, Iran in 2010. She is currently with departmentof computer science, Azad university of Shoushtar, Khouzestan,Iran, as a tutor.Mohsen Ganj received the B.S. degree from Azad University, Dez-ful, Iran in 2004, and the M.SC degree from Azad Universirt, Dezful,Iran in 2011. He is currently with department of computer science,

    Azad university of Shoushtar, Khouzestan, Iran, as a tutor.Ebrahim BehroozianNejad received the B.S. degree from AzadUniversity, Dezful, Iran in 2004, and the M.SC degree from AzadUniversirt, Tehran, Iran in 2006, and the PH.D. degree from AzadUniversirt, Tehran, Iran in 2011. He is currently with department ofcomputer science, Azad university of Shoushtar, Khouzestan, Iran,as a professor assistant.

    Fig. 7. The distribution of e-mails along the 24 hours of the day

    JOURNAL OF COMPUTING, VOLUME 4, ISSUE 12, DECEMBER 2012, ISSN (Online) 2151-9617

    https://sites.google.com/site/journalofcomputing

    WWW.JOURNALOFCOMPUTING.ORG 37

    2012 Journal of Computing Press, NY, USA, ISSN 2151-9617