8
Association Rule Mining for Suspicious Email Detection: A Data Mining Approach S.Appavu alias Balamurugan, Aravind, Athiappan, Bharathiraja, Muthu Pandian and Dr.R.Rajaram Abstract-Email has been an efficient and popular Work done by various researches suggests that communication mechanism as the number of internet user's deceptive writing is characterized by reduced increase. In many security informatics applications it is frequency of first-person pronouns and exclusive important to detect deceptive communication in email. This l J 1 1 paper proposes to apply Association Rule Mining for Suspected words and elevated frequency of negative emotion Email Detection.(Emails about Criminal activities).Deception words and action verbs [KS05]. We apply this model theory suggests that deceptive writing is characterized by of deception to the set of E-mail dataset and reduced frequency of first person pronouns and exclusive words preprocess the email body and to train the system we and elevated frequency of negative emotion words and action used Apriori algorithm to generate a classifier that verbs We apply this model of deception to the set of Email l s . . dataset, then applied Apriori algorithm to generate the rules categorize the email as deceptive or not. .The rules generated are used to test the email as deceptive or 1.1. Motivation not. In particular we are interested in detecting emails about Concern about National security has increased criminal activities. After classification we must be able to differentiate the emails giving information about past criminal Si ctly sinceThe terrorIs anttak ondra activities(Informative email) and those acting as September 2001.The CIA, FBI and other federal alerts(warnings) for the future criminal activities. This agencies are actively collecting domestic and foreign differentiation is done using the features considering the tense intelligence to prevent future attacks. These efforts used in the emails. Experimental results show that simple have in turn motivated us to collect data's and Associative classifier provides promising detection rates. undertake this paper work as a challenge. Index Terms- Data Mining, Deceptive Theory, Association Rule Data mining is a powerful tool that enables criminal Mining, Apriori algorithm, Tense. investigators who may lack extensive training as 1. INTRODUCTION data analyst to explore large databases quickly and E-mail has become one of today's standard means of efficiently. Computers can process thousands of communication. The large percentage of the total instructions in seconds, saving precious time. In traffic over the internet is the email. Email data is addition, installing and running software often costs also growing rapidly, creating needs for automated less than hiring and training personality. Computers ¢ . r 1 . ~~are also less prone to errors than human analysis. So, to detect crime a spectrum of techniques aeas espoet rosta ua analysis. So,todetctcrimeaspectrumotechniqus investigators. So this system helps and supports the should be applied to discover and identify patterns .v .ar and make predictions. ivsiaos and make predictions. To our knowledge, this is the first attempt to apply Data mining has emerged to address problems of Association rule mining to task of suspicious Email understanding ever-growing volumes of information Detection (Emails about criminal activities). The for structured data, finding patterns within data that are used to develop useful knowledge. As individuals rasoni th e have iluded gthe conc e incrasethei usge o elctroic ommuicaion extracting the informative emails using the tense n ' (Past tense) of the verbs used in the emails. Apart there has been research into detecting deception n from the informative emails, other emails are these new forms of communication. Models of considered as the alerting emails for the future deception assume that deception leaves a footprint. occurrences of hazard activities. The remainder of this paper is organized as follows: S.Appavu alias Balamurugan is with the Dept of Information Section 2 gives an overview of Problem Statement & Technology, Thiagarajar College of Engineering, Madurai-15, related work in Email classification. In section 3 we Tamilnadu, India.E-mail: app [email protected] introduce our new Suspicious Email detection approach. Experimental results are described in Dr.R.Rajaram is with the Dept of Computer Science, section 4 .We summarize our research and discuss Thiagarajar College of Engineering, Madurai-15, Tamilnadu, som fuue workadizeon in sect 5. India. 2. PROBLEM STATEMENTS AND RELATED WORK l1-4244-1l330-3/07/$25.OO 02007 IEEE. 31 B

Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

Association Rule Mining for SuspiciousEmail Detection: A Data Mining Approach

S.Appavu alias Balamurugan, Aravind, Athiappan, Bharathiraja, Muthu Pandian andDr.R.Rajaram

Abstract-Email has been an efficient and popular Work done by various researches suggests thatcommunication mechanism as the number of internet user's deceptive writing is characterized by reducedincrease. In many security informatics applications it is frequency of first-person pronouns and exclusiveimportant to detect deceptive communication in email. This l J 1 1paper proposes to apply Association Rule Mining for Suspected words and elevated frequency of negative emotionEmail Detection.(Emails about Criminal activities).Deception words and action verbs [KS05]. We apply this modeltheory suggests that deceptive writing is characterized by of deception to the set of E-mail dataset andreduced frequency of first person pronouns and exclusive words preprocess the email body and to train the system weand elevated frequency of negative emotion words and action used Apriori algorithm to generate a classifier thatverbs We apply this model of deception to the set of Email

l s. .dataset, then applied Apriori algorithm to generate the rules categorize the email as deceptive or not..The rules generated are used to test the email as deceptive or 1.1. Motivationnot. In particular we are interested in detecting emails about Concern about National security has increasedcriminal activities. After classification we must be able todifferentiate the emails giving information about past criminal Si ctly sinceThe terrorIs anttak ondraactivities(Informative email) and those acting as September 2001.The CIA, FBI and other federalalerts(warnings) for the future criminal activities. This agencies are actively collecting domestic and foreigndifferentiation is done using the features considering the tense intelligence to prevent future attacks. These effortsused in the emails. Experimental results show that simple have in turn motivated us to collect data's andAssociative classifier provides promising detection rates. undertake this paper work as a challenge.Index Terms- Data Mining, Deceptive Theory, Association Rule Data mining is a powerful tool that enables criminalMining, Apriori algorithm, Tense. investigators who may lack extensive training as

1. INTRODUCTION data analyst to explore large databases quickly and

E-mail has become one of today's standard means of efficiently. Computers can process thousands ofcommunication. The large percentage of the total instructions in seconds, saving precious time. Intraffic over the internet is the email. Email data is addition, installing and running software often costsalso growing rapidly, creating needs for automated less than hiring and training personality. Computers

¢ . r 1 . ~~are also less prone to errors than humananalysis. So, to detect crime a spectrum of techniques aeas espoet rosta uaanalysis. So,todetctcrimeaspectrumotechniqus investigators. So this system helps and supports theshould be applied to discover and identify patterns .v .ar

and make predictions. ivsiaosand makepredictions. To our knowledge, this is the first attempt to applyData mining has emerged to address problems of Association rule mining to task of suspicious Emailunderstanding ever-growing volumes of information Detection (Emails about criminal activities). Thefor structured data, finding patterns within data thatare used to develop useful knowledge. As individuals rasoni th e have iluded gtheconc e

incrasetheiusge o elctroic ommuicaion extracting the informative emails using the tensen' (Past tense) of the verbs used in the emails. Apartthere has been research into detecting deception n from the informative emails, other emails arethese new forms of communication. Models of considered as the alerting emails for the future

deception assume that deception leaves a footprint. occurrences of hazard activities.The remainder of this paper is organized as follows:

S.Appavu alias Balamurugan is with the Dept of Information Section 2 gives an overview of Problem Statement &Technology, Thiagarajar College of Engineering, Madurai-15, related work in Email classification. In section 3 weTamilnadu, India.E-mail: app [email protected] introduce our new Suspicious Email detection

approach. Experimental results are described inDr.R.Rajaram is with the Dept of Computer Science, section 4 .We summarize our research and discussThiagarajar College of Engineering, Madurai-15, Tamilnadu, som fuueworkadizeon in sect 5.India.

2. PROBLEM STATEMENTS AND RELATEDWORK

l1-4244-1l330-3/07/$25.OO 02007 IEEE. 31 B

Page 2: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

strengthened. Also we can prevent the occurrences ofIt's hard to remember what our lives were like future attackswithout email. Ranking up there with the web as oneof the most useful features of the Internet, billions of rplriiciir ofDetctiii f Sispdou Emilmessages are sent each year. Though email wasoriginally developed for sending simple textmessages, it has become more robust in the last few Tense &eyears. So, it is one possible source of data fromwhich potential problem can be detected. Thus the -irna1 Ei l S .,.,.:u.E-iilproblem is to find a system that identifies the Tense =Pas>onse F-uturedeception in communication through emails. Evenafter classification of deceptive emails we must beable to differentiate the informative emails from the -alerting emails. We refer to informative emails as Fig. 1. A Tree Structure of Detection of Suspicious Emailthose giving details about the already happened Many techniques such as NaYve bayeshazardous events and the alert emails are those which [LEW98,CDAR97,ABSSOO], Nearest Neighborremain us to prevent those hazard events to occur in [GL97],Support Vector Machines [JOA 98],the fore coming days. Regression [YC94],Decision Trees[ADW98],TF-IDF

Style Classifiers [SM83,BS95,ROC71] andExample of SUSpiCIOUS and normal email. Association classifiers [LHM98,WZL99] have been

Suspicious Email Normal Email developed for text classification.[COH 96] compares results for email classification

Sender: X Sender: y of a new rule induction method and adaptation ofSub: Bomb Blast Sub: Hi Rocchio's relevance feed back algorithm [ROC71] inBody: Today there will be bomb Body: Hope ur fine! [ILA95]. [SDHH98] employs NaYve Bayes Classifierblast in parliament house How are u & familyand the US consulates in members? to filter junk email.[B0098] uses a combination ofIndia at 11.46 am. Stop nearest neighbor and TF-IDF approaches .Naiveit if you could. Cut Bayes classifier is used for classifying email in torelations with the U.S.A.long live Osama multiple categories in [RENOO].Support VectorFinladen Asadullah Alkalfi. machines approach is implemented for email

Example of classifyin Susauthorship classification in [VEOO]. A comparison ofExformampleofeclassifying SsiouinoAetad binary classification using NaYve bayes and decision

trees [QUI93] approaches is performed inAlert Email Informative Email [DLWOO].TF-IDF style classifier defined in [BS95]

is implemented in [SK006] and is extended forSender: X Sender: y incremental case in [SKOOA].Approach toSub: Bomb Blast Sub: WTC AttackedBody: Today there will be bomb Body: The World Trade Center Anomalous email detection S consdered [ZD]blast in parliament house was attacked on 9/11/01 by showed approaches to detect Anomalous emailand the US consulates in Osarna Bin Laden and his involves the deployment of data mining techniques.

Indiaat146ampfollowers. [CMSCT] Proposed a model based on the NeuralIt If you could. Cutrelations with the U.S.A. Network to classify personal emails and the use oflong liveOsamna principal component analysis as a preprocessor of

NN to reduce the data in terms of bothThe informative emails provides us with the data dimensionality as well as size.about the past historical criminal activities by Using association rules for classification was firstenhancing some common sense to us such as in the introduced in [LHM98] and further developed inexample shown above we came to know that these [WZL99,MW99,WZHOO,LIOI].Classification basedtypes of email will never have any consequences in on Association rule (CBA) was introduced infuture. [LHM98] and Multiple Association rule (CMAR)The alert emails were identified using the deceptive introduced in [LHM98,LIOl].[KS05] proposed atheory and the future tense verbs used in the emails. method based on the singular value DecompositionBy which the security enforcing methods can be to detect unusual and Deceptive communication in

313~

Page 3: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

emails. The problem with this approach is that not Lexical analysis is the process of converting an inputdeals with incomplete data in an efficient and elegant steam of characters in to a stream ofwords or tokens.way and can not able to incorporate new data The lexical analysis phase produces candidate termsincrementally without having to reprocess the entire that are further checked and retained if they are notmatrix. in a stop list.No work is known to exist that would test Stop Word RemovalAssociative classification specifically to detect email Stop list is a list of words that are most frequent in aconcerning criminal activities. text corpus and are not discriminative of a message3. THE PROPOSED WORK contents, such as prepositions, pronouns and

In this paper, we present an association rule mining conjunction. Examples of stop words are "the",algorithm (Apriori algorithm) to detect suspicious "and", "about", etc.email and the further classification into the alert and Stemminginformative emails. It is developed specifically for Stemming is the process of suffix removal to generatedetect unusual and deceptive communication in email. word stems. Although not always absolutely true, termsThe proposed method is implemented using the java like "bomb", and "bombing" do not make biglanguage. In implementation, there are three parts: difference for the purpose of distinguishing messagesEmail Preprocessing, Building the associative containing trip bombing, for example, and can all beclassifier and validation.Fig.2. Shows a general replaced by their stem "bomb".framework ofthe Associative classifier construction. Consider the email message on Fig 3.1. and Fig

3.2.The body of the first email content is given in redDjlor to denote the reader that it is a suspiciousZ rPf6f6m-g Bddn

ail Th bod ofth secn email cotn show it ..*

. w f^ = < s-1a~~~~~~~~~snot a sus;piCi0oIS.Message Covst BnTest Dataeag EB ;

ps & Constructions ) PruningS Tioeseration &P ruing

3..e-ailureprocssin Stortingu Actcuracy~pihhe3.U~

tr selection email mesae into a represntoation

Phrase ~~~~~Rule

sutalefrathe Apir Clovrithm

Constructions & PrunEtgractio adi presentaThissertronI)

Fig.2.Classifier Construction frameworksbe tfields iA hd18t tte 11.46 th etse3.1. E-Mail Preprocessing i ti lotc0CoLid. lat r e i tiD WLuithhe J.SA.Email preprocessing involves the process of Inq hve QS Fihhdn Set1hh Alktelft.Ltransdorming emaremessages into a erentisentationsuitableh fot he Apror algoritv1hm

suspicious emails from the normal emails. Then thetense of the separated suspicious emails wereexamined to differentiate the alert and informative Fig.3.1. Semi Structured data (alert email)emails. It consists of the following four steps: Textterm extraction, lexical analysis, stop word removaland stemming.Text term Extractionxtras fromamessage body and s fldutmesnt,

peroringitopliswordsr removal andstfemmniaing.th

LemetacinexicalAnalysisopwod emva

313

Page 4: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

' NE'S'S. I 11- [ ]]mx1]'. E] a='..................S],,t,'ES:M'-|-Brlt.:'1 .....Td..-....... jb f pf M0 Ul, n 5 [l I >sE o """LCOrelfi- [ US 0i:gni" 0-1A]a

TFr| BBBAdomcinnamecorn F it7e -zar ffie go 6rrnnto this nat Cmz U.- an:2;t; lalatO ,elease ,f dl.ractalr talAltlfiptTiofon m-c D :,f our Oebltp-l.)k Faoe itnrg 1;- 3l i"ll ':

SulcWTC a .ac itubjed \fT C st%ck f The Lf 319'listf rlf-renes if shold be llStif morto-,^,> lf lC.f 13 l.r iMC2fO:ittit o,fe-- hatshl,-u Id beiLlt at1'lDtJrf t Wo.iaT}1- M,-9>ot shoTh.l' kk4foUe Ie'}t1'.ifr'rOfll-Or If 'lr '

U

di his tfil:ossti M8 sqtsCqitsOei of this attAsh i tuit fbtes----

5iiB0000 peDple IsEt theit lifti Aiso toe Woldo ttside 4soet, OI d-naKiro-\iAsiu

Fig.4.Email message before preprocessingFig.3.2. Semi structured data (informative email) The email after preprocessing is of the form that

removes space and extra characters and displays only3.1.2. Feature Selection the keywords. The Fig.5. gives the view of the emailBased on the theory of deception a deceptive email after preprocessing.will have highly emotional words and action verbs.

"'fll t"Al-illiiae,-, Irtw b 2er si-se

So, such words are set as keywords and extractedfrom the input dataset. Example for highly emotionalwords and action verbs are "lifeless", "anger", "kill","attack", etc.The fuiture tense denoting keywords fl

such as will, shall, may, might, should, can, could, irh,sik

would are used to indicate that the suspicious emailis of the type alert. The past tense denoting keywordssuch as was, were, etc are used to indicate that thesuspicious email is of the informative type.

Prior to classification, a number of preprocessingsteps were performed Fig.5.Email message after preprocessing

1. Emails were converted to plain -text from The output after the preprocessing is in the table.mbox files. format in which the attributes are given as the table

2. Headers and HTML components were headers and the records are given in column. Theremoved. class attribute is to detect either informative, alert or

3. Body of the message was extracted. ver email.

4. The messages body was tokenized in to EmaUil Tense Boi) Blast TeiToist Attack Tluneateni Class

words, stop words were removed, and wordsa1 Lt ) u iiifoiiiitiwere converted in to lower case.p azst II11 ) ) iiiforimitivewereconverted in to lower case. 3 p1e~~~~~~~~~~~~~~~~~~~~~~~ent y y v y 1p1se aleil

y 1 aeiAfter the above mentioned steps the email is 4 fituii II y 'v y aletgiven to the preprocessing program and the Fig.4. 5 past u u 11 u u iomial

SS + h ss1<1 s; ss1 1 1 Ealtack aLelathi lklieai0=sztatta.:h|_Nloilf:!;anipteS <~~6 pesent 11 1 alei

gives the view of the email before preprocessing. pee y v e_____past 1 U1 U U1 v itforiinative

S past _____ uifuniative9 f"ature II f tense alekt10 flitiue y 11 _ 11 y alert

WOUld~~ ~ ~ ~ ~ ~ ~ ~ ~~al 1:e Final oupuoflct preprocessingou e

is ~~~~ ~ ~ ~ ~ ~ ~ ~~~32 Buldn the Assciaiv Classifier eneentngkewod

L+ A + * A- + +L + +L32

Page 5: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

3.2.1. Email Classification Process: experiment and below is the tabulated result afterEmail Classification is the process of finding a set of preprocessing.models (or functions) that describes and distinguish Email used in the Experiments: We have collecteddata classes and concepts, for the purpose of being over 3000 e-mails through a Brainstorming session,able to use the model to predict class of objects some of them are as follows and the first example iswhose label is unknown. a real example,

__=_= | Email Tee KeVwordlKe1 o IC An example:[ TsDaR j 1WaMs Bomh TeaSorlsToday there will be bomb blast in parliament house

and the US consulates in India at 11.46 am. Stop it ifyou could. Cut relations with the U.S.A. Long liveOsama Finladen Asadullah Alkalfi.

5la.if ca-t Rule -: (tense=v1tI andc kewordl=bomb) ->alertMclel ~RWe 2: (ternse=vras) ancL (keyrword= ateacke d)7?-irfo_rrrative

.EmaiilEmaIl .Items or keywordsID1 .Will,Bomb,Blast,terrorist,attack

tzXeVAze-zaik e No+ kcrd|Class|l

= 1 1 hl hiiacI666666666666666 atta:k A166666666666666666666662May, Terrorist, attack,threaten-3 'Was, Blast, attack, threaten,

E- fiBomb, blast,7, = :s 3terrorist,attack,threaten

2 had--bo-b'blst6 Can, Bomb, terrorist, threatenFig.6. Classification Process '7 Was, Hijack, murder

Fig 6 shows a general framework of the 8 Could, Attack, Disasterclassification process. In the example, objects 9 Was, Terrorist, Bomb, blastcorrespond to email messages and object class labels lo Attack, hijack, murdercorrespond to message type. Every email message Might, ac bomb blast ill,contains two terms and a Tense, that are used to 'demolish, disasterpredict a email is suspicious or not. The training set 12 Will, Attack, hijack, murdercontains three email messages. For each message we 13 Was, Terrorist, Bomb, blastrecorded two terms or more (it can be a single word Shall, Attack, bomb, blast, kill,or a Tense), Terml, Term2 and a Tense. And class demolish, disasterlabel was pre assigned to each message manually.Let the following be a simplified informal algorithm, 15 Will, Hijack, murderfor classification model construction. If one or Table. 2. Sample Feature Selection from emailseveral attributes ai occur together in more than onetransaction assigned the same topic T, then output a 3.2.2. Apriori Algorithm for Suspicious Emailrule ai->T.In the first step we built a classification Detectionmodel. The training data contains two transactions ofclass Alert Email that have keyword Bomb/ Blast/kill Association Rule mining searches for interestingand a tense "will/may" in them, one transaction of association or correlation relationships among itemsclass informative email that have keyword in a given large data set. We model email messagesAttacked/Terrorist and a tense "was". Applying the as transaction where items are words or phrases fromApriori algorithm we obtain a model containing two the email. After preprocessing a email message, byrules shown as the classification model. In the eliminating stop words and stemming, emails aresecond step the model just built is tested using test represented by sets of cleaned words di= t.....t1n} asdata containing two transactions. If accuracy is well as category to which they belong Cj. Themeasured as a percentage of messages correctly Apriori algorithm is used for mining frequent item-classified, If accuracy is not satisfactory then one or sets in transactional databases to find frequent sets ofseveral steps ofthe classifier need to be modified. words in the emails of the training set. Given theWe have used Apriori algorithm to classify the frequent sets of words and topical category assignedemails. These are the few set of emails used in the to the transaction from which they were extracted

association rules are deduced with constraints on the

320

Page 6: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

antecedent and consequent of the rules such that the Association rules are constructed from itemsets. Forantecedent always contains words while the example, the itemset {Tense=past, Attack=Y} occursconsequent is exclusively a topical category. in many ofthe transactions where {Bomb=Y} appear

then the following association rule can be writtenThe input to the association rule generating program {Tense=past, Attack=Y, Bomb=Y}gets the values in numerical order hence we are ->{Class=informative}assigning the values to the attribute as given in the This rule's confidence is the percentagetable. of transactions containing {Tense=past, Attack=Y,}

that also contain {Bomb=Y}. The support for theTable 3: Assigned value for each attribute rule is the number of transactions that contain

Suppose and confidence is the two measures that are {Tense=past, Attack=Y} and {Bomb=Y}.Anused in association rule mining. Support can be association rule can have many items in itsdefined as fraction of transaction that contains both antecedent (left hand side) and many items in its

consequent (right hand side). The rule {Tense=past,Attribute Value 1 Value 2 Value 3 Attack=y, Bomb=Y}-> {Class=informative} hasTense Past =1 Present =2 Future =3 antecedent {Tense=past, Attack=Y,Bomb=Y}andBomb Yes =4 No =5 consequent {Class=Informative}.Blast Yes 6 No =7iTerrorist Yes =8 No=99------- lAttack Yes =1 No =11 -,&IThreaten Yes =12 No=13Class Normal =14 Informative =15 Alert =16X & Y.Confidence measure how often items in Yappear in transaction that contain X.

Market Basket analysis is possibly the largestapplication for algorithms that discover AssociationRules. The application of Association Rules is notrestricted to market basket data. This paper attemptto apply the algorithm for suspicious email detectionas informative or alert emails. The algorithm Fig.7. Reading Input fileproduces a list of Itemsets. Each itemset is a ....... U._.........

~~~~~~~~~~~~~~~~~~~~~~~~~~~~tmtaN 91 1itrWkftnucombination of attribute values. For example,Itemsets for the suspected informative email could M, la- fi

include {Tense=past, Attack= Yes, Bomb Yes} hirlE lt lhaiiWlUlEISiO Xtlhnt iand Item sets for the suspected alert email could Qg¾include {Tense=future, Attack= Yes, Bomb Yes}.The support shows how many cases have the Itemsetvalues {Tense =past, Attack Yes, Bomb Y iEmail =Suspicious informative email} and the 'Iconfidence shows the likelihood of EmailSuspicious or Deceptive for a case having Attack=Y and Bomb Y.Suppose an email contain the item ,'3or keyword {Bomb, Blast, Attack}. The itemsets ofsize one from this basket are {Bomb}, {Blast} and Fig.8. Frequent (Large) item sets generation{Attack}; the itemsets of size two are {Bomb,Blast}, {Blast, Attack} and {Bomb, Attack}; anditemsets of size three are {Bomb, Blast, Attack}.Some itemsets will appear in many different emails.For instance, bomb and blast will be found in manyemail samples: the itemset {bomb, blast} occursfrequently. The support for an itemset is the numberof transactions where that itemset appears.

322

Page 7: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

-. {Tense=Past, Bomb=N, Terrorist=N, Blast=N4__________+- Email=Normal}

Output of Apriori algorithm will be a set of rules foreach category. These rules were then used to classify

4,s3-115 the testing data. In the testing stage, rules generatedMIAE 15

in the training stage to be used to classify theisrJ5n4==|S incoming email. Extracted emails should be:U1 preprocessed before comparing with the generated

t5>8-15 isrules. Processing such as lexical analysis, stoplistl5X5Jr>4X==l5 word removal and stemming should be applied to the

extracted data. Resulting emails should be comparedt4A =lS with each rule and the email is categorized to the4,61}=115 i most exactly matching category. If the left part of the

IAA~10 15-----_- ---------_________________________ rule matches with the email, count the categoryofFig.9.Apriori algorithm for Association rules the rule of the document. For this purpose, we willgeneration assign a priority value for each generated rule.The association rules generated are in numerical Higher priority value will be assigned to largevalues hence the visualized output with respect to the frequent itemsets and vice versa. Since we areoutput column ofthe preprocessing. assigning priority, we can increase the classificationThis Item sets are then used to generate Association accuracy. If a rule matches with the emailRules and one such rule is corresponding priority value will be added to theTense=past, Attack=Y, Bomb =Y -> Lma = category. Finally, the category with the highest valueSuspicious lnformatlve IEmal. will be classification for that email.Tense=future, Attack= Y, Bomb =Y -> Email =

Suspicious alert Email.Tense=future, Attack= N, Bomb =N ->Email =

Normal Email. -a

These rules state that if there is Tense=past, attack&bomb then the email is suspected one withinformation that is it does not have any hazardouseffect in future. <4. EXPERIMENTAL RESULTS

The application of data mining to the task ofsuspected email detection is done; experiments were ___carried out on a small email corpus. A mixturecontaining 1000 informative emails, 1000 alertemails and 1000 normal emails. The system wastrained with the training dataset and the default Fig.1O.Classifioer Testingsupport and confidence threshold were used. When This is the input to the testing stagethe training process was finished, the top 20 bestquality rules were taken as the final classification

Th¢eeWnS; 8 IndtiA alwft t w =R6nifi lrules. Some of the rules generated by the Associationrule based classification are:{ITerrorist=y, Attack=y,Tense=past {Suspicious=informative}IThreaten=y, Tense=fuiturej 4Suspicious=alertlTense=future, Terrorist = Y, Bomb =Y e This is the output that is obtained in the execution{Suspicious =alertj stageTense=future, Attack=y, Blast=yj 4 The frequent itemset {Tense = future, Blast =Y,{Suspicious=alert} Bomb Y} and the resulting association rule is{Tense=past, Terrorist I es=ftueady,as Yadf30b=YteAttack=y}=*{Suspiciousrlnfortnmtivc} fEralSsiiu aet.This is a Suspicious Email

322

Page 8: Association Rule Miningfor Suspicious Email Detection: A ...184pc128.csie.ntnu.edu.tw/presentation/08-10-23/Association Rule Mining... · ofthe mostuseful features ofthe Internet,

of alert type that is it will lead to any consequences [CMSCT]B.Cui, A.Mondal, J.Shen, G.Cong, and K.Tan, "Onin future. Effective Email Classification via Neural Networks," In Proc.5. CONCLUSION AND FUTURE WORKS ofDEXA, 2005, PP.85-94.IDLWOOI Y. Diao, H. Lu, and D. Wu, "A comparative study ofEmail is an important vehicle for communication .It classification-based personal e-mail filtering," In Proc. 4this one possible source of data from which potential Pacific-Asia Conf. Knowledge Discovery and Data Miningproblem can be detected. In this paper, we have (PAKDD'00), pages 408-419, Kyoto, JP, 2000.employed Association rule mining based [JHYZ05] Jie Tang, Hang Li, Yunbo Cao and Zhaohui Tang,classification approach to detect deceptive "Email Data Cleaning," KDD'05, Chicago, USA.

[JHMK] Jiawei Han, Micheline Kamber, "Data Miningcommunication in email text as informative or alert Concepts and Techniques" Morgan Kaufmann Publishers".emails. We can find it that a simple Apriori [Joa98] T. Joachims, "Text categorization with support vectoralgorithm can provide better classification result for machines: learning with many relevant features," In Proc. 10thsuspicious email detection. In the near future, we European Conf Machine Learning (ECML'98), pages 137-142,plan to incorporate other techniques like different Chemnitz, Germany, 1998.

[KS05] P.S.Keila and D.B.Skillicorn, "Detecting unusual andways of feature selection, and Classification using Deceptive Communication in Email," Technical reports June,other methods. One major advantage of the 2005.association rule based classifier is that it does not [LHM] Liu,W. Hsu, and Y. Ma, "Integrating classification andassume that terms are independent and its training is association rule mining," In ACMInt. Conf on Knowledge Discovery

and Data Mining (SIGKDD'98), pages 80-86, New York City, NY,relatively fast. Furthermore, the rules are human August 1998.understandable and easy to be maintained or pruned [MO02] Maria-Luiza Antonie and Osmar R.Zaiane., "Textby human being. In this paper, a method of applying Document Categorization by Term Association," WEEAssociation rule mining for suspected email International Conference on Data Mining (ICDM'2002), PP 19-detection is presented using keyword extraction and 26, Maebashi City, Japan, December 9-12, 2002.[OM02] Osmar R.Zaiane ,Maria-Luiza Antonie,"Classifyingconsidering key attribute called Tense. The proposed Text documents by associating terms with text categories," Inwork will be helpful for identifying the deceptive Proceeding of the Thirteenth Australian Data base Conferenceemail and also assist the investigators to get the (ADC'02), Melbourne, Australia, January 28-February 1, 2002.information in time to take effective actions to [Qui93] J. R. Quinlan. C4. 5: Programs for Machine Learning.reduce the criminal activities. Morgan Kaufmann, San Mateo, CA, 1993.

[RenOO] J. Rennie, "An application of machine learning to e-A problem we faced when trying to test out new mail filtering," In Proc. KDD 2000 workshop on Text Mining,ideas dealing with email systems was an inherent Boston, MA, 2000.limitation of the available data. Because we only [SDHH98] M. Sahami, S. Dumais, D. Heckerman, and E.have access to our own data, our results and Horvitz. "A bayesian approach to filtering junk e-mail," In

exermnt oobtrflcsoeis. oProc. AAAI'98 Workshop Learning for Text Categorization,experiments no doubt reflects some bias. Much of Madison, Wisconsin, 1998.

the work published in the email classification domain [SKOO]R. B. Segal and J. 0. Kephart. Swiftfile, " Intelligentalso suffers from the fact that it tries to reach general assistant for organizing E-mail," In AAAI 2000 Springconclusion using very small data sets collected on a symposium on Adaptive User Interfaces, Stanford, CA, 2000.local scale. [WZHOO] K. Wang, S. Zhou, and Y. He, "Growing decision

REFERENCES trees on support-less association rules," In Proc. 6th ACM-SIGKDD Int. Conf. Knowledge Discovery and Data Mining[ABSSOO] R.Agrawal, R. J. Bayardo, and R. Srikant. Athena (KDD'00), pages 265-269, Boston, MA, 2000."Mining-based interactive management of text databases," In [WZL99] K. Wang, S. Zhou, and S. C. Liew, "BuildingProc. 7th Int. Conf. Extending Database Technology hierarchical classifiers using class proximity," In Proc. 25th Int.(EDBT'00), pages 365-379, Konstanz, Germany, 2000. Conf. Very Large Data Bases (VLDB'99), pages 363-374,[AIS93] R. Agrawal, T. Imielinski, and A. Swami, "Mining Edinburgh, UK, 1999.association rules between sets of items in large databases," In [ZD]Zan Huang and Daniel D.Zeng, "A Link PredictionProc. 1993 ACM-SIGMOD Int. Conf: Management ofData, pages 207- Approach to Anomalous Email Detection,"216, Washington, D.C., May 1993.[AS94] R.Agrawal and R.Srikant, "Fast algorithms for miningassociation rules," In Proc. 20th Int. Conf. Very Large DataBases (VLDB'94), pages 487-499, Santiago, Chile, 1994.[BO098] G. Boone. "Concept features in re:agent, anintelligent email agent," In Proc. 2nd Int. Conf. AutonomousAgents (Agents'98), pages 141-148, New York, 1998.[CDAR971 S. Chakrabarti, B. E. Dom, R. Agrawal, and P.Raghavan, "Using taxonomy, discriminants, and signatures fornavigating in text databases," In Proc. 23rd Int. Conf. VeryLarge Data Bases, pages 446-455, Athens, GR, 1997.

323