Data Mining Approach for Deceptive Phishing Detection System

International Journal of Scientific Research Engineering & Technology (IJSRET)Volume 2 Issue 6 pp 337-344 September 2013 www.ijsret.org ISSN 2278 – 0882

IJSRET @ 2013

Data Mining Approach for Deceptive PhishingDetection System

Mohd. Sirajuddin1, Mr. N. Pavan Kumar2, Ms. R. Divya3, M.A.Rasheed4

1 M.Tech. (CSE) Student @ Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA.2 Asst.Prof. , Department of CSE, Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA3 Asst.Prof.,Department of CSE, Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA

4 Asst.Prof., Dept. of IT, Muffakham Jah College of Engg.&Tech., Hyderabad, INDIA.

Abstract—Deceptive Phishing is the major problem in InstantMessengers, much of sensitive and personal information,disclosed through socio-engineered text messages for whichsolution is proposed[2] but, detection of phishing through voicechatting technique in Instant Messengers is not yet done which isthe motivating factor to carry out the work and solution toaddress this problem of privacy in Instant Messengers (IM) isproposed using Association Rule Mining (ARM) technique aData Mining approach integrated with Speech Recognitionsystem. Words are recognized from speech with the help of FFTspectrum analysis and LPC coefficients methodologies. Onlinecriminal’s now-a-days adapted voice chatting technique alongwith text messages collaboratively or either of them in IM’s andwraps out personal information leads to threat and hindrance forprivacy. In order to focus on privacy preserving we developedand experimented Anti Phishing Detection system (APD) in IM’sto detect deceptive phishing for text and audio collaboratively.

Keywords- Data Mining; Instant Messenger; DeceptivePhishing; Association Rule Mining(ARM); Anti PhishingDetection(APD); Speech Recognition system; Fast FourierTransform(FFT); Linear Predicted Coding(LPC);

I. INTRODUCTION

Phishing a fraudulent trick of stealing victim’s personalinformation by sending spoofed messages, through InstantMessengers via socially engineered messages. Over the pastdecades online identity fraud has transformed from being asmall scale attack to huge spread syndicated crime as identifiedin e-mails, concrete work exists to detect deceptive phishing inInstant Messengers for text messages[2], but inefficient forvoice chatting which is the fastest means for communicationnow-a-days e-criminals have adapted [3].

Data mining techniques emerged to address problems ofunderstanding ever-growing volumes of information forstructured and unstructured data, finding frequent patternswithin huge data using Association rule mining technique [4].

In Instant Messenger[2] phisher tries to find out passwordand security related information through questions by pretendingas a trustworthy chatmate through voice chat and sometimes textmessages or by both collaboratively at different intervals of time.In IM’s, deceptive phishing has to be tackled dynamically andthere are no robust techniques yet developed to do this, as theexisting anti Phishing techniques are equipped to deal with static

Phishing [5][6].

In static Anti-Phishing technique, a black list of suspectedmail-ids is maintained in centralized black list servers [5]which disseminates vetted black list to end users forenforcement. These techniques are ineffective for InstantMessengers to detect phishing, there are two categories ofdeceptive phishing attacks popularly employed in IM’s arePassword Phishing Scenarios and Security question PhishingScenarios.

In the second scenarios the phisher tries to trace out thepersonal information by acting as a trustworthy chatmate andthereby gain access to confidential data.

There is no robust technique to deal with such attacks inIM’s [6] to our knowledge; this is the first attempt to applyAssociation Rule Mining technique on the tables/log filesextracted from transaction database (TDB) using Informationretrieval system discussed in this paper [19] when the Textmessages or Audio messages are exchanged between chatmatesin Instant Messenger shown in table 1.

Our Contribution includes integrating an Instant MessagingSystem with a Phishing Detection System; using Data Miningtechnique of Associative rules [6] and Information Retrievaltechnique, which detects dynamic Phishing in Instant messagesfor both Voice and Text messages exchanged. In the remainderof the paper the term messages means both Voice and Textmessages are included. Similarly the term Phishing meansDeceptive Phishing to be understood. The proposed systemnamed as Anti-phishing Detection system (APD) detectsPhishing in Instant Messengers.

In this paper we proposed, an APD that dynamically tracesout any potential phishing attacks when messages exchangedbetween chatmates of an Instant Messaging System. Thecurrent Instant Messaging Systems doesn’t have any means todeal with Phishing.

The remainder of this paper is organized as follows: ThisSection provides an overview of Instant Messaging system anddeficiencies exist. Section II explains the problem statementand work done till date where as Section III explains thedetailed architecture of the proposed APD-IM system and stepsfor integration of speech detection along with text messages


IJSRET @ 2013

Chatmate-1 Chatmate-2Hello do u hav any pets? s I hav 2Whats ur fav food My fav food is pizzaWho was ur fav teacher I have manyWhat is ur fav past time I play number gamesWhat is ur lucky no My lucky no is 9

(a) first transaction for first day

Chatmate-1 Chatmate-2Where do u stay or asl please I stay at xxxxxxxxxIn which school did u study I did my schooling at yyyyWhat was ur fav subject XxxxxxxxxxWhat is ur age 25, and what about urs25 years 2 months 2 days 24 hours old Oh interestingWhat is your dob 20-10-1979You are 5 months elder than me, May be not sureCan I call you my big brother, if don’t Hey its ok.mind

(b)second transaction for second day

Chatmate-1 Chatmate-2I was tired standing at my bank today Where do u have account?I have at xxxx place where do u have? I hav at xxxx.Where is the location of ur bank? near to xxxx placeDo u hav online account? Yes do u?I have to create one. Do u have We can keep ids or namesany idea about the username in capital letters

ok, thanks for giving advice Its all right

(c) third transaction for third day

Just a minute, what passwords do Keep your Employeeid,you suggest for my account to behighly secureHmm... Its not so secure, as keep your name and useeveryone knows it. special characters at beginning

or endIts fine, what are special characters @,~!@#$%, or Shift+numberthat many its too complex to hey don’t worry its easy toremember rememberoh is it…. remember the numbers eg

DOB:20-10-1979, pressshiftkey+number

(d) fourth transaction for fourth day

Is the procedure for creating online First u need to go to bankaccount same as normal account? and show all ur proofs.Can I use the same technique of Yes,u can use specialcreating pswds? Characters as I told earlier.Is it safe to use special characters as Obviously. Its difficult topasswords? trace.

(e)Fifth transaction for fifth day(f) Sixth transaction for sixth day

………Nth transaction for Nth day

collaboratively in IM’s. Detection of phishing messages in text ispossible [2], but detecting phishing words from audio messagesalong with or without text messages is explained in this paper.Section III also explains general process followed in the proposedsystem is explained Section IV shows experimental results withpatterns generated for threshold support and confidence during atraditional Phishing scenario for different transactions. Section Vconcludes the paper with an outlook to future research directionsof IM’s must be enhanced to detect video Phishingcollaboratively with Audio and Text messages integrated with 3Gand 4G mobile technologies efficiently with high processingspeeds.

Table 1. Shows the chatting between the two chatmates words marked withblue color indicate audio speech where as black color is normal text messagesexchanged, where xxxx & yyyy represents the place names.

II. PROBLEM STATEMENT IN INSTANT MESSENGERS

AND RELATED WORK

As many as 98,256 phishing attacks were analyzed by theAPWG in the year 2011[3], phishers are constantlyexperimenting and adapting. Typical phishing scenariosthrough mails, phisher sets up fake website and tricks thepeople logging to the fake website page and collectsconfidential and personal information, specifically phishing ine-banking sector. The adoption and use of Instant Messengersin most of countries became the useful tool in day to daylife[8] for quick response, studies of IM text messaging andfile transfer frequency reveals the brief discussion in aspectsof worms, analysis and countermeasures in IMs[9].

Popular systems such as AOL Instant Messenger, MSNMessenger, ICQ, Yahoo Messenger, Google Talk, Skype andInternet Relay Chat (IRC) have changed the way wecommunicate with friends, acquaintances, and businesscolleagues. Once limited to desktops, popular InstantMessaging systems are finding their way onto handhelddevices and cell phones, allowing users to chat from virtuallyanywhere. The number of corporate instant messaging users isexpected to grow to over 500 million by 2012 with anadditional 800 million home computer users having IMsystems. Unfortunately, while IM systems have the ability tofundamentally change the way we communicate and dobusiness [7], many of today’s implementations pose securitychallenges. Most IM systems presently in use were designedwith scalability rather than security in mind with respect todeceptive Phishing attacks. Virtually some freeware IMprograms lack encryption capabilities and most have featuresthat bypass traditional corporate firewalls, making it difficultto control instant messaging usage. Some of these systemshave insecure password management and are vulnerable toaccount spoofing and denial-of-service (DoS) attacks. Evenworse, no firewall in the market today can scan instantmessaging deceptive phishing. While instant messaging mayseem like a new technology, it is actually decades old.

The IRC system developed in 1988 by Jarkko Oikarinen3still in use, this system allows users to form ad-hoc discussiongroups to chat peer-to-peer with one another and exchangefiles seen today in many different Messengers that provide thesame basic service, without detecting deceptive phishingmessages.

The basic Instant Messaging architecture providesfunctionality of chat, news alerts, and conferences. InstantMessaging resources includes Web server, LightweightDirectory Access Protocol(LDAP) server[10]. In this scenariofirst LDAP server provides user entries for authentication andlookup, second chatmates download the Instant Messagingresources from web server or System Application Serverthirdly chatmates are always connected to Instant Messagingserver through an Instant Messaging multiplexor that supportstext, audio and video chatting dynamically.

Comparative study of AOL, Yahoo and MSN InstantMessengers with features and functions taxonomy discussed[11] along with protocols used for passing instant messages.The feature of IM to collect and analyze information in e-learning environment [14] helped the users flexibility of easy


IJSRET @ 2013

learning methodology coupled with presence and availabilityof management services emerging as killer application inwireless and wire-line networks [12].The filtering and spamdetection in IM poured new life to IMs [9]. Integration of IMsin mobile collaborative learning helped the mobile users [13]but ability to detect and filter deceptive phishing is incompletefor Audio and Text messages in IM’s.

A Phishing Detection Tool [14], security and identificationindicators for browsers against Spoofing and PhishingAttacks[15],[16] is known but detecting and identifyingphishing websites in real-time is difficult tasks as it dependson many factors like (URL & Domain Identity) and (Security& Encryption)[17] identifying vulnerabilities which allowthese phishing sites to be created and suggest methods toidentify common attacks that helped webmasters and theirhosting companies to defend their servers[18], Legal Risks ForPhishing Researchers [6]. Now-a-days people are using socialPhishing in IM via Text and Audio messages. Phishingmessages in IM’s can be detected if alone text messages aresent [2]. But if Text messages and Audio messages or either ofthem is collaboratively used for sending messages in IM’sthen it is difficult to detect Phishing attacks.

In this paper we proposed APD-IM system for detectingPhishing messages either if it is Text message or Audiomessage or both of them used collaboratively. Most of thework proposed in this paper is related to finding wordparameters from Speech and detection of Phishing from thevoice, after filtering out unnecessary voices based on wordparameters from speech using FFT word parameters and LPCcoefficient parameters [23],[25]. The detection of phishingfrom Text messages already proposed in previous work [2].

This section describes significant vulnerabilities that arepresent in common Instant Messaging systems and the types ofattacks that can exploit the users leading to phishing attacks.

III. PROPOSED SYSTEM ARCHITECTURE OF APD-IM

In this paper we present an Association rule miningtechnique (Apriori algorithm) [21] to detect DeceptivePhishing, suspicious messages (Audio and Text or either ofthem) sent using Instant Messenger between two or morechatmates.

The messages are stored in Transaction database(TDB),before storing the messages in TDB the unnecessary words arefiltered out by searching the Ignore words Database(IGWDB)using Information retrieval system technique(stemming, N-gram technique, ignore words)[19], the frequent reoccurringwords are extracted from the TDB dynamically usingAssociation rule mining technique[20] and stored inTransaction pattern database(TPDB), Table 4 illustrates fewwords extracted, with unique ids allocated to them. Then therules are framed dynamically for the words exists in the TPDBwhich satisfies the user-defined minimum threshold supportand confidence (threshold value) [21]. If the condition is truephish words are pushed to Phishing Database (PDB) then alertmessage is triggered from PDB to chatmates It is developedspecifically to detect phishing of unusual and deceptivecommunication in IM’s for Text and Voice messages. Theparameters for Voice detection is found using Spectrum

analysis with the help of FFT [23]and LPC coefficientparameters [25] and simulated in MATLAB [1], parametersare used to differentiate one voice (word) from other voices(words). The proposed method is implemented using Javalanguage and integrated with IM. In implementation, there areSix (6) major functional parts:

1. Voice and Text detection Modified Architecture forIM.

2. Integration of Vice and Text messages in TDB.3. Voice recognition using spectrum analysis (FFT and

LPC coefficient methodologies).4. Differentiate words based on parameters using

MATLAB and using Spectraplus.5. Rules extractions using Association Rule Mining

technique.6. General algorithmic approach for Voice and Text

detection in IM’s.

A. Voice and Text detection Modified Architecture forIM

A Modified Architecture of Voice and Text recognitionsystem for IM’s is shown in Fig. 1. The Audio and Textmessages are passed together collaboratively or either of themin IM by chatmates. To detect phishing in such cases asmentioned is a challenging task. Detection of Deceptivephishing messages in IM’s for text messages is possible [2],but detecting phishing words from audio messages along withor without text messages is explained in this paper. The Text

Figure 1. Shows APD-IM Architecture of phishing detection system forVoice and Text messages in IM.

and Voice messages need to be filtered by removingunnecessary words, for this the Text messages and Voicemessages stored separately in the database. Later integrationof text messages and audio messages is done by mergingdynamically explained in the next section III.B.

The voice recognition from a long audio track is brokendown into smaller clips as shown in Fig. 2, each of these stepsare self explanatory. The audio track may consists of breaking-

up of voices during the chat sessions which is noise that has tobe identified and removed using Hidden Marklov Model(HMM) [24], training of Voices is not discussed elaborately inthis paper. We considered an ideal situation of sample Voices.Working of Voice processor tasks in IM is shown in Fig. 2.The tasks performed by Voice processor is appropriate formatconversion of audio clips *.amr to *.wav format removingnoise [22] from long audio track and classify into independent


IJSRET @ 2013

Figure 2. Shows Long Audio Track is broken into short clips (voices) viaClip Classifier(Voice Processor) and converted into .WAV format and sent todatabase for storing, where filtering of unnecessary clips is done and uniqueids are allocated for each clip, acts as input for TDB in IM.

short clips[1]. Send each independent clips to (VDB) databasefor storage where unique ids are allocated, which act as aninput to TDB after filtering out unnecessary words using(IGWDB) database with the help of Information RetrievalSystem technique [19].

VDB database store word parameters of each clip, theword parameters discussed in Section III.D are extracteddynamically with the help of FFT and LPC coefficient,spectrum analysis using MATLAB [1] by Voice processor asshown in Fig. 2. These word parameters for every clip storedin VDB database, are checked with IGWDB database whichconsists of ignore word parameters for Voice to filterunnecessary word parameters then sends to VWDB databaseas shown in Fig.1; ultimately unique id’s allocated based onset of significant word parameters identified and sent to TDBdatabase for later processing, where ARM technique appliedto find frequent occurrences of words in TDB database andsent to Transaction Pattern database (TPDB), where againARM technique reapplied to find phish words that must satisfyminimum threshold Support and confidence (user-defined),finally the phish words identified from TPDB database aresent to PDB database, which send the message to chatmates inIM’s as an alert message based upon detection of phishingwords from PDB database.

B. Integration of voice and text messages inTDB

Steps involved in the Process of Integration of Voice andText messages in Transaction database (TDB) refer Fig. 1.

1. If alone Voice message is detected it has to be handledwith Speech recognition system, dynamically wherethe parameters of the voice are found like (peak,frequency, amplitude, TDH, etc) explained in SectionIII.D, the frequent occurrences of these parameterscaptured using ARM technique [20] and stored in thevoice database (VDB), immediately this VDB iscompared with IGWDB, the IGWDB consists ofunnecessary words like prepositions, articles, etc.Finally the filtered words are chosen [19] and uniqueIDs are allocated then sent for storing in the TDB.

2. If alone Text message is detected it has to be stored inWDB and unnecessary words are filtered out bycomparing with IGWDB dynamically usingInformation Retrieval techniques, finally selected

words are sent for storing in TDB with unique IDs asdiscussed [2],[IRS].

3. If Voice and Text messages are detectedcollaboratively, then it involves merging of twodatabases VDB and WDB as one transaction andstored in Voice Word Database (VWDB) thencompared with IGWDB for filtering out unnecessarywords as explained in points 1 and 2 respectively.Finally selected words are allocated unique IDs andsent for storing in TDB.

4. Voice may also consists of 2002, or (Two zero zerotwo), or other words which is yet a challenging taskwe have considered an ideal situation of Voice in thispaper [1] which is out of scope.

C. Voice recognition using Spectrum analysis (FFT andLPC coefficient methodologies).

Speech should be initially transformed and compressed, inorder to simplify subsequent processing. Many signal analysistechniques are available which can extract useful features. Sixmajor Spectral analysis algorithms are available as shown inFig. 3. Among them most popular methods are (Fast FourierTransform (FFT) and Linear Prediction Coefficient (LPC),Speech signals are converted into the spectrum signal usingFFT [23] but, FFT requires only complex values. Similarly byusing LPC spectrum program, we get different spectrum fromthe original spectrum and then analysis on their spectrum isdone to find other parameters, structure of standard Speechrecognition systematic approach is illustrated in Fig. 4. Thereare various applications of word recognition, like mobilecommunication, on-line and off-line communications, etc. Wehave used to detect words from Voice in Instant Messengers(IMs) to detect Phishing words.

Figure 3. Shows Different types spectral analysis algorithms.

Figure 4. Shows General Process of Word detection from Speech signal.


IJSRET @ 2013

Word recognition from Speech signal using spectrumanalysis, which involves Features extraction, Preprocessing,Pattern matching and Decision making, for word parametersfrom spectrum of speech signal, are chosen using statisticalmethods which gives the range values for each word as outputThese parameters help us to differentiate the words from eachother. Every word has some bounded or range of values thatcharacterize the word based on parameters [1].

The various word parameters that are calculated byanalysis of spectrum for speech signal are Mean, Median,Standard deviation(STD), Root mean square(RMS),Maximum peak, Minimum peak, Width of maximum peak,Signal to noise ratio(SNR), Peak Frequency, Peak amplitude,Total power, Total harmonic distortion(THD), TDH+Noise,Inter modulation distortion (IDM). These parameters can beobtained by using MATLAB and SpectraPlus. Theseparameters have some values in which they are bounded basedon these bounded values we can differentiate one word withanother.

D. Differentiate words based on parameters usingMATLAB and using Spectraplus.

To recognize speech word dynamically, we have recordedthe word and converted into .wave format, then stored inMATLAB dictionary, Digital signal processing, technique isalso used to convert clip samples in a series of data that wecan interpret “.wav” extension, we retrieved these samplesusing “wavread” in MATLAB. To represent signal infrequency domain we used Discrete Fourier Transform (DFT),defined as shown below where f denotes hertz, N denoteswindow, frequency in duration of samples using FFT

command in MATLAB. This is done because the length of oursignal must be power or two. The real and imaginarycomponents of FFT of signal stored in vector x, where x, readsthe file name the Algorithm shown in Fig. 5.

Figure 5. Algorithm that accepts .WAV extention and produc Spectrum ofSignal from which word parameters are derived.

The Spectrum of signal after the Algorithm applied isshown in Fig. 6, the Time vs. Frequency plotted graph in

Figure 6. Shows Spectrum of signal from which word parameters arederived.

MATLAB. The graph obtained for significant andinsignificant parameters for word is plotted, the insignificantparameters are neglected and significant parameters arechosen for finding the word, significant parameters only sentto TDB for storage from VDB that differentiate the wordsfrom each other. For example let us take significantparameters selected by FFT Spectrum analysis for 5 different

Table 2. Word parameters selected by FFT spectrum analysis for 5different samples of word 'MURDER’.

samples for single word 'MURDER’ shown in Table 2.Among these parameters some significant parameters areselected where as insignificant parameters are neglected andmay not be efficient for differentiating the word in TDB.

Some of the word parameters are same for two differentwords in such cases, Linear Predicted Coding coefficient(LPC) is efficient in such cases, again the word parameters,recalculated from the spectrum of speech signal that helps usto differentiate the word from each other using LDA technique[24], for example KILL and BILL Voice words got the sameword parameters .where µ 1 & µ 2 are mean of parameters, 1 &

1 are Standard deviation for words KILL and BILL is for

differentiating the words that contain same parameters [1].

Similarly significant parameters selected by LPC spectrumanalysis for 5 different samples for word 'MURDER', is shownin table 3. Finally with the help of word parameter correctword recognition is done. We have used MATLAB forreading .wav files then finding spectrum of speech signal,sometimes, SpectraPlus is also used for analysis of .wav files,based on the requirement.

Table. 3. Significant word parameters selected by LPC spectrum analysisfor 5 different samples of word 'MURDER'.


IJSRET @ 2013

I2Î12=>I1, I12=>I1Î2, I1Î14=>I2, I2Î14=>I1,I14=>I1Î2, I1Î15=>I2, I2Î15=>I1, I15=>I1Î2,

E. Rules extractions using Association rule miningtechnique

Significant word parameters are chosen that differentiatevoice words from each other are stored in VDB, comparedwith IGWDB database for filtering out unnecessary words

above from Section III.B to Section III.E. The overall workingsteps of APD-IM system explained in Fig.7.

Chatmate start

using IRS technique, and sent to TDB database where frequentoccurrences of voice words are identified using ARMtechnique and sent to PDB database as phishing words whereagain ARM technique is reapplied to TPDB database then

messaging

6

Directory server

1

checks user-defined support and confidence for the voicewords and finally reports to chatmate in IM by checking PDBdatabase on detection of phishing words.

stored, as explained earlier, from TDB database unnecessarywords are also filtered out using IRS techniques discussed inSection II for text messages and Section III for audiomessages, based on existing number of transaction obtained inTable 1. It consists of 5 transactions between two chatmatesout of which 16 keywords are picked up with unique ids fromITEM1 to ITEM16 represented as I1...I16, as shown below intable 4.

Table 4. Shows List of few Words Chosen based on frequent occurrencescaptured using ARM technique from TDB discussed in Section II & III.

Instant Messenger Server

2

DB VDB

Transaction database wheretransactions stored in (TDB), IRStechnique filters out unnecessarywords, by checking (IGWDB)

3Apriori Alorithm applied onTDB, patterns detected and

stored in (TPDB)

4 Again Apriori applied toTPDB checks for phishingword and stores in (PDB)

YES/NoO5

Figure 7. Shows the General flow of APD-IM system works for detectingphishing words.

Let us assume that the Items in transactions which satisfiessupport=2 or 20% out of 5 different transaction are[{I1,I2,I12}, {I1,I2,I14}, {I1,I2,I15},{I1,I2,I16}] areconsidered to be frequent occurrences obtained from TDBand the confidence=100% which satisfies are [{I1Î12=>I2,

I1Î16=>I2 , I2Î16=>I1 , I16=>I1Î2}]. These ARM rulesare framed, based on these rules the items are sent to PDBdatabase as phishing words, Again ARM technique applied onPDB to find phishing words to detect phishing words. Twiceapplying ARM technique accuracy to identify phishing wordsimproved efficiently. The Support given is very less becausein IM privacy information is exposed within no time or lessnumber of transaction. During the process of sendingmessages, some of the words appeared to be phishing wordseven though they may not, but this is to be tolerated bychatmates during chatting in IMs.F. General Algorithmic steps for Text and Speechrecognition system in IM

The chatting of messages (Text and Audio) includes both,in IM detected by Anti Phishing Detection system (APD). Ifphishing words found, APD-IM send an alert message tochatmate users, at one or both the ends; Depends on where theAPD-IM system is installed, its architecture is shown in Fig. 2.Text words are detected and stored in WDB database where asAudio words stored in VDB database as already explained

1. The chatmate enabled with IM support establishesconnection with the Instant messaging Server, checksfor authentication of the chatmate through theDirectory Server. If chatmate is authenticated then hecan start sending messages.

2. The Instant Messenger Server forwards messageswhich include both Audio (VDB) and Text (WDB) oreither of them to transaction database (TDB), TDBstores messages exchanged between two or morechatmates, by checking Ignore word database(IGWDB) using IRS technique after filtering outunnecessary words.

3. Apriori algorithm applied on TDB, patterns detectedare stored in Transaction pattern database (TPDB).

4. Again Apriori applied to TPDB checks for phishingwords, if detected sends to Phishing database (PDB).

5. If phishing words detected, forwards a YES to theInstant Messenger server else NO.

6. If YES is the result, the Instant Messenger sends analert message to the victim chatmate about thepossible Phishing attack else if NO, is the result theInstant Messenger server proceeds further.


IJSRET @ 2013

3 Push patterns to TPDB4 } Until TDB!=NULL

//Apply Apriori find min_support and confidence for TPDB//user defined

5 Re-Call Apriori algorithm and Scan TPDB6 { Derive association rules dynamically for freq_words

tran

sact

ion

patte

rns

The working of the APD-IM algorithm is shown in Fig. 8.

Input: Instant Messages in Transaction Database(TDB) (day to day)

Output: Alert Phishing message to IM chatmate if detected

1 Do //Apply IRS for filtering (IGWDB) and pick words and push to//(TDB) which include both Text and Audio(WDB and VDB)// merged and stored in VWDB as discussed in section III.

2 { Do //Scan TDB for Relevant patterns//Apply Apriori technique find patterns from TDB// and push to Transaction pattern database (TPDB)

{Call Apriori algorithm and Scan TDB/ /generates patterns from TDB//and stores in TPDB

7 Calculate confidence//user defined8 Check the rules satisfying threshold //user-defined9 If (Confidence satisfies Threshold value)

// Pick relevant words// Push Phishing words in PDB permanently

10 { Scan TPDB and Push words to PDB11 }While TPDB!=NULL //satisfy min threshold support & Conf.12 if PDB==TDB // Check phish words in TDB if detected13 Report to Instant Messenger chatmate as Phishing word14 else15 return to IM // do nothing16 }17 } Until TDB! =NULL

Figure 8. Shows Algorithm of APD-IM for storing transactions andreporting to IM chatmate regarding Phishing detection in IM.

IV. IMPLEMENTATION AND EXPERIMENTALRESULTS

APD-IM implemented using Apache TomCat 6.0 for WebServer for creating separate sessions for each chatmate withBrowser support (IExplorer 6.5 or higher), SQL Server 2005for Database and Java 6.0 for Apriori Algorithm for findingfrequent patterns, using Information Retrieval systemtechnique from database, odbc/jdbc drivers for connectivity.The software Simulation tools are also used like MATLABand SPECTRAPLUS for Spectrum analysis from speechsignal for calculating word parameters using FFT and LPCcoefficients dynamically.

The sequence of steps clearly mentioned in Fig. 2. whenthe messages are sent between the chat messages the numberof databases dynamically used named aschatData/TDB(stores messages between the current chat-mates, chatData_bkp(stores historical chat messages),Ignorewords/IGWDB(stores ignore words, preposition,etc.which is to be neglected used by IRS), phishwords/PDB(stores

Figure 9. Shows Databases tables (TDB, IGWDB, TPDB, PDB, and ChatbackupDB, VDB, WDB, VWDB) which is used by DataProcess program.

in Fig. 10. The detected phish words are updated to PDBdatabase.

Figure 10. DataProcess program identifies frequent occurrences of patternsusing ARM technique (Apriori is used) with min support and min confidence.

DataProcess program checks for number of lines betweenthe chatmates must be < 25 (User-defined limit). The APD-IMsystem is tested on number of transactions (lines) between thechatmates with user defined minimum support, minimumconfidence verses the number of phishing words detected fromtransaction patterns database (TPDB) shown in fig 11(a) andfig 11(b) using columnar graph.

It is observed that as the number of transactions(145 lines)between the chatmates increases the transaction patterns andphishing words follows a constant straight line as seen infig 11 (b) using X-Y axis it may not detect phishing words aspredicted, so frequent deletions of transactions is required.

14

phishing words detected dynamically), Transpatters/TPDB(stores frequent patterns detected), voicewords/VDB,

12108 6 7

910

8

11 1111 121212

6 6 6 6TransactionPatterns

Textwords/WDB shown and voicetextwords/VWDB, some ofthem are shown Fig. 9.

DataProcess program perform the operation of detecting

6 5

4 2 2 3

20

4 4 5 5 5 Phishing words

Phishing words using TPDB database, which consists ofpatterns generated between the chatmates from TDB database,DataProcess program consists of Information Retrieval systemtechnique and Apriori Algorithm, DataProcess program mustalways be running in active state which identifies frequentpatterns from the messages and detect phishing words shown

1 3 5 7 9 11

phishing words

Figure 11. (a) Shows Columnar Graph of Transaction patterns vs PhishingWords detected from Transactions for min-skewed-support (2,3,4,5,6) & min-

conf 60%.


IJSRET @ 2013

[1]

REFERENCES

Gurpreet singh, “word recognition from speech signal using spectrum

[22]

analysis and LPC,” thesis submitted at thapar university in 2011.

[2] M. Mahmood Ali and L. Rajamani, “Phishing Detection in Instant [23]

Messengers using Data Mining Approach,” proceedings of ObCom2011, will be published by Springer-Verlag Berlin Heidelberg 2012, partI, CCIS 269, pp. 490–502, 2012. [24]

[3] “Apwg phishing activity trends till December, 23rd 2011.” [Online]http://www.antiphishing.org/ phishReportsArchive.html.

Tran

sact

ion

patt

ers

and

Phis

hing

wor

dsde

tect

ed 15

10 TransactionPatterns

5 Phishing words

support00 50 100 150

Total Num be r ofTr ans actions be tw een

us ers

Figure 11. (b) Shows Transaction Patterns vs Phishing Wordsdetected vs min-skewed support and min-conf 60%

V. CHALLENGES AND FUTURE WORK

The APD-IM designed to detect deceptive phishing formessages in text and audio format. We have shown theexperimental results, for text messages and acoustic voicemessages (converted into words). The APD-IM system quitecomplex to design for video Instant messaging system,because integration of one more sub-component ImageProcessing in Multiplexer required that captures the imagesfrom run-time video will be discussed later.

The other issues yet to be done are:

• Short-forms to be abbreviated and stored in the table,with unique identifiers.

• When voice consists of Numbers, their conversions tocharacter words like Numerical ‘0’ and character‘Zero’ is still challenging task, similarly Dates,Fractional numbers(5/2), in speech requireconversion.

• Number is said as double two (22) similarly Romannumbers. kg can be kilogram or something else.

• Instant Messengers must be enhanced to detect videophishing collaboratively with audio and text messages.

The future looks green as the APD-IM can be enhanced tomeet the requirements of wireless Instant Messengers, mobileInstant Messengers for 3G and 4G Technologies. The APD-IMcan be successfully integrated in Instant Messengers, ifdistributors of IM wish to share the data and avoid DeceptivePhishing attacks; we have tested by creating our own InstantMessenger test bed.

[4] Ahmed Jawad, Asim Karim and Imadullah Khan “Online algorithms forcomplete itemset counts using set-to-string Mappings,” published byIEEE in 2006.

[5] Michael Atighetchi, and Partha Pal, “Attribute-based Prevention ofPhishing Attacks,” Copyright 2009, BBN Technologies.

xplore in 2009.

[7] HwaMin Lee, Doosoon Park, and Min Hong, “An instant messenger

'08: Proceedings of the 9th ACM SIGITE conference on Informationtechnology education.

Internet in South Korea,” Journal of Computer-MediatedCommunication in 2004.

[9] Zhijun Liu, Weili Lin, and Na Li Lee, “Detecting and filtering instantmessaging spam - a global and personalized approach ,” at SecureNetwork Protocols, (NPSec). 1st IEEE ICNP Workshop on 6 Nov.2005.

[10] Salim, Et al., “Data Retrieval and Security Using Lightweight DirectoryAccess Protocol,” at Knowledge Discovery and Data Mining, 2009.WKDD 2009. Second International Workshop in. 2009.

[11] R.B. Jennings, Et.al., “A study of Internet instant messaging and chatprotocols,” IEEE Network, vol. 20, issue 4, pp. 16-21, July-Aug. 2006.

[12] Debbabi, and M. Rahman, “The war of presence and instant messaging:right protocols and APIs,” Consumer Communications and NetworkingConference, 2004. CCNC 2004. First IEEE on Jan. 2004.

[13] Fu Kai Fang, “Design and implementation of an instant messagingarchitecture for mobile collaborative learning,” at Computing,Communication, Control, and Management, 2009. CCCM 2009. ISECSInternational Colloquium on Aug. 2009.

[14] Weider D, Yu Shruti Nargundkar, Nagapriya Tiruthani, “A PhishingDetection Tool,” at 33rd Annual IEEE International Computer Softwareand Applications Conference Washington, USA on july 2009.

[15] Amirherzberg, and Ahmad jbara, “Security and Identification Indicatorsfor Browsers against Spoofing and Phishing Attacks,” at ACMTransactions on Internet Technology, Vol. 8, No. 4, Article 16, onSeptember 2008.

[16] juan chen, and Chuanxiong Guo, “Online Detection and Prevention ofPhishing,” at Communications and Networking in China, FirstInternational Conference in 2006.

[17] Modelling Intelligent Phishing Detection System for e-Banking usingFuzzy Data Mining by Maher Aburrous, etl at International Conferenceon CyberWorlds in 2009.

[18] Wardman, B. Shukla, and G. Warner,Identifying vulnerable websites byanalysis of common strings in phishing URLs,” at eCrime ResearchersSummit, eCRIME '09 on oct 2009.

[19] Gerald j. Kowalski, and mark t maybury, “Information storage andretrieval system theory and implementation,” second edition 2006published by springer.

[20] R. J. Bayardo, “Efficiently mining long patterns from database,” InProceedings of the 1998 ACM SIGMOD International conference onManagement of data, 1998, pp. 85-93.

[21] R. Srikant and R. Agarawal, “Mining quantitative association rules inlarge relational tables,” In Proceedings of the ACM - Special InterestGroup on Management of Data (ACM SIGMOD), 1996, pp.1-12.

Larence R. Rabiner, “A Tutorial on Hidden Markov Models andSelected Applications in Speech Recognition,” in feb 1989 published byIEEE.

Jose Leonardo Plaza Aguilar, and David Báez López, “A VoiceRecognition System for Speech Impaired People,” published by IEEE atCONIELECOMP, 2004.

Hamid Sharkhzadeh, and Li Deng, “Waveform based speech recognitionusing Hidden Filter Model parameter selection and sensitivity to powernormalization,” IEEE Transactions on Audio and Speech Processing,vol. 2, January 1994.

[25] Ibrahim N. Abu-Isbeih, Khaled Dagrouq, and Wael Ali-Sawalmeh,“Speaker identification wavelet transform based method,” IEEE 5thInternational Multi-Conference on Systems, Signals and Devices, 2008.

Documents

Data Mining Approach for Deceptive Phishing Detection System