16
DeltaPhish: Detecting Phishing Webpages in Compromised Websites * Igino Corona 1,2 , Battista Biggio 1,2 , Matteo Contini 2 , Luca Piras 1,2 , Roberto Corda 2 , Mauro Mereu 2 , Guido Mureddu 2 , Davide Ariu 1,2 , and Fabio Roli 1,2 1 Pluribus One, via Bellini 9, 09123 Cagliari, Italy 2 DIEE, University of Cagliari, Piazza d’Armi 09123, Cagliari, Italy Abstract The large-scale deployment of modern phishing at- tacks relies on the automatic exploitation of vulner- able websites in the wild, to maximize profit while hindering attack traceability, detection and blacklist- ing. To the best of our knowledge, this is the first work that specifically leverages this adversarial be- havior for detection purposes. We show that phish- ing webpages can be accurately detected by highlight- ing HTML code and visual differences with respect to other (legitimate) pages hosted within a compro- mised website. Our system, named DeltaPhish, can be installed as part of a web application firewall, to detect the presence of anomalous content on a web- site after compromise, and eventually prevent access to it. DeltaPhish is also robust against adversarial at- tempts in which the HTML code of the phishing page is carefully manipulated to evade detection. We em- pirically evaluate it on more than 5,500 webpages col- lected in the wild from compromised websites, show- ing that it is capable of detecting more than 99% of phishing webpages, while only misclassifying less than 1% of legitimate pages. We further show that the detection rate remains higher than 70% even un- der very sophisticated attacks carefully designed to evade our system. * Preprint version of the work accepted for publication at ESORICS 2017. 1 Introduction In spite of more than a decade of research, phishing is still a concrete, widespread threat that leverages social engineering to acquire confidential data from victim users [1]. Phishing scams are often part of a profit-driven economy, where stolen data is sold in underground markets [4, 5]. They may be even used to achieve political or military objectives [2, 3]. To maximize profit, as most of the current cyber- crime activities, modern phishing attacks are auto- matically deployed on a large scale, exploiting vul- nerabilities in publicly-available websites through the so-called phishing kits [4, 5, 6, 7, 8]. These toolkits automatize the creation of phishing webpages on hi- jacked legitimate websites, and advertise the newly- created phishing sites to attract potential victims us- ing dedicated spam campaigns. The data harvested by the phishing campaign is then typically sold on the black market, and part of the profit is reinvested to further support the scam campaign [4, 5]. To realize the importance of such a large-scale under- ground economy, note that, according to the most recent Global Phishing Survey by APWG, published in 2014, 59, 485 out of the 87, 901 domains linked to phishing scams (i.e., the 71.4%) were actually point- ing to legitimate (compromised) websites [8]. Compromising vulnerable, legitimate websites does not only enable a large-scale deployment of phishing attacks; it also provides several other advantages for 1

DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

DeltaPhish: Detecting Phishing Webpages

in Compromised Websites∗

Igino Corona1,2, Battista Biggio1,2, Matteo Contini2, Luca Piras1,2, Roberto Corda2, MauroMereu2, Guido Mureddu2, Davide Ariu1,2, and Fabio Roli1,2

1Pluribus One, via Bellini 9, 09123 Cagliari, Italy2DIEE, University of Cagliari, Piazza d’Armi 09123, Cagliari, Italy

Abstract

The large-scale deployment of modern phishing at-tacks relies on the automatic exploitation of vulner-able websites in the wild, to maximize profit whilehindering attack traceability, detection and blacklist-ing. To the best of our knowledge, this is the firstwork that specifically leverages this adversarial be-havior for detection purposes. We show that phish-ing webpages can be accurately detected by highlight-ing HTML code and visual differences with respectto other (legitimate) pages hosted within a compro-mised website. Our system, named DeltaPhish, canbe installed as part of a web application firewall, todetect the presence of anomalous content on a web-site after compromise, and eventually prevent accessto it. DeltaPhish is also robust against adversarial at-tempts in which the HTML code of the phishing pageis carefully manipulated to evade detection. We em-pirically evaluate it on more than 5,500 webpages col-lected in the wild from compromised websites, show-ing that it is capable of detecting more than 99%of phishing webpages, while only misclassifying lessthan 1% of legitimate pages. We further show thatthe detection rate remains higher than 70% even un-der very sophisticated attacks carefully designed toevade our system.

∗Preprint version of the work accepted for publication atESORICS 2017.

1 Introduction

In spite of more than a decade of research, phishingis still a concrete, widespread threat that leveragessocial engineering to acquire confidential data fromvictim users [1]. Phishing scams are often part ofa profit-driven economy, where stolen data is soldin underground markets [4, 5]. They may be evenused to achieve political or military objectives [2, 3].To maximize profit, as most of the current cyber-crime activities, modern phishing attacks are auto-matically deployed on a large scale, exploiting vul-nerabilities in publicly-available websites through theso-called phishing kits [4, 5, 6, 7, 8]. These toolkitsautomatize the creation of phishing webpages on hi-jacked legitimate websites, and advertise the newly-created phishing sites to attract potential victims us-ing dedicated spam campaigns. The data harvestedby the phishing campaign is then typically sold onthe black market, and part of the profit is reinvestedto further support the scam campaign [4, 5]. Torealize the importance of such a large-scale under-ground economy, note that, according to the mostrecent Global Phishing Survey by APWG, publishedin 2014, 59, 485 out of the 87, 901 domains linked tophishing scams (i.e., the 71.4%) were actually point-ing to legitimate (compromised) websites [8].

Compromising vulnerable, legitimate websites doesnot only enable a large-scale deployment of phishingattacks; it also provides several other advantages for

1

Page 2: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

Figure 1: Homepage (left), legitimate (middle) and phishing (right) pages hosted in a compromised website.

cyber-criminals. First, it does not require them totake care of registering domains and deal with hostingservices to deploy their scam. This also circumventsrecent approaches that detect malicious domains byevaluating abnormal domain behaviors (e.g., burstregistrations, typosquatting domain names), inducedby the need of automatizing domain registration [9].On the other hand, website compromise is only a piv-oting step towards the final goal of the phishing scam.In fact, cyber-criminals normally leave the legitimatepages hosted in the compromised website intact. Thisallows them to hide the presence of website compro-mise not only from the eyes of its legitimate ownerand users, but also from blacklisting mechanisms andbrowser plug-ins that rely on reputation services (aslegitimate sites tend to have a good reputation) [4].

For these reasons, malicious webpages in compro-mised websites remain typically undetected for alonger period of time. This has also been highlightedin a recent study by Han et al. [4], in which the au-thors have exposed vulnerable websites (i.e., honey-pots) to host and monitor phishing toolkits. Theyhave reported that the first victims usually connectto phishing webpages within a couple of days afterthe hosting website has been compromised, while thephishing website is blacklisted by common serviceslike Google Safe Browsing and PhishTank after ap-proximately twelve days, on average. The same au-thors have also pointed out that the most sophisti-cated phishing kits include functionalities to evadeblacklisting mechanisms. The idea is to redirect thevictim to a randomly-generated subfolder within the

compromised website, where the attacker has pre-viously installed another copy of the phishing kit.Even if the victim realizes that he/she is visiting aphishing webpage, he/she will be likely to report therandomly-generated URL of the visited webpage (andnot that of the redirecting one), which clearly makesblacklisting unable to stop this scam.

To date, several approaches have been proposedfor phishing webpage detection (Sect. 2). Most ofthem are based on comparing the candidate phishingwebpage against a set of known targets [10, 11], oron extracting some generic features to discriminatebetween phishing and legitimate webpages [12, 14].

To our knowledge, this is the first work that lever-ages the adversarial behavior of cyber-criminals todetect phishing pages in compromised websites, whileovercoming some limitations of previous work. Thekey idea behind our approach, named DeltaPhish

(or δPhish, for short), is to compare the HTMLcode and the visual appearance of potential phish-ing pages against the corresponding characteristics ofthe homepage of the compromised (hosting) website(Sect. 3). In fact, phishing pages normally exhibita much significant difference in terms of aspect andstructure with respect to the website homepage thanthe other legitimate pages of the website. The under-lying reason is that phishing pages should resemblethe appearance of the website targeted by the scam,while legitimate pages typically share the same styleand aspect of their homepage (see, e.g., Fig. 1).

Our approach is also robust to well-crafted manipu-lations of the HTML code of the phishing page, aimed

2

Page 3: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

to evade detection, as those performed in [15] to mis-lead the Google’s Phishing Pages Filter embedded inthe Chrome web browser. This is achieved by the pro-posal of two distinct adversarial fusion schemes thatcombine the outputs of our HTML and visual anal-yses while accounting for potential attacks againstthem. We consider attacks targeting the HTML codeof the phishing page as altering also its visual appear-ance may significantly affect the effectiveness of thephishing scam. Preserving the visual similarity be-tween a phishing page and the website targeted bythe scam is indeed a fundamental trust-building tac-tic used by miscreants to attract new victims [1].

In Sect. 4, we simulate a case study in whichδPhish is deployed as a module of a web applicationfirewall, used to protect a specific website. In thissetting, our approach can be used to detect whetherusers are accessing potential phishing webpages thatare uploaded to the monitored website after its com-promise. To simulate this scenario, we collect legiti-mate and phishing webpages hosted in compromisedwebsites from PhishTank, and compare each of themwith the corresponding homepage (which can be setas the reference page for δPhish when configuringthe web application firewall). We show that, un-der this setting, δPhish is able to correctly detectmore than 99% of the phishing pages while misclassi-fying less than 1% of legitimate pages. We also showthat δPhish can retain detection rates higher than70% even in the presence of adversarial attacks care-fully crafted to evade it. To encourage reproducibil-ity of our research, we have also made our dataset of1, 012 phishing and 4, 499 legitimate webpages pub-licly available, along with the classification results ofδPhish.

We conclude our work in Sect. 5, highlighting itsmain limitations and related open issues for futureresearch.

2 Phishing Webpage Detection

We categorize here previous work on the detection ofphishing webpages along two main axes, dependingon (i) the detection approach, and (ii) the featuresused for classification. The detection approach can

be target-independent, if it exploits generic featuresto discriminate between phishing and legitimate web-pages, or target-dependent, if it compares the suspectphishing webpage against known phishing targets. Inboth cases, features can be extracted from the web-page URL, its HTML content and visual appearance,as detailed below.

Target-independent. These approaches exploitfeatures computed from the webpage URL and itsdomain name [16, 14, 17, 18], from its HTML con-tent and structure, and from other sources, includ-ing search engines, HTTP cookies, website certifi-cates [19, 20, 10, 21, 22, 23, 24, 25], and evenpublicly-available blacklisting services like Google

Safe Browsing and PhishTank [26]. Another line ofwork has considered the detection of phishing emailsby analyzing their content along with that of thelinked phishing webpages [27].

Target-dependent. These techniques typicallycompare the potential phishing page to a set of knowntargets (e.g., PayPal, eBay). HTML analysis hasalso been exploited to this end, often complementedby the use of search engines to identify phishingpages with similar text and page layout [24, 28],or by the analysis of the pages linked to (or by)the suspect pages [29]. The main difference withtarget-independent approaches is that most of thetarget-dependent approaches have considered mea-sures of visual similarity between webpage snapshotsor embedded images, using a wide range of imageanalysis techniques, mostly based on computing low-level visual features, including color histograms, two-dimensional Haar wavelets, and other well-known im-age descriptors normally exploited in the field of com-puter vision [30, 31, 12, 13]. Notably, only few workhas considered the combination of both HTML andvisual characteristics [11, 32].

Limitations and Open Issues. The main limi-tations of current approaches and the related openresearch issues can be summarized as follows. De-spite target-dependent approaches are normally moreeffective than target-independent ones, they requirea-priori knowledge of the set of websites that maybe potentially targeted by phishing scams, or any-way try to retrieve them during operation by query-

3

Page 4: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

ing search engines. This makes them clearly unableto detect phishing scams against unknown, legitimateservices. On the other hand, target-independent tech-niques are, in principle, easier to evade, as they ex-ploit generic characteristics of webpages to discrimi-nate between phishing and legitimate pages, insteadof making an explicit comparison between webpages.In particular, as shown in [15], it is not only possi-ble to infer enough information on how a publicly-available, target-independent anti-phishing filter (likeGoogle’s Phishing Pages Filter) works, but it is alsopossible to exploit this information to evade detec-tion, by carefully manipulating phishing webpagesto resemble the characteristics of the legitimate web-pages used to learn the classification system. Evasionbecomes clearly more difficult if visual analysis is alsoperformed, as modifying the visual appearance of thephishing page tends to compromise the effectivenessof the phishing scam [1]. However, mainly due tothe higher computational complexity of this kind ofanalysis, only few approaches have combined HTMLand visual features for target-dependent phishing de-tection [11, 32], and it is not clear to which extentthey can be robust against well-crafted adversarial at-tacks. Another relevant limitation is that no datasethas been made publicly available for comparing dif-ferent detection approaches to a common benchmark,and this clearly hinders research reproducibility.

Our approach overcomes many of the aforemen-tioned limitations. First, it does not require anyknowledge of legitimate websites potentially targetedby phishing scams. Although it may be thus consid-ered a target-independent approach, it is not basedon extracting generic features from phishing and le-gitimate webpages, but rather on comparing the char-acteristics of the phishing page to those of the home-page hosted in the compromised website. This makesit more robust than other target-independent ap-proaches against evasion attempts in which, e.g., theHTML code of the phishing webpage is obfuscated,as this would make the phishing webpage even moredifferent from the homepage. Furthermore, we ex-plicitly consider a security-by-design approach whileengineering our system, based on explicitly account-ing for well-crafted attacks against it. As we willshow, our adversarial fusion mechanisms guarantee

high detection rates even under worst-case changesin the HTML code of phishing pages, by effectivelyleveraging the role of the visual analysis. Finally, wepublicly release our dataset to encourage research re-producibility and benchmarking.

3 DeltaPhish

In this section we present DeltaPhish (δPhish). Itsname derives from the fact that it determines whethera certain URL contains a phishing webpage by evalu-ating HTML and visual differences between the inputpage and the website homepage. The general archi-tecture of δPhish is depicted in Fig. 2. We denotewith x ∈ X either the URL of the input webpage orthe webpage itself, interchangeably. Accordingly, theset X represents all possible URLs or webpages. Thehomepage hosted in the same domain of the visitedpage (or its URL) is denoted with x0 ∈ X . Initially,our system receives the input URL of the input web-page x and retrieves that of the corresponding home-page x0. Each of these URLs is received as inputby a browser automation module (Sect. 3.1), whichdownloads the corresponding page and outputs itsHTML code and a snapshot image. The HTML codeof the input page and that of the homepage are thenused to compute a set of HTML features (Sect. 3.2).Similarly, the two snapshot images are passed to an-other feature extractor that computes a set of visualfeatures (Sect. 3.3). The goal of these feature ex-tractors is to map the input page x onto a vectorspace suitable for learning a classification function.Recall that both feature sets are computed based ona comparison between the characteristics of the inputpage x and those of the homepage x0. We denote thetwo mapping functions implemented by the HTMLand by the visual feature extractor respectively withδ1(x) ∈ Rd1 and δ2(x) ∈ Rd2 , being d1, d2 the di-mensionality of the two vector spaces. For compact-ness of our notation, we do not explicitly highlightthe dependency of δ1(x) and δ2(x) on x0, even if itshould be clear that such functions depend on bothx and x0. These two vectorial-based representationsare then used to learn two distinct classifiers, i.e.,an HTML- and a Snapshot-based classifier. Duringoperation, these classifiers will respectively output a

4

Page 5: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

GETHOMEPAGE

URL

BROWSERAUTOMATION

HTMLFEATURES

VISUALFEATURES

INPUT URL

HTML-based

Classifier

Snapshot-based

Classifier Legitimate

Fusion

Phishing

FEATUREEXTRACTION CLASSIFICATION

ẟPhishg(x)≥0

g(x)<0

s1(x)

s2(x)

x

x0

HTML

Snapshotẟ2(x)

ẟ1(x)

Figure 2: High-level architecture of δPhish.

dissimilarity score s1(x) ∈ R and s2(x) ∈ R for eachinput page x, which essentially measure how differentthe input page is from the corresponding homepage.Thus, the higher the score, the higher the probabil-ity of x being a phishing page. These scores are thencombined using different (standard and adversarial)fusion schemes (Sect. 3.4), to output an aggregatedscore g(x) ∈ R. If g(x) ≥ 0, the input page x isclassified as a phish, and as legitimate otherwise.

Before delving into the technical implementationof each module, it is worth remarking that δPhishcan be implemented as a module in web applicationfirewalls, and, potentially, also as an online black-listing service (to filter suspicious URLs). Some im-plementation details that can be used to speed upthe processing time of our approach are discussed inSect. 4.2.

3.1 Browser Automation

The browser automation module launches a browserinstance using Selenium1 to gather the snapshot ofthe landing web page and its HTML source, even ifthe latter is dynamically generated with (obfuscated)JavaScript code. This is indeed a common case forphishing webpages.

1http://docs.seleniumhq.org

3.2 HTML-Based Classification

For HTML-based classification, we define a set of 11features, obtained by comparing the input page xand the homepage x0 of the website hosted in thesame domain. They will be the elements of the d1-dimensional feature vector δ1(x) (with d1 = 11) de-picted in Fig. 2. We use the Jaccard index J as asimilarity measure to compute most of the featurevalues. Given two sets A,B, it is defined as the cardi-nality of their intersection divided by the cardinalityof their union:

J(A,B) = |A ∩B|/|A ∪B| ∈ [0, 1] . (1)

If A and B are both empty, J(A,B) = 1. The 11HTML features used by our approach are describedbelow.

(1) URL. We extract all URLs corresponding to hy-perlinks in x and x0 through the inspection of thehref attribute of the <a> tag,2 and create a set foreach page. URLs are considered once in each setwithout repetition. We then compute the Jaccardindex (Eq. 1) of the two sets extracted. For instance,let us assume that x and x0 respectively contain thesetwo URL sets:

Ux : {https://www.example.com/p1/,https://www.example.com/p2/,

https://support.example.com/}2Recall that the <a> tag defines a hyperlink and the href

attribute is its destination.

5

Page 6: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

Ux0 : {https://support.example.com/p1,https://www.example.com/p2/,

https://support.example.com/en-us/ht20}

In this case, since only one elementis exactly the same in both sets (i.e.,https://www.example.com/p2/), the Jaccardindex is J(Ux, Ux0) = 0.2.

(2) 2LD. This feature is similar to the previous one,except that we consider the second-level domains(2LDs) extracted from each URL instead of the fulllink. The 2LDs are considered once in each set with-out repetition. Let us now consider the example givenfor the computation of the previous feature. In thiscase, both Ux and Ux0

will contain only example.com,and, thus, J(Ux, Ux0

) = 1.

(3) SS. To compute this feature, we extract the con-tent of the <style> tags from x and x0. They areused to define style information, and every webpagecan embed multiple <style> tags. We compare thesimilarity between the sets of <style> tags of x andx0 using the Jaccard index.

(4) SS-URL. We extract URLs from x and x0 thatpoint to external style sheets through the inspec-tion of the href attribute of the <link> tag; e.g.,http://example.com/resources/styles.css. Wecreate a set of URLs for x and another for x0 (whereevery URL appears once in each set, without repeti-tion), and compute their similarity using the Jaccardindex (Eq. 1).

(5) SS-2LD. As for the previous feature, we extractall the URLs that link external style sheets in x andx0. However, in this case we only consider the second-level domains for each URL (e.g., example.com). Thefeature value is then computed again using the Jac-card index (Eq. 1).

(6) I-URL. For this feature, we consider the URLsof linked images in x and in x0, separately, by ex-tracting all the URLs specified in the <img src=...>

attributes. The elements of these two sets are imageURLs;e.g., http://example.com/img/image.jpg, and areconsidered once in each set without repetition. Wethen compute the Jaccard index for these two sets(Eq. 1).

(7) I-2LD. We consider the same image URLs ex-tracted for I-URL, but restricted to their 2LDs.Each 2LD is considered once in each set without rep-etition, and the feature value is computed using againthe Jaccard index (Eq. 1).

(8) Copyright. We extract all significant words,sentences and symbols found in x and x0 that canbe related to copyright claims (e.g., c©, copyright,all rights reserved), without repetitions, and exclud-ing stop-words of all human languages. The featurevalue is then computed using the Jaccard index.

(9) X-links. This is a binary feature. It equals 1 ifthe homepage x0 is linked in x (accounting for po-tential redirections), and 0 otherwise.

(10) Title. This feature is also computed using theJaccard index. We create the two sets to be comparedby extracting all words (except stop-words) from thetitle of x and x0, respectively, without repetitions.They can be found within the tag <title>, whichdefines the title of the HTML document, i.e., the oneappearing on the browser toolbar and displayed insearch-engine results.

(11) Language. This feature is set to 1 if x andx0 use the same language, and to 0 otherwise. Toidentify the language of a page, we first extract thestop-words for all the human languages known from xand x0, separately, and without repetitions. We thenassume that the page language is that associated tothe maximum number of corresponding stop-wordsfound.

Classification. The 11 HTML features map our in-put page x onto a vector space suitable for classifi-cation. Using the compact notation defined at thebeginning of this section (see also Fig. 2), we denotethe d1-dimensional feature vector corresponding to xas δ1(x) (being d1 = 11). We then train a linear Sup-port Vector Machine (SVM) [33] on these features toclassify phishing and legitimate pages. For each in-put page, during operation, this classifier computes adissimilarity score measuring how different the inputpage is from its homepage:

s1(x) = wT1 δ1(x) + b1 . (2)

The feature weights w1 ∈ Rd1 and the bias b1 ∈ R of

6

Page 7: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

the classification function are optimized during SVMlearning, using a labeled set of training webpages [33].

3.3 Snapshot-Based Classification

To analyze differences in the snapshots of the inputpage x and the corresponding homepage x0, we lever-age two state-of-the-art feature representations thatare widely used for image classification, i.e., the so-called Histogram of Oriented Gradients (HOGs) [34],and color histograms. We have selected these featuressince, with respect to other popular descriptors (likethe Scale-Invariant Feature Transform, SIFT), theytypically achieve better performance in the presenceof very high inter-class similarities. Unlike HOGs,which are local descriptors, color histograms give arepresentation of the spatial distribution of colorswithin an image, providing complementary informa-tion to our snapshot analysis.

We exploit these two representations to compute aconcatenated (stacked) feature vector for each snap-shot image, and then define a way to compute asimilarity-based representation from them. The over-all architecture of our snapshot-based classifier is de-picted in Fig. 3. In the following, we explain morein detail how HOG and color histograms are com-puted for each snapshot image separately, and howwe combine the stacked feature vectors of the inputpage x and of the homepage x0 to obtain the finalsimilarity-based feature vector.

Image Tiling. To preserve spatial information inour visual representation of the snapshot, we extractvisual features not only from the whole snapshot im-age, but also from its quarters and sixteenths (as de-picted in Fig. 4), yielding (1×1)+(2×2)+(4×4) = 21tiles. HOG descriptors and color histograms are ex-tracted from each tile, and stacked, to obtain twovectors of 21 × 300 = 6, 300 and 21 × 288 = 6, 048dimensions, respectively.

HOG features. We compute the HOG features foreach of the 21 input image tiles following the stepshighlighted in Fig. 5 and detailed below, as in [34].First, the image is divided in cells of 8×8 pixels. Foreach cell, a 31-dimensional HOG descriptor is com-puted, in which each bin represents a quantized di-

rection and its value corresponds to the magnitudeof gradients in that direction (we refer the readerto [36, 34, 35] for further details). The second stepconsists of considering overlapping blocks of 2 × 2neighboring cells (i.e., 16×16 pixels). For each block,the 31-dimensional HOG descriptors of the four cellsare simply concatenated to form a (31 × 4) 124-dimensional stacked HOG descriptor, also referred toas a visual word. In the third step, each visual wordextracted from the image tile is compared againsta pre-computed vocabulary of K visual words, andassigned to the closest word in the vocabulary (wehave used K = 300 visual words in our experiments).Eventually, a histogram of K = 300 bins is obtainedfor the whole tile image, where each bin representsthe occurrence of each pre-computed visual word inthe tile. This approach is usually referred to as Bagof Visual Words (BoVW) [37]. The vocabulary canbe built using the centroids found by k-means cluster-ing from the whole set of visual words in the trainingdata. Alternatively, a vocabulary computed from adifferent dataset may be also used.

Color features. To extract our color features,we first convert the image from the RGB (Red-Green-Blue) to the HSV (Hue-Saturation-Value)color space, and perform the same image tiling donefor the extraction of the HOG features (see Fig. 4).We then compute a quantized 3D color histogramwith 8, 12 and 3 bins respectively for the H, S and Vchannel, corresponding to a vector of 8×12×3 = 288feature values. This technique has shown to be ca-pable of outperforming histograms computed in theRGB color space, in content-based image retrievaland image segmentation tasks [38].

Both the HOG descriptor and the color histogramobtained from each image tile are normalized to sumup to one (to correctly represent the relative fre-quency of each bin). The resulting 21 × 300 HOGdescriptors and 21 × 288 color histograms are thenstacked to obtain a feature vector consisting of d2 =12, 348 feature values, as shown in Fig. 3. In the fol-lowing, we denote this vector respectively with p andp0 for the input page x and the homepage x0.

Similarity-based Feature Representation. Af-ter computing the visual features p for the input

7

Page 8: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

INPUT IMAGECOLOR

FEATURES(21 x 288)

HOGFEATURES(21 x 300)

Snapshot-based

Classifier

STACKEDFEATURESCOLOR / HOG FEATURES

s2(x)ẟ2(x)IMAGE

TILING (21)

HOMEPAGE

COLOR / HOG FEATURES

Similarity-based

Representation

Visual Features

STACKED FEATURES

Figure 3: Computation of the visual features in δPhish.

Figure 4: δPhish image tiling extracts visual features retaining spatial information.

page x and p0 for the homepage x0, we compute thesimilarity-based representation δ2(x) (Figs. 2-3) fromthese feature vectors as:

δ2(x) = min(p,p0) (3)

where min here returns the minimum of the two vec-tors for each coordinate. Thus, the vector δ2(x) willalso consists of d2 = 12, 348 feature values.

Classification. The similarity-based mapping inEq. (3) is inspired to the histogram intersection ker-nel [39]. This kernel evaluates the similarity betweentwo histograms u and v as

∑i min(ui, vi). Instead

of summing up the values of δ2(x) (which will giveus exactly the histogram intersection kernel betweenthe input page and the homepage), we learn a linearSVM to estimate a weighted sum:

s2(x) = wT2 δ2(x) + b2 , (4)

where, similarly to the HTML-based classifier, w2 ∈Rd2 and b2 ∈ R are the feature weights and bias,

respectively. This enables us to achieve better per-formances, as, in practice, the classifier itself learns aproper similarity measure between webpages directlyfrom the training data. This is a well-known practicein the area of machine learning, usually referred toas similarity learning [40].

3.4 Classifier Fusion

The outputs of the HTML- and of the Snapshot-based classifiers, denoted in the following with a two-dimensional vector s = (s1(x), s2(x)) (Eqs. 2-4), canbe combined using a fixed (untrained) fusion rule, ora classifier (trained fusion). We consider three differ-ent combiners in our experiments, as described below.

Maximum. This rule simply computes the overallscore as:

g(x) = max (s1(x), s2(x)) . (5)

The idea is that, for a page to be classified as legit-imate, both classifiers should output a low score. Ifone of the two classifiers outputs a high score and

8

Page 9: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

HOG Features

IMAGETILE

8x8 cells(HOG image)

1 2 ... 31

2x2 cell stacking (blocks)

1 2 3 431x4 dimensional

stacked HOG descriptor(visual word)

BoVW1 2 ... 300

1

2

...

300

frequency of(vocabulary)visual words

within theimage tile

matched to the closest visual word

in the vocabulary

Figure 5: Computation of the 300 HOG features from an image tile.

classifies the page as a phish, then the overall sys-tem will also classify it as a phishing page. The rea-son behind this choice relies upon the fact that theHTML-based classifier can be evaded by a skilled at-tacker, as we will see in our experiments, and we aimto avoid that misleading such a classifier will sufficeto evade the whole system. In other words, we wouldlike our system to be evaded only if both classifiersare successfully fooled by the attacker. For this rea-son, this simple rule can be also considered itself asort of adversarial fusion scheme.

Trained Fusion. To implement this fusion mecha-nism, we use an SVM with the Radial Basis Function(RBF) kernel, which computes the overall score as:

g(x) =∑n

i=1 yiαik(s, si) + b , (6)

where k(s, si) = exp (−γ‖s− si‖2) is the RBF ker-nel function, γ is the kernel parameter, and s =(s1(x), s2(x)) and si = (s1(xi), s2(xi)) are the scoresprovided by the HTML- and Snapshot-based classi-fiers for the input page x and for the n pages in ourtraining set D = {xi, yi}ni=1, being yi ∈ {−1,+1} theclass label (i.e., −1 and +1 for legitimate and phish-ing pages). The classifier parameters {αi}ni=1 and bare estimated during training by the SVM learningalgorithm, on the set of scores S = {si, yi}ni=1, whichcan be computed through stacked generalization (toavoid overfitting [41]) as explained in Sect. 4.1.

Adversarial Fusion. In this case, we consider thesame trained fusion mechanism described above, butaugment the training scores by simulating attacksagainst the HTML-based classifier. In particular,

we add a fraction of samples for which the score ofthe Snapshot-based classifier is not altered, while thescore of the HTML-based classifier is randomly sam-pled from a uniform distribution in [0, 1]. This is astraightforward way to account for the fact that thescore of the HTML-based classifier can be potentiallydecreased by a targeted attack against that module,and make the combiner aware of this potential threat.

Some examples of the resulting decision functionsare shown in Fig. 7. Worth remarking, when us-ing trained fusion rules, the output scores of the theHTML- and Snapshot-based classifiers are normal-ized in [0, 1] using min-max normalization, to facili-tate learning (see Sect. 4.1 for further details).

4 Experimental Evaluation

In this section we empirically evaluate δPhish, sim-ulating its application as a module in a web applica-tion firewall. Under this scenario, we assume that themonitored website has been compromised (e.g., usinga phishing kit), and it is hosting a phishing webpage.The URLs contacted by users visiting the website aremonitored by the web application firewall, which candeny access to a resource if retained suspicious (orwhich can stop a request if retained a potential at-tack against the web server). The contacted URLsthat are not blocked by the web application firewallare forwarded to δPhish, which detects whether theyare substantially different from the homepage (i.e.,they are potential phishing pages hosted in the mon-itored website). If δPhish reveals such a sign of com-promise, the web application firewall can deny user

9

Page 10: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

access to the corresponding URL.We first discuss the characteristics of the webpages

we have collected from legitimate, compromised web-sites (hosting phishing scams) to build our dataset,along with the settings used to run our experiments(Sect. 4.1). We then report our results, showing thatour system can detect most of the phishing pageswith very high accuracy, while misclassifying onlyfew legitimate webpages (Sect. 4.2). We have alsoconsidered an adversarial evaluation of our systemin which the characteristics of the phishing pagesare manipulated to evade detection of the HTML-based classifier. The goal of this adversarial analysisis to show that δPhish can successfully resist even toworst-case evasive attempts. Notably, we have notconsidered attacks against the Snapshot-based clas-sifier as they would require modifying the visual as-pect of the phishing page, thus making it easier forthe victim to recognize the phishing scam.

4.1 Experimental Setting

Dataset. Our dataset has been collected from Oc-tober 2015 to January 2016, starting from activephishing URLs obtained online from the PhishTankfeed.3 We have collected and manually validated1, 012 phishing pages. For each phishing page, wehave then collected the corresponding homepage fromthe hosting domain. By parsing the hyperlinks inthe HTML code of the homepage, we have collectedfrom 3 to 5 legitimate pages from the same website,and validated them manually. This has allowed us togather 1, 012 distinct sets of webpages, from now onreferred to as families, each consisting of a phishingpage and some legitimate pages collected from thesame website. Overall, our dataset consists of 5, 511distinct webpages, 1, 012 of which are phishing pages.We make this data publicly available, along with theclassification results of δPhish.4

In these experiments, we consider 20 distincttraining-test pairs to average our results. For a fairevaluation, webpages collected from the same domain(i.e., belonging to the same family) are included ei-ther in the training data or in the test data. In each

3https://www.phishtank.com4http://deltaphish.pluribus-one.it/

repetition, we randomly select 60% of the families fortraining, while the remaining 40% are used for test-ing. We normalize the feature values δ1(x) and δ2(x)using min-max normalization, but estimating the 5th

and the 95th percentile from the training data foreach feature value, instead of the minimum and themaximum, to reduce the influence of outlying featurevalues.

This setting corresponds to the case in whichδPhish is trained before deployment on the web ap-plication firewall, to detect phishing webpages inde-pendently from the specific website being monitored.It is nevertheless worth pointing out that our systemcan also be trained using only the legitimate pagesof the monitored website, i.e., it can be customizeddepending on the specific deployment.

Classifiers. We consider the HTML- and Snapshot-based classifiers (Sects. 3.2-3.3), using the three fu-sion rules discussed in Sect. 3.4 to combine their out-puts: (i) Fusion (max.), in which the max rule isused to combine the two outputs (Eq. 5); (ii) Fusion(tr.), in which we use an SVM with the RBF kernelas the combiner (Eq. 6); and (iii) Fusion (adv.), inwhich we also use an SVM with the RBF kernel as thecombiner, but augment the training set with phish-ing webpages adversarially manipulated to evade theHTML-based classifier.

Parameter tuning. For HTML- and Snapshot-based classifiers, the only parameter to be tuned isthe regularization parameter C of the SVM algo-rithm. For SVM-based combiners exploiting the RBFkernel, we also have to set the kernel parameter γ. Inboth cases, we exploit a 5-fold cross-validation pro-cedure to tune the parameters, by performing a gridsearch on C, γ ∈ {0.001, 0.01, 0.1, 1, 10, 100}. As thetrained fusion rules require a separate training set forthe base classifiers and the combiner (to avoid over-fitting), we run a two-level (nested) cross-validationprocedure, usually referred to as stacked generaliza-tion [41]. In particular, the outer 5-fold cross vali-dation splits the training data into a further train-ing and validation set. This training set is used totune the parameters (using an inner 5-fold cross val-idation as described above) and train the base clas-sifiers. Then, these classifiers are evaluated on the

10

Page 11: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

validation data, and their outputs on each validationsample are stored. We normalize these output scoresin [0, 1] using min-max normalization. At the end ofthe outer cross-validation procedure, we have com-puted the outputs of the base classifiers for each of theinitial training samples, i.e., the set S = {si, yi}ni=1

(Sect. 3.4). We can thus optimize the parameters ofthe combiner on this data and then learn the fusionrule on all data. For the adversarial fusion, we setthe fraction of simulated attacks added to the train-ing score set to 30% (Sect. 3.4).

4.2 Experimental Results

The results for phishing detection are shown in Fig. 6(left plot), using Receiver-Operating-Characteristic(ROC) curves. Each curve reports the average de-tection rate of phishing pages (i.e., the true positiverate, TP) against the fraction of misclassified legiti-mate pages (i.e., the false positive rate, FP).

The HTML-based classifier is able to detect morethan 97% of phishing webpages while misclassifyingless than 0.5% of legitimate webpages, demonstrat-ing the effectiveness of exploiting differences in theHTML code of phishing and legitimate pages. TheSnapshot-based classifier is not able to reach such ac-curacy since in some cases legitimate webpages mayhave some different visual appearance, and the vi-sual learning task is inherently more complex. Thevisual classifier is indeed trained on a much highernumber of features than the HTML-based one. Nev-ertheless, the detection rate of the Snapshot-basedclassifier is higher than 80% at 1% FP, which is stilla significant achievement for this classification task.Note finally that both trained and max fusion rulesare able to achieve accuracy similar to those of theHTML-based classifier, while the adversarial fusionperforms slightly worse. This behavior is due to thefact that injecting simulated attacks into the train-ing score set of the combiner causes an increase ofthe false positive rate (see Fig. 7). This highlightsa tradeoff between system security under attack andaccuracy in the absence of targeted attacks againstthe HTML-based classifier.

Processing time. We have run our experimentson a personal computer equipped with an Intel(R)

Xeon(R) CPU E5-2630 0 operating at 2.30GHz and4 GB RAM. The processing time of δPhish is clearlydominated by the browser automation module, whichhas to retrieve the HTML code and snapshot of theconsidered pages. This process typically requires fewseconds (as estimated, on average, on our dataset).The subsequent HTML-based classification is instan-taneous, while the Snapshot-based classifier requiresmore than 1.2 seconds, on average, to compute itssimilarity score. This delay is mainly due to the ex-traction of the HOG features, while the color featuresare extracted in less than 3 ms, on average. Theprocessing time of our approach can be speeded upusing parallel computation (e.g., through the imple-mentation of a scalable application on a cloud com-puting service), and a caching mechanism to avoidre-classifying known pages.

Adversarial Evasion. We consider here an attackerthat manipulates the HTML code of his/her phish-ing page to resemble that of the homepage of thecompromised website, aiming to evade detection byour HTML-based classifier. We simulate a worst-case scenario in which the attacker has perfect knowl-edge of such a classifier, i.e., that he/she knows theweights assigned by the classifier to each HTML fea-ture. The idea of this evasion attack is to maximallydecrease the classification score of the HTML mod-ule while manipulating the minimum number of fea-tures, as in [42]. In this case, an optimal attack willstart manipulating features having the highest abso-lute weight values. For simplicity, we assume a worstcase attack, where the attacker can modify a featurevalue either to 0 or 1, although this may not be pos-sible for all features without compromising the na-ture of the phishing scam. For instance, in order toset the URL feature to 1 (see Sect. 3.2), an attackerhas to use exactly the same set of URLs present inthe compromised website’s homepage. This mightrequire removing some links from the phishing page,compromising its malicious functionality.

The distribution of the feature weights (and bias)for the HTML-based classifier (computed over the 20repetitions of our experiment) is shown in the boxplotof Fig. 8, highlighting two interesting facts. First, fea-tures tend to be assigned only negative weights. This

11

Page 12: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

0 0.005 0.01 0.015 0.02 0.025 0.03

0.7

0.75

0.8

0.85

0.9

0.95

1

FP

TP

HTML

Snapshot

Fusion (tr.)

Fusion (adv.)

Fusion (max)

0 1 2 3 4 5 6 7 8 9 10 110

0.2

0.4

0.6

0.8

1

Number of modified HTML features

TP

at

FP

=1%

HTML

Fusion (tr.)

Fusion (adv.)

Fusion (max)

Figure 6: ROC curves (left) and adversarial evaluation (right) of the classifiers.

HTML classifier score

Ima

ge

cla

ssifie

r sco

re

−6 −4 −2 0 2 4

−6

−4

−2

0

2

4

−10

−5

0

5

10

HTML classifier score

Image c

lassifie

r score

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

−10

−5

0

5

10

HTML classifier score

Image c

lassifie

r score

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

−10

−5

0

5

10

Figure 7: Examples of decision functions (in colors) for maximum (left), trained fusion (center), and ad-versarial fusion (right), in the space of the base classifiers’ outputs. Blue (red) points represent legitimate(phishing) pages. Decision boundaries are shown as black lines. Phishing pages manipulated to evade theHTML-based classifier will receive a lower score (i.e., the red points will be shifted to the left), and mostlikely evade only the trained fusion.

means that each feature tends to exhibit higher val-ues for legitimate pages, and that the attacker shouldincrease its value to mislead detection. Since the biasis generally positive, a page tends to be classifiedgenerally as a phish, unless there is sufficient “evi-dence” that it is similar to the homepage. Second,the most relevant features (i.e., those which tend tobe assigned the lowest negative weights) are Title,URL, SS-URL, and I-URL. This will be, in most ofthe cases, the first four features to be increased bythe attacker to evade detection, while the remainingfeatures play only a minor role in the classification ofphishing and legitimate pages.

The results are reported in Fig. 6 (right plot). Itshows how the detection rate achieved by δPhishat 1% FP decreases against an increasing numberof HTML features modified by the attacker, for thedifferent fusion schemes and the HTML-based classi-fier. The first interesting finding is about the HTML-based classifier, that can be evaded by modifying onlya single feature (most likely, URL). The trained fu-sion remains slightly more robust, although it ex-hibits a dramatic performance drop already at theearly stages of the attack. Conversely, the detectionrate of maximum and adversarial fusion rules underattack remains higher than 70%. The underlying rea-

12

Page 13: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

URL

2LD SS

SS-URL

SS-2LD

I-URL

I-2LD

Copyright

X-links

Title

Language b

-0.8-0.6-0.4-0.2

00.20.40.6

HTML-based classifier's weights (w) and bias (b)

Figure 8: Boxplot of feature weights (and bias) forthe HTML-based classifier.

son is that they rely more upon the output of theSnapshot-based classifier with respect to the trainedfusion. In fact, as already mentioned, such schemesexplicitly account for the presence of attacks againstthe base classifiers. Note also that the adversarialfusion outperforms maximum when only one featureis modified, while achieving a similar detection rateat the later stages of the attack. This clearly comesat the cost of a worse performance in the absenceof attack. Thus, if one retains that such evasion at-tempts may be very likely in practice, he/she maydecide to trade accuracy in the absence of attack foran improved level of security against these potentialmanipulations. This tradeoff can also be tuned in amore fine-grained manner by varying the percentageof simulated attacks while training the adversarialfusion scheme (which we set to 30%), and also byconsidering a less pessimistic score distribution thanthe uniform one (e.g., a Beta distribution skewed to-wards the average score assigned by the HTML-basedclassifier to the phishing pages).

5 Conclusions and FutureWork

The widespread presence of public, exploitable web-sites in the wild has enabled a large-scale deploymentof modern phishing scams. We have observed thatphishing pages hosted in compromised websites ex-hibit a different aspect and structure from those ofthe legitimate pages hosted in the same website, for

two main reasons: (i) to be effective, phishing pagesshould resemble the visual appearance of the websitetargeted by the scam; and (ii) leaving the legitimatepages intact guarantees that phishing pages remainactive for a longer period of time before being black-listed. Website compromise can be thus regarded as asimple pivoting step in the implementation of modernphishing attacks.

To the best of our knowledge, this is the first workthat leverages this aspect for phishing webpage de-tection. By comparing the HTML code and the vi-sual appearance of a potential phishing page with thehomepage of the corresponding website, δPhish ex-hibits high detection accuracy even in the presenceof well-crafted, adversarial manipulation of HTMLcode. While our results are encouraging, our pro-posal has its own limitations. It is clearly not ableto detect phishing pages hosted through other meansthan compromised websites. It may be adapted toaddress this issue by comparing the webpage to beclassified against a set of known phishing targets(e.g., PayPal, eBay); in this case, if the similarityexceeds a given threshold, then the page is classifiedas a phish. Another limitation is related to the as-sumption that legitimate pages within a certain web-site share a similar appearance/HTML code with thehomepage. This assumption may be indeed violated,leading the system to misclassify some pages. Webelieve that such errors can be limited by extendingthe comparison between the potential phishing pageand the website homepage also to the other legitimatepages in the website (and this can be configured atthe level of the web application firewall). This is aninteresting evaluation for future work.

Our adversarial evaluation also exhibits some lim-itations. We have considered an attacker that de-liberately modifies the HTML code of the phishingpage to evade detection. A more advanced attackermight also modify the phishing page to evade oursnapshot-based classifier. This is clearly more com-plex, as he/she should not compromise the visual ap-pearance of the phishing page while aiming to evadeour visual analysis. Moreover, the proposed adver-sarial fusion (i.e., the maximum) already accounts forthis possibility, and the attack can be successful onlyif both the HTML and snapshot-based classifiers are

13

Page 14: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

fooled. We anyway leave a more detailed investiga-tion of this aspect to future work, along with the pos-sibility of training our system using only legitimatedata, which would alleviate the burden of collectinga set of manually-labeled phishing webpages.

Finally, it is worth remarking that we have experi-mented on more than 5, 500 webpages collected in thewild, which we have also made publicly available forresearch reproducibility. Despite this, it is clear thatour data should be extended to include more exam-ples of phishing and legitimate webpages, hopefullythrough the help of other researchers, to get morereliable insights on the validity of phishing webpagedetection approaches.

Acknowledgments

This work has been partially supported by the DO-GANA project, funded by the EU Horizon 2020framework programme, under Grant Agreement no.653618.

References

[1] Beardsley, T.: Phishing detection and preven-tion, practical counter-fraud solutions. Technicalreport, TippingPoint (2005)

[2] Hong, J.: The state of phishing attacks. Com-mun. ACM 55(1) (Jan. 2012) 74–81

[3] Khonji, M., Iraqi, Y., Jones, A.: Phishing detec-tion: A literature survey. Communications Sur-veys Tutorials, IEEE 15(4) (Fourth 2013) 2091–2121

[4] Han, X., Kheir, N., Balzarotti, D.: Phisheye:Live monitoring of sandboxed phishing kits. In:Proceedings of the 2016 ACM SIGSAC Confer-ence on Computer and Communications Secu-rity. CCS ’16, New York, NY, USA, ACM (2016)1402–1413

[5] Bursztein, E., Benko, B., Margolis, D.,Pietraszek, T., Archer, A., Aquino, A., Pitsil-lidis, A., Savage, S.: Handcrafted fraud and ex-tortion: Manual account hijacking in the wild.In: IMC ’14 (2014) 347–358

[6] Cova, M., Kruegel, C., Vigna, G.: There is nofree phish: an analysis of “free” and live phishingkits. In: 2nd WOOT’08, Berkeley, CA, USA,USENIX (2008) 4:1–4:8

[7] Invernizzi, L., Benvenuti, S., Cova, M., Com-paretti, P.M., Kruegel, C., Vigna, G.: Evilseed:A guided approach to finding malicious webpages. In: IEEE Symp. SP ’12, Washington DC,USA, IEEE CS (2012) 428–442

[8] APWG: Global phishing survey: Trends anddomain name use in 2014. (2015)

[9] Hao, S., Kantchelian, A., Miller, B., Paxson, V.,Feamster, N.: PREDATOR: proactive recogni-tion and elimination of domain abuse at time-of-registration. In: ACM CCS, ACM (2016) 1568–1579

[10] Basnet, R.B., Sung, A.H.: Learning to detectphishing webpages. J. Internet Services and Inf.Sec. (JISIS) 4(3) (2014) 21–39

[11] Medvet, E., Kirda, E., Kruegel, C.: Visual-similarity-based phishing detection. In: 4th Int’lConf. SecureComm ’08, New York, NY, USA,ACM (2008) 22:1–22:6

[12] Chen, T.C., Stepan, T., Dick, S., Miller, J.:An anti-phishing system employing diffused in-formation. ACM Trans. Inf. Syst. Secur. 16(4)(April 2014) 16:1–16:31

[13] Chen, T.C., Dick, S., Miller, J.: Detecting vi-sually similar web pages: Application to phish-ing detection. ACM Trans. Intern. Tech. 10(2)(June 2010) 5:1–5:38

[14] Blum, A., Wardman, B., Solorio, T., Warner,G.: Lexical feature based phishing URL detec-tion using online learning. In: 3rd ACM work-shop on Artificial intelligence and security. AISec’10, New York, NY, USA, ACM (2010) 54–60

[15] Liang, B., Su, M., You, W., Shi, W., Yang, G.:Cracking classifiers for evasion: A case study onthe google’s phishing pages filter. In: 25th Int’lConf. WWW, Montreal, Canada (2016) 345–356.

14

Page 15: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

[16] Garera, S., Provos, N., Chew, M., Rubin, A.D.:A framework for detection and measurement ofphishing attacks. In: Proceedings of the 2007ACM workshop on Recurring malcode. WORM’07, New York, NY, USA, ACM (2007) 1–8

[17] Le, A., Markopoulou, A., Faloutsos, M.:Phishdef: Url names say it all. In: INFOCOM,2011 Proceedings IEEE. (April 2011) 191–195

[18] Marchal, S., Francois, J., State, R., Engel, T.:Proactive discovery of phishing related domainnames. In: 15th RAID. Vol. 7462. Springer(2012) 190–209

[19] Pan, Y., Ding, X.: Anomaly based web phishingpage detection. In: 22nd ACSAC (2006) 381–392

[20] Xu, L., Zhan, Z., Xu, S., Ye, K.: Cross-layerdetection of malicious websites. In: 3rd CO-DASPY, New York, NY, USA, ACM (2013) 141–152

[21] Whittaker, C., Ryner, B., Nazif, M.: Large-scaleautomatic classification of phishing pages. In:NDSS, San Diego, California, USA, The InternetSociety (2010)

[22] Xiang, G., Pendleton, B.A., Hong, J., Rose,C.P.: A hierarchical adaptive probabilistic ap-proach for zero hour phish detection. In: 15thESORICS, Berlin, Heidelberg, Springer-Verlag(2010) 268–285

[23] Xiang, G., Hong, J., Rose, C.P., Cranor,L.: Cantina+: A feature-rich machine learn-ing framework for detecting phishing web sites.ACM Trans. Inf. Syst. Secur. 14(2) (September2011) 21:1–21:28

[24] Britt, J., Wardman, B., Sprague, A., Warner,G.: Clustering potential phishing websites usingdeepmd5. In: 5th LEET, Berkeley, CA, USA,USENIX (2012)

[25] Jo, I., Jung, E., Yeom, H.: You’re not who youclaim to be: Website identity check for phishingdetection. In: Int’l Conf. Computer Comm. andNetworks (2010) 1–6

[26] Ludl, C., Mcallister, S., Kirda, E., Kruegel,C.: On the effectiveness of techniques to detectphishing sites. In: DIMVA ’07, Springer-Verlag(2007) 20–39

[27] Fette, I., Sadeh, N., Tomasic, A.: Learningto detect phishing emails. In: 16th Int’l Conf.WWW, ACM (2007) 649–656

[28] Wardman, B., Stallings, T., Warner, G., Skjel-lum, A.: High-performance content-based phish-ing attack detection. In: eCrime ResearchersSummit. (Nov. 2011)

[29] Wenyin, L., Liu, G., Qiu, B., Quan, X.: An-tiphishing through phishing target discovery.IEEE Internet Computing 16(2) (2012) 52–61

[30] Chen, K.T., Chen, J.Y., Huang, C.R., Chen,C.S.: Fighting phishing with discriminative key-point features. IEEE Internet Computing 13(3)(2009) 56–63

[31] Fu, A.Y., Wenyin, L., Deng, X.: Detectingphishing web pages with visual similarity assess-ment based on earth mover’s distance (emd).IEEE Transactions on Dependable and SecureComputing 3(4) (2006) 301–311

[32] Afroz, S., Greenstadt, R.: Phishzoo: Detectingphishing websites by looking at them. In: 5thIEEE Int’l Conf. Semantic Computing (2011)368–375

[33] Cortes, C., Vapnik, V.: Support-vector net-works. Mach. Learn. 20 (1995) 273–297

[34] Dalal, N., Triggs, B.: Histograms of orientedgradients for human detection. In: CVPR, SanDiego, CA, USA, IEEE CS (2005) 886–893

[35] Felzenszwalb, P.F., Girshick, R.B., McAllester,D.A., Ramanan, D.: Object detection with dis-criminatively trained part-based models. IEEETrans. Pattern Anal. Mach. Intell. 32(9) (2010)1627–1645

15

Page 16: DeltaPhish: Detecting Phishing Webpages in …pralab.diee.unica.it/sites/default/files/corona17...DeltaPhish: Detecting Phishing Webpages in Compromised Websites Igino Corona 1,2,

[36] Vedaldi, A., Fulkerson, B.: Vlfeat: an openand portable library of computer vision algo-rithms. In Bimbo, A.D., Chang, S.F., Smeul-ders, A.W.M., eds.: 18th Int’l Conf. Multimedia,Firenze, Italy, ACM (2010) 1469–1472

[37] Csurka, G., Dance, C.R., Fan, L., Willamowski,J., Bray, C.: Visual categorization with bags ofkeypoints. In: ECCV Worksh. Stat. Learn. inComp. Vis. (2004) 1–22

[38] Sural, S., Qian, G., Pramanik, S.: Segmentationand histogram generation using the HSV colorspace for image retrieval. In: ICIP (2). (2002)589–592

[39] Swain, M.J., Ballard, D.H.: Color indexing. Int’lJ. Comp. Vis. 7(1) (1991) 11–32

[40] Chechik, G., Sharma, V., Shalit, U., Bengio,S.: Large scale online learning of image simi-larity through ranking. J. Mach. Learn. Res. 11(March 2010) 1109–1135

[41] Wolpert, D.H.: Stacked generalization. NeuralNetworks 5 (1992) 241–259

[42] Biggio, B., Corona, I., Maiorca, D., Nelson, B.,Srndic, N., Laskov, P., Giacinto, G., Roli, F.:Evasion attacks against machine learning at testtime. In Blockeel et al., eds.: ECML PKDD, vol.8190 of LNCS, Springer (2013) 387–402

16