DECISION MAKING SUPPORT SYSTEM FOR MANAGING …

JAISCR, 2021, Vol. 11, No. 4, pp. 331Artur Starczewski, Magdalena M. Scherer, Wojciech Ksiazek, Maciej Debski, Lipo Wang

Data 2, Data 3, Data 4, Data 5, Data 6, Data 7,Data 8 and Data 9 sets. When the number of gridcells is specified by the new method, the GBCN al-gorithm is used to cluster the artificial datasets. Fig-ure 7 shows the results of the algorithm, where eachcluster is marked with different signs. It shouldbe noted that despite the fact that the differencesof distances and shapes between clusters are sig-nificant, all the datasets are clustered correctly bythe clustering algorithm. Moreover, the data ele-ments classified as the noise are marked with a cir-cle, and their number is small in all the datasets.Table 2 presents detailed information about the re-sults of experiments conducted with the use of theGBCN algorithm. The table shows the number ofused intervals, the number of received clusters, andthe number of the noise elements for individual datasets. Generally, only one parameter must be calcu-lated for this algorithm, i.e. the number of intervalsand it is calculated automatically. On the otherhand, the well-known CLIQUE algorithm was usedin experiments on these artificial datasets and it re-quired two input parameters, i.e. the number ofintervals and the density threshold. In this case, thecalculation of the parameters is not easy. Table 3presents the results of clustering for the CLIQUEalgorithm, and the parameters which were deter-mined experimentally.

5 Conclusions

In this paper, a new grid-based algorithm GBCNis proposed. This algorithm uses input parameters.,i.e. the number of intervals and the minimal num-ber of elements in clusters. The first parameter isdetermined automatically, i.e by the kdist function,which computes the distance between each elementof a dataset and its kth nearest neighbor. The dis-tances are used to calculate the size of the intervalsfor each dimensional data and then make it possi-ble to determine the right number of grid cells. Thenext parameter defines the minimal number of ele-ments in clusters and it is equal to 5. It should benoted that such value is very often used by the DB-SCAN algorithm. In the conducted experiments,several 2-dimensional datasets were used in whichthe number of clusters, sizes, and shapes variedwithin a wide range. From the perspective of theconducted experiments, this new grid-based algo-

rithm and the method of determining the number ofintervals are very useful. All the presented resultsconfirm the high efficiency of the newly proposedapproach.

Acknowledgements

The paper is financed under the program ofthe Polish Minister of Science and Higher Edu-cation under the name ”Regional Initiative of Ex-cellence” in the years 2019-2022; project number020/RID/2018/19; the amount of financing PLN12,000,000.00.

References[1] Agrawal R., Gehrke J., Gunopulos D., Raghavan

P.: Automatic subspace clustering of high dimen-sional data for data mining applications. SIGMODRec., vol. 27, pp. 94-105 (1998).

[2] Boonchoo T., Ao X., Liu Y., Zhao W., He Q.: Grid-based DBSCAN: Indexing and inference. PatternRecognition, Vol. 90, pp.271-284 (2019).

[3] Bradley P., Fayyad U.: Refining initial points fork-means clustering. In Proceedings of the fifteenthinternational conference on knowledge discoveryand data mining, New York, AAAI Press, pp. 9-15(1998).

[4] Chen Y., Tang S., Bouguila N., Wanga C., DuJ., Li H.: A fast clustering algorithm based onpruning unnecessary distance computations in DB-SCAN for high-dimensional data. Pattern Recogni-tion, Vol.83, pp.375-387 (2018).

[5] Darong H., Peng W.: Grid-based dbscan algorithmwith referential parameters. Physics Procedia, 24,Part B, pp.1166-1170 (2012).

[6] Ester M., Kriegel H.P, Sander J., Xu X.: A density-based algorithm for discovering clusters in largespatial databases with noise, In Proceeding of 2ndInternational Conference on Knowledge Discoveryand Data Mining, pp. 226-231 (1996).

[7] Franti P., Rezaei M., Zhao Q.: Centroid index:Cluster level similarity measure. Pattern Recogni-tion, Vol. 47, Issue 9, pp. 3034-3045 (2014).

[8] Gabryel M.: Data Analysis Algorithm for ClickFraud Recognition. Communications in Computerand Information Science, Vol 920, pp.437-446(2018).

[9] Gan J. , Tao Y.: Dbscan revisited: mis-claim, un-fixability, and approximation. SIGMOD (2015).

DECISION MAKING SUPPORT SYSTEM FOR MANAGINGADVERTISERS BY AD FRAUD DETECTION

Marcin Gabryel1,∗, Magdalena M. Scherer2, Łukasz Sułkowski3,4,5, Robertas Damasevicius6

1Faculty of Mechanical Engineering and Computer Science, Czestochowa University of Technology,al. Armii Krajowej 36, 42-200 Czestochowa, Poland

2Faculty of Management, Czestochowa University of Technology, Poland

3Faculty of Management and Social Sciences, Jagiellonian University, Cracow, Poland

4Management Department, University of Social Sciences, 90 - 113 Lodz, Poland

5Clark University, Worcester, MA 01610, USA

6Department of Applied Informatics, Vytautas Magnus University, Kaunas 44404, Lithuania

∗E-mail: [email protected]

Submitted: November 2020; Accepted: 22nd September 2021

Abstract

Efficient lead management allows substantially enhancing online channel marketing pro-grams. In the paper, we classify website traffic into human- and bot-origin ones. We usefeedforward neural networks with embedding layers. Moreover, we use one-hot encod-ing for categorical data. The data of mouse clicks come from seven large retail storesand the data of lead classification from three financial institutions. The data are collectedby a JavaScript code embedded into HTML pages. The three proposed models achievedrelatively high accuracy in detecting artificially generated traffic.Keywords: lead management, feedforward neural networks, embedding, online market-ing

1 Introduction

The Internet advertising system consists of: (1)publishers (provide resources for advertising traf-fic), (2) advertisers buying traffic to deliver adver-tisements to recipients, (3) intermediaries (affiliatenetworks, media houses, programmatic platforms).Publishers both provide and generate the traffic. Inother words, a user’s visit to a publisher’s websiteenables one or more advertisements to be displayedto that user. Thus, the publisher may sell this possi-bility to those advertisers who are interested in dis-playing their ads. Most often this is done throughan intermediary in the form of an affiliate network.

Unfortunately, to increase revenues, the parties mayuse unethical activities. For example, some ad-vertisers may exhaust their competitors’ market-ing budgets through a carefully conducted abuse.Similarly, some publishers generate various trapswhich are meant to lure users to browse or clickads about products that users are not actually in-terested in. Affiliate networks, on the other hand,do not react to irregular clicking behaviour hop-ing to earn commissions. In some circumstances,malicious advertisers as well as publishers becomefraudsters. In the commonly used different pricingmodels, despite different security measures, fraud-

10.2478/jaiscr-2021-0020 – 339

332 Marcin Gabryel, Magdalena M. Scherer, Łukasz Sułkowski, Robertas Damasevicius

sters are constantly trying to introduce new types offraud using display, click or action-based scams.

One of the main types of crime and frauds on-line (adfrauds, frauds) has become in the last fewyears the so-called traffic scams – page views orclick fraud, which consist in increasing revenuesfrom online advertising by automatically generat-ing artificial page views, clicks or web forms. Thispractice brings real (and relatively easy to obtain)income or profit as a result of losses of competingcompanies. On a global scale, adfrauds are caus-ing multi-billion losses in the advertising industry.Adfraud activities may be carried out by:

1. Publishers billed in the model for the effect.This model can take, for example, a form of“cost per click” (CPC), “cost per thousand”(CPM), or “cost per lead” (CPL). A lead is apotential customer data. The most commonfrauds in online advertising are related to per-formance advertising, i.e. billed for the volumeof clicks on the advertisement. In addition to“clicking ads”, the effects of dishonest publish-ers may also be fraud leads, i.e. completing webforms with incorrect or false data, and in ex-treme cases, even fictitious sales resulting in thepayment of an unjustified commission and lossof time for the advertiser to handle the order.

2. Intermediaries, offering support for marketingactivities for advertisers, e.g. affiliate partners,owners of websites and comparison websites.Fraudulent activities include artificial generationof clicks or leads, generating worthless clicksor leads, “clicking” ads, entering false data inweb forms, automatic completion of applica-tions without the knowledge of the person con-cerned, creating websites generating additionalclicks, and even creating special algorithms andbots.

3. Competition – striving to build a position for itsads or lower the cost per click by clicking on itscompetitor’s ads (activities consuming the ad-vertising budget, not bringing the expected salesresults, as a result limiting the scale of the com-petitors’ advertising campaign).

According to [4], the global costs of fraud ac-tivities are estimated between USD 6.5 and 19 bil-lion (it is difficult to determine the exact scale of the

phenomenon due to the lack of precise data). Thesetypes of unfair practices result in ineffective bud-get allocation for online campaigns. However, re-search predicts that artificial intelligence, machinelearning and big data technologies will be key solu-tions in minimizing losses caused by fraud, thanksto the ability to analyze the huge amounts of datagenerated from advertising activities. AI-poweredplatforms will account for 74% of total online andmobile advertising spending by 2022. However, asartificial intelligence becomes saturated, only plat-forms with the most effective algorithms will beable to guarantee effective ad protection. Accord-ing to research, these platforms will have to focuson new data sources to improve the proficiency ofartificial intelligence algorithms [10]. The financialsector is one of the most attacked – every fourth ad-vertiser is a victim of ad fraud – right behind thegambling industry and airlines [8].

The work presents two deep neural network-based classifiers with the aim to:

– evaluation of internet traffic and making a deci-sion about the presence of a human or bot on thewebsite for the CPC and CPM billing models,i.e. so-called monitoring of clicks. This trafficoccurs when the user moves from clicking on anadvertisement on the publisher’s website to theadvertiser’s website.

– assessment of human or bot behaviour on thewebsite in order to secure the website forms inthe CPS and CPA billing model that is so-calledlead monitoring. In this case, traffic goes fromthe publisher to the advertiser’s website in or-der to fill out the application form (i.e. acquire alead).

We use data downloaded from the client’swebsite, generated after the user or bot accessit. Data such as browser and device parame-ters, mouse pointer behaviour, keystrokes, pagescrolling, screen touch position, etc., are collectedusing a JavaScript script included in the website’sHTML code. The code is designed to retrieve infor-mation about the browser and device, leave a tracefor later identification and track user behaviour. Thetask of the classifier is to return the probability valuethat will allow managing the obtained lead in orderto better queue tasks related to its service (trans-

333Marcin Gabryel, Magdalena M. Scherer, Łukasz Sułkowski, Robertas Damasevicius

sters are constantly trying to introduce new types offraud using display, click or action-based scams.

One of the main types of crime and frauds on-line (adfrauds, frauds) has become in the last fewyears the so-called traffic scams – page views orclick fraud, which consist in increasing revenuesfrom online advertising by automatically generat-ing artificial page views, clicks or web forms. Thispractice brings real (and relatively easy to obtain)income or profit as a result of losses of competingcompanies. On a global scale, adfrauds are caus-ing multi-billion losses in the advertising industry.Adfraud activities may be carried out by:

1. Publishers billed in the model for the effect.This model can take, for example, a form of“cost per click” (CPC), “cost per thousand”(CPM), or “cost per lead” (CPL). A lead is apotential customer data. The most commonfrauds in online advertising are related to per-formance advertising, i.e. billed for the volumeof clicks on the advertisement. In addition to“clicking ads”, the effects of dishonest publish-ers may also be fraud leads, i.e. completing webforms with incorrect or false data, and in ex-treme cases, even fictitious sales resulting in thepayment of an unjustified commission and lossof time for the advertiser to handle the order.

2. Intermediaries, offering support for marketingactivities for advertisers, e.g. affiliate partners,owners of websites and comparison websites.Fraudulent activities include artificial generationof clicks or leads, generating worthless clicksor leads, “clicking” ads, entering false data inweb forms, automatic completion of applica-tions without the knowledge of the person con-cerned, creating websites generating additionalclicks, and even creating special algorithms andbots.

3. Competition – striving to build a position for itsads or lower the cost per click by clicking on itscompetitor’s ads (activities consuming the ad-vertising budget, not bringing the expected salesresults, as a result limiting the scale of the com-petitors’ advertising campaign).

According to [4], the global costs of fraud ac-tivities are estimated between USD 6.5 and 19 bil-lion (it is difficult to determine the exact scale of the

phenomenon due to the lack of precise data). Thesetypes of unfair practices result in ineffective bud-get allocation for online campaigns. However, re-search predicts that artificial intelligence, machinelearning and big data technologies will be key solu-tions in minimizing losses caused by fraud, thanksto the ability to analyze the huge amounts of datagenerated from advertising activities. AI-poweredplatforms will account for 74% of total online andmobile advertising spending by 2022. However, asartificial intelligence becomes saturated, only plat-forms with the most effective algorithms will beable to guarantee effective ad protection. Accord-ing to research, these platforms will have to focuson new data sources to improve the proficiency ofartificial intelligence algorithms [10]. The financialsector is one of the most attacked – every fourth ad-vertiser is a victim of ad fraud – right behind thegambling industry and airlines [8].

The work presents two deep neural network-based classifiers with the aim to:

– evaluation of internet traffic and making a deci-sion about the presence of a human or bot on thewebsite for the CPC and CPM billing models,i.e. so-called monitoring of clicks. This trafficoccurs when the user moves from clicking on anadvertisement on the publisher’s website to theadvertiser’s website.

– assessment of human or bot behaviour on thewebsite in order to secure the website forms inthe CPS and CPA billing model that is so-calledlead monitoring. In this case, traffic goes fromthe publisher to the advertiser’s website in or-der to fill out the application form (i.e. acquire alead).

We use data downloaded from the client’swebsite, generated after the user or bot accessit. Data such as browser and device parame-ters, mouse pointer behaviour, keystrokes, pagescrolling, screen touch position, etc., are collectedusing a JavaScript script included in the website’sHTML code. The code is designed to retrieve infor-mation about the browser and device, leave a tracefor later identification and track user behaviour. Thetask of the classifier is to return the probability valuethat will allow managing the obtained lead in orderto better queue tasks related to its service (trans-

DECISION MAKING SUPPORT SYSTEM FOR MANAGING ADVERTISERS . . .

fer to the call centre for verification, transfer to thesales department to handle the service request, etc.).

This paper is organized as follows. Section2 presents the current state of the art on internetfrauds. Section 3 describes the classifier modelsused during in the paper. The course of experimen-tal research can be found in Section 4. Section 5concludes the paper and offers suggestions for fu-ture work.

2 Related works

Detecting fraud in the Internet advertising in-dustry is one of the challenges that can be success-fully solved using machine learning. The literature,however, almost lacks practical works on detectingcomputer programs (bots) that perform abuses re-lated to generating artificial Internet traffic, relyingonly on data downloaded from a website. One ofthe most interesting comprehensive descriptions ofthese issues can be found in [14]. However, thisstudy is limited to cataloguing threats, not mention-ing the methods of counteracting them. Neverthe-less, solutions that detect various bots by analyzingthe webserver event log (e.g. [6]), requiring modi-fication of the application or redirection of all traf-fic, are quite popular. Despite the declared effec-tiveness of these methods, the possibility of theirpractical application is negligible. Access to the in-formation required by these solutions and modifica-tions to the computer systems of website owners arechallenging to carry out or require time and addi-tional financial outlays, which usually discouragesthe use of such solutions. Therefore, the scope ofresearch was assumed to be limited only to moni-toring user behaviour on the website on the browserside. In [7] the behaviour of bots clicking on ad-vertisements on web pages is analysed. The datacomes from advertising systems offered by Google,Facebook, LinkedIn and Yahoo. In [11] botnets orDDoS attacks are identified by analyzing log filesfrom WWW servers. Botnets are interconnectedcompromised machines that are used to performattacks or filling web forms [12]. It is believedthat even 16-25% machines are part of botnets [1].One of the methods for detecting bots is presentedin citesoniya2016detection where randomized botcommand and control traffic is detected at an earlystage. In [3] botnet-related traffic is identified by

using similarity measures and periodic characteris-tics of botnets.

Classification is one of the functions success-fully performed by feedforward neural networks(perceptron networks). Perceptron networks are themost effective model for analyzing large amountsof data stored in a tabular form. However, themost challenging task is the selection of the net-work structure and the selection of the network hy-perparameter values. Another difficulty is that inour data, there are no labels identifying the sam-ple as “bot” / “non-human” or “human”. Therefore,we used the data with information on sales effec-tiveness and/or information obtained from externalsources (Google Captcha v3).

3 Decision-making Support system

As part of the research, we developed two mod-els to classify data obtained from a website for thepurpose of identifying human or program (bot) gen-erated traffic. The models will be slightly differentfrom each other because they relate to two differentmethods of customer financial billing: 1) a modelfor monitoring clicks and 2) a model for monitoringleads. In the first case, the emphasis of the analysisof human-bot behaviour is put on the very processof opening the page, the time spent on the page andthe behaviour on the page (mainly the behaviour ofthe mouse pointer). In the second case, the pro-cess of filling in a web form takes place and, firstof all, this process is examined (examination of themethod of filling in the form by the user, includ-ing the analysis of the sequence of pressed keys andtheir combinations).

The method analyzes data collected duringclicking on advertisements and price comparisonengines. Its outcome says if the behaviour on thewebsite is natural, generated by a human or gen-erated artificially by a computer program (bot). Adata flow diagram is presented in Figure 1. The ser-vice and database belong to the system monitoringand data collecting module. The monitoring alsoprovides a dedicated Javascript code for the adver-tiser, which allows data to be collected on its web-site.

After clicking an advert on a publisher’s web-site, the advertiser’s website opens, and a script


Figure 1. Data flow diagram for fraud click/leads analysis.

monitoring the client’s behaviour is launched. Thebehaviour-related data are stored in a database andused in the paper.

The data collected during the user’s visit tothe monitored website are of two types: floating-point numbers and categorical data. The neural net-work is naturally adapted to work with floating-point numbers, so the problem arises in the caseof categorical data. In such cases, the simplestmethod is to use one-hot encoding, where the neu-ral network for each parameter has as many inputsas there are available categories. In other words,each category has its own input. Zeros are givenfor inputs, except for the input corresponding to agiven category, then one is given there. This cod-ing method was used in the first network modelpresented in Figure 2. It is a multilayer percep-tron model consisting of two hidden layers, and theoutput is calculated by the softmax function. Thelearning set is X = x1,x2, . . . ,xN ∈ Rm where xi

is the m-dimensional feature vector, and the set ofcategories is D = d1,d2, . . . ,dN ∈ Ω, set of labelsΩ = 0,1,2, N – the number of samples. We usemultilayer perceptron which can be presented as afunction

y = f ∗(x,Θ) (1)

where the input vector x is mapped to the outputvector y, and Θ is the set of parameters that bestapproximates the function. A single k multilayerperceptron layer takes the following form

fk (x,Θ) = s(Wx+b) (2)

where Θ is set W,b, and s(·) is the adopted acti-

vation function. In our model, s(·) is the ReLU [5]function

s(t) = max1,2 (3)

The described model consists of two layers, so thefunction (1) takes the form:

y = f ∗ (x,Θ)= f 2( f 1 (x,Θ1) ,Θ2) (4)

where Θ1 and Θ2 are parameters of, respectively,the first W1,b1 and the second layer W2,b2,W1 ∈ Rm×n1 , b1 ∈ Rn1 , W2 ∈ Rn1×n2 , b2 ∈ Rn2 , n1and n2 is the number of neurons in the first and sec-ond layer, respectively. The goal of the multilayerperceptron training is to minimize the difference be-tween the obtained output of the entire yi networkand the expected output of di. For this purpose, theloss function is calculated according to the follow-ing formula

L(di,yi) = ∥di − yi∥2 (5)

The aim of training is to find the optimal values ofΘ1 and Θ2 parameters that minimize the error be-tween the obtained output and the obtained valuefor the entire training set

Θ1,Θ2 = argminΘ1,Θ2

L(d,y) (6)

There are three outputs from the network, each re-sponsible for one of the Ω classes. For the consid-ered category, value one should appear at the net-work output. Therefore, at the network output, thesoftmax [2] function was used, which takes the fol-lowing form for each of the k outputs

s(t)k =etk

∑mj=1 et j

(7)


Figure 1. Data flow diagram for fraud click/leads analysis.

monitoring the client’s behaviour is launched. Thebehaviour-related data are stored in a database andused in the paper.

The data collected during the user’s visit tothe monitored website are of two types: floating-point numbers and categorical data. The neural net-work is naturally adapted to work with floating-point numbers, so the problem arises in the caseof categorical data. In such cases, the simplestmethod is to use one-hot encoding, where the neu-ral network for each parameter has as many inputsas there are available categories. In other words,each category has its own input. Zeros are givenfor inputs, except for the input corresponding to agiven category, then one is given there. This cod-ing method was used in the first network modelpresented in Figure 2. It is a multilayer percep-tron model consisting of two hidden layers, and theoutput is calculated by the softmax function. Thelearning set is X = x1,x2, . . . ,xN ∈ Rm where xi

is the m-dimensional feature vector, and the set ofcategories is D = d1,d2, . . . ,dN ∈ Ω, set of labelsΩ = 0,1,2, N – the number of samples. We usemultilayer perceptron which can be presented as afunction

y = f ∗(x,Θ) (1)

where the input vector x is mapped to the outputvector y, and Θ is the set of parameters that bestapproximates the function. A single k multilayerperceptron layer takes the following form

fk (x,Θ) = s(Wx+b) (2)

where Θ is set W,b, and s(·) is the adopted acti-

vation function. In our model, s(·) is the ReLU [5]function

s(t) = max1,2 (3)

The described model consists of two layers, so thefunction (1) takes the form:

y = f ∗ (x,Θ)= f 2( f 1 (x,Θ1) ,Θ2) (4)

where Θ1 and Θ2 are parameters of, respectively,the first W1,b1 and the second layer W2,b2,W1 ∈ Rm×n1 , b1 ∈ Rn1 , W2 ∈ Rn1×n2 , b2 ∈ Rn2 , n1and n2 is the number of neurons in the first and sec-ond layer, respectively. The goal of the multilayerperceptron training is to minimize the difference be-tween the obtained output of the entire yi networkand the expected output of di. For this purpose, theloss function is calculated according to the follow-ing formula

L(di,yi) = ∥di − yi∥2 (5)

The aim of training is to find the optimal values ofΘ1 and Θ2 parameters that minimize the error be-tween the obtained output and the obtained valuefor the entire training set

Θ1,Θ2 = argminΘ1,Θ2

L(d,y) (6)

There are three outputs from the network, each re-sponsible for one of the Ω classes. For the consid-ered category, value one should appear at the net-work output. Therefore, at the network output, thesoftmax [2] function was used, which takes the fol-lowing form for each of the k outputs

s(t)k =etk

∑mj=1 et j

(7)


where t ∈ R3, and t j is jth element of vector t. Forsoftmax, the loss function L(di,yi) is computed bythe categorical cross-entropy

L(di,yi) =−m

∑j=1

di jlog(yi j) (8)

where m is the number of outputs (categories).

Figure 2. Multilayer perceptron network withone-hot encoded input data.

Model 2 uses embedding layers. Embedding isa mapping of separate categories (such as words)into vectors of real numbers. The rationale behindembedding is that machine learning using generalpatterns of locations and distances between vec-tors will find certain relationships between the cat-egories. The network diagram is presented in Fig-ure 3. The network approximates the function

f ∗ (xr,xc,Θ) =

f 3( f 2( f 1(xr,Θ1),e1(xc1,Θc1), . . . , (9)

enc (xcnc ,Θcnc) ,Θ2),Θ3)

where xr are continuous parameters (floating point)fed to the network input, xc ∈ xc1, . . .xcnc are cat-egorical variables, nc is the number of categoricalvariables, ek (xck,Θck) the result of embedding layerfor kth variable xck, Θck is the set of weights Wekfor categorical variable k, Wek ∈ Rne , ne is the num-ber of neurons for every categorical variable k. Pa-rameters Θ1, Θ2 and Θ3 are sets of weights for suc-cessive layers of the neural network. The resultsare returned similarly to the previous model by thesoftmax function.

Model 3 in its structure is identical to Model 2,except that the process of training the neural net-

work is performed differently. First, all embeddinglayers are trained. The neural network model isshown in Figure 4. Only the xc categorical vari-ables are fed into the network. The network con-sists of one layer and its purpose is to approximatethe function

f ∗ (xc,Θ) =

f 1 (e1 (xc1,Θc1) , . . . ,enc (xcnc ,Θcnc),Θ1

)(10)

where successive variables have the same notationas in Model 2. In the next training step, the em-bedding layers are transferred to a network with astructure identical to Model 2. The whole model istrained again, but the parameters (weights) of theembedding layers are fixed and no longer trained.

Figure 3. A multilayer perceptron network withone-hot encoded input data.

Figure 4. Embedding layer training.

4 Experiments

The experiments were carried out on two datasets:

– Monitoring of mouse clicks – the dataset con-tains data of 300,000 individual visits to thewebsites of seven online stores. The data was


obtained from the monitored website after click-ing on the advertisement. The data has three la-bels: ’ok’ - the input is correct and was madeby a human, ’fake’ - the input was most likelymade by a bot program or unfair competition,and ’undecided’ - the data is most often incom-plete. Incomplete values are substituted with adefault value and new input is created for a one-hot encoded set of inputs. The labels were ob-tained on the basis of sales and additionally as-sessed by an expert.

– Lead monitoring – the dataset contains data fromthree financial institutions from 263,000 individ-ual visits to the website with the web form inorder to fill in the data necessary to obtain alead (interest in a loan, credit, etc.). The datahas three labels: ’ok’, ’fake’ – the labels havea similar meaning as in monitoring clicks and’copy’ – the data is most likely copied from an-other database and automatically pasted into theform. The assessment was made on the basis ofthe sales obtained and additionally assessed byan expert.

The data is collected by a JavaScript code installedon the advertiser’s website. All publicly availableinformation that can be obtained from the browseris collected. The most important downloaded datainclude the following:

– For monitoring clicks: date and time of entry,source (where the traffic came from, which ad-vertisements came from), identifier of the mon-itored website, type of the operating systemand the browser, information whether the op-erating system, browser, or set languages werenot counterfeited, whether mouse movementsor screen scrolling were performed, informationabout the Internet provider, whether the con-nection was made through a proxy, the type ofIP number (hosting, proxy, VPN, Tor network,botnet – information from an external source),whether there was a sale, whether the parametersprovided by JavaScript are correct for a givenbrowser type and are not counterfeited. In ad-dition, information about the user’s behavior isalso taken into account: the number of visits tothe page, the area where the mouse moved, thenumber of page scrolls, the number of unique

mouse points indicated, the number of pressedkeys.

– For monitoring leads: date and time of entry,source (where the traffic came from, what ad-vertisements came from), identifier of the mon-itored website, type of operating system andbrowser, information whether the operating sys-tem, browser, set languages were not coun-terfeited, whether mouse movements or screenscrolling were performed, information about theInternet provider, whether the connection wasmade through a proxy, the type of IP num-ber (hosting, proxy, VPN, Tor network, botnet),whether there was a sale, whether the param-eters returned by JavaScript are correct for agiven browser type and the information aboutthe browser has not been counterfeited.

In addition, information about the websiteuser’s behaviour is also taken into account: thenumber of visits to the website, the area in whichthe mouse moved, the number of page scrolls, thenumber of unique mouse points indicated, the num-ber of pressed keys, the number of clicks with theleft, middle and right mouse buttons, the number ofpages remembered in the history, the number of textfields on the page, the number of pasted words fromthe clipboard, the number of text fields clicked, thenumber of passes between text fields on the pagewith the Tab key, the number of all controls in theform necessary to fill, the number of changed val-ues in the controls, number of texts changed duringediting, number of fields filled with no key pressed,number of fields filled very quickly (time is mea-sured in ms).

In the experiments, three different models ofneural networks described in Section 3 were tested:

– Model 1 – model of a multilayer perceptron net-work, where one-hot encoded categorical dataand other variables in the form of real numbersare provided as inputs. The network has twohidden layers, the ReLU activation function, andthe Softmax function at the output. The networkdiagram is shown in Figure 2. The investigatednetwork structure has the following parameters:n1 = 10 and n2 = 5. Due to one-hot encoding,the number of network inputs is m = 199.

– Model 2 – a multi-layer network model, where


obtained from the monitored website after click-ing on the advertisement. The data has three la-bels: ’ok’ - the input is correct and was madeby a human, ’fake’ - the input was most likelymade by a bot program or unfair competition,and ’undecided’ - the data is most often incom-plete. Incomplete values are substituted with adefault value and new input is created for a one-hot encoded set of inputs. The labels were ob-tained on the basis of sales and additionally as-sessed by an expert.

– Lead monitoring – the dataset contains data fromthree financial institutions from 263,000 individ-ual visits to the website with the web form inorder to fill in the data necessary to obtain alead (interest in a loan, credit, etc.). The datahas three labels: ’ok’, ’fake’ – the labels havea similar meaning as in monitoring clicks and’copy’ – the data is most likely copied from an-other database and automatically pasted into theform. The assessment was made on the basis ofthe sales obtained and additionally assessed byan expert.

The data is collected by a JavaScript code installedon the advertiser’s website. All publicly availableinformation that can be obtained from the browseris collected. The most important downloaded datainclude the following:

– For monitoring clicks: date and time of entry,source (where the traffic came from, which ad-vertisements came from), identifier of the mon-itored website, type of the operating systemand the browser, information whether the op-erating system, browser, or set languages werenot counterfeited, whether mouse movementsor screen scrolling were performed, informationabout the Internet provider, whether the con-nection was made through a proxy, the type ofIP number (hosting, proxy, VPN, Tor network,botnet – information from an external source),whether there was a sale, whether the parametersprovided by JavaScript are correct for a givenbrowser type and are not counterfeited. In ad-dition, information about the user’s behavior isalso taken into account: the number of visits tothe page, the area where the mouse moved, thenumber of page scrolls, the number of unique

mouse points indicated, the number of pressedkeys.

– For monitoring leads: date and time of entry,source (where the traffic came from, what ad-vertisements came from), identifier of the mon-itored website, type of operating system andbrowser, information whether the operating sys-tem, browser, set languages were not coun-terfeited, whether mouse movements or screenscrolling were performed, information about theInternet provider, whether the connection wasmade through a proxy, the type of IP num-ber (hosting, proxy, VPN, Tor network, botnet),whether there was a sale, whether the param-eters returned by JavaScript are correct for agiven browser type and the information aboutthe browser has not been counterfeited.

In addition, information about the websiteuser’s behaviour is also taken into account: thenumber of visits to the website, the area in whichthe mouse moved, the number of page scrolls, thenumber of unique mouse points indicated, the num-ber of pressed keys, the number of clicks with theleft, middle and right mouse buttons, the number ofpages remembered in the history, the number of textfields on the page, the number of pasted words fromthe clipboard, the number of text fields clicked, thenumber of passes between text fields on the pagewith the Tab key, the number of all controls in theform necessary to fill, the number of changed val-ues in the controls, number of texts changed duringediting, number of fields filled with no key pressed,number of fields filled very quickly (time is mea-sured in ms).

In the experiments, three different models ofneural networks described in Section 3 were tested:

– Model 1 – model of a multilayer perceptron net-work, where one-hot encoded categorical dataand other variables in the form of real numbersare provided as inputs. The network has twohidden layers, the ReLU activation function, andthe Softmax function at the output. The networkdiagram is shown in Figure 2. The investigatednetwork structure has the following parameters:n1 = 10 and n2 = 5. Due to one-hot encoding,the number of network inputs is m = 199.

– Model 2 – a multi-layer network model, where


categorical data is fed to the embedding layers,and data that are real numbers are fed to a sepa-rate hidden layer. Both signals are transferred totwo successive hidden layers, similar to Model1. The network diagram is shown in Figure 3.The parameters of the network under study areas follows: the number of neurons in the sub-sequent layers of the network: n1 = 5, n2 = 10,n3 = 5, the number of neurons for each categoryin embedding layers ne = 2, the number of net-work inputs m = 6.

– Model 3 – similar to Model 2, the weights ofwhich are trained in a different way. The weightsof embedding layers are trained first. For thispurpose, a special neural network with one hid-den layer is prepared, shown in Figure 4. Aftertraining, the weights are transferred to the Model2 and the network is trained with these weightsvalues unchanged. The network parameters arethe same as in Model 2.

Each network has three outputs corresponding tothree data labels: ’ok’, ’fake’ and ’undecided’ forthe data from monitoring clicks, and ’ok’, ’fake’and ’copy’ for the data from monitoring leads. A se-ries of ten training sessions was conducted for eachof the models. The models were trained with dataderived from monitoring clicks and leads, previ-ously adjusting the appropriate number of networkinputs for each case. Accuracy, precision and recallwere used to evaluate the effectiveness of the tests[9]. Accuracy is the degree of closeness of mea-surements of a quantity to that quantity’s true value

accuracy =t p+ tn

t p+ tn+ f p+ f n(11)

Precision is the ratio of the number of correctlyclassified cases to the total number of irrelevant andrelevant cases classified

precision =t p

t p+ f p(12)

and recall is the ratio between the number of datathat are correctly classified to the total number ofpositive data

recall =t p

t p+ f n(13)

where t p – true positive, tn – false positive, f p –false positive, f n – false negative and they can be

derived from the confusion matrix [9]. The param-eter which combines the above two parameters isF1 score that it is the harmonic mean of precisionand recall

F1 =2 · precision · recallprecision+ recall

(14)

Tables 1 and 2 show, respectively: average resultsfrom ten training trials of three models for clickmonitoring data, the best results obtained duringthis training. Tables 3 and 4 present the averageresults obtained from ten attempts to train systemson the data obtained from lead monitoring for threedifferent models and the best results obtained.

When scrutinizing the best results obtained, itcan be noticed that Models 2 and 3 are compara-ble with each other in terms of learning efficiency.Model 1 clearly gives worse results. It is worth not-ing; however, that Model 3 showed a much betterreproducibility of the results during learning. Forpractical applications, where a system of this typewould be trained regularly with new data, Model 3would require much less repetition of learning in thecase of unsatisfactory results, which would translateinto the speed of learning and implementation of thenew model into production.

Table 1. Average results obtained for ten attemptsto train neural networks for the problem of

monitoring clicks.

Model Acc Prec Rec F11 train 0.898 0.89 0.90 0.87

test 0.897 0.89 0.90 0.872 train 0.77 0.66 0.77 0.69

test 0.78 0.66 0.77 0.693 train 0.95 0.93 0.95 0.94

test 0.95 0.93 0.95 0.94

Table 2. The best results obtained during tenattempts to train neural networks for the problem

of monitoring clicks.

Model Acc Prec Recall F11 train 0.975 0.974 0.975 0.974

test 0.977 0.977 0.977 0.9772 train 0.987 0.987 0.987 0.987

test 0.986 0.987 0.986 0.9863 train 0.987 0.987 0.987 0.987

test 0.986 0.986 0.986 0.986


Table 3. Average results obtained for ten attemptsto train neural networks for the problem of lead

monitoring.


test 0.879 0.860 0.879 0.8602 train 0.822 0.725 0.822 0.760

test 0.822 0.725 0.822 0.7603 train 0.956 0.937 0.956 0.943

test 0.956 0.937 0.956 0.943


of lead monitoring.


test 0.975 0.9754 0.975 0.9752 train 0.995 0.995 0.995 0.995

test 0.995 0.995 0.995 0.9953 train 0.995 0.995 0.995 0.995

test 0.995 0.995 0.995 0.995

5 Conclusion

In the paper, six neural network models pre-pared for the classification of network traffic re-lated to Internet advertising were developed, pre-sented and tested. The models were trained withreal data obtained from 11 large websites – largePolish e-stores and financial institutions. The pre-sented research may become the basis for conduct-ing intelligent protection against artificially gener-ated internet traffic, in particular against internetbots, virtual machines generating artificial clicks,click farms, fraudsters (fraudsters generating artifi-cial clicks) and abuses related to repeated filling ofapplications by bots on websites (fraud leads). Theycan also contribute to the proper management of ad-vertising expenses, be the basis for making com-plaints, blocking inappropriate visits to the websiteor queuing tasks related to handling leads. Worseleads, classified as artificially generated (bot), couldbe moved to the very end of the queue. Then, valu-able leads could be handled with priority.

References[1] AsSadhan, B., Moura, J.M., Lapsley, D., Jones, C.,

Strayer, W.T.: Detecting botnets using commandand control traffic. In: 2009 Eighth IEEE Inter-national Symposium on Network Computing andApplications, pp. 156–162. IEEE (2009)

[2] Bengio, Y.: Learning deep architectures for AI.Now Publishers Inc (2009)

[3] Chen, C.M., Lin, H.C.: Detecting botnet byanomalous traffic. journal of information securityand applications 21, 42–51 (2015)

[4] eMarketer: Digital ad fraud 2019.https://www.emarketer.com/content/digital-ad-fraud-2019

[5] Glorot, X., Bordes, A., Bengio, Y.: Deep sparserectifier neural networks. In: Proceedings ofthe fourteenth international conference on artifi-cial intelligence and statistics, pp. 315–323. JMLRWorkshop and Conference Proceedings (2011)

[6] Lagopoulos, A., Tsoumakas, G., Papadopoulos,G.: Web robot detection in academic publishing.arXiv preprint arXiv:1711.05098 (2017)

[7] Neal, A., Kouwenhoven, S., Sa, O.: Quantifyingonline advertising fraud: Ad-click bots vs humans.In: Tech. Rep. Oxford Bio Chronometrics (2015)

[8] Networks, D.: 2018 bad bot report.https://resources.distilnetworks.com/whitepapers/2018-bad-bot-report

[9] Rajaraman, A., Ullman, J.D.: Mining of massivedatasets. Cambridge University Press (2011)

[10] Research, J.: Ad fraud - how ai will res-cue your budget. https://news.unilead.net/wp-content/uploads/2017/09/Ad-Fraud-How-AI-will-rescueyour-Budget-whitepaper.pdf

[11] Seyyar, M.B., Catak, F.O., Gul, E.: Detection ofattack-targeted scans from the apache http serveraccess logs. Applied computing and informatics14(1), 28–36 (2018)

[12] Silva, S.S., Silva, R.M., Pinto, R.C., Salles, R.M.:Botnets: A survey. Computer Networks 57(2),378–403 (2013)

[13] Soniya, B., Wilscy, M.: Detection of randomizedbot command and control traffic on an end-pointhost. Alexandria Engineering Journal 55(3), 2771–2781 (2016)

[14] Zhu, X., Tao, H., Wu, Z., Cao, J., Kalish, K.,Kayne, J.: Fraud prevention in online digital ad-vertising. Springer (2017)


Table 3. Average results obtained for ten attemptsto train neural networks for the problem of lead

monitoring.


test 0.879 0.860 0.879 0.8602 train 0.822 0.725 0.822 0.760

test 0.822 0.725 0.822 0.7603 train 0.956 0.937 0.956 0.943

test 0.956 0.937 0.956 0.943


of lead monitoring.


test 0.975 0.9754 0.975 0.9752 train 0.995 0.995 0.995 0.995

test 0.995 0.995 0.995 0.9953 train 0.995 0.995 0.995 0.995

test 0.995 0.995 0.995 0.995

5 Conclusion

In the paper, six neural network models pre-pared for the classification of network traffic re-lated to Internet advertising were developed, pre-sented and tested. The models were trained withreal data obtained from 11 large websites – largePolish e-stores and financial institutions. The pre-sented research may become the basis for conduct-ing intelligent protection against artificially gener-ated internet traffic, in particular against internetbots, virtual machines generating artificial clicks,click farms, fraudsters (fraudsters generating artifi-cial clicks) and abuses related to repeated filling ofapplications by bots on websites (fraud leads). Theycan also contribute to the proper management of ad-vertising expenses, be the basis for making com-plaints, blocking inappropriate visits to the websiteor queuing tasks related to handling leads. Worseleads, classified as artificially generated (bot), couldbe moved to the very end of the queue. Then, valu-able leads could be handled with priority.

References[1] AsSadhan, B., Moura, J.M., Lapsley, D., Jones, C.,

Strayer, W.T.: Detecting botnets using commandand control traffic. In: 2009 Eighth IEEE Inter-national Symposium on Network Computing andApplications, pp. 156–162. IEEE (2009)

[2] Bengio, Y.: Learning deep architectures for AI.Now Publishers Inc (2009)

[3] Chen, C.M., Lin, H.C.: Detecting botnet byanomalous traffic. journal of information securityand applications 21, 42–51 (2015)

[4] eMarketer: Digital ad fraud 2019.https://www.emarketer.com/content/digital-ad-fraud-2019

[5] Glorot, X., Bordes, A., Bengio, Y.: Deep sparserectifier neural networks. In: Proceedings ofthe fourteenth international conference on artifi-cial intelligence and statistics, pp. 315–323. JMLRWorkshop and Conference Proceedings (2011)

[6] Lagopoulos, A., Tsoumakas, G., Papadopoulos,G.: Web robot detection in academic publishing.arXiv preprint arXiv:1711.05098 (2017)

[7] Neal, A., Kouwenhoven, S., Sa, O.: Quantifyingonline advertising fraud: Ad-click bots vs humans.In: Tech. Rep. Oxford Bio Chronometrics (2015)

[8] Networks, D.: 2018 bad bot report.https://resources.distilnetworks.com/whitepapers/2018-bad-bot-report

[9] Rajaraman, A., Ullman, J.D.: Mining of massivedatasets. Cambridge University Press (2011)

[10] Research, J.: Ad fraud - how ai will res-cue your budget. https://news.unilead.net/wp-content/uploads/2017/09/Ad-Fraud-How-AI-will-rescueyour-Budget-whitepaper.pdf

[11] Seyyar, M.B., Catak, F.O., Gul, E.: Detection ofattack-targeted scans from the apache http serveraccess logs. Applied computing and informatics14(1), 28–36 (2018)

[12] Silva, S.S., Silva, R.M., Pinto, R.C., Salles, R.M.:Botnets: A survey. Computer Networks 57(2),378–403 (2013)

[13] Soniya, B., Wilscy, M.: Detection of randomizedbot command and control traffic on an end-pointhost. Alexandria Engineering Journal 55(3), 2771–2781 (2016)

[14] Zhu, X., Tao, H., Wu, Z., Cao, J., Kalish, K.,Kayne, J.: Fraud prevention in online digital ad-vertising. Springer (2017)

Marcin Gabryel earned his Ph.D degree in computer science at Cze-stochowa University of Technology, Poland, in 2007. He is an assistant pro-fessor in the Department of Computer Engineering at Częstochowa Universi-ty of Technology. His research focuses on developing new methods in compu-tational intelligence and data mining.

He has published over 50 research papers. His present re-search interests include deep learning architectures and their applications in databases and security.

Magdalena Scherer received her M.Sc. degree in computer science from the Czestochowa University of Technology, Poland, in 2008 and her Ph.D. in mana-gement in 2016 from the same university. Currently, she is an assistant professor at Czestochowa University of Technology. Her present research interests include machine learning for prediction and classification.

Łukasz Sułkowski holds a professor degree in economics and a doctoral degree in humanities and specializes in issues related to university manage-ment. He is the head of the Department of Higher Education Institution Man-agement at the Jagiellonian University as well as a professor at Clark Univer-

sity and at the Academy of Social Sciences. His research in-terests include organization and management, epistemology and methodology of social sciences and humanities, organi-zational culture and intercultural management as well as pub-lic management and family business management. He is the author of about 400 publications and 10 books closely related to the subject of his interests.

He is a member of the following international organi-zations and associations: American Academy of Manage-ment (AAofM - USA), International Family Entreprises Research Assotiation (IFERA), Reseau Pays du Groupe de Vysegrad (PGV) and the European Academy of Management (EURAM).

Robertas Damaševičius received his Ph.D. in Informatics Engineering from Kaunas University of Technology (Lithuania) in 2005. Currently, he is a Professor at Department of Applied Informatics, Vytautas Magnus Univer-sity (Lithuania) and Adjunct Professor at Institute of Mathematics, Silesian University of Technology (Poland).

His research interests include sustainable software engineer-ing, human–computer interfaces, assisted living, data mining and machine learning. He is the author of over 300 papers as well as a monograph published by Springer. He is also the Editor-in-Chief of the Information Technology and Control journal and has been the Guest Editor of several invited is-sues of international journals (Biomed Research Internation-al, Computational Intelligence and Neuroscience, Journal of Healthcare Engineering, IEEE Access, and Electronics).


Documents

DECISION MAKING SUPPORT SYSTEM FOR MANAGING …