Prevent XPath and CSS Based Scrapers by Using Markup ...Ahmed Mustafa Ibrahim Diab Supervised by Dr. Tawfiq S. Barhoom Associate Prof. of Applied Computer Technology A thesis submitted

Prevent XPath and CSS Based Scrapers by Using

Markup Randomization

CSSو XPath منع جمع المعلومات بالطريقة المعتمدة على

باستخدام عشوائية الترميز

By

Ahmed Mustafa Ibrahim Diab

Supervised by

Dr. Tawfiq S. Barhoom

Associate Prof. of Applied Computer Technology

A thesis submitted in partial fulfilment

of the requirements for the degree of

Master of Information Technology

September/2018

زةـ ـغب ةـ ـلاميــــــة الإســـــــــ امعـالج

ات العلي اس البح ل العلم والد ا عم ا ة

تكنولوجي ا المعلوم اتة ليــــ ــك

تكنولوجي ا المعلوم اتماجس تير

The Islamic University of Gaza

Deanship of Research and Graduate Studies

Faculty of Information Technology

Master of Information Technology

I

ــرارإقـــــــــــ

أنا الموقع أدناه مقدم الرسالة التي تحمل العنوان:

Prevent XPath and CSS Based Scrapers by Using Markup

Randomization

باستخدام CSSو XPathمنع جمع المعلومات بالطريقة المعتمدة على

عشوائية الترميز

بأن ما اشتملت عليه هذه الرسالة إنما هو نتاج جهدي الخاص، باستثناء ما تمت الإشارة إليه حيثما ورد، وأن أقر

لنيل درجة أو لقب علمي أو بحثي لدى أي مؤسسة الاخرين هذه الرسالة ككل أو أي جزء منها لم يقدم من قبل

تعليمية أو بحثية أخرى.

Declaration

I understand the nature of plagiarism, and I am aware of the University’s policy on

this.

The work provided in this thesis, unless otherwise referenced, is the researcher's own

work, and has not been submitted by others elsewhere for any other degree or

qualification.

:Student's name احمد مصطفى إبراهيم دياب :اسم الطالب

:Signature التوقيع:

28/08/2018 التاريخ:Date:

I

II

Abstract

Web Scraping is a useful technique can be used in an ethical way such as climate

and many researching fields, on the other hand, unethical way such as exploit content

privacy, which is Data Theft.

Several researchers have introduced some approaches for addressing this issue,

these solutions could have solved the problem in partial ways or in some cases,

therefore, the problem still needs another effort.

Consequently, in this work, a new solution is introduced for preventing web

scraping based on XPath and CSS in an efficient way and applicable to modern web

techniques. The proposed solution will be based on Markup Randomization which will

rename all CSS classes for a web page then sync those changes back with the HTML

page. The main advantage of the proposed solution that can be applied on any web

page.

Experiments were done over collected dataset which consists of 30 websites

divided into three categories: News, Currency Rates and Weather. The aim of the

experiments is to measure the Similarity, File Size and the processing time.

Visual Similarity was tested and proved that no visual changed occurred during

and after applying the solution and most of comparing results were 100% and few

results were above 97% due to some unsupported HTML tags was exists on the page

such as tags with different namespace like Facebook plugins.

File size also changed during the process so some experiments showed that file

size reduced due to unnecessary HTML elements removed and other increased due to

the length of CSS classes’ length.

The processing time of applying the solution is related to file size so that the file

with more than 4500 lines should take an average of 5 minutes while the file contains

(0-4500) lines the processing time should be less than 2 minutes.

Keywords: Anti-Scraper, Anti-Data theft, Web Scrapers.

III

ملخص الد اسة

يمكن ان تستخدم بطريقة أخلاقية مثل –عملية جمع المعلومات بطريقة آلية من مواقع الانترنت –كشط الويب

مي، ومن ناحية أخرى يمكن استخدامها بطريقة لا أخلاقية تعزز مبدئ التنبؤ بحالة الطقس او حتى في البحث العل

انتهاك ملكية المحتوى وهذا يعتبر سرقة البيانات.

بعض الباحثين اقترحوا طرق عديدة لحل هذه المشكلة ولكن هذه الحلول لا يمكن ان تنهي هذه المشكلة بشكل كامل

تشغيل وليس كل أوقات تشغيل برنامج الكشط أو حتى لأنها تطرقت للمشكلة بشكل جزئي أو في بعض أوقات

حلول لا يمكن تطبيقها من أحدث معايير الويب الحديث.

على العكس تماماً، طريقة جديدة لحل المشكلة تم طرحها في هذه الاطروحة لمنع مشكلة كشط الويب بشكل كافي

ز العشوائي للكود البرمجي والتي ستعمل وفعّال مع أحدث معايير الويب. هذه الطريقة ستبنى على مبدئ الترمي

" وفي نفس الوقت تغييرها في الكود البرمجي الخاص CSS Rulesعلى أعادة تسمية جميع قواعد الشكل "

" ويمكن تطبيقها على كل صفحة من صفحات الموقع بسهولة وبدون قيود.HTML Markupبالصفحة "

موقع الكتروني 30تجهزيها لتتلاءم من المقترح بحيث تتكون من تم فحص هذا المقترح على عينة البيانات التي تم

مختلف في الشكل موزّعين على ثلاثة تصنيفات مواقع إخبارية، مواقع العملات ومواقع حالات الطقس، وكان

الهدف من التجارب هو فحص مدى تشابه الصفحة قبل وبعد تطبيق الطريقة المقترحة، وحجم التغيير على ملفات

البرمجي وأخيراً الوقت الإجمالي لتنفيذ الطريقة.الكود

التشابه المرئي تم تفحصه باستخدام أدوات ذكية تفحص مدى تشابه الصفحات، وأثبتت النتائج انه لا يوجد تغيير

% وفي بعض الحالات كانت نسبة التشابه وصلت الى 100مرئي في اغلب الحالات بحيث نسبة التشابه كانت

ء الكود البرمجي الأصلي على بعض الرمز الغير مدعومة لأدوات الفحص وتعطي في كل % نتيجة لاحتوا97

مرة كود برمجة متخلف مثل إضافات فيسبوك.

التغير في حجم الملفات تم فحصة ومقارنته بما كان عليه وكانت النتائج تثبت ان حجم الملفات تقل بسبب عملية

الطريقة المقترحة وفي بعض الحالات كانت هناك زيادة في حجم تحسين الكود البرمجي التي تتم اثناء تطبيق

الملفات طبيعية بسبب ان الكود البرمجي الأصلي محسن ولا يوجد بده أي سطور برمجية غير ضرورية او اغير

مرئية يمكن ازالتها.

ي حالات ان الكود الوقت الإجمالي لتطبيق الطريقة المقترحة يعتمد على حجم ملفات الكود البرمجي الأصلي، فف

دقائق، بينا ان 5سطر فأكثر فان الوقت الإجمالي لتطبيق الطريقة يكون في حدود 4500البرمجي يحتوي على

سطر يكون اقل من دقيقتين. 4500الوقت اللازم لتطبيق الطريقة المقترحة في حالة اقل من

كشط الويب، منع سرقة البيانات، منع كشط الويب كلمات مفتاحية:

IV

Dedication

This research is dedicated to my father Mustafa, my mother Suad, Sister,

brothers, my wife and my sons Ezzuddeen and Yassin, friends and all one who

encourage me to complete my study.

V

Acknowledgment

I would first like to thank my thesis advisor Associate Professor Tawfiq Soliman

Barhoom of the Information Technology at Islamic University of Gaza. The door to

Prof. Tawfiq office was always open whenever I ran into a trouble spot or had a

question about my research or writing. He consistently allowed this thesis to be my

own work, but steered me in the right the direction whenever he thought I needed it.

At long last, I should offer my extremely significant thanks to my Father and

Mother and to my better half to provide me with unfailing help and constant

consolation during my time of study and through the way toward exploring and

composing this theory. This achievement would not have been conceivable without

them. Much obliged to you.

Author

Ahmed Mustafa Ibrahim Diab

VI

Table of Contents

Declaration .................................................................................................................... I

Abstract ........................................................................................................................ II

III ................................................................................................................. ملخص الدراسة

Dedication .................................................................................................................. IV

Acknowledgment ......................................................................................................... V

Table of Contents ....................................................................................................... VI

List of Tables ........................................................................................................... VIII

List of Figures ............................................................................................................ IX

List of Formulas ......................................................................................................... XI

List of Abbreviations ................................................................................................ XII

Chapter 1 Introduction .................................................................................................. 1

1.1 Statement of the Problem ........................................................................................ 2

1.2 Objectives ............................................................................................................... 3

1.2.2 Main Objectives ...................................................................................................3

1.2.3 Specific Objectives ..............................................................................................3

1.3 Importance of the Research .................................................................................... 3

1.3.1 Motivation ............................................................................................................3

1.4 Scope and Limitation of the Research .................................................................... 4

1.5 Overview of Thesis ................................................................................................. 4

Chapter 2 Theoretical Background ............................................................................... 6

2.1 Introduction ............................................................................................................. 6

2.2 Web Scraping Techniques ...................................................................................... 6

2.2.1 Web Usage Mining ..............................................................................................6

2.2.2 Web Scraping: .....................................................................................................9

2.2.3 Semantic Annotations ..........................................................................................9

2.3 The Custom Scraper ................................................................................................ 9

2.3.1 Web Crawler ........................................................................................................9

2.3.2 Data Extractor ....................................................................................................10

2.3.3 Exporting to CSV ..............................................................................................11

2.4 Scrapple ................................................................................................................ 11

2.5 Extracting Entity Data from Deep Web Precisely ................................................ 12

2.6 XQUERY Wrapper ............................................................................................... 13

2.7 Page Similarity ...................................................................................................... 14

2.7.1 Structure and Style Similarity ............................................................................14

2.7.2 Visual Similarity ................................................................................................17

2.8 Summary ............................................................................................................... 22

Chapter 3 Related Works ............................................................................................ 23

VII

3.1 Introduction ........................................................................................................... 23

3.2 Legal Efforts ......................................................................................................... 23

3.2.1 Copyright Law ...................................................................................................23

3.2.2 Digital Millennium Copyright Act ....................................................................24

3.3 Developer Efforts .................................................................................................. 25

3.3.1 ShieldSquare ......................................................................................................25

3.3.2 ScrapeDefender ..................................................................................................26

3.3.3 ScrapeSentry ......................................................................................................27

3.3.3 Distil Networks ..................................................................................................28

3.4 Researchers Efforts ............................................................................................... 30

3.4.1 Markup Randomization .....................................................................................30

3.4.2 Identification and Clustering .............................................................................31

3.5 Summary ............................................................................................................... 37

Chapter 4 Methodology .............................................................................................. 40

4.1 Introduction ........................................................................................................... 40

4.2 The proposed solution: .......................................................................................... 40

4.2.1 Supported Scrapers ............................................................................................41

4.2.2 Roadmap ............................................................................................................45

4.4 Summary: .............................................................................................................. 51

Chapter 5 Experiments and Discussion ...................................................................... 53

5.1 Introduction ........................................................................................................... 53

5.2 Dataset .................................................................................................................. 53

5.3 Experiment Settings .............................................................................................. 55

5.4 Experiments Process ............................................................................................. 55

5.4.1 Experiment: Processing Time ............................................................................56

5.4.2 Result Discussion: Processing Time ..................................................................58

5.4.3 Experiment: File Size ........................................................................................60

5.4.4 Result Discussion: File size ...............................................................................61

5.4.5 Experiment: Similarity .......................................................................................64

5.4.6 Result Discussion: Similarity ............................................................................67

5.4.7 Re-Run Web Scraper .........................................................................................72

5.5 Summary: .............................................................................................................. 75

Chapter 6 Conclusion .................................................................................................. 76

References ................................................................................................................... 78

VIII

List of Tables

Table (3.1): Summary for Related works .................................................................. 38

Table (5.1): Dataset website categories. .................................................................... 53

Table (5.2): Website list with category. ..................................................................... 54

Table (5.3): Machine specifications. .......................................................................... 55

Table (5.4): Total seconds require to apply the proposed solution. ........................... 57

Table (5.5): Results takes less than 2 minutes. .......................................................... 59

Table (5.6): Results takes more than 2 minutes. ........................................................ 59

Table (5.7): Results that take less processing time than most results. ....................... 60

Table (5.8): Website file size before and after applying the proposed solution. ....... 60

Table (5.9): Website HTML file size decreased after applying the proposed solution.

................................................................................................................................... 62

Table (5.10): Web site HTML page size increase after applying the proposed solution.

................................................................................................................................... 63

Table (5.11): Web Page Similarity results by applying Matiskay’s tool. .................. 65

Table (5.12): Website page similarity between original and generated website. ...... 66

Table (5.13): Website Category similarity test. ......................................................... 67

Table (5.14): Results for running web scraper after applying the proposed solution. 72

Table (16): Website extracted data before randomization ......................................... 74

IX

List of Figures

Figure (2.1): General Visits Report. ............................................................................ 7

Figure (2.2): Visits Traffic Source. .............................................................................. 7

Figure (2.3): Web Errors. ............................................................................................. 8

Figure (2.4): Visitor Depth .......................................................................................... 8

Figure (2.5): Top Visits Errors. ................................................................................... 8

Figure (2.6): Web Crawler Architecture. ................................................................... 10

Figure (2.7): Scrapple Architecture ........................................................................... 11

Figure (2.8): Scrapple Configuration File Example. ................................................. 12

Figure (2.9): DOM Tree. ............................................................................................ 13

Figure (2.10): Proposed schema model. .................................................................... 14

Figure (2.11): Tree with post order numbering for DOM elements .......................... 15

Figure (2.12): Example of Translated page. .............................................................. 18

Figure (2.13): Example of marked algebra. ............................................................... 19

Figure (2.14): Naïve term compression ..................................................................... 19

Figure (2.15): Vertical compression. ......................................................................... 20

Figure (2.16): Irreducible term. ................................................................................. 20

Figure (2.17): Visual representatives of two different pages. .................................... 21

Figure (3.1): Researchers Parikh et al algorithm for ducting web scraper. ............... 33

Figure (3.2): Researchers Catalin and Cristian proposed model architecture. .......... 35

Figure (3.3): Results showing suspicious IP address. ................................................ 36

Figure (4.1): The proposed solution based on Markup Randomization. ................... 40

Figure (4.2): Flow Chart for the proposed solution. .................................................. 41

Figure (4.3): Original CSS code example .................................................................. 42

Figure (4.4): Randomized CSS code ......................................................................... 43

Figure (4.5): Original HTML file. ............................................................................. 43

Figure (4.6): Randomized HTML file. ...................................................................... 44

Figure (4.7): The Proposed solution applying steps. ................................................. 45

Figure (4.8): Snippet from a scraped website. ........................................................... 46

Figure (4.9): CSS code before applying the proposed solution. ................................ 50

Figure (4.10): CSS code after applying the proposed solution. ................................. 50

Figure (4.11): HTML code snippet before applying the proposed solution. ............. 51

X

Figure (4.12): HTML code snippet after applying the proposed solution. ................ 51

Figure (5.1): Total time required for the proposed solution. ..................................... 56

Figure (5.2): Results classification based on time. .................................................... 58

Figure (5.3): Difference between generated file size original file size. ..................... 61

Figure (5.4): Code snippet before applying the proposed solution. ........................... 64

Figure (5.5): Code snippet after applying the proposed solution. ............................. 64

Figure (5.6): The original offline version of CBSL website. ..................................... 68

Figure (5.7): Generated version of CBSL website. ................................................... 68

Figure (5.8): Facebook Quote Dialog Example ......................................................... 69

Figure (5.9): Facebook generated code replacing the fb-root div. ............................. 70

Figure (5.10): Facebook generated Quote button. ..................................................... 70

Figure (5.11): AddThis setup code. ........................................................................... 71

Figure (5.12): AddThis generated code. .................................................................... 71

Figure (5.13): AddThis generate buttons look and feel. ............................................ 71

Figure (46): Website markup before randomization .................................................. 73

Figure (47): Website markup after randomization .................................................... 74

Figure (6.1): Proposed mode based on Markup Randomization. .............................. 76

XI

List of Formulas

Formula (2.1): XPath Formula Pattern. .................................................................... 14

Formula (2.2): Zhang Shasha’s algorithm complexity. ............................................ 15

Formula (2.3): Zhang Shasha’s space complexity. .................................................. 15

Formula (2.4): Jaccard coefficient formula. ............................................................. 16

Formula (2.5): Web page similarity equation. ......................................................... 17

Formula (2.6): Tree edit distance function. .............................................................. 21

XII

List of Abbreviations

API Application Program Interface

AP-TED Adapted Tree Edit Distance

BOT Automated program that runs over the Internet

CAPTCHA

Completely Automated Public Turing test to tell Computers

and Humans Apart

CSS Cascading Style Sheets

CSV Comma Separated Values

DB Database

DDOS Distributed Denial of Service

DMCA The Digital Millennium Copyright Act

DOM Document Object Model

DOS Denial of Service

HTML Hypertext Markup Language

HTTP Hypertext Transfer Protocol

JSON JavaScript Object Notation

OWASP Open Web Application Security Project

Pop Point of Presence

SaaS Software as a Service

SOC Security Operation Centre

TED Tree Edit Distance

URL Universal Resource Locator

VPN Virtual Private Network

WAF Web Application Firewall

XML Extensible Markup Language

XPath XML Path Language

Chapter 1

Introduction

Chapter 1

Introduction

Web scraping is the process of extracting information from the web pages, this

process can mimic human attitude for opening the website but web scraper differs from

human and can be an automated process done using HTTP protocol or by embedding

a web browser. Web Scraping is a process like web index "search engine function"

which indexes websites information using its bots. In contrast, web scraping extracts

specific information related to the web page itself but in case of the search engine, they

take only meta tags if exists from the website (Mahto & Singh, 2016).

Due to the richness of web page information and the increasing of the need for

data exchanging among the web in an automated fashion, first web scraper has

developed and have inspired from search engine bot functionality.

Web scraping tools can be used in an ethical and unethical way, first when it

used for research purposes and without taking over the privacy and copyright and the

other when some people take content from some websites and repost the content on

their websites particularly when the content is unique and creative.

Web Scraping is a useful technique that helps many fields of research to improve

their data and knowledge, one of the most practical fields is for weather forecasting

they used the scrapers to get historical data about the weather (Bonifacio, Barchyn,

Hugenholtz, & Kienzle, 2015).

Another usage of the web scrapers (Mahto & Singh, 2016) is for the new

Startups because of the lack of time, the need of data and the limitations of resources

they do prefer to use the web scraper to scrape data from similar websites initially then

they can update the scraped data whenever they need to. This is not fair for content

owners who have the ownership right of the data itself such as innovative content and

patents. By the time, this issue caused much loss for them in multiple fields as Data

Theft, Intellectual Theft, and Economic Lose. So this type of unauthorized use may be

classified as Data Theft (is the act of stealing computer-based information from an

unknowing victim with the intent of compromising privacy or obtaining confidential

information) which is a harmful unethical problem with destructive effects for the

companies.

2

As a result, web scraping becomes a crucial trend problem need to be solved

and so far, while there are few solutions proposed for mitigating this problem.

Researchers (Wetterström & Andersson, 2009) have introduced an invention for

preventing scraping by using a filter that reproduces the data requested by the client in

an unstructured manner, which could be understood by browsers, but a robot with

scraping software can't deal with in order to get the desired data. Other researchers

(Haque & Singh, 2015) have introduced a compound solution based on filtering the

visit to three categories (Black-List, Gray-List, White-List) and then treat with the

visitor depends on his category. Gray-List contains suspicious visitors, which are

subjected to several techniques to decide whether block or not.

Other solutions (ScrapeDefender), (ScrapeSentry), (ShieldSquare, 2013) and

("Distil Networks," 2018) was provided as commercial tools by developers while they

focusing on bot identification and clustering, not on the document itself.

This work proposed a solution for preventing web scraping based on CSS and

XPath by using the Markup Randomization which will change the HTML and CSS

files automatically in a timely manner to be different in markup and the same in the

visual look and feel. Therefore, the web scraper will be meaningless because it will

access different web page each request, while the scraper should take an action to

update the rules at each time they access the web page. Because of this technique, the

scraper will stop functioning well and stop to scrape these pages.

1.1 Statement of the Problem Although the web scraping is a content security issue, the most of the suggested

and proposed researches and tools was not focus on the content and some of them have

talked about the content in a minor way and the rest of them have not. As a result of

that, the scrapers were not prevented and still upgrading and updating all the time, so

that the need for an efficient technique to prevent web scrapers from accessing web

page data and without having any visual effects to the web page and can be applied in

short time without affecting the size of the website files.

3

1.2 Objectives

1.2.2 Main Objectives

The main objective of this research is to introduce a new solution of Anti

Scraping technique to protect web pages from the web scrapers by changing the

markup randomly in a timely manner and to assure that XPath and CSS scrapers

mitigated and stopped.

1.2.3 Specific Objectives

The specific objectives of the proposed solution are:

1- Focusing on XPath and CSS Web Scrapers to understand their methodology

and techniques.

2- Developing a technique to randomize the markup as well as the style without

having any visual effect to the website visitor.

3- Developing the Anti Scraping which automatically runs the randomizer in a

timely manner.

4- Building a dataset for testing and measuring visual similarity, processing time

and file size for generated documents.

1.3 Importance of the Research

Due to the rapid developing of the web scrapers, the data theft becomes the

most important issue for the content owners while the researchers don’t fill the gap

and stop the scrapers which lead to massive damage for websites owners. Researchers

proposed a lot of techniques that mitigate the damage but it is still not enough. The

need for a solution which can eliminate the web scraper still necessary by defending

the markup itself is the first procedure it should be taken before paying efforts for

building obstacles on the road for the document.

As a result of defending the document is the most important by changing the

markup on-the-fly which will definitely stop the scraper at the meantime.

1.3.1 Motivation

Regarding to Distil Network 2017 (Duffield, Haffner, Krishnamurthy, &

Ringberg, 2018) for Bad bot they figured out that 42.2% of all internet traffic wasn’t

4

human and 21.8% of the traffic was bad bots while the 74% percent of the bots are

advance bot that using anonymous proxies or even mimic human behaviour. On the

other hand, researchers (Mi et al., 2019) listed a group of Residential IP Proxies

provers whom provides enormous IPs for using on Web Scraping which is can bypass

any security system firewall based on IP filter or digital biometric approaches so that

all solutions based on identifying and classifying the bots will 100% fail on detecting

those scrapers.

1.4 Scope and Limitation of the Research 1- XPath and CSS Web Scrapers only will be the base of the study because most

of the web scrapers are based on XPath as well as CSS.

2- Anti-Scraper will focus on HTML markup and CSS changing randomly in a

timely manner.

3- Regular Expression Scrappers is out of scope.

4- Processing time is not considered in this proposed solution, but maybe in future

work.

1.5 Overview of Thesis This thesis is organized as follows:

1- Theoretical Background: this chapter addresses a background for the reader for

the web scraping techniques and models then it summarizes two algorithms for

web pages’ similarity to test the experiments by using them.

2- Related Works: this chapter demonstrates different efforts for preventing or

mitigating the web scraping issue, and provide insight into the gap between all

those efforts. Those efforts categories into 3 categories Legal, Developers and

Researchers efforts.

3- Methodology: this chapter introduced the proposed solution for preventing the

web scraping which contains from three steps Randomize CSS, Sync the

HTML, send the new page to the browser then demonstrated supported web

scrapers that will be prevented by the solution and finally, addressed proposed

solution applying steps.

4- Experiments and Discussion: this chapter concerned measuring the proposed

solution to determine how much processing time it needed, file size changing

5

and visual similarity between the original and generated web page. Those

factors reveal fairly the situation of the proposed solution.

5- Conclusion: this chapter concludes the thesis problem, solution, experiments

and results within a few paragraphs and highlighted the main outcome of this

thesis for businesses.

Chapter 2

Theoretical Background

6

Chapter 2

Theoretical Background

2.1 Introduction In this Chapter the researches concerned in Web Bot, Web Scraping and Page

Similarity researches. Web scraping techniques will be discussed to understand

researchers' efforts to improve and enhance the scrapers, therefore, the proposed

solution should be able to deal with.

Page Similarity researches will be discussed which will support the proposed

solution on the experimental section because of the similarity is the most important

factor to be measured.

2.2 Web Scraping Techniques

2.2.1 Web Usage Mining

Web usage mining refers to " the automatic discovery and analysis of patterns in

clickstream and associated data collected or generated as a result of user interactions

with Web resources on one or more Web sites " (Mobasher, 2006).

They show that it can extract the data for web usage using web server log and

show how much knowledge can get extracted if the logs analyzed using a specific

software such as "Nihuo Web Log Analyzer".

On the other hand, a deep view can be taken for the visitor attitude and here are

some of the reports from the analyzer. Figure 2.1,2.2,2.3,2.4 and 2.5 shows that what

the kind of data acquired by the Web servers and how can use it to differentiate

between the normal visitor and bot as well as scraper.

Figure 2.1 illustrates the graph for the number of visitors every single day which

can be used to detect which day was abnormal caused by an attack.

Figure 2.2 illustrates the graph for visit traffic countries sources so if you see

that the website targeted country has the most of visits it will seem to be normal, but

if you have a content targeted for the U.S and the visitors from Asia will lead to being

an Attack.

Figure 4 shows that the number of success response pages and the error, which

indicate that if the error is higher than success, the leads to be brute force attack on the

website and maybe a kind of Brute Force.

7

Figure 2.4 shows that how many pages visited in every session which will help

us to detect the bad attitude if the deep-depth have high rate this lead to know there is

an attack for the website.

Finally, Figure 2.5 like figure 2.3 but for specific error codes which help us to

understand and differentiate the errors including authorization and authentication leaks

and lead us to know if someone tries to access protected pages with password.

Figure (2.1): General Visits Report.

(Malik & Rizvi, 2011)

Figure (2.2): Visits Traffic Source.


8

Figure (2.3): Web Errors.


Figure (2.4): Visitor Depth


Figure (2.5): Top Visits Errors.


9

2.2.2 Web Scraping:

Converting unstructured information into structured information and stored into

a central database/spreadsheet. This can be done by using one of the scrapers within

an application and then define the criteria and targets for extracting and grouping.

2.2.3 Semantic Annotations

Notations or Metadata used to locate data within the document, so by preparing

a list of semantic data and define a layer for the web scraper before scraping data

(Malik & Rizvi, 2011).

Another technique (Mahto & Singh, 2016; Mathew, Balakrishnan, & Palani,

2015; Nie, Shen, Yu, Kou, & Yang, 2011; Yu, Guo, Yu, Xian, & Yan, 2014) was very

common on the most of papers and implement in the most of scraping tools which are

DOM-based manipulation and accessing data by XPath and CSS because it’s the

easiest and simplest technique and supported by the most of programming languages

and treated like the XML processing. Because of that, they encouraged to build their

scrapers on those techniques and proposed their approach of scraping data on the bases

of DOM manipulation and different in the architecture of the methodology,

programing language or even used tools.

2.3 The Custom Scraper Python based scraper consists of three part of the process, the first part is a web

crawler, the second is data extractor, and the last one is the storing method.

They have built the scraper with new concepts to full-fill the new startup need

as they need very much data but with no time to collect, so they need an efficient and

speed tool.

2.3.1 Web Crawler

Web Crawler is a tool or a set of tools that iteratively and automatically

downloading web pages also extracting URLs from their HTML and fetching them

recursively (Thelwall, 2001).

So it just needs to have a list of URLs to be visited this list will be called as a

seed (Mathew et al., 2015), each page will be visited and also all links inside each

particular page will be extracted to the list "seed" again to visit. Figure 2.6 contains

the most common architecture for the web crawler, which contains the following

component:

10

1- Downloader: the process to download the pages.

2- Queue: contains the list of URL to download.

3- Scheduler: is the process to start and organize the downloader.

4- Storage: is the process to extract the Metadata of the web page and save it as

well the text of the web page.

Figure (2.6): Web Crawler Architecture.

(Thelwall, 2001)

2.3.2 Data Extractor

The process of extracting information from a single web page, although a lot of

useful resources exists but the focus will be on extracting a specific data depending or

predefined rules. They achieves this goal by selecting the data using CSS Selectors or

XPath patterns (Mathew et al., 2015).

11

2.3.3 Exporting to CSV

After crawling the pages and extracting the data than a list of extracted

information stored in memory ready to save them to CSV using Python API (Mathew

et al., 2015).

2.4 Scrapple

Scrapple a Flexible Framework to Develop Semi-Automatic Web Scrapers

(Mathew et al., 2015),The main purpose and contribution of Scrapple are to reduce the

required modifications on the scripts to run the scraper like Scrapy (Kouzis-Loukas,

2016). To explain figure 2.7 parts scrapple divided it into the following points:

1. Web pages: is the web pages to crawled and scraped.

2. Scrapple: the proposed systems which consist of three processes:

3. Fetching the page: This process will download the page markup and store it

online.

4. Parsing the element tree: This process will enhance and clean the markup from

missing closing tags and the white spaces to be lighter to speed the parsing

process.

5. Extract the content: This process will extract the data from the web page by

applying the XPath or CSS patterns.

6. JSON Configuration File: This file contains the start page for the Crawling as

well as the criteria for the data extraction.

7. Data Format Handler: This is the final process to save the data to JSON or CSV

file contains all the data extracted from the visited web pages.

Figure (2.7): Scrapple Architecture

(Mathew et al., 2015)

12

The system architecture emphasizes the split of the configuration file to be

outside the Scrapple concept have achieved by splitting the configuration to be out of

the Python code to Key-Value configuration file like the figure 2.8. Then Scrapple

calls the file and reads the configuration after the crawler accessed the page.

Figure (2.8): Scrapple Configuration File Example.

(Mathew et al., 2015)

Scrapple is very fast because it used lxml (Behnel, Faassen, & Bicking, 2005)

library for parsing the webpage, and they have tested the library in comparison with

BeautifulSoup (Richardson, 2008) and show that it

2.5 Extracting Entity Data from Deep Web Precisely

Researchers(Yu et al., 2014) have proposed a model for web data extracting, this

model consists of many modules :

Web Crawler: they have proposed an intelligent web crawler that can deep dive

into the website and talk the navigation links form the static web pages as well as

dynamic.

Pretreatment of web resources: they have developed two procedures before

processing the webpages first is to normalize the HTML page and the other is to

eliminate the noisy information.

13

Locate and extract the entity data from Deep Web accurately: the concept of data

extraction from unstructured data to structured done by DOM interface, then parsing

the document using JTidy the web page is transformed to DOM tree to access each

node of the web page as an object, figure 2.9 illustrate the DOM tree.

Figure (2.9): DOM Tree.

(Yu et al., 2014)

2.6 XQUERY Wrapper

Researchers (Yu et al., 2014) have proposed a system to extract from websites,

this approach was based on XQuery. Wikipedia Says: "XQuery (XML Query) is a

query and functional programming language that queries and transforms collections of

structured and unstructured data, usually in the form of XML, text and with vendor-

specific extensions for other data formats (JSON, binary, etc.)"("XQuery," 2016)

They have proposed a schema model for modelling both web data and user

requirement illustrated in figure 2.10; therefore, they handle all type of data (single

and complex data). The following figure shows the structure of the data in a website

that emphasize the hierarchical nature of the data

14

Figure (2.10): Proposed schema model.

(Nie et al., 2011)

This example of the proposed models previews the hierarchical data of the

website and differentiate between the type of nodes each web page has (single and

complex).

The annotating of data semantics they map each data value to an attribute, and

then they used an exclusive path to annotate the location of the node in DOM tree. The

path will be XQuery expression which is based on XPath, Formula 2.1 shows XPath

pattern:

P = /T1[p1]/ T2[p2]/....../ Tm[pm]/

Formula (2.1): XPath Formula Pattern.

(Nie et al., 2011)

2.7 Page Similarity

2.7.1 Structure and Style Similarity

A technique proposed by (Gowda & Mattmann, 2016) for clustering the web

pages depending on DOM structure and Style which represent all the web page

structural and visual parts.

Researchers used Tree Edit Distance (TED) (Pawlik & Augsten, 2016) for

measuring DOM trees while the CSS is measured by Jaccard similarity (Niwattanakul,

Singthongchai, Naenudorn, & Wanapu, 2013) on CSS class names.

15

2.7.1.1 Structural Similarity using Tree Edit Distance Measure

Zhang Shasha’s (Zhang & Shasha, 1989) TED algorithm applied for

calculating the similarity between trees because of its simplicity and correctness.

Figure (2.11) demonstrates Zhang Shasha’s algorithm tree.

Figure (2.11): Tree with post order numbering for DOM elements

(Gowda & Mattmann, 2016).

The components in tree index in post order as appeared in Figure (2.11), the

nodes in the DOM tree are correspondingly indexed in post order.

The tree is incrementally built from smaller forests and the edit cost between two

forests is computed by gradually aligning nodes with Insert, Remove and Replace

operations as described in (Zhang & Shasha, 1989).

Dynamic programming applied for calculating the edit distance between the root

nodes of two DOM trees.

Researchers (Gowda & Mattmann, 2016) find out that TED algorithm is less

speed in case of a modern page because of the complexity of the web page itself, like

the nested tags and rich elements on it. Zhang Shasha’s algorithm has a complexity of:

O(|T1| × |T2| × min(depth(T1), leaves(T1)) × min(depth(T2), leaves(T2)))

Formula (2.2): Zhang Shasha’s algorithm complexity.

(Zhang & Shasha, 1989)

While it has space complexity of:

O(|T1| × |T2|)

Formula (2.3): Zhang Shasha’s space complexity.

(Zhang & Shasha, 1989)

16

Research (Gowda & Mattmann, 2016) determined to use AP-TED (Pawlik &

Augsten, 2016) implementation of TED because it is faster than traditional TED and

less time by 57% (Pawlik & Augsten, 2016).

TED applied efficiently on a timely manner for comparing two DOM trees, the

original copy of HTML markup and the randomized one, however, TED can’t

determine the similarity between two pages so the need for additional effort to

comparing the CSS styles of the two documents.

Although TED can’t be applied for measuring the CSS similarity because CSS

is not an XML-driven document and can’t be presented by a tree.

Therefore researchers (Gowda & Mattmann, 2016) adapted Jacard for measuring

CSS style similarity between the style original document as well as the randomized

one.

2.7.1.2 Stylistic Similarity using Jaccard Similarity

Cascading Style Sheets (CSS) are the webpage style which can be adapted and

used in unlimited styles because of its flexibility, therefore, comparing CSS is very

important to on similarity check process.

Researchers (Gowda & Mattmann, 2016) applied Jaccard Index algorithm by

assuming D1 and D2 as two web pages and the set of style class names parsed from

the DOM then they used Jaccard’s similarity coefficient ("Jaccard index," 2018) of

styles is computed by determining the fraction of styles overlapping in both of them:

𝑠𝑡𝑦𝑙𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =| 𝐴 ∩ 𝐵 |

|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|

Formula (2.4): Jaccard coefficient formula.

("Jaccard index," 2018)

Implications of Jaccard’s Similarity coefficient in style magnificence names is

created from:

1- For the reason that unique class names are used for computing the similarity,

the unequal variety of repeated businesses does now not adjust the stylistic

similarity.

17

2- The documents displaying similar content material possess the same set of

class names accordingly they bring about a higher price for the Jaccard

similarity coefficient.

3- The stylistic similarity degree may additionally cause fake wonderful for the

multiple documents from the identical website that is because of the truth that

the styles are most probably kept regular throughout all the web pages within

the same internet site consequently, it only complements the proposed

structural similarity degree described in section2.6.1.

2.7.1.3 Aggregating the Similarities

Researchers (Gowda & Mattmann, 2016) proposed a formula for calculating the

overall similarity presented on Formula (2.3) which have κ as a constant value from

[0.0, 1.0] to be fraction significance of the structural similarity.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝜅 · 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑎𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + (1 − 𝜅) · 𝑠𝑡𝑦𝑙𝑖𝑠𝑡𝑖𝑐 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦

Formula (2.5): Web page similarity equation.

(Gowda & Mattmann, 2016)

Finally, This technique used by the experimental section to calculate the

similarity between the two copies of each web page, the original one and the

randomized one moreover, Matiskay ("HTML Similarity," 2017) implemented this

paper on GitHub called HTML Similarity he used python as a scripting language to

realize the idea.

2.7.2 Visual Similarity

Researchers (Alpuente & Romero, 2009) proposed a new technique for

comparing web pages visual structure by classification the HTML tags depending to

its visual effect which transformed the web page into normalized form where the group

of HTML tags grouped into common canonical one then, they proposed a method for

calculating the distance between two particular web pages by some processes such as

compression which will decrease the complexity and enhance the time. Next sections

will cover all methodology steps.

18

2.7.2.1 Visual Structure of Web Pages:

Researchers (Alpuente & Romero, 2009) distinguish between the visual effect

of each HTML tag because there are many HTML tags have the same visual sensation,

that idea lets to group the HTML tags depending on their visual sensation and

introduce the following tag classes:

1- grp: table, ul, html, body, tbody, div and p.

2- row: tr, li, h1, h2, hr.

3- col: td.

4- text: otherwise.

After that, they translate all HTML tags to those group tags then the page new graph

should be like figure 2.12.

Figure (2.12): Example of Translated page.

(Alpuente & Romero, 2009)

2.7.2.2 Web Compression

Translate of the page generates a clear visual structure for the page itself then

they can detect the repeating structures while they do not depend on the concrete

number of child elements of given classes, therefore, rows are equal to a table with

one column then they group them.

2.7.2.2.1 Marked term

They count the number of nodes before transformation so that they not losing

any information. Also, they will group the terms appeared twice, figure 2.13 (a, b)

show that a tree before marking the terms and after respectively.

19

Figure (2.13): Example of marked algebra.


Marked algebra for this equals τ ([N]ΣV), where “[N]” represents the number

of times that the term t is duplicated “in the marked term [N]t”. Therefore, they find

that two rows having the same text have appeared twice so they combined them on

one form [1]grp([2]row([1]text)) = grp([2]row(text)).

2.7.2.2.2 Horizontal compression

Simplifying the trees is too important for the analysis process time, so repeating

tags should be grouped as shown in figure 2.14.

Figure (2.14): Naïve term compression


2.7.2.2.3 Vertical compression

While the HTML is a markup language with semi-structured elements all tags

should be nested then.

20

Figure (2.15): Vertical compression.


This process will eliminate all empty contains the page has, therefore, all grp

tags will be grouped while the text not because it is data and sensitive and no way to

lose it.

2.7.2.2.4 Shrinking and Join

Both vertical and horizontal compression done by shrinking the chains and join

the subterms. To achieve this, they do the following steps:

1- They initially remove the tags that belong to a chain of tags that don’t influence

the aspect of the resulting page.

2- Joining the subterms. Since both the vertical and horizontal transformations are

confluent and terminating by repeatedly applying this operation it then

generates an irreducible term after an arbitrary number of steps.

Figure (2.16): Irreducible term.


21

2.7.2.3 Comparison based on visual structure

The comparing between web pages are essentially comparing two trees after

they have normalized and transformed the trees they have to use “Edit Distance” for

comparing process.

2.7.2.3.1 Tree edit distance

To use TED, cost function must be defined on each edit operation like the following:

They assumed λ as a fresh constant symbol that represents the empty marked term,

and nd1, nd2 ∈ [N]ΣV be two marked trees. Then, each edit operation is represented

as:

(𝑛𝑑1 → 𝑛𝑑2) ∈ ([𝑁]𝛴𝑉 × [𝑁]𝛴𝑉)\(𝜆, 𝜆)

Formula (2.6): Tree edit distance function.

Therefore, (𝑛𝑑1 → 𝑛𝑑2) is:

1- a relabeling if 𝑛𝑑1 ≠ 𝜆 and 𝑛𝑑2 ≠ 𝜆

2- a deletion if 𝑛𝑑2 ≡ 𝜆,

3- Insertion if 𝑛𝑑1 ≡ 𝜆.

2.7.2.3.2 Comparison of Web pages

Comparing the two tree that transformed, shrined and joined on the previous

steps by applying the edit distance to measure the similarity between the two web

pages with the number of nodes. The following two trees illustrated in figure 2.17

Figure (2.17): Visual representatives of two different pages.


22

By applying all the steps above to calculate the following formulas they find

out the similarity between the two different web pages is equals 92%.

|𝑇𝑧𝑖𝑝| = 15 𝑎𝑛𝑑 |𝑆𝑧𝑖𝑝| = 12

𝛿(𝑡𝑧𝑖𝑝, 𝑠𝑧𝑖𝑝) = 2

𝑐𝑚𝑝(𝑡, 𝑠) = 0.92 ∼

While Tzip and Szip is the irreducible term for T and S trees and δ is the edit

distance between them, while the cmp is the comparison function of it.

2.7.2.4 Implantation

Researchers published their code on the university website, however, it exists

but it didn’t work for us due to HTML5 standards so in this work the implementation

adapted and upgraded before using it.

2.8 Summary

This chapter reviewed the most recent types of web bots and web scrapers as

well while each type of scraper’s idea discussed and reviewed briefly and each scraper

model studied and the methodology well understood.

Web scrapers type is also reviewed and classified based on the nature of the core

activity of it then multiple proposed scraping models were also discussed and each

idea well understood.

Page Similarity researches were also reviewed and the core idea of calculation

the page similarity was broken-down into multiple steps because the page similarity is

the main factor will be used to measure for the proposed solution.

Chapter 3

Related Works

23

Chapter 3

Related Works

3.1 Introduction Many efforts addressed to mitigate and stop the Web Scraping, these efforts have

classified into the following categories (Legal, Developers and research) effort.

Recently, a group addressed the problem by researchers (Wetterström &

Andersson, 2009) have proposed a model for preventing the web scraper by securing

the web page itself, while the other researchers' efforts distributed on identifying,

classifying and block the access the web bots at all.

Markup Randomizer is a suggested solution for preventing the web scraper in

total by changing the HTML markup in corresponding to CSS in order to stop the web

scraper selection rules.

3.2 Legal Efforts This section presents few the legal efforts that can deal with web scraping issue

which highly tighten copyright issue and fair using for others properties, thus

Copyright law, Digital Millennium Copyright (DMCA) law and Trespass to Chattels

tort discussed next subsections.

3.2.1 Copyright Law

Copyright law (Mitchell, 2015) was first adopted at Switzerland in 1886,

"Copyright is a legal right created by the law of a country that grants the creator of

original work exclusive rights for its use and distribution. This is usually only for a

limited time. The exclusive rights are not absolute but limited by limitations and

exceptions to copyright law, including fair use ("Copyright," 2018)".

Copyrights cover creative content only and the statistics and facts are not

included,

In the case of web scrapers, there are two copyright concerns the first is fine

and the last is not fine any may be opening myself up to a lawsuit:

6- Illegal usage of others content such as

The creative works like poetry are not allowed to be copied to your website.

7- Legal usage of others content:

24

1- Statistics and facts: if you publish a fact on your website about something is

copyrighted it will be much fine.

2- Information about copyrighted content posting frequency over the time is

fine also.

3- If the creative content is shared in a verbatim may not be violating copyright

law if the data is prices, names, company executives or some factual piece

of information.

3.2.2 Digital Millennium Copyright Act

DMCA (Mitchell, 2015) is "a United States copyright law that implements two

1996 treaties of the World Intellectual Property Organization (WIPO). It criminalizes

production and dissemination of technology, devices, or services intended to

circumvent measures (commonly known as digital rights management or DRM) that

control access to copyrighted works"("Digital Millennium Copyright Act of 1998,"

1998).

Within DMCA, safe harbor law "is a provision of a statute or a regulation that

specifies that certain conduct will be deemed not to violate a given rule. It is usually

found in connection with a vaguer, overall standard("Safe harbor (law)," 2018).

1- For the scrapers case with safe harbor law if you scrape a webpage that has

not a copyright, you will be free if the website itself haven’t listed that the

content is copyrighted and when you notified that the content becomes

copyrighted you should remove the content.

2- You cannot avoid the security rules e.g. password protection to access and to

harvest the content.

3- You can fairly use any content if it has “fair use” rule, which requires taking

into your account the percentage of the copyrighted work you have used and

the purpose of the usage.

To summarize the laws, you should never publish a material without own right

and permissions. In case you just want to store the materials into your offline database,

you will be fine, but if you publish it again to your websites you will never be fine. If

you analyzing that database and publishing the statistics, authors data or even meta-

25

analysis data is fine. Another fine usage if you select a few quotes or brief samples to

your meta-analysis to make your point but you should examine that "fair use".

3.3 Developer Efforts Some developers have addressed their own tools to prevent, detect and monitor

the web scrapers, they talked about their success and clients but they have not

academic papers about the methodology they have applied to reach their goal. I guess

that because of the market competition all of them still hide the recipe.

3.3.1 ShieldSquare

ShieldSquare (ShieldSquare, 2013) is a software service that provides a Real-

Time anti scraping service that contains the following features:

1- Actively detect/prevent website scraping & screen scraping

2- Prevent price scraping bots from competitors

3- Enhance your website’s user experience

4- Get complete visibility into bot traffic on your website

5- See comprehensive insights on BOT types and their sources

3.3.1.1 ShieldSquare Methodology

ShieldSquare provides automated bot prevention and detection to the websites

and mobile app without affecting the real user experience. They have introduced

innovative bot detection to detect the bots by building signatures for each unique

visitor to your site. To understand the process of ShieldSquare Architecture as shown

in figure 3.1 below.

Figure (3.1): ShieldSquare Model Architecture.

(ShieldSquare, 2013)

26

3.2.1.2 ShieldSquare Process:

1- When a page visit happens, ShieldSquare API call and JavaScript embedded on

the page collects and sends various parameters about the visitor to the backend

ShieldSquare Engine. Using proprietary technologies and smart algorithms,

ShieldSquare engine builds a unique fingerprint for each visitor.

2- Based on the exhaustive bot detection tests done on the previous activity of this

visitor, the cloud engine classifies the visitor as a human, search engine crawler,

or a bad bot. Based on the classification, if the visitor is a friendly entity (human

or search engine crawler), then ShieldSquare transparently allows the user to pass

by sending API response code as Allow. All of this is achieved in a few

milliseconds without impacting user experience.

3- In the event of a bad bot, ShieldSquare sends the corresponding response code back

to the application. Based on the response codes, you can implement actions like

blocking the bot, challenging with a CAPTCHA, feeding fake data, etc.

ShieldSquare, thus covers all routes and provides you flexibility to choose the

desired response to act against bots as per your business needs.

Although ShieldSquare contains multiple analysis and defense levels, but it not

preventing web scrapers totally due to the fast upgrade and update in the scrapers

techniques, they can eliminate the barriers and avoid the detection and catching

techniques. Because of that, the proposed solution may mitigate the number of bots,

but never helping the websites to be safe from the bots.

On the other hand, ShieldSquare requires each webpage or mobile app page to

check if the visitor is a real visitor or bot, which means lack of performance. As a result

of that, the problem still need a paradigm to protect the whole website on the level of

web server that never need an interaction from the developers to be assured that each

request will be handled without exceptions.

3.3.2 ScrapeDefender

ScrapeDefender (ScrapeDefender) is a tool to stop the web scrapers with main

three functions Scan, Protect and Monitor detailed into the following points:

27

1- Scan: ScrapeDefender routinely scans your site for web scraping

vulnerabilities, alert you about what it finds and what recommend solutions.

2- Secure: ScrapeDefender provides bullet-proof protection that stops web

scrapers dead in their tracks. Your content is locked down and secure.

3- Monitor: ScrapeDefender provides smart monitoring using intrusion detection

techniques and alert you about suspicious scraping activity when it occurs.

Securing process is achieved by using patented technology, which is a firewall

that will prevent the scrapers and deny their activity on the website, which will stop

the web scrapers and protect the content and lock it down from the bad bots.

ScrapeDefender has multiple checks along the time, which will prevent all

known scrapers patterns by the firewall and make the content safe, but if there are web

scrapers techniques with different behavior which means a new attitude and new

patterns the firewall, therefore, will not prevent those scrapers.

On the other hand, if the attackers exploit the DDOS and target the website then

the firewall will stop then the website will be either stopped or the scrapers will

continue work alone. As a result, the scrapers will access the gems and talk the control

over the website.

3.3.3 ScrapeSentry

ScrapeSentry (ScrapeSentry, 2018) blocks scrapers from violating intellectual

property with the ability to distinguish the good and bad scrapers whether human or

bot.

ScrapeSentry is a software as a service (SaaS) anti-scraping service 24/7

delivered from the Sentor Security Operations Centre (SOC). These Services include

monitoring, analysis, investigation, blocking policy development, enforcement, and

support.

ScrapeSentry can be installed either on a span port or directly on the webservers

aggregating traffic to a passively located appliance containing the ScrapeSentry

platform.

The policies are applied are through interaction with the infrastructure such as

load balancers, webservers or the client’s application. Then if they detect any type of

28

unauthorized usage, they will either automatically block the visitor or alert the Sentor

SOC for further investigation and intervention in minutes.

The ScrapeSentry service monitors the traffic any suspicious or bad usage traffic.

When the system detects malicious traffic it will analyze it and take action relative to

the analysis result and it will generate an alert to a security analyst that will act

according to the client specific Incident Response Plan.

ScrapeSentry have great reviews from its clients as they list on their website.

Like others solutions, they have filter the request and then take an action in according

to the analysis of the request, so the problem still exists which means is a new bot

developed with a different footprint the system will be blinded and never detect, until

the security officers fix it.

Another weak point also same the others is they add a new layer for the request

lifecycle, which will filter the request, let us say if the website under DDOS attack to

let the layer down then the scraper will scrape everything until the layer return back.

3.3.3 Distil Networks

Distil Networks ("Distil Networks," 2018) block every OWASP automated

threats such as Web Scraping, Denial of Service or even Skewing by BOT defense

product they own its very excellent product because it’s the first product covers Web

Pages, API and Mobile Apps which is distinct service. However, it covers all those

production environment tiers but it is good to say that now web scraping for mobile

and can’t be happened on API because all the response are data without any

representational layers comes with the output. Distil Network invent a holistic bot

defense mechanism contains the following process:

1- Robot exclusion standard:

This process to bar great bot by adding some content lines to robots.txt record

on the site. In any case, web scrapers are not cooperating with these instructions.

2- Manual:

Manual process to stop or reduce web scrapers by adding some rules to firewall

or by including some network infrastructure that could hide the network and the

original server IP address. In any case, may be excessively expensive hours yet with

no idea added-value.

29

3- Web application firewalls (WAF):

WAFs are designed to protect web applications from being misused because of

the presence of common software vulnerabilities. Web scrapers are not focusing on

vulnerabilities but rather intending to mimic real users. In this manner, other than being

programmed to block manually identified IP addresses (see last point), they are of little

use for controlling web scraping.

4- Login enforcement:

Some sites require login to access to the most valued data, notwithstanding,

this is no protection from web scrapers, as it is simple for the perpetrators to make their

own accounts and program their web scrapers as needs be.

Strong authentication or CAPTCHAs (see next point) can be deployed, yet

these present more burden for honest to goodness clients, whose underlying easy-going

interest might be dispersed by the dedication of account creation.

5- Are you a human?

One clear way to check web scraping is to ask users to show they are human.

This is the goal of CAPTCHAs (Completely Automated Public Turing test to tell

Computers and Humans Apart). They aggravate a few clients who discover them

difficult to decipher and, obviously, workarounds have been developed. One of the

bad-bot exercises depicted by OWASP is CAPTCHA Bypass (OAT-0093). There are

additionally CAPTCHA farms, where the test posed by the CAPTCHA is outsourced

to teams of low-cost humans via sites on the dark web.

6- Geo-fencing:

Geo-fencing implies sites are just uncovered inside the geographic areas in

which they lead business. This won't stop web scraping in essence, however will mean

the perpetrators need to go to the additional effort of appearing to run their web

scrapers inside a particular geographic area. This may basically include utilizing a

VPN link to a local point of presence (Pop).

7- Flow enforcement:

Upholding the course authentic clients take through a web-site can guarantee

they are approved at every turn. Web scrapers are frequently hardwired to follow after

high-value targets and experience issues if compelled to take after a typical client's

foreordained flow.

30

8- Direct bot detection and mitigation:

The objective here is the immediate location of scrapers through a scope of

techniques including conduct investigation and computerized fingerprinting utilizing

particular bot discovery and control innovation intended for the errand. Over various

clients, providers of such advances can enhance their comprehension of web scrubbers

and different bots through machine figuring out how to the advantage of all.

Referring to Direct bot detection and mitigation process describe above they

depending on how to prevent the bot to reach the web server as whole, but they never

had any plan for some cases for such cases that bot successfully reach the page and

steal the content so it still not sufficient and not dependable so they add the term

‘Mitigation’ for their proposed technology.

3.4 Researchers Efforts There are relatively few works for addressing Web Scraping issue and here

discussed in this section and its relation to the proposed solution.

Most of the researchers pay their attention to analysing the bot behaviours then

classifying them to Good and Bad bot while one researcher concentrates his effort on

the document itself because it’s the target for the scraper. Next sections will discuss

the efforts against Randomization and identification respectively.

3.4.1 Markup Randomization

Researchers (Wetterström & Andersson, 2009) have presented an invention for

preventing the scrapping of the information content of a database used for providing a

website with data information. Their invention depends on using an anti-scrapping

filter or filtering means. The filter is used to perform some processing on the data

requested by clients before being sent to them, in order to prevent scraping. The

method of preventing the information scrapping comprises the following steps:

1- Receiving the requested structured data record from the database.

2- Splitting all the elements or the fields of the data into data containers, called

cells, in a predetermined way.

3- Giving each cell a unique sort-id, which is generated by a random number

generator, and location information, which determine the location of the cell

inside the web page.

31

4- The cells are sorted by the sort-id to establish a new unstructured data, to be

sent to the requesting client.

5- Each cell is encoded into a markup language, e.g. HTML.

6- The resulting file is delivered to the requesting client.

As a result of sorting the data containers into the unstructured manner, a robot with

scraping software would not be able to interpret the content, because it can only deal

with structured data.

On the other hand, the unstructured placement of the data containers or cells would

not cause any problem for the displaying of the file as a web page. The web browser

will ignore the cells structural placement in the code, which is based upon the sort-id,

and will visually sort the data according to the location information.

Thus, the scraping robot will be prohibited to use a file that is generated by the

proposed filter.

While (Wetterström & Andersson, 2009) proposed a good solution because it

solves a part of the problem this part is XPath based scrapers, but it not efficient today

because when the model reorder the HTML tags within the pages the style of the page

will corrupt as it randomly ordered. Another problem is HTML5/CSS3 based websites

build in a way that cannot be reordered because the stylesheet is identical to the

elements in HTML file.

On the other hand, they cannot deal with CSS based scrapers and the scraper will

still be functioning well because the class is not changed the change only in the order

so the scraper will access the data despite the layout.

Last weakness point is the paper never talk about the performance issues and

caching for the files, so the performance of the system will be very bad and will not

help the website owners.

3.4.2 Identification and Clustering

Two researchers (Haque & Singh, 2015) have proposed a new model to mitigate

the web scrapers based on historical analysis for the visits. They have created three

lists for the visitor's IP address (Black-list, Gray-list, White-list) and deal with the

visitor depending on his class. In the case of Black list, the model will block the visit

and deny the session from initiation. In white list the session will be initiated

32

successfully without any barriers Then if the visit was classified as gray listed visit the

model will treat with it in may suggested solutions as listed below:

Defense levels:

1- The model may display captcha before he views the content.

2- The model may identify the scraper through browser information that usually

not send to browser.

3- The model may change the markup randomly to stop scraper from getting data

using old CSS and XPath selectors.

4- The model may change the information to an image so that the scraper will not

reach any valuable text.

5- The model may produce a frequency analysis to check if the visits number is

normal or abnormal.

6- The mode may produce an interval analysis to check that if the interval analysis

if similar it may classify as gray listed and to be redirected to bot-differentiating

techniques like Captcha. Therefore, it may efficient if used in long-term

strategy.

7- The mode may produce a traffic analysis this is very necessary in this days

because the modern scrapers have many IP address by this technique they can

detect those scrapers.

8- The mode may produce a URL Analysis for the visited pages to check if the

ratio between data-rich pages and non-rich, so that they can identify the

scrapers.

9- The model may use Honeypots and Honeynets, which is very common in

networking companies like Amazon and CloudFlare.

The proposed solution really is very good as it provided a multi-tier defense, on

the other hand, it not enough because the scraper may be developed and then will be

treated as whitelisted therefore, the need to focus more on the content itself so the

scraper can’t deal with it. The Markup Randomization they proposed it could stop only

the CSS based selectors, but if the scraper was used XPath it will not be mitigated and

the scraper will behave and function well.

33

Another weak point is that not suggested idea provided to cache the generated

randomized HTML markup, which means the model will generate a new randomized

HTML file each time it accessed which will cause a harmful load on the server as well

as if the server received too many sessions it will go down. So the possibility of

Distributed denial-of-service (DDoS) (Mirkovic & Reiher, 2004) will be increased

which is not acceptable in any way.

Another group of researchers (Parikh, Singh, Yadav, & Rathod, 2018) adopted

the Machine Learning for detecting the web scrapers patter that helps the detection of

the attackers on the run-time by building a tool contains a graphical interface to the

customer will easily identify them too while these tools are targeted for enterprise

businesses. The tool they developed intended to trap such signature of the attackers by

using the following techniques:

1- Logstash (Turnbull, 2013): open source tool for sysadmins and developer for

collecting and parsing logs also transforming the logs.

2- Kabana (Gupta, 2015): a tool for visualize Elasticsearch (Gormley & Tong,

2015) data and navigate the Elastic Stack.

3- Flagging the attacker patterns from the log.

4- Attacker feature extraction from the logs.

Then the researchers defined their algorithm illustrated on figure below:

Figure (3.1): Researchers Parikh et al algorithm for ducting web scraper.

(Parikh et al., 2018)

Read website

Logs

Feed logs to ElasticSearch database

Visualize using

Kibana

Detect the attacks

Block the attackers in

real time

34

They finally talked about the results on section VII tilted with “Expected

Results” which representing the overall summary for the paper. They talked about

Visualization and Pattern match then Extracting the various feature anomalies.

Their effort is good while the paper is not containing all required graphs that

prove their work in particular, they finished with “Expected Results” which means no

results existed. Another bad point is they say that visualization will justify the data so

it’s the main part of their methodology but it missing on the paper at least they have to

attach 2 figures that represent a data for regular person and data for suspicious web

scraper.

Their model based on Apache logs which are good, but it not efficient at all

because the intelligent web scraper may increase the interval of the visit so the logs

results for him will be fine and will not be distinguished than the bad bot. On the other

hand, they don’t have any digital biometric for the same scraper so they used the IP

address for the scraper as the Identity of it.

Therefore, this proposed system can’t be reliable at all and needs to be refactored

and reached with results and illustrated results.

Researchers Catalin and Cristian (Catalin & Cristian, 2017) proposed an

efficient method in pre-processing phase of mining suspicious web crawlers which

intended to automatically capture the data from network traffic as input data for mining

algorithms as a pre-processing step of data mining then results of the potential threats

visualized. Catalin & Cristian model for contain multiple phases like the following:

1- Framework Architecture.

2- Experiment Setup and Configuration.

3- Results Section.

Framework Architecture section presents the architecture they propose and the

component as shown in Figure (3.2):

35

Figure (3.2): Researchers Catalin and Cristian proposed model architecture.

(Catalin & Cristian, 2017)

Not as usual, researcher didn’t used (LogsTrash, ElasticSearch and Kibana)

model while they used Snort (Beale, Baker, & Esler, 2007) and Splunk (Duffield,

Haffner, Krishnamurthy, & Ringberg, 2018) for collected network traffic then filtering

it into a specific folder, after that it should be ready for mining algorithms.

Then they setups the environment and servers and adapted the tools for

identifying the suspicious bots’ engine summarized on the following points:

1- Snort for automatically analyze the traffic

2- Snort then filters the suspicious signals to external folder.

3- Splunk then automatically analyze the output of Snort for identifying the

possible threats and finally human expert should visualize the data to discover

hidden patterns.

They listed that bots’ activity on the logs contains the traditional information

User agent, IP address and Geolocation about it, however, those data and not enough

on distinguishing the bad bot from human or normal bots they find another digital

biometrics which may help the IDS on figuring out the bad bots as the following:

36

1- Number of hits per IP address.

2- Crawling speed.

3- Recurring hits.

4- Hits generating 404 errors.

5- Cookies

Experiments show that Snort IDS can process huge data within seconds as they

listed that 99,552 packets/sec which is too high rates.

The last section of their model is for results as they using Splunk for visualization

the correlated results which clearly showing the suspicious IP as shown in Figure 3.3:

Figure (3.3): Results showing suspicious IP address.

(Duffield et al., 2018)

This pre-processing method for identifying is very advanced and no doubt this

is the most intelligent technique founded because it presents an excellent method

starting from the collecting the data using IDS which mean intelligent enough to deal

with advance bots. Then the adaptation of Splunk within the framework architecture

that helps the human expert not only the machine on identifying anomalies as well as

the new pattern on the scrapers which mean some sort of digital biometric for the

scrapers.

Although the efforts were good on the scope, this system is not completely

covering the issue, they paying the attention to how to extract the data and then pre-

process it then, I think they need to complete their idea to wrap the main problem and

cover all sub problems. Another point is they depend on Snort IDS which is very good

software but it can be bypassed like the following example:

URL:website=http://www.site.com/%73%68%65%6C%6C%2E%70%68%70

Translates =http://www.site.com/shell.php

37

The previous example shows that, if the attacker tries to hit a specific web page

on the server hey can encode the URL then the IDS will treat the two URLs as

different, while this not all but it gives some weakness points for their proposed

method.

Also, they mentioned that human expert should review the results which will

exhaust the company and the expert himself and still have some fractions of errors may

occur by the human or by the IDS. Even though, if the scraper bypassed the system by

DDOS attack on the IDS or by camouflage then the whole system will be useless.

3.5 Summary

This chapter presents the related works to the proposed solution, which was

grouped into three categories Legal, Developer and Researchers efforts, Table 3.1

summarizes all efforts.

First of all, legal efforts which are the laws introduced to organize the copyrights

as well as the fair use of the digital information, websites and the web server at all.

While those efforts were very good but still don not forcing the scrapers to stop their

activity while the identifying the real people who do the scraping it for judgment, not

an easy task.

Secondly, the developer or commercial efforts that developed to solve the web

scrapings by mainly identifying or blocking them from accessing the web page by

different ways such as traps, CAPTACH and IP blocking, although they can prevent

some trivial scrapers but not protecting the document itself.

Third and finally, researchers’ efforts to prevent the scraping by detecting them

and identifying them at least then the site administrator should take action against, on

the other hand, Wetterström & Andersson technique discusses changing the web

document structure which is the most related to this work but it not working because

of it not supporting the current HTML5 and CSS3 standards and therefore, will not be

able to stop current web scrapers.

38

Table (3.1): Summary for Related works

Category Authors Advantages Disadvantages Difference

Legal Copyright Law * Protect the

original content

* For limited time.

* Not automatically

preventing

scrapers, it need

some legal actions.

* Protecting the

content all the

time.

* Preventing

web scraper on

real time.

Legal DMCA * protecting content

from digital user.

* Offline usage is

fine.

* Store data to

database is fine.

* Preventing

scraper from

getting data and

use it offline or

store it into

database.

Developer ShieldSquare * Preventing web

scrapers on real

time based on

detection

* Not supporting

new patters of web

scrapers.

* Based on log

analysis

* No need for

logs.

* Protecting the

document all the

time while

preserving the

same look and

feel.

Developer ScrapeDefender * Detect and

Prevent web

scrapers by

firewalls

* DDoS can down

the firewall.

* Detecting code

runs on Client Side

and may be

bypassed.

* No need for

firewalls.

* No need for

special code to

be run on the

client side.

* Protecting the

web page itself.

Developer ScrapeSentry * A technique based

on detecting and

blocking the web

scraper can be

* based on logs

analysis.

* attached to the

web server then

decrease web

* No need for

logs.

* No need for

additional effort

39

Category Authors Advantages Disadvantages Difference

installed easily for

any web server.

server

performance.

on the web

server.

* Protecting the

web page itself.

Developer Distil Networks * Huge digital

biometric network

for detecting and

preventing web

scrapers

* Proposed for

mitigating web

scrapers.

* Based on log

analysis

* Preventing the

web scrapers on

the mean time.

* Protecting the

web page itself.

Researcher Markup

Randomization

* Encrypting and

Randomizing the

HTML.

* Not caring about

CSS.

* Not designed

from new Web

Standards.

* Not preventing

XPath web

scrapers.

* Supporting

HTML5/CSS3

Standards.

* Prevent XPath

web scrapers.

Researcher Identification

and Clustering

* Based on

Intrusion detection.

* Based on Log

Analysis.

* Intrusion

detection can be

avoided.

* Logs are not

enough for

detecting the web

scrapers.

* Requires a human

expert to help the

classifier for

unknown and new

web scraper

patterns.

* Protecting the

web page itself.

* No need for

Logs analysis or

intrusion

detection

because it can be

bypassed.

* No need for

expert human

for

classification.

Chapter 4

Methodology

40

Chapter 4

Methodology

4.1 Introduction This chapter will present the proposed solution for preventing web scrapers

based on Markup Randomization that based on XPath and CSS selectors, also it will

present the dataset elicitation and finally the roadmap for this thesis.

4.2 The proposed solution: The proposed solution based on markup randomization is a technique to protect

each single web page from XPath and CSS based web scraper consists of three main

steps and presented on figure 4.1.

Figure (4.1): The proposed solution based on Markup Randomization.

The proposed solution contains the following processes:

1- CSS Randomization: randomized all CSS rules names by generating random

string consist of 16 characters within the range (a-z A-Z) and generating a

dictionary object contains the mapping between original rule name and

generated rule name.

2- HTML Sync with new CSS: sync HTML page with the new CSS rules names

by using the generated dictionary object for the mapping.

3- Cache the randomized HTML and CSS files on the disk so it can be served for

the client very fast.

4- Send the randomization version to the browsers: serve the client a randomized

version from the website.

This framework can be easily adapted within the web server request life cycle as shown

in Figure 4.2.

Send to Browser

Cache Randomized

version to disk

HTML Sync with new CSS

CSS Randomization

41

Figure (4.2): Flow Chart for the proposed solution.

This flow chart shows that the proposed solution starts after requesting a web

page and then check if they are any available cached version of the same web page to

send it to the user if there is no cached version available for a page the proposed

solution will generate a new web page and then caching it then return to the user.

4.2.1 Supported Scrapers

4.2.1.1 CSS-Based Scrapers

This type of scrapers is designed to extract the data from a webpage using CSS

selectors, for example, if a web page contains two elements values and the original

markup is:

<div class="title">Data</div>

<div class="news_details">Data</div>

Therefore, the scraper should write the following code to extract those fields.

42

$('.title') .text();

$('. news_details) .text();

This code will return the value of the two fields to be stored in the database. The

problem is the CSS class of each field is never changed. Therefore, the scraper will

reach the data whenever tried to access the page.

In the proposed solution the page markup, as well as CSS, will be changed

automatically in a timely manner so, when the scraper setup the configuration to

extract field by CSS classes he will figure out that scraper is stopped working and

never returning data.

Because of the automatic change and this is the expected result of the proposed

solution to see the CSS code snippet before and after the change at figure 4.3 and 4.4.

Figure (4.3): Original CSS code example

43

Figure (4.4): Randomized CSS code

This change in CSS required a necessary change in HTML to fit with the new

CSS rules, so the prpopsed solution will preserve the old rules names and have created

a dictionary file that contains the mapping between the old rule name and the new rule

name and stores the file temporarily to the disk. An example of Randomize HTML file

as well as the original is shown below at figures 4.5 and 4.6..

Figure (4.5): Original HTML file.

44

Figure (4.6): Randomized HTML file.

4.2.1.2 XPath-Based Scrapers

Another type of scrapers is designed to extract the data from a web page using

XPath notation selectors, for example, if there is a web page contains a table to be

extracted and the original markup as the following:

<html>

<body>

<h1>Data</h1>

<table>

<tr><td>Changes</td></tr>

<tr><td class=”change-value”>4.2</td></tr>

<tr><td class=”change-value”>3.3</td></tr>

</table>

</body>

</html>

Therefore, the scraper should write the following code to extract those fields

$('//*[@class="change-value"]) .text();

45

This code will also return the value of the H1 element as well as each td

element. To prevent the XPath-based scraper from extracting the data the following

ways can be done:

8- Randomize the CSS attributes like the previous section (Will used by this

work).

9- Adding new empty invisible tags to the randomized HTML file so that the

scrapers will not find the data match that tag.

After generating a randomized markup and saving the following files to the disk:

1- Randomized CSS.

2- Randomized HTML.

3- The Mapping File.

This will enhance the performance of the proposed solution. Cron Jobs is the

ideal way to automize the randomization process for each webpage on the web site to

ensure that the markup is unique and refreshed all the time. The following steps are

executed by each run of the Cron Job:

1- Delete the old cached version of the Randomized CSS, HTML and Mapping

File.

2- Generate the new Randomized files.

4.2.2 Roadmap

By this section, all steps needed to implement and test the proposed solution will

be discussed, Figure 4.7 illustrate the applying solution

Figure (4.7): The Proposed solution applying steps.

Defining

Scraping

Applying

Evaluating

46

4.2.2.1 Defining

The first step is to define the websites and building the dataset that will be used

on the following steps also the offline version of each website created as saved and

contains all files needed such as HTML, CSS and Javascript files.

4.2.2.2 Scraping

Running the web scraper on each website on the dataset and extract the data from

it, then store the results on a file to be compared later on with the results of the

randomized version of it. Figure 4.8 present example data after running the web

scraper on a website genuine version of the website.

Figure (4.8): Snippet from a scraped website.

4.2.2.3 Applying the Solution

Applying the proposed solution on each web page contained on the dataset then

save the randomize version of each web page for testing purposed. Calculating the

total required time for the randomization process, therefore, calculating the difference

on the files size done during the process, finally, the visual similarity calculated

between the original page and the generated page is too important for result discussion.

47

The following figures 4.9 and 4.10 show the a CSS code snippet of the website

before applying the proposed solution and after applying it and figure 4.11,4.12 show

the HTML code snippet before applying the proposed mode and after applying it.

The following PHP code presenting our methodology in generating CSS rules

randomization and HTML.

public function decryptRules($Rules)

{

$oCssParser = new Sabberworm\CSS\Parser($Rules);

$oCssDocument = $oCssParser->parse();

foreach ($oCssDocument->getAllDeclarationBlocks() as $oBlock) {

foreach ($oBlock->getSelectors() as $oSelector) {

$newSelector = $this->convertRule($oSelector->getSelector());

$oSelector->setSelector($newSelector);

}

}

return $oCssDocument->render();

}

private function convertRule($ruleName)

{

switch ($this->getSelectorsCount($ruleName)){

case 0:

return $ruleName;

case 1:

return $this->getNewNameORExists($ruleName);

break;

default:

$matches = null;

$returnValue = preg_match_all($this->pattern,$ruleName , $matches);

foreach($matches[0] as $match)

{

$new_rule = $this->convertRule($match);

$ruleName = str_replace($match,$new_rule,$ruleName);

}

return $ruleName;

}

}

private function getSelectorsCount($ruleName)

{

/*

* return how many (dots) on the selector string.

* */

$matches = array();

return preg_match_all($this->pattern,$ruleName , $matches);

}

48

private function getNewNameORExists($ruleName)

{

/*

* check if the current selector name is already decrypted or now and then:

* return new name in case of not decrypted yet.

* Or return the decrypted name.

* */

/*

* TODO

* Loop for all sub-roles and replace die command

* */

$matches = array();

if(preg_match_all($this->pattern,$ruleName , $matches)>1)

{

die('Loop for all sub-roles and replace die command');

}

else{

$real_role = $matches[0][0];

$start_key = substr($real_role, 0, 1);

if(!array_key_exists($real_role,$this->dictionary)){

$this->dictionary[$real_role]=$start_key.$this-

>getRandomString($real_role);

}

$the_rule=str_replace($real_role,$this-

>dictionary[$real_role],$ruleName);

return $the_rule;

}

}

private function getRandomString($length)

{

$chars = array_merge(range('a', 'z'), range('A', 'Z'), array('_'));

$length = intval($length) > 0 ? intval($length) : 16;

$max = count($chars) - 1;

$str = "";

while ($length--) {

shuffle($chars);

$rand = mt_rand(0, $max);

$str .= $chars[$rand];

}

return $str;

}

49

public function convertHTML($dictionary, $page)

{

set_time_limit(0);

$dom = new Dom;

$opt_a = array("cleanupInput"=>false );

$dom->loadFromFile($page,$opt_a);

$totalClasses = count($dictionary);

$UsedClasses = 0;

foreach ($dictionary as $oldKey => $newKey) {

$a = $dom->find($oldKey);

if(count($a)>0)

{

}

foreach ($a as $node) {

$UsedClasses++;

$type = substr($newKey, 0, 1);

if ($type == '.') {

$attrClass = $node->getAttribute('class');

$splitClass = explode(" ", $attrClass);

$strClass = "";

foreach ($splitClass as $key) {

$strClass .= substr($dictionary[".".$key], 1) . " ";

}

$node->setAttribute('class', $strClass);

} else

$node->setAttribute('id', substr($newKey, 1));

}

}

return $dom->root->outerHtml();

}

50

Figure (4.9): CSS code before applying the proposed solution.

Figure (4.10): CSS code after applying the proposed solution.

51

Figure (4.11): HTML code snippet before applying the proposed solution.

Figure (4.12): HTML code snippet after applying the proposed solution.

4.2.2.4 Evaluating

To evaluate the proposed solution there is a list of processes listed below:

1- Check the web scraper prevent or not by trying to scrap the generated website

again.

2- Calculating the total time required for applying the proposed solution.

3- Figuring out the visual similarity between the original version and the

randomized version of each web page.

4- Calculating the difference of the HTML and CSS files size before and after

applying the proposed solution.

4.4 Summary:

This chapter presented the proposed solution for preventing websites from the

web scrapers based on a technique called Markup Randomization. This proposed

solution can be to deal with XPath and CSS based web scrapers which are the have the

same structure internally but with few different on the way to select a particular node

on the DOM.

52

The proposed solution consists of three steps, CSS file rules names

randomization then HTML file sync with the randomized CSS file and finally send the

randomized version to the client.

Applying the proposed solution done through four steps, first of all, is to define

the dataset of websites to be used on the testing the proposed solution after that creating

offline version of the website so it can be used it on the next steps.

The second step is to run the web scraper for each website on the dataset to be

sure that the website is scrapable and to extract its data that will help us on next

sections.

The third step is to apply the proposed solution to each single web page to

generate a new web page that will have the same look and feel.

The final step is to evaluate that the generated document is not scraped while

maintaining the look and feel.

Chapter 5

Experiments and

Discussion

53

Chapter 5

Experiments and Discussion

5.1 Introduction

This chapter presents the experiments for the proposed solution which based on

markup randomization which intended to change the markup during the processing

time while preserving the same look and feel. Experiments established to measure

three factors Processing time, File size and the visual similarity and the results

presented and discusses in this section. Finally, re-run the web scraper to check if it

prevents on not.

5.2 Dataset

The dataset is a set of websites from three main categories News, Weather

forecasting and Stock Markets each category contains 10 websites (Table 5.1) shows

the categories with description. Those websites are collected manually by searching

google using contains related keywords for each category, after that open each website

to see check it has fresh content or not.

Table (5.1): Dataset website categories.

Category Name Category Description

News A set of websites that present daily-updated news.

Weather Forecasting A set of websites that contains daily-weekly-monthly

predications for the climate properties e.g. humidity,

wind speed.

Stock Markets A set of websites that contains currency prices

updated from the stock immediately.

As you see on the table, all of those websites have a sensitive-content and highly-

updated on time frame, which means it will cause a lot of damage to content owner

who pays a lot to populate and edit those data if a particular website stole his content

therefore, visits will be degraded and competitor website will hijack his website rating

by the time.

The selected websites finally are listed in Table 5.2 which illustrate the website

and the category of it.

54

Table (5.2): Website list with category.

# Website Category

1 Bbc News

2 Businessinsider News

3 Buzzfeed News

4 Gizmodo News

5 Huffingtonpost News

6 Mashable News

7 Techcrunch News

8 Thedailybeast News

9 Thenextweb News

10 Thinkprogress News

11 Cbsl Stock Market

12 Forex Stock Market

13 forex-ratings Stock Market

14 Marksandspencer Stock Market

15 Nrb Stock Market

16 Xe Stock Market

17 x-rates Stock Market

18 Wellingtonfx Stock Market

19 Bnm Stock Market

20 Centralbank Stock Market

21 Accuweather Weather forecasting

22 Intellicast Weather forecasting

23 weather-forecast Weather forecasting

24 Yr Weather forecasting

25 holiday-weather Weather forecasting

55

# Website Category

26 Timeanddate Weather forecasting

27 Nwac Weather forecasting

28 Jnto Weather forecasting

29 Forecast Weather forecasting

30 Bernews Weather forecasting

5.3 Experiment Settings

The experiments were carried out in the Cloud server environment that applies

the proposed solution on it. The Cloud server machine contains the following

specifications; Table 5.3 demonstrates the machine specifications.

Table (5.3): Machine specifications.

Machine Cloud Server

CPU 12 cores of Intel Xeon CPU E5-2650L v3 @ 1.80GHz

RAM 16 GB

OS Ubuntu 16.4

Hard Drive Virtual Cloud SSD

Because of the proposed solution built with PHP 7 Ubuntu Linux distribution

chosen to run the experiments on it due to the PHP is much faster on Linux. Cloud was

selected because of the need for many process on cheap price and can be extended and

scaled at any point without any extra configuration or reinstallation.

5.4 Experiments Process

Experiments were done by over the dataset to check the following factors:

10- Processing Time: Total processing time required for applying the proposed

solution, the fewer time means better adaptation on the production

environments.

11- File Size: Due to the limitation on resources it is highly recommended to test

the generated randomized markup size whether increased or decreased.

56

12- Similarity: Similarity is to check that the visual look and feel have been

changed after applying the proposed solution or not which lead that the

proposed solution is correct and run as intended.

13- Re-Test Web Scraper: Re-run the web scraper to check it prevent on not.

5.4.1 Experiment: Processing Time

Processing time is the main point for any business because there is a trade-off

between fast page rendering for the regular visitor and stopping the scraper bots. In the

first case, the regular user who hit on the website like he would like to open the website

for a specific purpose in the meantime.

Let’s assume that a currency exchange dealer man would like to exchange an

amount for a client who is waiting for him and the website took a lot of processing

time to be rendered and shown up, then, absolutely he will shutdown and close his

business due to his unreliability.

On the other hand, when the scarper bot need to scrape a data from the website

it should be rendered to the bot but the HTML markup, as well as the CSS markup ,

should be randomized by the proposed solution so it will stop.

As a result of that, the whole processing time shown in figure 5.1 which present

the total time for generating a new randomized web page. All results for 30 websites

shown in Table 5.4.

Figure (5.1): Total time required for the proposed solution.

0

50

100

150

200

250

300

350

400

450

500

Bn

m

Cb

sl

Cen

tral

ban

k

Fore

x

fore

x-ra

tin

gs

Mar

ksan

dsp

ence

r

Nrb

Wel

lingt

on

fx Xe

x-ra

tes

Bb

c

Bu

sin

ess

insi

der

Bu

zzfe

ed

Giz

mo

do

Hu

ffin

gto

np

ost

Mas

hab

le

Tech

cru

nch

Thed

aily

be

ast

Then

extw

eb

Thin

kpro

gre

ss

Acc

uw

eath

er

Ber

new

s

Fore

cast

ho

liday

-we

ath

er

Inte

llica

st

Jnto

Nw

ac

Tim

ean

dd

ate

wea

the

r-fo

reca

st Yr

57

Table (5.4): Total seconds require to apply the proposed solution.

Website Total Seconds

Bnm 122

Cbsl 6

Centralbank 361

Forex 12

forex-ratings 137

Marksandspencer 264

Nrb 14

Wellingtonfx 1

Xe 39

x-rates 15

Bbc 91

Businessinsider 120

Buzzfeed 107

Gizmodo 64

Huffingtonpost 99

Mashable 88

Techcrunch 432

Thedailybeast 53

Thenextweb 1

Thinkprogress 68

Accuweather 98

Bernews 54

Forecast 121

58

Website Total Seconds

holiday-weather 274

Intellicast 23

Jnto 5

Nwac 65

Timeanddate 36

weather-forecast 156

Yr 148

Regarding Table 5.4 most of web pages required little processing time to apply

the proposed solution on it, few websites are having an odd value will discuss it on the

next section.

5.4.2 Result Discussion: Processing Time

Processing time for applying the proposed solution regularly takes less than 2

minutes and little results take more than two minutes as shown in Figure 5.2. Most

results took less than two minutes due to the markup lines count because the required

time for applying the proposed solution is coupled with HTML and CSS lines count

as shown in Table 5.5 the results take less than two minutes.

Figure (5.2): Results classification based on time.

Results classfication based on time

Less than two minutes More than two minutes Less than 25 seconds

59

Table (5.5): Results takes less than 2 minutes.

Category Website Time

Currencies x-rates 0:00:36

Weather Forecast 0:00:39

News Techcrunch 0:00:53

News Businessinsider 0:00:54

Currencies Wellingtonfx 0:01:04

Currencies Nrb 0:01:05

Currencies Forex 0:01:08

Currencies Cbsl 0:01:28

Currencies forex-ratings 0:01:31

News Mashable 0:01:38

News Gizmodo 0:01:39

News Buzzfeed 0:01:47

Weather holiday-weather 0:02:00

Weather Bernews 0:02:01

News Thinkprogress 0:02:17

Weather Accuweather 0:02:22

News Bbc 0:02:28

Weather Intellicast 0:02:44

Above range experiments are the web pages with longer lines for the HTML

markup as well as CSS, this caused due a lot of replacements it need, while assuming

that each line of the body element needs at least one replacement therefore, it will take

much time for processing as shown on table 5.6.

Table (5.6): Results takes more than 2 minutes.

Category Website HTML lines CSS lines TOTAL

lines

Time

Currencies Bnm 5014 3221 8235 0:07:12

Currencies Centralbank 2785 4057 6842 0:06:01

News Thenextweb 984 4722 5706 0:04:34

News Huffingtonpost 1383 2397 3780 0:04:24

60

Finally, below range experiments are the web page that takes processing time less

than expected as shown in table 5.7 this caused by one the following:

1- CSS lines are not too long.

2- HTML lines are not too long.

3- CSS is not used 100% at the HTML document.

Table (5.7): Results that take less processing time than most results. Category Website HTML lines CSS lines TOTAL LINES SECONDS

Currencies Marksandspencer 1789 1836 3625 1

Weather Yr 202 124 326 1

Weather Nwac 501 164 665 5

Currencies Xe 1660 61 1721 6

Weather weather-forecast 383 877 1260 12

Weather Jnto 595 369 964 14

Weather Timeanddate 492 699 1191 15

News Thedailybeast 1033 418 1451 23

5.4.3 Experiment: File Size

Server Resources is an important point and should be measured for any proposed

solution because of the servers all about resources. As a result, file size changes tested

and tracked between the two versions of the page, a page before applying the

randomizer and the page after applying it then the relation between file size before and

after illustrated in Table 5.8.

Table (5.8): Website file size before and after applying the proposed solution. Website Size before Size after Diff (Size before/ Size after)

Bbc 267 112 2.383929

Businessinsider 92 77 1.194805

Buzzfeed 223 127 1.755906

Gizmodo 174 179 0.972067

Huffingtonpost 276 63 4.380952

Mashable 261 40 6.525

Techcrunch 330 269 1.226766

Thedailybeast 197 62 3.177419

Thenextweb 154 114 1.350877

Thinkprogress 158 129 1.224806

61

Website Size before Size after Diff (Size before/ Size after)

Cbsl 68 43 1.581395

Forex 27 26 1.038462

forex-ratings 53 50 1.06

Marksandspencer 97 61 1.590164

Nrb 54 24 2.25

Xe 67 48 1.395833

x-rates 30 22 1.363636

Wellingtonfx 11 12 0.916667

Bnm 73 64 1.140625

Centralbank 151 102 1.480392

Accuweather 110 48 2.291667

Intellicast 71 49 1.44898

weather-forecast 71 30 2.366667

Yr 85 72 1.180556

holiday-weather 87 45 1.933333

Timeanddate 25 23 1.086957

Nwac 147 106 1.386792

Jnto 40 38 1.052632

Forecast 65 71 0.915493

Bernews 88 123 0.715447

5.4.4 Result Discussion: File size

File size results were a bit different than the processing time results as shown in

figure 5.2 because of the developers not following the web standards on writing the

CSS documents as well as the HTML documents.

Figure (5.3): Difference between generated file size original file size.

-250

-200

-150

-100

-50

0

50

Generate file size

62

The proposed solution restructures all those files on the final step therefore, the

generated HTML and CSS enhanced at the most cases than the total lines of the

generated documents less than the original see Table 5.9. Although, some cases the

lines increased as illustrated in Table 5.10.

Table (5.9): Website HTML file size decreased after applying the proposed solution.

Website Size before Size after

bob 267 112

Businessinsider 92 77

buzzfeed 223 127

huffingtonpost 276 63

mashable 261 40

techcrunch 330 269

thedailybeast 197 62

thenextweb 154 114

thinkprogress 158 129

cbsl 68 43

forex 27 26

forex-ratings 53 50

marksandspencer 97 61

nrb 54 24

xe 67 48

x-rates 30 22

bnm 73 64

centralbank 151 102

accuweather 110 48

intellicast 71 49

weather-forecast 71 30

yr 85 72

holiday-weather 87 45

timeanddate 25 23

nwac 147 106

jnto 40 38

63

To discuss those results on Table 5.9, it too necessary to understand the webpage

HTML markup and CSS. Each HTML and CSS both may contain the following un-

necessary elements:

14- Comments: Comments on CSS are any text wrapped by “/* and */” and

comment on HTML are text wrapped by “”.

15- White spaces: One white space or more.

16- Line breaks: One line break or multiple can be added by clicking on “Enter”

key.

Table (5.10): Web site HTML page size increase after applying the proposed

solution. Website Size before Size after

Gizmodo 174 179

Wellingtonfx 11 12

Forecast 65 71

Bernews 88 123

Therefore, when the proposed solution finished the randomization process and it

removes all unnecessary lines and comments from the original copy of the markup.

Then the file with a lot of un-necessary element will be decreased immediately and the

difference should be obvious. While the file with few un-necessary elements will not

be decreased at most cases or may increase a bit due to the length of CSS class name

is longer than the original one. File size is too important for the production

environment if file size can be shrunk and decreased then a lot of version of the

randomized web page can be generated in advanced which means more applicability

for the system.

To figure out the difference between the files before and after applying the proposed

solution Figure 5.4 and 5.5 present a specific code snippet before and after applying

the mode.

64

Figure (5.4): Code snippet before applying the proposed solution.

Figure (5.5): Code snippet after applying the proposed solution.

5.4.5 Experiment: Similarity

The similarity is an important factor to be checked for testing the original and

generated version of the web page to see that if the proposed solution preserves the

visual look and feel of each web page or break it.

Two group of researchers (Alpuente & Romero, 2009; Gowda & Mattmann,

2016) proposed two different ways to compare each different two web pages, therefore

the two proposed methods were used on by the experiments section.

Unfortunately, the first technique proposed by (Gowda & Mattmann, 2016) for

testing the similarity was failed while the second one proposed by (Alpuente &

Romero, 2009) succeeded.

As a result, the second approach which suggested by (Alpuente & Romero,

2009) adapted for the latest HTML5 standards and feeling confident of using it. Next

sections contain the full review on the results.

65

5.4.5.1 Visual Similarity using Gowa et al. Method:

Testes were done using Matiskay’s ("HTML Similarity," 2017) python tool

which is the implementation for Gowa et al (Gowda & Mattmann, 2016) technique.

The tool has to main parts:

1- Html Similarity part by applying TED on the two documents.

2- CSS Similarity part by applying Jaccard similarity between the set of CSS

classes.

The results for the similarity test for each web page shown on table 5.11.

Table (5.11): Web Page Similarity results by applying Matiskay’s tool. Website Similarity

Thinkprogress 8%

Businessinsider 10%

Buzzfeed 19%

Bbc 21%

Nrb 21%

Accuweather 22%

Nwac 35%

Jnto 39%

Bnm 41%

Forecast 41%

Cbsl 41%

Thedailybeast 45%

x-rates 45%

Huffingtonpost 45%

holiday-weather 46%

Xe 46%

Bernews 47%

Mashable 47%

Centralbank 48%

Thenextweb 48%

Marksandspencer 48%

Forex 48%

Intellicast 48%

Techcrunch 48%

66

Website Similarity

Timeanddate 48%

Gizmodo 48%

weather-forecast 48%

Wellingtonfx 49%

Yr 50%

forex-ratings 50%

5.4.5.2 Visual Similarity using Romero and Maria Method:

Visual Similarity tested by using a xml2maude tool (Alpuente & Romero, 2009)

which compares the two web pages similarity by a list of complex normalization and

transformation setups then it generated the tree edit distance for each web page and

calculate the similarity by their suggested formulas (Alpuente & Romero, 2009).

Each website tested individually and the website category as the whole also

tested then (Table 5.12) and (Table 5.13) illustrate the results respectively.

Table (5.12): Website page similarity between original and generated website.

Website Similarity

Accuweather 99.82%

Bbc 100.00%

Bernews 100.00%

Bnm 99.95%

Businessinsider 99.66%

Buzzfeed 100.00%

Cbsl 99.75%

Centralbank 98.98%

Forecast 99.11%

Forex 100.00%

forex-ratings 100.00%

Gizmodo 100.00%

holiday-weather 97.03%

Huffingtonpost 99.92%

Intellicast 100.00%

Jnto 98.67%

67

Website Similarity

Marksandspencer 100.00%

Mashable 100.00%

Nrb 97.74%

Nwac 100.00%

Techcrunch 100.00%

Thedailybeast 99.92%

Thenextweb 99.89%

Thinkprogress 100.00%

Timeanddate 100.00%

weather-forecast 100.00%

Wellingtonfx 100.00%

Xe 100.00%

x-rates 100.00%

Yr 100.00%

Table (5.13): Website Category similarity test.

Category Similarity

News 99.94%

Currency 99.64%

Weather 99.46%

5.4.6 Result Discussion: Similarity

Two methods for similarity applied, first of all, the first methodology proposed

by (Gowda & Mattmann, 2016) and implemented on python by Matiskay ("HTML

Similarity," 2017) applied but the results were not matching the expectations and it

was too hard to find a relation between the characteristics of each website and the

results such as:

1- Relation between the calculated similarity and the file size.

2- Relation between the calculated similarity and CSS coverage inside the HTML.

Then, Matiskay implementation for web page similarity fails to work on this

model while the two web pages have the same look and feel as shown in figure 5.6 and

5.7.

68

Figure (5.6): The original offline version of CBSL website.

Figure (5.7): Generated version of CBSL website.

69

As a result of that, another solution used that compares the two web pages

visually not by other means. The other solution compares the two web pages visually

by transformation and compression the two web pages then calculate the similarity.

The results of Romero’s and Marıa’s approach match the expectations because

it measures the difference between the original documents and the generated

documents and then calculates how much the generated document similar to the

original documents. The similarity values ranged from 97.0% to 100% depending on:

1- How many unsupported tags cleared during applying the proposed solution?

Such as the following tags:

a. <b:if> and <b:else/>

b. <gcse:search/>

2- How many Run-time generated DOM elements inserted updated or deleted?

because the compare tool will dismiss all of them such as:

a. Facebook Social button and dialogs: the following code snippet on

figure 5.8 demonstrate an example of Facebook for changing DOM at

run-time, so empty div tag with id “fb-root” will be replaced by the

code snippet shown on figure 5.9 to show the quote button illustrated

on figure 5.10.

Figure (5.8): Facebook Quote Dialog Example

(Facebook, 2018).

70

Figure (5.9): Facebook generated code replacing the fb-root div.

Figure (5.10): Facebook generated Quote button.

b. AddThis Social buttons: Many types of buttons with counters and

statistics produced and maintained by AddThis as a service. For

example, share buttons can be used by code demonstrated in figure 5.11

and on the runtime it replaced by the code illustrated in figure 5.12 and

finally represented like figure 5.13.

71

Figure (5.11): AddThis setup code.

(AddThis, 2018)

Figure (5.12): AddThis generated code.

Figure (5.13): AddThis generate buttons look and feel.

72

5.4.7 Re-Run Web Scraper

The web scraper executed for three times to extract data but it failed to get any

single data as presented by Table 5.14 all websites succeeded to stop web scraper and

its data are protected.

Table (5.14): Results for running web scraper after applying the proposed solution.

# Website Prevent Web Scraper

1 Bbc YES

2 Businessinsider YES

3 Buzzfeed YES

4 Gizmodo YES

5 Huffingtonpost YES

6 Mashable YES

7 Techcrunch YES

8 Thedailybeast YES

9 Thenextweb YES

10 Thinkprogress YES

11 Cbsl YES

12 Forex YES

13 forex-ratings YES

14 Marksandspencer YES

15 Nrb YES

16 Xe YES

17 x-rates YES

18 Wellingtonfx YES

19 Bnm YES

20 Centralbank YES

21 Accuweather YES

73

# Website Prevent Web Scraper

22 Intellicast YES

23 weather-forecast YES

24 Yr YES

25 holiday-weather YES

26 Timeanddate YES

27 Nwac YES

28 Jnto YES

29 Forecast YES

30 Bernews YES

For instance, figure 1 represents a code snippet from the original website and before

randomizing the markup and figure 2 represents the code snippet after the

randomization, On the other hand, table 1 contains the data that have been scrapped

from the website before randomizing the markup and while no data extracted after

website randomization.

Figure (14): Website markup before randomization

74

Figure (15): Website markup after randomization

Table (15): Website extracted data before randomization

News Title News Url

McMaster: Evidence of Russian

meddling in the US election is

'now really incontrovertible'

http://www.businessinsider.com/mcmaster-

russia-meddling-us-election-

incontrovertible-2018-2

The Mueller indictments â€”

here's which Russians were

charged with interfering in the

2016 US election

http://www.businessinsider.com/russians-

mueller-charged-with-interfering-2016-

election-2018-2

Twitter users are being called out

for posting fake claims of racially

motivated assaults at 'Black

Panther' showings

http://www.businessinsider.com/twitter-

users-post-fake-claims-assaults-black-

panther-showings-2018-2

A hedge fund that focuses solely

on marijuana is crushing it

http://www.businessinsider.com/bi-prime-

navy-capital-investing-in-the-public-

marijuana-market-2018-2

Video shows buildings swaying

violently during a massive

earthquake in Mexico

http://www.businessinsider.com/mexico-

city-earthquake-video-building-2018-2

75

5.5 Summary:

This chapter presents the experiments and the discussion done for the proposed

solution to measure processing time, File size and similarity.

The processing time required for applying the proposed solution on each web

page is measured because it is the most important factor in real production

environments. File size also tested because of the website life is all about resources.

Finally, similarity measured between the markup before and after applying the

proposed solution which will prove that changes are done on the markup while no

visual effect happened during the process.

Time is the most important factor for real environments so the lower time will

let the proposed solution applicable more than the higher time, will the experiment

show that the time need to apply the proposed solution on a particular web page is 2

minutes for web pages have total markup code less than 4500 lines, therefore higher

time justified well so that the developers should benefit for the outcome of the

justification and build the web page with fewer lines and fewer CSS attributes by

defining root CSS class for each block and use the element as selector for the required

style.

File size is enhanced at most cases and not increased because the proposed

solution normalized the web pages by removing all unnecessary white-spaces and line

breaks as well as code comments, and may increase for normalized web pages by 9%

only on the most cases while one case increased by 39%.

Similarity tests show that most of the web pages have no visual changes

happened during applying the proposed solution while the web pages’ results contain

1-3% change justified by run-time generated code by some third party or by un-

resolved third party HTML tags.

Finally, re-testing the web scraper for all dataset websites shows that all

websites are protected after applying the proposed solution periodically.

Chapter 6

Conclusion

76

Chapter 6

Conclusion

Web scraping problem is a trend legal and business issue affecting many fields

of websites such as bloggers and online businesses websites. The problem of web

scraping activity leads to steels the original content then publish it immediately without

preserving the intellectual property or copyrights for online businesses.

Figure (6.1): Proposed mode based on Markup Randomization.

The proposed solution based on Markup Randomization shown on Figure 6.1

preventing websites from the web scrapers by generated a randomized version of each

single web page that will equals the original one without any changes happened during

the process which will prevent the web scraper and permanently solve the web

scraping issue repeating the process on time span can be defined and adjusted by the

website administrator.

Experiments done over the dataset which contains 30 websites from three

categories News, Weather forecasting and Currency markets to test the total processing

time required the randomization, File size changes before and after the processing and

finally, the visual similarity between the generated and the original web page.

Results show that processing time needs less than 2 minutes for instances that

total lines for HTML and CSS less than 4500 line. While the file size is enhanced and

decreased with few exceptional cases by removing all unnecessary white-spaces and

line breaks as well as code comments but increased for normalized web pages by 9%

only on the most cases only one case increased by 39%. Lastly, visual similarity test

Send to Browser

Send randomized

version to the browers

HTML Sync with new CSS

CSS Randomization

77

proved that most of the web pages have no visual changes happened during the

processing the proposed solution while few websites have 1-3% change justified by

run-time generated code by some third party or by un-resolved third party HTML tags.

The proposed solution for markup randomization proposed solution is highly

distinct and never achieved or discussed because it prevents websites from the web

scrapers and can be applicable for latest web standards and technology and can be

embedded within Web Cache systems or can act as an intermediate layer for the web

server.

Online businesses now can depend on the proposed solution because it will

prevent them web scraping problem then this will lead to more success on the

competition and beat the competitors who steal the prices. On the other hand, Bloggers

can also take advantage of the proposed solution and increasing the traffic and SEO

ratings with more revenue and traffic.

Future works:

Many different enhancements have been left for the future due to the lack of

resources and time and the future work should concern about the following points:

1- Web Scrapers that lookup for the specific content by intelligent methods such

as regular expression, semantic search or even machine learning.

2- The proposed solution takes a lot of time to generate the new HTML and have

time complexity O(nx) while n is the number of CSS classes and x is the number

of classes used on the HTML document.

3- Changing the structure of HTML in the way that will mislead the web scraper.

78

References

AddThis. (2018). AddThis. Retrieved from https://www.addthis.com/

Alpuente, M., & Romero, D. (2009). A visual technique for web pages comparison.

Electronic Notes in Theoretical Computer Science, 235, 3-18.

Beale, J., Baker, A. R., & Esler, J. (2007). Snort: IDS and IPS toolkit: Syngress.

Behnel, S., Faassen, M., & Bicking, I. (2005). lxml: XML and HTML with Python. In.

Bonifacio, C., Barchyn, T. E., Hugenholtz, C. H., & Kienzle, S. W. (2015). CCDST:

A free Canadian climate data scraping tool. Computers & Geosciences, 75, 13-

16.

Catalin, M., & Cristian, A. (2017). An efficient method in pre-processing phase of

mining suspicious web crawlers. Paper presented at the System Theory,

Control and Computing (ICSTCC), 2017 21st International Conference on.

Copyright. (2018). Retrieved from https://en.wikipedia.org/wiki/Copyright

Digital Millennium Copyright Act of 1998. (1998). Retrieved from

https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act_of_1998

Distil Networks. (2018, 06/30/2018). Retrieved from

https://www.crunchbase.com/organization/distil

Duffield, N., Haffner, P., Krishnamurthy, B., & Ringberg, H. A. (2018). Systems and

methods for rule-based anomaly detection on IP network flow. In: Google

Patents.

Facebook. (2018). Quote Plugin. Retrieved from

https://developers.facebook.com/docs/plugins/quote#example

Gormley, C., & Tong, Z. (2015). Elasticsearch: The Definitive Guide: A Distributed

Real-Time Search and Analytics Engine: " O'Reilly Media, Inc.".

Gowda, T., & Mattmann, C. A. (2016). Clustering Web Pages Based on Structure and

Style Similarity (Application Paper). Paper presented at the 2016 IEEE 17th

International Conference on Information Reuse and Integration (IRI).

Gupta, Y. (2015). Kibana Essentials: Packt Publishing Ltd.

Haque, A., & Singh, S. (2015). Anti-scraping application development. Paper

presented at the Advances in Computing, Communications and Informatics

(ICACCI), 2015 International Conference on.

HTML Similarity. (2017). Retrieved from https://github.com/matiskay/html-similarity

Jaccard index. (2018). Retrieved from https://en.wikipedia.org/wiki/Jaccard_index

Kouzis-Loukas, D. (2016). Learning Scrapy: Packt Publishing Ltd.

Mahto, D. K., & Singh, L. (2016). A dive into Web Scraper world. Paper presented at

the Computing for Sustainable Global Development (INDIACom), 2016 3rd

International Conference on.

Malik, S. K., & Rizvi, S. (2011). Information extraction using web usage mining, web

scrapping and semantic annotation. Paper presented at the Computational

Intelligence and Communication Networks (CICN), 2011 International

Conference on.

Mathew, A., Balakrishnan, H., & Palani, S. (2015). Scrapple: a Flexible Framework

to Develop Semi-Automatic Web Scrapers. International Review on

Computers and Software (IRECOS), 10(5), 475-480.

Mi, X., Liu, Y., Feng, X., Liao, X., Liu, B., Wang, X., . . . Sun, L. (2019). Resident

Evil: Understanding Residential IP Proxy as a Dark Service. Paper presented

at the Resident Evil: Understanding Residential IP Proxy as a Dark Service.

https://www.addthis.com/

https://en.wikipedia.org/wiki/Copyright

https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act_of_1998

https://www.crunchbase.com/organization/distil

https://developers.facebook.com/docs/plugins/quote#example

https://github.com/matiskay/html-similarity

https://en.wikipedia.org/wiki/Jaccard_index

79

Mirkovic, J., & Reiher, P. (2004). A taxonomy of DDoS attack and DDoS defense

mechanisms. ACM SIGCOMM Computer Communication Review, 34(2), 39-

53.

Mitchell, R. (2015). Web scraping with Python: collecting data from the modern web:

" O'Reilly Media, Inc.".

Mobasher, B. (2006). Web usage mining. Web data mining: Exploring hyperlinks,

contents and usage data, 12.

Nie, T., Shen, D., Yu, G., Kou, Y., & Yang, D. (2011). Construct the XQuery-based

wrapper for extracting web data. Paper presented at the Fuzzy Systems and

Knowledge Discovery (FSKD), 2011 Eighth International Conference on.

Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using of

Jaccard coefficient for keywords similarity. Paper presented at the Proceedings

of the International MultiConference of Engineers and Computer Scientists.

Parikh, K., Singh, D., Yadav, D., & Rathod, M. (2018). DETECTION OF WEB

SCRAPING USING MACHINE LEARNING.

Pawlik, M., & Augsten, N. (2016). Tree edit distance: Robust and memory-efficient.

Information Systems, 56, 157-173.

Richardson, L. (2008). Beautiful Soup-HTML. XML parser for Python.

Safe harbor (law). (2018). Retrieved from

https://en.wikipedia.org/wiki/Safe_harbor_(law)

ScrapeDefender. (2018). ScrapeDefender. Retrieved from http://scrapedefender.com/

ScrapeSentry. (2018). ScrapeSentry. Retrieved from https://www.scrapesentry.com/

ShieldSquare. (2013). ShieldSquare Bot Mitigation and Bot Management solution.

Retrieved from https://www.shieldsquare.com/

Thelwall, M. (2001). A web crawler design for data mining. Journal of Information

Science, 27(5), 319-325.

Turnbull, J. (2013). The Logstash Book: James Turnbull.

Wetterström, R., & Andersson, S. (2009). Web information scraping protection. In:

Google Patents.

XQuery. (2016). Retrieved from https://en.wikipedia.org/wiki/Web_scraping

Yu, H.-t., Guo, J.-y., Yu, Z.-t., Xian, Y.-t., & Yan, X. (2014). A novel method for

extracting entity data from Deep Web precisely. Paper presented at the The

26th Chinese Control and Decision Conference (2014 CCDC).

Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance

between trees and related problems. SIAM journal on computing, 18(6), 1245-

1262.

https://en.wikipedia.org/wiki/Safe_harbor_(law

http://scrapedefender.com/

https://www.scrapesentry.com/

https://www.shieldsquare.com/

https://en.wikipedia.org/wiki/Web_scraping

Documents

Prevent XPath and CSS Based Scrapers by Using Markup ...Ahmed Mustafa Ibrahim Diab Supervised by Dr. Tawfiq S. Barhoom Associate Prof. of Applied Computer Technology A thesis submitted