Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Prevent XPath and CSS Based Scrapers by Using
Markup Randomization
CSSو XPath منع جمع المعلومات بالطريقة المعتمدة على
باستخدام عشوائية الترميز
By
Ahmed Mustafa Ibrahim Diab
Supervised by
Dr. Tawfiq S. Barhoom
Associate Prof. of Applied Computer Technology
A thesis submitted in partial fulfilment
of the requirements for the degree of
Master of Information Technology
September/2018
زةـ ـغب ةـ ـلاميــــــة الإســـــــــ امعـالج
ات العلي اس البح ل العلم والد ا عم ا ة
تكنولوجي ا المعلوم اتة ليــــ ــك
تكنولوجي ا المعلوم اتماجس تير
The Islamic University of Gaza
Deanship of Research and Graduate Studies
Faculty of Information Technology
Master of Information Technology
I
ــرارإقـــــــــــ
أنا الموقع أدناه مقدم الرسالة التي تحمل العنوان:
Prevent XPath and CSS Based Scrapers by Using Markup
Randomization
باستخدام CSSو XPathمنع جمع المعلومات بالطريقة المعتمدة على
عشوائية الترميز
بأن ما اشتملت عليه هذه الرسالة إنما هو نتاج جهدي الخاص، باستثناء ما تمت الإشارة إليه حيثما ورد، وأن أقر
لنيل درجة أو لقب علمي أو بحثي لدى أي مؤسسة الاخرين هذه الرسالة ككل أو أي جزء منها لم يقدم من قبل
تعليمية أو بحثية أخرى.
Declaration
I understand the nature of plagiarism, and I am aware of the University’s policy on
this.
The work provided in this thesis, unless otherwise referenced, is the researcher's own
work, and has not been submitted by others elsewhere for any other degree or
qualification.
:Student's name احمد مصطفى إبراهيم دياب :اسم الطالب
:Signature التوقيع:
28/08/2018 التاريخ:Date:
I
II
Abstract
Web Scraping is a useful technique can be used in an ethical way such as climate
and many researching fields, on the other hand, unethical way such as exploit content
privacy, which is Data Theft.
Several researchers have introduced some approaches for addressing this issue,
these solutions could have solved the problem in partial ways or in some cases,
therefore, the problem still needs another effort.
Consequently, in this work, a new solution is introduced for preventing web
scraping based on XPath and CSS in an efficient way and applicable to modern web
techniques. The proposed solution will be based on Markup Randomization which will
rename all CSS classes for a web page then sync those changes back with the HTML
page. The main advantage of the proposed solution that can be applied on any web
page.
Experiments were done over collected dataset which consists of 30 websites
divided into three categories: News, Currency Rates and Weather. The aim of the
experiments is to measure the Similarity, File Size and the processing time.
Visual Similarity was tested and proved that no visual changed occurred during
and after applying the solution and most of comparing results were 100% and few
results were above 97% due to some unsupported HTML tags was exists on the page
such as tags with different namespace like Facebook plugins.
File size also changed during the process so some experiments showed that file
size reduced due to unnecessary HTML elements removed and other increased due to
the length of CSS classes’ length.
The processing time of applying the solution is related to file size so that the file
with more than 4500 lines should take an average of 5 minutes while the file contains
(0-4500) lines the processing time should be less than 2 minutes.
Keywords: Anti-Scraper, Anti-Data theft, Web Scrapers.
III
ملخص الد اسة
يمكن ان تستخدم بطريقة أخلاقية مثل –عملية جمع المعلومات بطريقة آلية من مواقع الانترنت –كشط الويب
مي، ومن ناحية أخرى يمكن استخدامها بطريقة لا أخلاقية تعزز مبدئ التنبؤ بحالة الطقس او حتى في البحث العل
انتهاك ملكية المحتوى وهذا يعتبر سرقة البيانات.
بعض الباحثين اقترحوا طرق عديدة لحل هذه المشكلة ولكن هذه الحلول لا يمكن ان تنهي هذه المشكلة بشكل كامل
تشغيل وليس كل أوقات تشغيل برنامج الكشط أو حتى لأنها تطرقت للمشكلة بشكل جزئي أو في بعض أوقات
حلول لا يمكن تطبيقها من أحدث معايير الويب الحديث.
على العكس تماماً، طريقة جديدة لحل المشكلة تم طرحها في هذه الاطروحة لمنع مشكلة كشط الويب بشكل كافي
ز العشوائي للكود البرمجي والتي ستعمل وفعّال مع أحدث معايير الويب. هذه الطريقة ستبنى على مبدئ الترمي
" وفي نفس الوقت تغييرها في الكود البرمجي الخاص CSS Rulesعلى أعادة تسمية جميع قواعد الشكل "
" ويمكن تطبيقها على كل صفحة من صفحات الموقع بسهولة وبدون قيود.HTML Markupبالصفحة "
موقع الكتروني 30تجهزيها لتتلاءم من المقترح بحيث تتكون من تم فحص هذا المقترح على عينة البيانات التي تم
مختلف في الشكل موزّعين على ثلاثة تصنيفات مواقع إخبارية، مواقع العملات ومواقع حالات الطقس، وكان
الهدف من التجارب هو فحص مدى تشابه الصفحة قبل وبعد تطبيق الطريقة المقترحة، وحجم التغيير على ملفات
البرمجي وأخيراً الوقت الإجمالي لتنفيذ الطريقة.الكود
التشابه المرئي تم تفحصه باستخدام أدوات ذكية تفحص مدى تشابه الصفحات، وأثبتت النتائج انه لا يوجد تغيير
% وفي بعض الحالات كانت نسبة التشابه وصلت الى 100مرئي في اغلب الحالات بحيث نسبة التشابه كانت
ء الكود البرمجي الأصلي على بعض الرمز الغير مدعومة لأدوات الفحص وتعطي في كل % نتيجة لاحتوا97
مرة كود برمجة متخلف مثل إضافات فيسبوك.
التغير في حجم الملفات تم فحصة ومقارنته بما كان عليه وكانت النتائج تثبت ان حجم الملفات تقل بسبب عملية
الطريقة المقترحة وفي بعض الحالات كانت هناك زيادة في حجم تحسين الكود البرمجي التي تتم اثناء تطبيق
الملفات طبيعية بسبب ان الكود البرمجي الأصلي محسن ولا يوجد بده أي سطور برمجية غير ضرورية او اغير
مرئية يمكن ازالتها.
ي حالات ان الكود الوقت الإجمالي لتطبيق الطريقة المقترحة يعتمد على حجم ملفات الكود البرمجي الأصلي، فف
دقائق، بينا ان 5سطر فأكثر فان الوقت الإجمالي لتطبيق الطريقة يكون في حدود 4500البرمجي يحتوي على
سطر يكون اقل من دقيقتين. 4500الوقت اللازم لتطبيق الطريقة المقترحة في حالة اقل من
كشط الويب، منع سرقة البيانات، منع كشط الويب كلمات مفتاحية:
IV
Dedication
This research is dedicated to my father Mustafa, my mother Suad, Sister,
brothers, my wife and my sons Ezzuddeen and Yassin, friends and all one who
encourage me to complete my study.
V
Acknowledgment
I would first like to thank my thesis advisor Associate Professor Tawfiq Soliman
Barhoom of the Information Technology at Islamic University of Gaza. The door to
Prof. Tawfiq office was always open whenever I ran into a trouble spot or had a
question about my research or writing. He consistently allowed this thesis to be my
own work, but steered me in the right the direction whenever he thought I needed it.
At long last, I should offer my extremely significant thanks to my Father and
Mother and to my better half to provide me with unfailing help and constant
consolation during my time of study and through the way toward exploring and
composing this theory. This achievement would not have been conceivable without
them. Much obliged to you.
Author
Ahmed Mustafa Ibrahim Diab
VI
Table of Contents
Declaration .................................................................................................................... I
Abstract ........................................................................................................................ II
III ................................................................................................................. ملخص الدراسة
Dedication .................................................................................................................. IV
Acknowledgment ......................................................................................................... V
Table of Contents ....................................................................................................... VI
List of Tables ........................................................................................................... VIII
List of Figures ............................................................................................................ IX
List of Formulas ......................................................................................................... XI
List of Abbreviations ................................................................................................ XII
Chapter 1 Introduction .................................................................................................. 1
1.1 Statement of the Problem ........................................................................................ 2
1.2 Objectives ............................................................................................................... 3
1.2.2 Main Objectives ...................................................................................................3
1.2.3 Specific Objectives ..............................................................................................3
1.3 Importance of the Research .................................................................................... 3
1.3.1 Motivation ............................................................................................................3
1.4 Scope and Limitation of the Research .................................................................... 4
1.5 Overview of Thesis ................................................................................................. 4
Chapter 2 Theoretical Background ............................................................................... 6
2.1 Introduction ............................................................................................................. 6
2.2 Web Scraping Techniques ...................................................................................... 6
2.2.1 Web Usage Mining ..............................................................................................6
2.2.2 Web Scraping: .....................................................................................................9
2.2.3 Semantic Annotations ..........................................................................................9
2.3 The Custom Scraper ................................................................................................ 9
2.3.1 Web Crawler ........................................................................................................9
2.3.2 Data Extractor ....................................................................................................10
2.3.3 Exporting to CSV ..............................................................................................11
2.4 Scrapple ................................................................................................................ 11
2.5 Extracting Entity Data from Deep Web Precisely ................................................ 12
2.6 XQUERY Wrapper ............................................................................................... 13
2.7 Page Similarity ...................................................................................................... 14
2.7.1 Structure and Style Similarity ............................................................................14
2.7.2 Visual Similarity ................................................................................................17
2.8 Summary ............................................................................................................... 22
Chapter 3 Related Works ............................................................................................ 23
VII
3.1 Introduction ........................................................................................................... 23
3.2 Legal Efforts ......................................................................................................... 23
3.2.1 Copyright Law ...................................................................................................23
3.2.2 Digital Millennium Copyright Act ....................................................................24
3.3 Developer Efforts .................................................................................................. 25
3.3.1 ShieldSquare ......................................................................................................25
3.3.2 ScrapeDefender ..................................................................................................26
3.3.3 ScrapeSentry ......................................................................................................27
3.3.3 Distil Networks ..................................................................................................28
3.4 Researchers Efforts ............................................................................................... 30
3.4.1 Markup Randomization .....................................................................................30
3.4.2 Identification and Clustering .............................................................................31
3.5 Summary ............................................................................................................... 37
Chapter 4 Methodology .............................................................................................. 40
4.1 Introduction ........................................................................................................... 40
4.2 The proposed solution: .......................................................................................... 40
4.2.1 Supported Scrapers ............................................................................................41
4.2.2 Roadmap ............................................................................................................45
4.4 Summary: .............................................................................................................. 51
Chapter 5 Experiments and Discussion ...................................................................... 53
5.1 Introduction ........................................................................................................... 53
5.2 Dataset .................................................................................................................. 53
5.3 Experiment Settings .............................................................................................. 55
5.4 Experiments Process ............................................................................................. 55
5.4.1 Experiment: Processing Time ............................................................................56
5.4.2 Result Discussion: Processing Time ..................................................................58
5.4.3 Experiment: File Size ........................................................................................60
5.4.4 Result Discussion: File size ...............................................................................61
5.4.5 Experiment: Similarity .......................................................................................64
5.4.6 Result Discussion: Similarity ............................................................................67
5.4.7 Re-Run Web Scraper .........................................................................................72
5.5 Summary: .............................................................................................................. 75
Chapter 6 Conclusion .................................................................................................. 76
References ................................................................................................................... 78
VIII
List of Tables
Table (3.1): Summary for Related works .................................................................. 38
Table (5.1): Dataset website categories. .................................................................... 53
Table (5.2): Website list with category. ..................................................................... 54
Table (5.3): Machine specifications. .......................................................................... 55
Table (5.4): Total seconds require to apply the proposed solution. ........................... 57
Table (5.5): Results takes less than 2 minutes. .......................................................... 59
Table (5.6): Results takes more than 2 minutes. ........................................................ 59
Table (5.7): Results that take less processing time than most results. ....................... 60
Table (5.8): Website file size before and after applying the proposed solution. ....... 60
Table (5.9): Website HTML file size decreased after applying the proposed solution.
................................................................................................................................... 62
Table (5.10): Web site HTML page size increase after applying the proposed solution.
................................................................................................................................... 63
Table (5.11): Web Page Similarity results by applying Matiskay’s tool. .................. 65
Table (5.12): Website page similarity between original and generated website. ...... 66
Table (5.13): Website Category similarity test. ......................................................... 67
Table (5.14): Results for running web scraper after applying the proposed solution. 72
Table (16): Website extracted data before randomization ......................................... 74
IX
List of Figures
Figure (2.1): General Visits Report. ............................................................................ 7
Figure (2.2): Visits Traffic Source. .............................................................................. 7
Figure (2.3): Web Errors. ............................................................................................. 8
Figure (2.4): Visitor Depth .......................................................................................... 8
Figure (2.5): Top Visits Errors. ................................................................................... 8
Figure (2.6): Web Crawler Architecture. ................................................................... 10
Figure (2.7): Scrapple Architecture ........................................................................... 11
Figure (2.8): Scrapple Configuration File Example. ................................................. 12
Figure (2.9): DOM Tree. ............................................................................................ 13
Figure (2.10): Proposed schema model. .................................................................... 14
Figure (2.11): Tree with post order numbering for DOM elements .......................... 15
Figure (2.12): Example of Translated page. .............................................................. 18
Figure (2.13): Example of marked algebra. ............................................................... 19
Figure (2.14): Naïve term compression ..................................................................... 19
Figure (2.15): Vertical compression. ......................................................................... 20
Figure (2.16): Irreducible term. ................................................................................. 20
Figure (2.17): Visual representatives of two different pages. .................................... 21
Figure (3.1): Researchers Parikh et al algorithm for ducting web scraper. ............... 33
Figure (3.2): Researchers Catalin and Cristian proposed model architecture. .......... 35
Figure (3.3): Results showing suspicious IP address. ................................................ 36
Figure (4.1): The proposed solution based on Markup Randomization. ................... 40
Figure (4.2): Flow Chart for the proposed solution. .................................................. 41
Figure (4.3): Original CSS code example .................................................................. 42
Figure (4.4): Randomized CSS code ......................................................................... 43
Figure (4.5): Original HTML file. ............................................................................. 43
Figure (4.6): Randomized HTML file. ...................................................................... 44
Figure (4.7): The Proposed solution applying steps. ................................................. 45
Figure (4.8): Snippet from a scraped website. ........................................................... 46
Figure (4.9): CSS code before applying the proposed solution. ................................ 50
Figure (4.10): CSS code after applying the proposed solution. ................................. 50
Figure (4.11): HTML code snippet before applying the proposed solution. ............. 51
X
Figure (4.12): HTML code snippet after applying the proposed solution. ................ 51
Figure (5.1): Total time required for the proposed solution. ..................................... 56
Figure (5.2): Results classification based on time. .................................................... 58
Figure (5.3): Difference between generated file size original file size. ..................... 61
Figure (5.4): Code snippet before applying the proposed solution. ........................... 64
Figure (5.5): Code snippet after applying the proposed solution. ............................. 64
Figure (5.6): The original offline version of CBSL website. ..................................... 68
Figure (5.7): Generated version of CBSL website. ................................................... 68
Figure (5.8): Facebook Quote Dialog Example ......................................................... 69
Figure (5.9): Facebook generated code replacing the fb-root div. ............................. 70
Figure (5.10): Facebook generated Quote button. ..................................................... 70
Figure (5.11): AddThis setup code. ........................................................................... 71
Figure (5.12): AddThis generated code. .................................................................... 71
Figure (5.13): AddThis generate buttons look and feel. ............................................ 71
Figure (46): Website markup before randomization .................................................. 73
Figure (47): Website markup after randomization .................................................... 74
Figure (6.1): Proposed mode based on Markup Randomization. .............................. 76
XI
List of Formulas
Formula (2.1): XPath Formula Pattern. .................................................................... 14
Formula (2.2): Zhang Shasha’s algorithm complexity. ............................................ 15
Formula (2.3): Zhang Shasha’s space complexity. .................................................. 15
Formula (2.4): Jaccard coefficient formula. ............................................................. 16
Formula (2.5): Web page similarity equation. ......................................................... 17
Formula (2.6): Tree edit distance function. .............................................................. 21
XII
List of Abbreviations
API Application Program Interface
AP-TED Adapted Tree Edit Distance
BOT Automated program that runs over the Internet
CAPTCHA
Completely Automated Public Turing test to tell Computers
and Humans Apart
CSS Cascading Style Sheets
CSV Comma Separated Values
DB Database
DDOS Distributed Denial of Service
DMCA The Digital Millennium Copyright Act
DOM Document Object Model
DOS Denial of Service
HTML Hypertext Markup Language
HTTP Hypertext Transfer Protocol
JSON JavaScript Object Notation
OWASP Open Web Application Security Project
Pop Point of Presence
SaaS Software as a Service
SOC Security Operation Centre
TED Tree Edit Distance
URL Universal Resource Locator
VPN Virtual Private Network
WAF Web Application Firewall
XML Extensible Markup Language
XPath XML Path Language
Chapter 1
Introduction
Chapter 1
Introduction
Web scraping is the process of extracting information from the web pages, this
process can mimic human attitude for opening the website but web scraper differs from
human and can be an automated process done using HTTP protocol or by embedding
a web browser. Web Scraping is a process like web index "search engine function"
which indexes websites information using its bots. In contrast, web scraping extracts
specific information related to the web page itself but in case of the search engine, they
take only meta tags if exists from the website (Mahto & Singh, 2016).
Due to the richness of web page information and the increasing of the need for
data exchanging among the web in an automated fashion, first web scraper has
developed and have inspired from search engine bot functionality.
Web scraping tools can be used in an ethical and unethical way, first when it
used for research purposes and without taking over the privacy and copyright and the
other when some people take content from some websites and repost the content on
their websites particularly when the content is unique and creative.
Web Scraping is a useful technique that helps many fields of research to improve
their data and knowledge, one of the most practical fields is for weather forecasting
they used the scrapers to get historical data about the weather (Bonifacio, Barchyn,
Hugenholtz, & Kienzle, 2015).
Another usage of the web scrapers (Mahto & Singh, 2016) is for the new
Startups because of the lack of time, the need of data and the limitations of resources
they do prefer to use the web scraper to scrape data from similar websites initially then
they can update the scraped data whenever they need to. This is not fair for content
owners who have the ownership right of the data itself such as innovative content and
patents. By the time, this issue caused much loss for them in multiple fields as Data
Theft, Intellectual Theft, and Economic Lose. So this type of unauthorized use may be
classified as Data Theft (is the act of stealing computer-based information from an
unknowing victim with the intent of compromising privacy or obtaining confidential
information) which is a harmful unethical problem with destructive effects for the
companies.
2
As a result, web scraping becomes a crucial trend problem need to be solved
and so far, while there are few solutions proposed for mitigating this problem.
Researchers (Wetterström & Andersson, 2009) have introduced an invention for
preventing scraping by using a filter that reproduces the data requested by the client in
an unstructured manner, which could be understood by browsers, but a robot with
scraping software can't deal with in order to get the desired data. Other researchers
(Haque & Singh, 2015) have introduced a compound solution based on filtering the
visit to three categories (Black-List, Gray-List, White-List) and then treat with the
visitor depends on his category. Gray-List contains suspicious visitors, which are
subjected to several techniques to decide whether block or not.
Other solutions (ScrapeDefender), (ScrapeSentry), (ShieldSquare, 2013) and
("Distil Networks," 2018) was provided as commercial tools by developers while they
focusing on bot identification and clustering, not on the document itself.
This work proposed a solution for preventing web scraping based on CSS and
XPath by using the Markup Randomization which will change the HTML and CSS
files automatically in a timely manner to be different in markup and the same in the
visual look and feel. Therefore, the web scraper will be meaningless because it will
access different web page each request, while the scraper should take an action to
update the rules at each time they access the web page. Because of this technique, the
scraper will stop functioning well and stop to scrape these pages.
1.1 Statement of the Problem Although the web scraping is a content security issue, the most of the suggested
and proposed researches and tools was not focus on the content and some of them have
talked about the content in a minor way and the rest of them have not. As a result of
that, the scrapers were not prevented and still upgrading and updating all the time, so
that the need for an efficient technique to prevent web scrapers from accessing web
page data and without having any visual effects to the web page and can be applied in
short time without affecting the size of the website files.
3
1.2 Objectives
1.2.2 Main Objectives
The main objective of this research is to introduce a new solution of Anti
Scraping technique to protect web pages from the web scrapers by changing the
markup randomly in a timely manner and to assure that XPath and CSS scrapers
mitigated and stopped.
1.2.3 Specific Objectives
The specific objectives of the proposed solution are:
1- Focusing on XPath and CSS Web Scrapers to understand their methodology
and techniques.
2- Developing a technique to randomize the markup as well as the style without
having any visual effect to the website visitor.
3- Developing the Anti Scraping which automatically runs the randomizer in a
timely manner.
4- Building a dataset for testing and measuring visual similarity, processing time
and file size for generated documents.
1.3 Importance of the Research
Due to the rapid developing of the web scrapers, the data theft becomes the
most important issue for the content owners while the researchers don’t fill the gap
and stop the scrapers which lead to massive damage for websites owners. Researchers
proposed a lot of techniques that mitigate the damage but it is still not enough. The
need for a solution which can eliminate the web scraper still necessary by defending
the markup itself is the first procedure it should be taken before paying efforts for
building obstacles on the road for the document.
As a result of defending the document is the most important by changing the
markup on-the-fly which will definitely stop the scraper at the meantime.
1.3.1 Motivation
Regarding to Distil Network 2017 (Duffield, Haffner, Krishnamurthy, &
Ringberg, 2018) for Bad bot they figured out that 42.2% of all internet traffic wasn’t
4
human and 21.8% of the traffic was bad bots while the 74% percent of the bots are
advance bot that using anonymous proxies or even mimic human behaviour. On the
other hand, researchers (Mi et al., 2019) listed a group of Residential IP Proxies
provers whom provides enormous IPs for using on Web Scraping which is can bypass
any security system firewall based on IP filter or digital biometric approaches so that
all solutions based on identifying and classifying the bots will 100% fail on detecting
those scrapers.
1.4 Scope and Limitation of the Research 1- XPath and CSS Web Scrapers only will be the base of the study because most
of the web scrapers are based on XPath as well as CSS.
2- Anti-Scraper will focus on HTML markup and CSS changing randomly in a
timely manner.
3- Regular Expression Scrappers is out of scope.
4- Processing time is not considered in this proposed solution, but maybe in future
work.
1.5 Overview of Thesis This thesis is organized as follows:
1- Theoretical Background: this chapter addresses a background for the reader for
the web scraping techniques and models then it summarizes two algorithms for
web pages’ similarity to test the experiments by using them.
2- Related Works: this chapter demonstrates different efforts for preventing or
mitigating the web scraping issue, and provide insight into the gap between all
those efforts. Those efforts categories into 3 categories Legal, Developers and
Researchers efforts.
3- Methodology: this chapter introduced the proposed solution for preventing the
web scraping which contains from three steps Randomize CSS, Sync the
HTML, send the new page to the browser then demonstrated supported web
scrapers that will be prevented by the solution and finally, addressed proposed
solution applying steps.
4- Experiments and Discussion: this chapter concerned measuring the proposed
solution to determine how much processing time it needed, file size changing
5
and visual similarity between the original and generated web page. Those
factors reveal fairly the situation of the proposed solution.
5- Conclusion: this chapter concludes the thesis problem, solution, experiments
and results within a few paragraphs and highlighted the main outcome of this
thesis for businesses.
Chapter 2
Theoretical Background
6
Chapter 2
Theoretical Background
2.1 Introduction In this Chapter the researches concerned in Web Bot, Web Scraping and Page
Similarity researches. Web scraping techniques will be discussed to understand
researchers' efforts to improve and enhance the scrapers, therefore, the proposed
solution should be able to deal with.
Page Similarity researches will be discussed which will support the proposed
solution on the experimental section because of the similarity is the most important
factor to be measured.
2.2 Web Scraping Techniques
2.2.1 Web Usage Mining
Web usage mining refers to " the automatic discovery and analysis of patterns in
clickstream and associated data collected or generated as a result of user interactions
with Web resources on one or more Web sites " (Mobasher, 2006).
They show that it can extract the data for web usage using web server log and
show how much knowledge can get extracted if the logs analyzed using a specific
software such as "Nihuo Web Log Analyzer".
On the other hand, a deep view can be taken for the visitor attitude and here are
some of the reports from the analyzer. Figure 2.1,2.2,2.3,2.4 and 2.5 shows that what
the kind of data acquired by the Web servers and how can use it to differentiate
between the normal visitor and bot as well as scraper.
Figure 2.1 illustrates the graph for the number of visitors every single day which
can be used to detect which day was abnormal caused by an attack.
Figure 2.2 illustrates the graph for visit traffic countries sources so if you see
that the website targeted country has the most of visits it will seem to be normal, but
if you have a content targeted for the U.S and the visitors from Asia will lead to being
an Attack.
Figure 4 shows that the number of success response pages and the error, which
indicate that if the error is higher than success, the leads to be brute force attack on the
website and maybe a kind of Brute Force.
7
Figure 2.4 shows that how many pages visited in every session which will help
us to detect the bad attitude if the deep-depth have high rate this lead to know there is
an attack for the website.
Finally, Figure 2.5 like figure 2.3 but for specific error codes which help us to
understand and differentiate the errors including authorization and authentication leaks
and lead us to know if someone tries to access protected pages with password.
Figure (2.1): General Visits Report.
(Malik & Rizvi, 2011)
Figure (2.2): Visits Traffic Source.
(Malik & Rizvi, 2011)
8
Figure (2.3): Web Errors.
(Malik & Rizvi, 2011)
Figure (2.4): Visitor Depth
(Malik & Rizvi, 2011)
Figure (2.5): Top Visits Errors.
(Malik & Rizvi, 2011)
9
2.2.2 Web Scraping:
Converting unstructured information into structured information and stored into
a central database/spreadsheet. This can be done by using one of the scrapers within
an application and then define the criteria and targets for extracting and grouping.
2.2.3 Semantic Annotations
Notations or Metadata used to locate data within the document, so by preparing
a list of semantic data and define a layer for the web scraper before scraping data
(Malik & Rizvi, 2011).
Another technique (Mahto & Singh, 2016; Mathew, Balakrishnan, & Palani,
2015; Nie, Shen, Yu, Kou, & Yang, 2011; Yu, Guo, Yu, Xian, & Yan, 2014) was very
common on the most of papers and implement in the most of scraping tools which are
DOM-based manipulation and accessing data by XPath and CSS because it’s the
easiest and simplest technique and supported by the most of programming languages
and treated like the XML processing. Because of that, they encouraged to build their
scrapers on those techniques and proposed their approach of scraping data on the bases
of DOM manipulation and different in the architecture of the methodology,
programing language or even used tools.
2.3 The Custom Scraper Python based scraper consists of three part of the process, the first part is a web
crawler, the second is data extractor, and the last one is the storing method.
They have built the scraper with new concepts to full-fill the new startup need
as they need very much data but with no time to collect, so they need an efficient and
speed tool.
2.3.1 Web Crawler
Web Crawler is a tool or a set of tools that iteratively and automatically
downloading web pages also extracting URLs from their HTML and fetching them
recursively (Thelwall, 2001).
So it just needs to have a list of URLs to be visited this list will be called as a
seed (Mathew et al., 2015), each page will be visited and also all links inside each
particular page will be extracted to the list "seed" again to visit. Figure 2.6 contains
the most common architecture for the web crawler, which contains the following
component:
10
1- Downloader: the process to download the pages.
2- Queue: contains the list of URL to download.
3- Scheduler: is the process to start and organize the downloader.
4- Storage: is the process to extract the Metadata of the web page and save it as
well the text of the web page.
Figure (2.6): Web Crawler Architecture.
(Thelwall, 2001)
2.3.2 Data Extractor
The process of extracting information from a single web page, although a lot of
useful resources exists but the focus will be on extracting a specific data depending or
predefined rules. They achieves this goal by selecting the data using CSS Selectors or
XPath patterns (Mathew et al., 2015).
11
2.3.3 Exporting to CSV
After crawling the pages and extracting the data than a list of extracted
information stored in memory ready to save them to CSV using Python API (Mathew
et al., 2015).
2.4 Scrapple
Scrapple a Flexible Framework to Develop Semi-Automatic Web Scrapers
(Mathew et al., 2015),The main purpose and contribution of Scrapple are to reduce the
required modifications on the scripts to run the scraper like Scrapy (Kouzis-Loukas,
2016). To explain figure 2.7 parts scrapple divided it into the following points:
1. Web pages: is the web pages to crawled and scraped.
2. Scrapple: the proposed systems which consist of three processes:
3. Fetching the page: This process will download the page markup and store it
online.
4. Parsing the element tree: This process will enhance and clean the markup from
missing closing tags and the white spaces to be lighter to speed the parsing
process.
5. Extract the content: This process will extract the data from the web page by
applying the XPath or CSS patterns.
6. JSON Configuration File: This file contains the start page for the Crawling as
well as the criteria for the data extraction.
7. Data Format Handler: This is the final process to save the data to JSON or CSV
file contains all the data extracted from the visited web pages.
Figure (2.7): Scrapple Architecture
(Mathew et al., 2015)
12
The system architecture emphasizes the split of the configuration file to be
outside the Scrapple concept have achieved by splitting the configuration to be out of
the Python code to Key-Value configuration file like the figure 2.8. Then Scrapple
calls the file and reads the configuration after the crawler accessed the page.
Figure (2.8): Scrapple Configuration File Example.
(Mathew et al., 2015)
Scrapple is very fast because it used lxml (Behnel, Faassen, & Bicking, 2005)
library for parsing the webpage, and they have tested the library in comparison with
BeautifulSoup (Richardson, 2008) and show that it
2.5 Extracting Entity Data from Deep Web Precisely
Researchers(Yu et al., 2014) have proposed a model for web data extracting, this
model consists of many modules :
Web Crawler: they have proposed an intelligent web crawler that can deep dive
into the website and talk the navigation links form the static web pages as well as
dynamic.
Pretreatment of web resources: they have developed two procedures before
processing the webpages first is to normalize the HTML page and the other is to
eliminate the noisy information.
13
Locate and extract the entity data from Deep Web accurately: the concept of data
extraction from unstructured data to structured done by DOM interface, then parsing
the document using JTidy the web page is transformed to DOM tree to access each
node of the web page as an object, figure 2.9 illustrate the DOM tree.
Figure (2.9): DOM Tree.
(Yu et al., 2014)
2.6 XQUERY Wrapper
Researchers (Yu et al., 2014) have proposed a system to extract from websites,
this approach was based on XQuery. Wikipedia Says: "XQuery (XML Query) is a
query and functional programming language that queries and transforms collections of
structured and unstructured data, usually in the form of XML, text and with vendor-
specific extensions for other data formats (JSON, binary, etc.)"("XQuery," 2016)
They have proposed a schema model for modelling both web data and user
requirement illustrated in figure 2.10; therefore, they handle all type of data (single
and complex data). The following figure shows the structure of the data in a website
that emphasize the hierarchical nature of the data
14
Figure (2.10): Proposed schema model.
(Nie et al., 2011)
This example of the proposed models previews the hierarchical data of the
website and differentiate between the type of nodes each web page has (single and
complex).
The annotating of data semantics they map each data value to an attribute, and
then they used an exclusive path to annotate the location of the node in DOM tree. The
path will be XQuery expression which is based on XPath, Formula 2.1 shows XPath
pattern:
P = /T1[p1]/ T2[p2]/....../ Tm[pm]/
Formula (2.1): XPath Formula Pattern.
(Nie et al., 2011)
2.7 Page Similarity
2.7.1 Structure and Style Similarity
A technique proposed by (Gowda & Mattmann, 2016) for clustering the web
pages depending on DOM structure and Style which represent all the web page
structural and visual parts.
Researchers used Tree Edit Distance (TED) (Pawlik & Augsten, 2016) for
measuring DOM trees while the CSS is measured by Jaccard similarity (Niwattanakul,
Singthongchai, Naenudorn, & Wanapu, 2013) on CSS class names.
15
2.7.1.1 Structural Similarity using Tree Edit Distance Measure
Zhang Shasha’s (Zhang & Shasha, 1989) TED algorithm applied for
calculating the similarity between trees because of its simplicity and correctness.
Figure (2.11) demonstrates Zhang Shasha’s algorithm tree.
Figure (2.11): Tree with post order numbering for DOM elements
(Gowda & Mattmann, 2016).
The components in tree index in post order as appeared in Figure (2.11), the
nodes in the DOM tree are correspondingly indexed in post order.
The tree is incrementally built from smaller forests and the edit cost between two
forests is computed by gradually aligning nodes with Insert, Remove and Replace
operations as described in (Zhang & Shasha, 1989).
Dynamic programming applied for calculating the edit distance between the root
nodes of two DOM trees.
Researchers (Gowda & Mattmann, 2016) find out that TED algorithm is less
speed in case of a modern page because of the complexity of the web page itself, like
the nested tags and rich elements on it. Zhang Shasha’s algorithm has a complexity of:
O(|T1| × |T2| × min(depth(T1), leaves(T1)) × min(depth(T2), leaves(T2)))
Formula (2.2): Zhang Shasha’s algorithm complexity.
(Zhang & Shasha, 1989)
While it has space complexity of:
O(|T1| × |T2|)
Formula (2.3): Zhang Shasha’s space complexity.
(Zhang & Shasha, 1989)
16
Research (Gowda & Mattmann, 2016) determined to use AP-TED (Pawlik &
Augsten, 2016) implementation of TED because it is faster than traditional TED and
less time by 57% (Pawlik & Augsten, 2016).
TED applied efficiently on a timely manner for comparing two DOM trees, the
original copy of HTML markup and the randomized one, however, TED can’t
determine the similarity between two pages so the need for additional effort to
comparing the CSS styles of the two documents.
Although TED can’t be applied for measuring the CSS similarity because CSS
is not an XML-driven document and can’t be presented by a tree.
Therefore researchers (Gowda & Mattmann, 2016) adapted Jacard for measuring
CSS style similarity between the style original document as well as the randomized
one.
2.7.1.2 Stylistic Similarity using Jaccard Similarity
Cascading Style Sheets (CSS) are the webpage style which can be adapted and
used in unlimited styles because of its flexibility, therefore, comparing CSS is very
important to on similarity check process.
Researchers (Gowda & Mattmann, 2016) applied Jaccard Index algorithm by
assuming D1 and D2 as two web pages and the set of style class names parsed from
the DOM then they used Jaccard’s similarity coefficient ("Jaccard index," 2018) of
styles is computed by determining the fraction of styles overlapping in both of them:
𝑠𝑡𝑦𝑙𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =| 𝐴 ∩ 𝐵 |
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
Formula (2.4): Jaccard coefficient formula.
("Jaccard index," 2018)
Implications of Jaccard’s Similarity coefficient in style magnificence names is
created from:
1- For the reason that unique class names are used for computing the similarity,
the unequal variety of repeated businesses does now not adjust the stylistic
similarity.
17
2- The documents displaying similar content material possess the same set of
class names accordingly they bring about a higher price for the Jaccard
similarity coefficient.
3- The stylistic similarity degree may additionally cause fake wonderful for the
multiple documents from the identical website that is because of the truth that
the styles are most probably kept regular throughout all the web pages within
the same internet site consequently, it only complements the proposed
structural similarity degree described in section2.6.1.
2.7.1.3 Aggregating the Similarities
Researchers (Gowda & Mattmann, 2016) proposed a formula for calculating the
overall similarity presented on Formula (2.3) which have κ as a constant value from
[0.0, 1.0] to be fraction significance of the structural similarity.
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝜅 · 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑎𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + (1 − 𝜅) · 𝑠𝑡𝑦𝑙𝑖𝑠𝑡𝑖𝑐 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
Formula (2.5): Web page similarity equation.
(Gowda & Mattmann, 2016)
Finally, This technique used by the experimental section to calculate the
similarity between the two copies of each web page, the original one and the
randomized one moreover, Matiskay ("HTML Similarity," 2017) implemented this
paper on GitHub called HTML Similarity he used python as a scripting language to
realize the idea.
2.7.2 Visual Similarity
Researchers (Alpuente & Romero, 2009) proposed a new technique for
comparing web pages visual structure by classification the HTML tags depending to
its visual effect which transformed the web page into normalized form where the group
of HTML tags grouped into common canonical one then, they proposed a method for
calculating the distance between two particular web pages by some processes such as
compression which will decrease the complexity and enhance the time. Next sections
will cover all methodology steps.
18
2.7.2.1 Visual Structure of Web Pages:
Researchers (Alpuente & Romero, 2009) distinguish between the visual effect
of each HTML tag because there are many HTML tags have the same visual sensation,
that idea lets to group the HTML tags depending on their visual sensation and
introduce the following tag classes:
1- grp: table, ul, html, body, tbody, div and p.
2- row: tr, li, h1, h2, hr.
3- col: td.
4- text: otherwise.
After that, they translate all HTML tags to those group tags then the page new graph
should be like figure 2.12.
Figure (2.12): Example of Translated page.
(Alpuente & Romero, 2009)
2.7.2.2 Web Compression
Translate of the page generates a clear visual structure for the page itself then
they can detect the repeating structures while they do not depend on the concrete
number of child elements of given classes, therefore, rows are equal to a table with
one column then they group them.
2.7.2.2.1 Marked term
They count the number of nodes before transformation so that they not losing
any information. Also, they will group the terms appeared twice, figure 2.13 (a, b)
show that a tree before marking the terms and after respectively.
19
Figure (2.13): Example of marked algebra.
(Alpuente & Romero, 2009)
Marked algebra for this equals τ ([N]ΣV), where “[N]” represents the number
of times that the term t is duplicated “in the marked term [N]t”. Therefore, they find
that two rows having the same text have appeared twice so they combined them on
one form [1]grp([2]row([1]text)) = grp([2]row(text)).
2.7.2.2.2 Horizontal compression
Simplifying the trees is too important for the analysis process time, so repeating
tags should be grouped as shown in figure 2.14.
Figure (2.14): Naïve term compression
(Alpuente & Romero, 2009)
2.7.2.2.3 Vertical compression
While the HTML is a markup language with semi-structured elements all tags
should be nested then.
20
Figure (2.15): Vertical compression.
(Alpuente & Romero, 2009)
This process will eliminate all empty contains the page has, therefore, all grp
tags will be grouped while the text not because it is data and sensitive and no way to
lose it.
2.7.2.2.4 Shrinking and Join
Both vertical and horizontal compression done by shrinking the chains and join
the subterms. To achieve this, they do the following steps:
1- They initially remove the tags that belong to a chain of tags that don’t influence
the aspect of the resulting page.
2- Joining the subterms. Since both the vertical and horizontal transformations are
confluent and terminating by repeatedly applying this operation it then
generates an irreducible term after an arbitrary number of steps.
Figure (2.16): Irreducible term.
(Alpuente & Romero, 2009)
21
2.7.2.3 Comparison based on visual structure
The comparing between web pages are essentially comparing two trees after
they have normalized and transformed the trees they have to use “Edit Distance” for
comparing process.
2.7.2.3.1 Tree edit distance
To use TED, cost function must be defined on each edit operation like the following:
They assumed λ as a fresh constant symbol that represents the empty marked term,
and nd1, nd2 ∈ [N]ΣV be two marked trees. Then, each edit operation is represented
as:
(𝑛𝑑1 → 𝑛𝑑2) ∈ ([𝑁]𝛴𝑉 × [𝑁]𝛴𝑉)\(𝜆, 𝜆)
Formula (2.6): Tree edit distance function.
Therefore, (𝑛𝑑1 → 𝑛𝑑2) is:
1- a relabeling if 𝑛𝑑1 ≠ 𝜆 and 𝑛𝑑2 ≠ 𝜆
2- a deletion if 𝑛𝑑2 ≡ 𝜆,
3- Insertion if 𝑛𝑑1 ≡ 𝜆.
2.7.2.3.2 Comparison of Web pages
Comparing the two tree that transformed, shrined and joined on the previous
steps by applying the edit distance to measure the similarity between the two web
pages with the number of nodes. The following two trees illustrated in figure 2.17
Figure (2.17): Visual representatives of two different pages.
(Alpuente & Romero, 2009)
22
By applying all the steps above to calculate the following formulas they find
out the similarity between the two different web pages is equals 92%.
|𝑇𝑧𝑖𝑝| = 15 𝑎𝑛𝑑 |𝑆𝑧𝑖𝑝| = 12
𝛿(𝑡𝑧𝑖𝑝, 𝑠𝑧𝑖𝑝) = 2
𝑐𝑚𝑝(𝑡, 𝑠) = 0.92 ∼
While Tzip and Szip is the irreducible term for T and S trees and δ is the edit
distance between them, while the cmp is the comparison function of it.
2.7.2.4 Implantation
Researchers published their code on the university website, however, it exists
but it didn’t work for us due to HTML5 standards so in this work the implementation
adapted and upgraded before using it.
2.8 Summary
This chapter reviewed the most recent types of web bots and web scrapers as
well while each type of scraper’s idea discussed and reviewed briefly and each scraper
model studied and the methodology well understood.
Web scrapers type is also reviewed and classified based on the nature of the core
activity of it then multiple proposed scraping models were also discussed and each
idea well understood.
Page Similarity researches were also reviewed and the core idea of calculation
the page similarity was broken-down into multiple steps because the page similarity is
the main factor will be used to measure for the proposed solution.
Chapter 3
Related Works
23
Chapter 3
Related Works
3.1 Introduction Many efforts addressed to mitigate and stop the Web Scraping, these efforts have
classified into the following categories (Legal, Developers and research) effort.
Recently, a group addressed the problem by researchers (Wetterström &
Andersson, 2009) have proposed a model for preventing the web scraper by securing
the web page itself, while the other researchers' efforts distributed on identifying,
classifying and block the access the web bots at all.
Markup Randomizer is a suggested solution for preventing the web scraper in
total by changing the HTML markup in corresponding to CSS in order to stop the web
scraper selection rules.
3.2 Legal Efforts This section presents few the legal efforts that can deal with web scraping issue
which highly tighten copyright issue and fair using for others properties, thus
Copyright law, Digital Millennium Copyright (DMCA) law and Trespass to Chattels
tort discussed next subsections.
3.2.1 Copyright Law
Copyright law (Mitchell, 2015) was first adopted at Switzerland in 1886,
"Copyright is a legal right created by the law of a country that grants the creator of
original work exclusive rights for its use and distribution. This is usually only for a
limited time. The exclusive rights are not absolute but limited by limitations and
exceptions to copyright law, including fair use ("Copyright," 2018)".
Copyrights cover creative content only and the statistics and facts are not
included,
In the case of web scrapers, there are two copyright concerns the first is fine
and the last is not fine any may be opening myself up to a lawsuit:
6- Illegal usage of others content such as
The creative works like poetry are not allowed to be copied to your website.
7- Legal usage of others content:
24
1- Statistics and facts: if you publish a fact on your website about something is
copyrighted it will be much fine.
2- Information about copyrighted content posting frequency over the time is
fine also.
3- If the creative content is shared in a verbatim may not be violating copyright
law if the data is prices, names, company executives or some factual piece
of information.
3.2.2 Digital Millennium Copyright Act
DMCA (Mitchell, 2015) is "a United States copyright law that implements two
1996 treaties of the World Intellectual Property Organization (WIPO). It criminalizes
production and dissemination of technology, devices, or services intended to
circumvent measures (commonly known as digital rights management or DRM) that
control access to copyrighted works"("Digital Millennium Copyright Act of 1998,"
1998).
Within DMCA, safe harbor law "is a provision of a statute or a regulation that
specifies that certain conduct will be deemed not to violate a given rule. It is usually
found in connection with a vaguer, overall standard("Safe harbor (law)," 2018).
1- For the scrapers case with safe harbor law if you scrape a webpage that has
not a copyright, you will be free if the website itself haven’t listed that the
content is copyrighted and when you notified that the content becomes
copyrighted you should remove the content.
2- You cannot avoid the security rules e.g. password protection to access and to
harvest the content.
3- You can fairly use any content if it has “fair use” rule, which requires taking
into your account the percentage of the copyrighted work you have used and
the purpose of the usage.
To summarize the laws, you should never publish a material without own right
and permissions. In case you just want to store the materials into your offline database,
you will be fine, but if you publish it again to your websites you will never be fine. If
you analyzing that database and publishing the statistics, authors data or even meta-
25
analysis data is fine. Another fine usage if you select a few quotes or brief samples to
your meta-analysis to make your point but you should examine that "fair use".
3.3 Developer Efforts Some developers have addressed their own tools to prevent, detect and monitor
the web scrapers, they talked about their success and clients but they have not
academic papers about the methodology they have applied to reach their goal. I guess
that because of the market competition all of them still hide the recipe.
3.3.1 ShieldSquare
ShieldSquare (ShieldSquare, 2013) is a software service that provides a Real-
Time anti scraping service that contains the following features:
1- Actively detect/prevent website scraping & screen scraping
2- Prevent price scraping bots from competitors
3- Enhance your website’s user experience
4- Get complete visibility into bot traffic on your website
5- See comprehensive insights on BOT types and their sources
3.3.1.1 ShieldSquare Methodology
ShieldSquare provides automated bot prevention and detection to the websites
and mobile app without affecting the real user experience. They have introduced
innovative bot detection to detect the bots by building signatures for each unique
visitor to your site. To understand the process of ShieldSquare Architecture as shown
in figure 3.1 below.
Figure (3.1): ShieldSquare Model Architecture.
(ShieldSquare, 2013)
26
3.2.1.2 ShieldSquare Process:
1- When a page visit happens, ShieldSquare API call and JavaScript embedded on
the page collects and sends various parameters about the visitor to the backend
ShieldSquare Engine. Using proprietary technologies and smart algorithms,
ShieldSquare engine builds a unique fingerprint for each visitor.
2- Based on the exhaustive bot detection tests done on the previous activity of this
visitor, the cloud engine classifies the visitor as a human, search engine crawler,
or a bad bot. Based on the classification, if the visitor is a friendly entity (human
or search engine crawler), then ShieldSquare transparently allows the user to pass
by sending API response code as Allow. All of this is achieved in a few
milliseconds without impacting user experience.
3- In the event of a bad bot, ShieldSquare sends the corresponding response code back
to the application. Based on the response codes, you can implement actions like
blocking the bot, challenging with a CAPTCHA, feeding fake data, etc.
ShieldSquare, thus covers all routes and provides you flexibility to choose the
desired response to act against bots as per your business needs.
Although ShieldSquare contains multiple analysis and defense levels, but it not
preventing web scrapers totally due to the fast upgrade and update in the scrapers
techniques, they can eliminate the barriers and avoid the detection and catching
techniques. Because of that, the proposed solution may mitigate the number of bots,
but never helping the websites to be safe from the bots.
On the other hand, ShieldSquare requires each webpage or mobile app page to
check if the visitor is a real visitor or bot, which means lack of performance. As a result
of that, the problem still need a paradigm to protect the whole website on the level of
web server that never need an interaction from the developers to be assured that each
request will be handled without exceptions.
3.3.2 ScrapeDefender
ScrapeDefender (ScrapeDefender) is a tool to stop the web scrapers with main
three functions Scan, Protect and Monitor detailed into the following points:
27
1- Scan: ScrapeDefender routinely scans your site for web scraping
vulnerabilities, alert you about what it finds and what recommend solutions.
2- Secure: ScrapeDefender provides bullet-proof protection that stops web
scrapers dead in their tracks. Your content is locked down and secure.
3- Monitor: ScrapeDefender provides smart monitoring using intrusion detection
techniques and alert you about suspicious scraping activity when it occurs.
Securing process is achieved by using patented technology, which is a firewall
that will prevent the scrapers and deny their activity on the website, which will stop
the web scrapers and protect the content and lock it down from the bad bots.
ScrapeDefender has multiple checks along the time, which will prevent all
known scrapers patterns by the firewall and make the content safe, but if there are web
scrapers techniques with different behavior which means a new attitude and new
patterns the firewall, therefore, will not prevent those scrapers.
On the other hand, if the attackers exploit the DDOS and target the website then
the firewall will stop then the website will be either stopped or the scrapers will
continue work alone. As a result, the scrapers will access the gems and talk the control
over the website.
3.3.3 ScrapeSentry
ScrapeSentry (ScrapeSentry, 2018) blocks scrapers from violating intellectual
property with the ability to distinguish the good and bad scrapers whether human or
bot.
ScrapeSentry is a software as a service (SaaS) anti-scraping service 24/7
delivered from the Sentor Security Operations Centre (SOC). These Services include
monitoring, analysis, investigation, blocking policy development, enforcement, and
support.
ScrapeSentry can be installed either on a span port or directly on the webservers
aggregating traffic to a passively located appliance containing the ScrapeSentry
platform.
The policies are applied are through interaction with the infrastructure such as
load balancers, webservers or the client’s application. Then if they detect any type of
28
unauthorized usage, they will either automatically block the visitor or alert the Sentor
SOC for further investigation and intervention in minutes.
The ScrapeSentry service monitors the traffic any suspicious or bad usage traffic.
When the system detects malicious traffic it will analyze it and take action relative to
the analysis result and it will generate an alert to a security analyst that will act
according to the client specific Incident Response Plan.
ScrapeSentry have great reviews from its clients as they list on their website.
Like others solutions, they have filter the request and then take an action in according
to the analysis of the request, so the problem still exists which means is a new bot
developed with a different footprint the system will be blinded and never detect, until
the security officers fix it.
Another weak point also same the others is they add a new layer for the request
lifecycle, which will filter the request, let us say if the website under DDOS attack to
let the layer down then the scraper will scrape everything until the layer return back.
3.3.3 Distil Networks
Distil Networks ("Distil Networks," 2018) block every OWASP automated
threats such as Web Scraping, Denial of Service or even Skewing by BOT defense
product they own its very excellent product because it’s the first product covers Web
Pages, API and Mobile Apps which is distinct service. However, it covers all those
production environment tiers but it is good to say that now web scraping for mobile
and can’t be happened on API because all the response are data without any
representational layers comes with the output. Distil Network invent a holistic bot
defense mechanism contains the following process:
1- Robot exclusion standard:
This process to bar great bot by adding some content lines to robots.txt record
on the site. In any case, web scrapers are not cooperating with these instructions.
2- Manual:
Manual process to stop or reduce web scrapers by adding some rules to firewall
or by including some network infrastructure that could hide the network and the
original server IP address. In any case, may be excessively expensive hours yet with
no idea added-value.
29
3- Web application firewalls (WAF):
WAFs are designed to protect web applications from being misused because of
the presence of common software vulnerabilities. Web scrapers are not focusing on
vulnerabilities but rather intending to mimic real users. In this manner, other than being
programmed to block manually identified IP addresses (see last point), they are of little
use for controlling web scraping.
4- Login enforcement:
Some sites require login to access to the most valued data, notwithstanding,
this is no protection from web scrapers, as it is simple for the perpetrators to make their
own accounts and program their web scrapers as needs be.
Strong authentication or CAPTCHAs (see next point) can be deployed, yet
these present more burden for honest to goodness clients, whose underlying easy-going
interest might be dispersed by the dedication of account creation.
5- Are you a human?
One clear way to check web scraping is to ask users to show they are human.
This is the goal of CAPTCHAs (Completely Automated Public Turing test to tell
Computers and Humans Apart). They aggravate a few clients who discover them
difficult to decipher and, obviously, workarounds have been developed. One of the
bad-bot exercises depicted by OWASP is CAPTCHA Bypass (OAT-0093). There are
additionally CAPTCHA farms, where the test posed by the CAPTCHA is outsourced
to teams of low-cost humans via sites on the dark web.
6- Geo-fencing:
Geo-fencing implies sites are just uncovered inside the geographic areas in
which they lead business. This won't stop web scraping in essence, however will mean
the perpetrators need to go to the additional effort of appearing to run their web
scrapers inside a particular geographic area. This may basically include utilizing a
VPN link to a local point of presence (Pop).
7- Flow enforcement:
Upholding the course authentic clients take through a web-site can guarantee
they are approved at every turn. Web scrapers are frequently hardwired to follow after
high-value targets and experience issues if compelled to take after a typical client's
foreordained flow.
30
8- Direct bot detection and mitigation:
The objective here is the immediate location of scrapers through a scope of
techniques including conduct investigation and computerized fingerprinting utilizing
particular bot discovery and control innovation intended for the errand. Over various
clients, providers of such advances can enhance their comprehension of web scrubbers
and different bots through machine figuring out how to the advantage of all.
Referring to Direct bot detection and mitigation process describe above they
depending on how to prevent the bot to reach the web server as whole, but they never
had any plan for some cases for such cases that bot successfully reach the page and
steal the content so it still not sufficient and not dependable so they add the term
‘Mitigation’ for their proposed technology.
3.4 Researchers Efforts There are relatively few works for addressing Web Scraping issue and here
discussed in this section and its relation to the proposed solution.
Most of the researchers pay their attention to analysing the bot behaviours then
classifying them to Good and Bad bot while one researcher concentrates his effort on
the document itself because it’s the target for the scraper. Next sections will discuss
the efforts against Randomization and identification respectively.
3.4.1 Markup Randomization
Researchers (Wetterström & Andersson, 2009) have presented an invention for
preventing the scrapping of the information content of a database used for providing a
website with data information. Their invention depends on using an anti-scrapping
filter or filtering means. The filter is used to perform some processing on the data
requested by clients before being sent to them, in order to prevent scraping. The
method of preventing the information scrapping comprises the following steps:
1- Receiving the requested structured data record from the database.
2- Splitting all the elements or the fields of the data into data containers, called
cells, in a predetermined way.
3- Giving each cell a unique sort-id, which is generated by a random number
generator, and location information, which determine the location of the cell
inside the web page.
31
4- The cells are sorted by the sort-id to establish a new unstructured data, to be
sent to the requesting client.
5- Each cell is encoded into a markup language, e.g. HTML.
6- The resulting file is delivered to the requesting client.
As a result of sorting the data containers into the unstructured manner, a robot with
scraping software would not be able to interpret the content, because it can only deal
with structured data.
On the other hand, the unstructured placement of the data containers or cells would
not cause any problem for the displaying of the file as a web page. The web browser
will ignore the cells structural placement in the code, which is based upon the sort-id,
and will visually sort the data according to the location information.
Thus, the scraping robot will be prohibited to use a file that is generated by the
proposed filter.
While (Wetterström & Andersson, 2009) proposed a good solution because it
solves a part of the problem this part is XPath based scrapers, but it not efficient today
because when the model reorder the HTML tags within the pages the style of the page
will corrupt as it randomly ordered. Another problem is HTML5/CSS3 based websites
build in a way that cannot be reordered because the stylesheet is identical to the
elements in HTML file.
On the other hand, they cannot deal with CSS based scrapers and the scraper will
still be functioning well because the class is not changed the change only in the order
so the scraper will access the data despite the layout.
Last weakness point is the paper never talk about the performance issues and
caching for the files, so the performance of the system will be very bad and will not
help the website owners.
3.4.2 Identification and Clustering
Two researchers (Haque & Singh, 2015) have proposed a new model to mitigate
the web scrapers based on historical analysis for the visits. They have created three
lists for the visitor's IP address (Black-list, Gray-list, White-list) and deal with the
visitor depending on his class. In the case of Black list, the model will block the visit
and deny the session from initiation. In white list the session will be initiated
32
successfully without any barriers Then if the visit was classified as gray listed visit the
model will treat with it in may suggested solutions as listed below:
Defense levels:
1- The model may display captcha before he views the content.
2- The model may identify the scraper through browser information that usually
not send to browser.
3- The model may change the markup randomly to stop scraper from getting data
using old CSS and XPath selectors.
4- The model may change the information to an image so that the scraper will not
reach any valuable text.
5- The model may produce a frequency analysis to check if the visits number is
normal or abnormal.
6- The mode may produce an interval analysis to check that if the interval analysis
if similar it may classify as gray listed and to be redirected to bot-differentiating
techniques like Captcha. Therefore, it may efficient if used in long-term
strategy.
7- The mode may produce a traffic analysis this is very necessary in this days
because the modern scrapers have many IP address by this technique they can
detect those scrapers.
8- The mode may produce a URL Analysis for the visited pages to check if the
ratio between data-rich pages and non-rich, so that they can identify the
scrapers.
9- The model may use Honeypots and Honeynets, which is very common in
networking companies like Amazon and CloudFlare.
The proposed solution really is very good as it provided a multi-tier defense, on
the other hand, it not enough because the scraper may be developed and then will be
treated as whitelisted therefore, the need to focus more on the content itself so the
scraper can’t deal with it. The Markup Randomization they proposed it could stop only
the CSS based selectors, but if the scraper was used XPath it will not be mitigated and
the scraper will behave and function well.
33
Another weak point is that not suggested idea provided to cache the generated
randomized HTML markup, which means the model will generate a new randomized
HTML file each time it accessed which will cause a harmful load on the server as well
as if the server received too many sessions it will go down. So the possibility of
Distributed denial-of-service (DDoS) (Mirkovic & Reiher, 2004) will be increased
which is not acceptable in any way.
Another group of researchers (Parikh, Singh, Yadav, & Rathod, 2018) adopted
the Machine Learning for detecting the web scrapers patter that helps the detection of
the attackers on the run-time by building a tool contains a graphical interface to the
customer will easily identify them too while these tools are targeted for enterprise
businesses. The tool they developed intended to trap such signature of the attackers by
using the following techniques:
1- Logstash (Turnbull, 2013): open source tool for sysadmins and developer for
collecting and parsing logs also transforming the logs.
2- Kabana (Gupta, 2015): a tool for visualize Elasticsearch (Gormley & Tong,
2015) data and navigate the Elastic Stack.
3- Flagging the attacker patterns from the log.
4- Attacker feature extraction from the logs.
Then the researchers defined their algorithm illustrated on figure below:
Figure (3.1): Researchers Parikh et al algorithm for ducting web scraper.
(Parikh et al., 2018)
Read website
Logs
Feed logs to ElasticSearch database
Visualize using
Kibana
Detect the attacks
Block the attackers in
real time
34
They finally talked about the results on section VII tilted with “Expected
Results” which representing the overall summary for the paper. They talked about
Visualization and Pattern match then Extracting the various feature anomalies.
Their effort is good while the paper is not containing all required graphs that
prove their work in particular, they finished with “Expected Results” which means no
results existed. Another bad point is they say that visualization will justify the data so
it’s the main part of their methodology but it missing on the paper at least they have to
attach 2 figures that represent a data for regular person and data for suspicious web
scraper.
Their model based on Apache logs which are good, but it not efficient at all
because the intelligent web scraper may increase the interval of the visit so the logs
results for him will be fine and will not be distinguished than the bad bot. On the other
hand, they don’t have any digital biometric for the same scraper so they used the IP
address for the scraper as the Identity of it.
Therefore, this proposed system can’t be reliable at all and needs to be refactored
and reached with results and illustrated results.
Researchers Catalin and Cristian (Catalin & Cristian, 2017) proposed an
efficient method in pre-processing phase of mining suspicious web crawlers which
intended to automatically capture the data from network traffic as input data for mining
algorithms as a pre-processing step of data mining then results of the potential threats
visualized. Catalin & Cristian model for contain multiple phases like the following:
1- Framework Architecture.
2- Experiment Setup and Configuration.
3- Results Section.
Framework Architecture section presents the architecture they propose and the
component as shown in Figure (3.2):
35
Figure (3.2): Researchers Catalin and Cristian proposed model architecture.
(Catalin & Cristian, 2017)
Not as usual, researcher didn’t used (LogsTrash, ElasticSearch and Kibana)
model while they used Snort (Beale, Baker, & Esler, 2007) and Splunk (Duffield,
Haffner, Krishnamurthy, & Ringberg, 2018) for collected network traffic then filtering
it into a specific folder, after that it should be ready for mining algorithms.
Then they setups the environment and servers and adapted the tools for
identifying the suspicious bots’ engine summarized on the following points:
1- Snort for automatically analyze the traffic
2- Snort then filters the suspicious signals to external folder.
3- Splunk then automatically analyze the output of Snort for identifying the
possible threats and finally human expert should visualize the data to discover
hidden patterns.
They listed that bots’ activity on the logs contains the traditional information
User agent, IP address and Geolocation about it, however, those data and not enough
on distinguishing the bad bot from human or normal bots they find another digital
biometrics which may help the IDS on figuring out the bad bots as the following:
36
1- Number of hits per IP address.
2- Crawling speed.
3- Recurring hits.
4- Hits generating 404 errors.
5- Cookies
Experiments show that Snort IDS can process huge data within seconds as they
listed that 99,552 packets/sec which is too high rates.
The last section of their model is for results as they using Splunk for visualization
the correlated results which clearly showing the suspicious IP as shown in Figure 3.3:
Figure (3.3): Results showing suspicious IP address.
(Duffield et al., 2018)
This pre-processing method for identifying is very advanced and no doubt this
is the most intelligent technique founded because it presents an excellent method
starting from the collecting the data using IDS which mean intelligent enough to deal
with advance bots. Then the adaptation of Splunk within the framework architecture
that helps the human expert not only the machine on identifying anomalies as well as
the new pattern on the scrapers which mean some sort of digital biometric for the
scrapers.
Although the efforts were good on the scope, this system is not completely
covering the issue, they paying the attention to how to extract the data and then pre-
process it then, I think they need to complete their idea to wrap the main problem and
cover all sub problems. Another point is they depend on Snort IDS which is very good
software but it can be bypassed like the following example:
URL:website=http://www.site.com/%73%68%65%6C%6C%2E%70%68%70
Translates =http://www.site.com/shell.php
37
The previous example shows that, if the attacker tries to hit a specific web page
on the server hey can encode the URL then the IDS will treat the two URLs as
different, while this not all but it gives some weakness points for their proposed
method.
Also, they mentioned that human expert should review the results which will
exhaust the company and the expert himself and still have some fractions of errors may
occur by the human or by the IDS. Even though, if the scraper bypassed the system by
DDOS attack on the IDS or by camouflage then the whole system will be useless.
3.5 Summary
This chapter presents the related works to the proposed solution, which was
grouped into three categories Legal, Developer and Researchers efforts, Table 3.1
summarizes all efforts.
First of all, legal efforts which are the laws introduced to organize the copyrights
as well as the fair use of the digital information, websites and the web server at all.
While those efforts were very good but still don not forcing the scrapers to stop their
activity while the identifying the real people who do the scraping it for judgment, not
an easy task.
Secondly, the developer or commercial efforts that developed to solve the web
scrapings by mainly identifying or blocking them from accessing the web page by
different ways such as traps, CAPTACH and IP blocking, although they can prevent
some trivial scrapers but not protecting the document itself.
Third and finally, researchers’ efforts to prevent the scraping by detecting them
and identifying them at least then the site administrator should take action against, on
the other hand, Wetterström & Andersson technique discusses changing the web
document structure which is the most related to this work but it not working because
of it not supporting the current HTML5 and CSS3 standards and therefore, will not be
able to stop current web scrapers.
38
Table (3.1): Summary for Related works
Category Authors Advantages Disadvantages Difference
Legal Copyright Law * Protect the
original content
* For limited time.
* Not automatically
preventing
scrapers, it need
some legal actions.
* Protecting the
content all the
time.
* Preventing
web scraper on
real time.
Legal DMCA * protecting content
from digital user.
* Offline usage is
fine.
* Store data to
database is fine.
* Preventing
scraper from
getting data and
use it offline or
store it into
database.
Developer ShieldSquare * Preventing web
scrapers on real
time based on
detection
* Not supporting
new patters of web
scrapers.
* Based on log
analysis
* No need for
logs.
* Protecting the
document all the
time while
preserving the
same look and
feel.
Developer ScrapeDefender * Detect and
Prevent web
scrapers by
firewalls
* DDoS can down
the firewall.
* Detecting code
runs on Client Side
and may be
bypassed.
* No need for
firewalls.
* No need for
special code to
be run on the
client side.
* Protecting the
web page itself.
Developer ScrapeSentry * A technique based
on detecting and
blocking the web
scraper can be
* based on logs
analysis.
* attached to the
web server then
decrease web
* No need for
logs.
* No need for
additional effort
39
Category Authors Advantages Disadvantages Difference
installed easily for
any web server.
server
performance.
on the web
server.
* Protecting the
web page itself.
Developer Distil Networks * Huge digital
biometric network
for detecting and
preventing web
scrapers
* Proposed for
mitigating web
scrapers.
* Based on log
analysis
* Preventing the
web scrapers on
the mean time.
* Protecting the
web page itself.
Researcher Markup
Randomization
* Encrypting and
Randomizing the
HTML.
* Not caring about
CSS.
* Not designed
from new Web
Standards.
* Not preventing
XPath web
scrapers.
* Supporting
HTML5/CSS3
Standards.
* Prevent XPath
web scrapers.
Researcher Identification
and Clustering
* Based on
Intrusion detection.
* Based on Log
Analysis.
* Intrusion
detection can be
avoided.
* Logs are not
enough for
detecting the web
scrapers.
* Requires a human
expert to help the
classifier for
unknown and new
web scraper
patterns.
* Protecting the
web page itself.
* No need for
Logs analysis or
intrusion
detection
because it can be
bypassed.
* No need for
expert human
for
classification.
Chapter 4
Methodology
40
Chapter 4
Methodology
4.1 Introduction This chapter will present the proposed solution for preventing web scrapers
based on Markup Randomization that based on XPath and CSS selectors, also it will
present the dataset elicitation and finally the roadmap for this thesis.
4.2 The proposed solution: The proposed solution based on markup randomization is a technique to protect
each single web page from XPath and CSS based web scraper consists of three main
steps and presented on figure 4.1.
Figure (4.1): The proposed solution based on Markup Randomization.
The proposed solution contains the following processes:
1- CSS Randomization: randomized all CSS rules names by generating random
string consist of 16 characters within the range (a-z A-Z) and generating a
dictionary object contains the mapping between original rule name and
generated rule name.
2- HTML Sync with new CSS: sync HTML page with the new CSS rules names
by using the generated dictionary object for the mapping.
3- Cache the randomized HTML and CSS files on the disk so it can be served for
the client very fast.
4- Send the randomization version to the browsers: serve the client a randomized
version from the website.
This framework can be easily adapted within the web server request life cycle as shown
in Figure 4.2.
Send to Browser
Cache Randomized
version to disk
HTML Sync with new CSS
CSS Randomization
41
Figure (4.2): Flow Chart for the proposed solution.
This flow chart shows that the proposed solution starts after requesting a web
page and then check if they are any available cached version of the same web page to
send it to the user if there is no cached version available for a page the proposed
solution will generate a new web page and then caching it then return to the user.
4.2.1 Supported Scrapers
4.2.1.1 CSS-Based Scrapers
This type of scrapers is designed to extract the data from a webpage using CSS
selectors, for example, if a web page contains two elements values and the original
markup is:
<div class="title">Data</div>
<div class="news_details">Data</div>
Therefore, the scraper should write the following code to extract those fields.
42
$('.title') .text();
$('. news_details) .text();
This code will return the value of the two fields to be stored in the database. The
problem is the CSS class of each field is never changed. Therefore, the scraper will
reach the data whenever tried to access the page.
In the proposed solution the page markup, as well as CSS, will be changed
automatically in a timely manner so, when the scraper setup the configuration to
extract field by CSS classes he will figure out that scraper is stopped working and
never returning data.
Because of the automatic change and this is the expected result of the proposed
solution to see the CSS code snippet before and after the change at figure 4.3 and 4.4.
Figure (4.3): Original CSS code example
43
Figure (4.4): Randomized CSS code
This change in CSS required a necessary change in HTML to fit with the new
CSS rules, so the prpopsed solution will preserve the old rules names and have created
a dictionary file that contains the mapping between the old rule name and the new rule
name and stores the file temporarily to the disk. An example of Randomize HTML file
as well as the original is shown below at figures 4.5 and 4.6..
Figure (4.5): Original HTML file.
44
Figure (4.6): Randomized HTML file.
4.2.1.2 XPath-Based Scrapers
Another type of scrapers is designed to extract the data from a web page using
XPath notation selectors, for example, if there is a web page contains a table to be
extracted and the original markup as the following:
<html>
<body>
<h1>Data</h1>
<table>
<tr><td>Changes</td></tr>
<tr><td class=”change-value”>4.2</td></tr>
<tr><td class=”change-value”>3.3</td></tr>
</table>
</body>
</html>
Therefore, the scraper should write the following code to extract those fields
$('//*[@class="change-value"]) .text();
45
This code will also return the value of the H1 element as well as each td
element. To prevent the XPath-based scraper from extracting the data the following
ways can be done:
8- Randomize the CSS attributes like the previous section (Will used by this
work).
9- Adding new empty invisible tags to the randomized HTML file so that the
scrapers will not find the data match that tag.
After generating a randomized markup and saving the following files to the disk:
1- Randomized CSS.
2- Randomized HTML.
3- The Mapping File.
This will enhance the performance of the proposed solution. Cron Jobs is the
ideal way to automize the randomization process for each webpage on the web site to
ensure that the markup is unique and refreshed all the time. The following steps are
executed by each run of the Cron Job:
1- Delete the old cached version of the Randomized CSS, HTML and Mapping
File.
2- Generate the new Randomized files.
4.2.2 Roadmap
By this section, all steps needed to implement and test the proposed solution will
be discussed, Figure 4.7 illustrate the applying solution
Figure (4.7): The Proposed solution applying steps.
Defining
Scraping
Applying
Evaluating
46
4.2.2.1 Defining
The first step is to define the websites and building the dataset that will be used
on the following steps also the offline version of each website created as saved and
contains all files needed such as HTML, CSS and Javascript files.
4.2.2.2 Scraping
Running the web scraper on each website on the dataset and extract the data from
it, then store the results on a file to be compared later on with the results of the
randomized version of it. Figure 4.8 present example data after running the web
scraper on a website genuine version of the website.
Figure (4.8): Snippet from a scraped website.
4.2.2.3 Applying the Solution
Applying the proposed solution on each web page contained on the dataset then
save the randomize version of each web page for testing purposed. Calculating the
total required time for the randomization process, therefore, calculating the difference
on the files size done during the process, finally, the visual similarity calculated
between the original page and the generated page is too important for result discussion.
47
The following figures 4.9 and 4.10 show the a CSS code snippet of the website
before applying the proposed solution and after applying it and figure 4.11,4.12 show
the HTML code snippet before applying the proposed mode and after applying it.
The following PHP code presenting our methodology in generating CSS rules
randomization and HTML.
public function decryptRules($Rules)
{
$oCssParser = new Sabberworm\CSS\Parser($Rules);
$oCssDocument = $oCssParser->parse();
foreach ($oCssDocument->getAllDeclarationBlocks() as $oBlock) {
foreach ($oBlock->getSelectors() as $oSelector) {
$newSelector = $this->convertRule($oSelector->getSelector());
$oSelector->setSelector($newSelector);
}
}
return $oCssDocument->render();
}
private function convertRule($ruleName)
{
switch ($this->getSelectorsCount($ruleName)){
case 0:
return $ruleName;
case 1:
return $this->getNewNameORExists($ruleName);
break;
default:
$matches = null;
$returnValue = preg_match_all($this->pattern,$ruleName , $matches);
foreach($matches[0] as $match)
{
$new_rule = $this->convertRule($match);
$ruleName = str_replace($match,$new_rule,$ruleName);
}
return $ruleName;
}
}
private function getSelectorsCount($ruleName)
{
/*
* return how many (dots) on the selector string.
* */
$matches = array();
return preg_match_all($this->pattern,$ruleName , $matches);
}
48
private function getNewNameORExists($ruleName)
{
/*
* check if the current selector name is already decrypted or now and then:
* return new name in case of not decrypted yet.
* Or return the decrypted name.
* */
/*
* TODO
* Loop for all sub-roles and replace die command
* */
$matches = array();
if(preg_match_all($this->pattern,$ruleName , $matches)>1)
{
die('Loop for all sub-roles and replace die command');
}
else{
$real_role = $matches[0][0];
$start_key = substr($real_role, 0, 1);
if(!array_key_exists($real_role,$this->dictionary)){
$this->dictionary[$real_role]=$start_key.$this-
>getRandomString($real_role);
}
$the_rule=str_replace($real_role,$this-
>dictionary[$real_role],$ruleName);
return $the_rule;
}
}
private function getRandomString($length)
{
$chars = array_merge(range('a', 'z'), range('A', 'Z'), array('_'));
$length = intval($length) > 0 ? intval($length) : 16;
$max = count($chars) - 1;
$str = "";
while ($length--) {
shuffle($chars);
$rand = mt_rand(0, $max);
$str .= $chars[$rand];
}
return $str;
}
49
public function convertHTML($dictionary, $page)
{
set_time_limit(0);
$dom = new Dom;
$opt_a = array("cleanupInput"=>false );
$dom->loadFromFile($page,$opt_a);
$totalClasses = count($dictionary);
$UsedClasses = 0;
foreach ($dictionary as $oldKey => $newKey) {
$a = $dom->find($oldKey);
if(count($a)>0)
{
}
foreach ($a as $node) {
$UsedClasses++;
$type = substr($newKey, 0, 1);
if ($type == '.') {
$attrClass = $node->getAttribute('class');
$splitClass = explode(" ", $attrClass);
$strClass = "";
foreach ($splitClass as $key) {
$strClass .= substr($dictionary[".".$key], 1) . " ";
}
$node->setAttribute('class', $strClass);
} else
$node->setAttribute('id', substr($newKey, 1));
}
}
return $dom->root->outerHtml();
}
50
Figure (4.9): CSS code before applying the proposed solution.
Figure (4.10): CSS code after applying the proposed solution.
51
Figure (4.11): HTML code snippet before applying the proposed solution.
Figure (4.12): HTML code snippet after applying the proposed solution.
4.2.2.4 Evaluating
To evaluate the proposed solution there is a list of processes listed below:
1- Check the web scraper prevent or not by trying to scrap the generated website
again.
2- Calculating the total time required for applying the proposed solution.
3- Figuring out the visual similarity between the original version and the
randomized version of each web page.
4- Calculating the difference of the HTML and CSS files size before and after
applying the proposed solution.
4.4 Summary:
This chapter presented the proposed solution for preventing websites from the
web scrapers based on a technique called Markup Randomization. This proposed
solution can be to deal with XPath and CSS based web scrapers which are the have the
same structure internally but with few different on the way to select a particular node
on the DOM.
52
The proposed solution consists of three steps, CSS file rules names
randomization then HTML file sync with the randomized CSS file and finally send the
randomized version to the client.
Applying the proposed solution done through four steps, first of all, is to define
the dataset of websites to be used on the testing the proposed solution after that creating
offline version of the website so it can be used it on the next steps.
The second step is to run the web scraper for each website on the dataset to be
sure that the website is scrapable and to extract its data that will help us on next
sections.
The third step is to apply the proposed solution to each single web page to
generate a new web page that will have the same look and feel.
The final step is to evaluate that the generated document is not scraped while
maintaining the look and feel.
Chapter 5
Experiments and
Discussion
53
Chapter 5
Experiments and Discussion
5.1 Introduction
This chapter presents the experiments for the proposed solution which based on
markup randomization which intended to change the markup during the processing
time while preserving the same look and feel. Experiments established to measure
three factors Processing time, File size and the visual similarity and the results
presented and discusses in this section. Finally, re-run the web scraper to check if it
prevents on not.
5.2 Dataset
The dataset is a set of websites from three main categories News, Weather
forecasting and Stock Markets each category contains 10 websites (Table 5.1) shows
the categories with description. Those websites are collected manually by searching
google using contains related keywords for each category, after that open each website
to see check it has fresh content or not.
Table (5.1): Dataset website categories.
Category Name Category Description
News A set of websites that present daily-updated news.
Weather Forecasting A set of websites that contains daily-weekly-monthly
predications for the climate properties e.g. humidity,
wind speed.
Stock Markets A set of websites that contains currency prices
updated from the stock immediately.
As you see on the table, all of those websites have a sensitive-content and highly-
updated on time frame, which means it will cause a lot of damage to content owner
who pays a lot to populate and edit those data if a particular website stole his content
therefore, visits will be degraded and competitor website will hijack his website rating
by the time.
The selected websites finally are listed in Table 5.2 which illustrate the website
and the category of it.
54
Table (5.2): Website list with category.
# Website Category
1 Bbc News
2 Businessinsider News
3 Buzzfeed News
4 Gizmodo News
5 Huffingtonpost News
6 Mashable News
7 Techcrunch News
8 Thedailybeast News
9 Thenextweb News
10 Thinkprogress News
11 Cbsl Stock Market
12 Forex Stock Market
13 forex-ratings Stock Market
14 Marksandspencer Stock Market
15 Nrb Stock Market
16 Xe Stock Market
17 x-rates Stock Market
18 Wellingtonfx Stock Market
19 Bnm Stock Market
20 Centralbank Stock Market
21 Accuweather Weather forecasting
22 Intellicast Weather forecasting
23 weather-forecast Weather forecasting
24 Yr Weather forecasting
25 holiday-weather Weather forecasting
55
# Website Category
26 Timeanddate Weather forecasting
27 Nwac Weather forecasting
28 Jnto Weather forecasting
29 Forecast Weather forecasting
30 Bernews Weather forecasting
5.3 Experiment Settings
The experiments were carried out in the Cloud server environment that applies
the proposed solution on it. The Cloud server machine contains the following
specifications; Table 5.3 demonstrates the machine specifications.
Table (5.3): Machine specifications.
Machine Cloud Server
CPU 12 cores of Intel Xeon CPU E5-2650L v3 @ 1.80GHz
RAM 16 GB
OS Ubuntu 16.4
Hard Drive Virtual Cloud SSD
Because of the proposed solution built with PHP 7 Ubuntu Linux distribution
chosen to run the experiments on it due to the PHP is much faster on Linux. Cloud was
selected because of the need for many process on cheap price and can be extended and
scaled at any point without any extra configuration or reinstallation.
5.4 Experiments Process
Experiments were done by over the dataset to check the following factors:
10- Processing Time: Total processing time required for applying the proposed
solution, the fewer time means better adaptation on the production
environments.
11- File Size: Due to the limitation on resources it is highly recommended to test
the generated randomized markup size whether increased or decreased.
56
12- Similarity: Similarity is to check that the visual look and feel have been
changed after applying the proposed solution or not which lead that the
proposed solution is correct and run as intended.
13- Re-Test Web Scraper: Re-run the web scraper to check it prevent on not.
5.4.1 Experiment: Processing Time
Processing time is the main point for any business because there is a trade-off
between fast page rendering for the regular visitor and stopping the scraper bots. In the
first case, the regular user who hit on the website like he would like to open the website
for a specific purpose in the meantime.
Let’s assume that a currency exchange dealer man would like to exchange an
amount for a client who is waiting for him and the website took a lot of processing
time to be rendered and shown up, then, absolutely he will shutdown and close his
business due to his unreliability.
On the other hand, when the scarper bot need to scrape a data from the website
it should be rendered to the bot but the HTML markup, as well as the CSS markup ,
should be randomized by the proposed solution so it will stop.
As a result of that, the whole processing time shown in figure 5.1 which present
the total time for generating a new randomized web page. All results for 30 websites
shown in Table 5.4.
Figure (5.1): Total time required for the proposed solution.
0
50
100
150
200
250
300
350
400
450
500
Bn
m
Cb
sl
Cen
tral
ban
k
Fore
x
fore
x-ra
tin
gs
Mar
ksan
dsp
ence
r
Nrb
Wel
lingt
on
fx Xe
x-ra
tes
Bb
c
Bu
sin
ess
insi
der
Bu
zzfe
ed
Giz
mo
do
Hu
ffin
gto
np
ost
Mas
hab
le
Tech
cru
nch
Thed
aily
be
ast
Then
extw
eb
Thin
kpro
gre
ss
Acc
uw
eath
er
Ber
new
s
Fore
cast
ho
liday
-we
ath
er
Inte
llica
st
Jnto
Nw
ac
Tim
ean
dd
ate
wea
the
r-fo
reca
st Yr
57
Table (5.4): Total seconds require to apply the proposed solution.
Website Total Seconds
Bnm 122
Cbsl 6
Centralbank 361
Forex 12
forex-ratings 137
Marksandspencer 264
Nrb 14
Wellingtonfx 1
Xe 39
x-rates 15
Bbc 91
Businessinsider 120
Buzzfeed 107
Gizmodo 64
Huffingtonpost 99
Mashable 88
Techcrunch 432
Thedailybeast 53
Thenextweb 1
Thinkprogress 68
Accuweather 98
Bernews 54
Forecast 121
58
Website Total Seconds
holiday-weather 274
Intellicast 23
Jnto 5
Nwac 65
Timeanddate 36
weather-forecast 156
Yr 148
Regarding Table 5.4 most of web pages required little processing time to apply
the proposed solution on it, few websites are having an odd value will discuss it on the
next section.
5.4.2 Result Discussion: Processing Time
Processing time for applying the proposed solution regularly takes less than 2
minutes and little results take more than two minutes as shown in Figure 5.2. Most
results took less than two minutes due to the markup lines count because the required
time for applying the proposed solution is coupled with HTML and CSS lines count
as shown in Table 5.5 the results take less than two minutes.
Figure (5.2): Results classification based on time.
Results classfication based on time
Less than two minutes More than two minutes Less than 25 seconds
59
Table (5.5): Results takes less than 2 minutes.
Category Website Time
Currencies x-rates 0:00:36
Weather Forecast 0:00:39
News Techcrunch 0:00:53
News Businessinsider 0:00:54
Currencies Wellingtonfx 0:01:04
Currencies Nrb 0:01:05
Currencies Forex 0:01:08
Currencies Cbsl 0:01:28
Currencies forex-ratings 0:01:31
News Mashable 0:01:38
News Gizmodo 0:01:39
News Buzzfeed 0:01:47
Weather holiday-weather 0:02:00
Weather Bernews 0:02:01
News Thinkprogress 0:02:17
Weather Accuweather 0:02:22
News Bbc 0:02:28
Weather Intellicast 0:02:44
Above range experiments are the web pages with longer lines for the HTML
markup as well as CSS, this caused due a lot of replacements it need, while assuming
that each line of the body element needs at least one replacement therefore, it will take
much time for processing as shown on table 5.6.
Table (5.6): Results takes more than 2 minutes.
Category Website HTML lines CSS lines TOTAL
lines
Time
Currencies Bnm 5014 3221 8235 0:07:12
Currencies Centralbank 2785 4057 6842 0:06:01
News Thenextweb 984 4722 5706 0:04:34
News Huffingtonpost 1383 2397 3780 0:04:24
60
Finally, below range experiments are the web page that takes processing time less
than expected as shown in table 5.7 this caused by one the following:
1- CSS lines are not too long.
2- HTML lines are not too long.
3- CSS is not used 100% at the HTML document.
Table (5.7): Results that take less processing time than most results. Category Website HTML lines CSS lines TOTAL LINES SECONDS
Currencies Marksandspencer 1789 1836 3625 1
Weather Yr 202 124 326 1
Weather Nwac 501 164 665 5
Currencies Xe 1660 61 1721 6
Weather weather-forecast 383 877 1260 12
Weather Jnto 595 369 964 14
Weather Timeanddate 492 699 1191 15
News Thedailybeast 1033 418 1451 23
5.4.3 Experiment: File Size
Server Resources is an important point and should be measured for any proposed
solution because of the servers all about resources. As a result, file size changes tested
and tracked between the two versions of the page, a page before applying the
randomizer and the page after applying it then the relation between file size before and
after illustrated in Table 5.8.
Table (5.8): Website file size before and after applying the proposed solution. Website Size before Size after Diff (Size before/ Size after)
Bbc 267 112 2.383929
Businessinsider 92 77 1.194805
Buzzfeed 223 127 1.755906
Gizmodo 174 179 0.972067
Huffingtonpost 276 63 4.380952
Mashable 261 40 6.525
Techcrunch 330 269 1.226766
Thedailybeast 197 62 3.177419
Thenextweb 154 114 1.350877
Thinkprogress 158 129 1.224806
61
Website Size before Size after Diff (Size before/ Size after)
Cbsl 68 43 1.581395
Forex 27 26 1.038462
forex-ratings 53 50 1.06
Marksandspencer 97 61 1.590164
Nrb 54 24 2.25
Xe 67 48 1.395833
x-rates 30 22 1.363636
Wellingtonfx 11 12 0.916667
Bnm 73 64 1.140625
Centralbank 151 102 1.480392
Accuweather 110 48 2.291667
Intellicast 71 49 1.44898
weather-forecast 71 30 2.366667
Yr 85 72 1.180556
holiday-weather 87 45 1.933333
Timeanddate 25 23 1.086957
Nwac 147 106 1.386792
Jnto 40 38 1.052632
Forecast 65 71 0.915493
Bernews 88 123 0.715447
5.4.4 Result Discussion: File size
File size results were a bit different than the processing time results as shown in
figure 5.2 because of the developers not following the web standards on writing the
CSS documents as well as the HTML documents.
Figure (5.3): Difference between generated file size original file size.
-250
-200
-150
-100
-50
0
50
Generate file size
62
The proposed solution restructures all those files on the final step therefore, the
generated HTML and CSS enhanced at the most cases than the total lines of the
generated documents less than the original see Table 5.9. Although, some cases the
lines increased as illustrated in Table 5.10.
Table (5.9): Website HTML file size decreased after applying the proposed solution.
Website Size before Size after
bob 267 112
Businessinsider 92 77
buzzfeed 223 127
huffingtonpost 276 63
mashable 261 40
techcrunch 330 269
thedailybeast 197 62
thenextweb 154 114
thinkprogress 158 129
cbsl 68 43
forex 27 26
forex-ratings 53 50
marksandspencer 97 61
nrb 54 24
xe 67 48
x-rates 30 22
bnm 73 64
centralbank 151 102
accuweather 110 48
intellicast 71 49
weather-forecast 71 30
yr 85 72
holiday-weather 87 45
timeanddate 25 23
nwac 147 106
jnto 40 38
63
To discuss those results on Table 5.9, it too necessary to understand the webpage
HTML markup and CSS. Each HTML and CSS both may contain the following un-
necessary elements:
14- Comments: Comments on CSS are any text wrapped by “/* and */” and
comment on HTML are text wrapped by “<!-- and -->”.
15- White spaces: One white space or more.
16- Line breaks: One line break or multiple can be added by clicking on “Enter”
key.
Table (5.10): Web site HTML page size increase after applying the proposed
solution. Website Size before Size after
Gizmodo 174 179
Wellingtonfx 11 12
Forecast 65 71
Bernews 88 123
Therefore, when the proposed solution finished the randomization process and it
removes all unnecessary lines and comments from the original copy of the markup.
Then the file with a lot of un-necessary element will be decreased immediately and the
difference should be obvious. While the file with few un-necessary elements will not
be decreased at most cases or may increase a bit due to the length of CSS class name
is longer than the original one. File size is too important for the production
environment if file size can be shrunk and decreased then a lot of version of the
randomized web page can be generated in advanced which means more applicability
for the system.
To figure out the difference between the files before and after applying the proposed
solution Figure 5.4 and 5.5 present a specific code snippet before and after applying
the mode.
64
Figure (5.4): Code snippet before applying the proposed solution.
Figure (5.5): Code snippet after applying the proposed solution.
5.4.5 Experiment: Similarity
The similarity is an important factor to be checked for testing the original and
generated version of the web page to see that if the proposed solution preserves the
visual look and feel of each web page or break it.
Two group of researchers (Alpuente & Romero, 2009; Gowda & Mattmann,
2016) proposed two different ways to compare each different two web pages, therefore
the two proposed methods were used on by the experiments section.
Unfortunately, the first technique proposed by (Gowda & Mattmann, 2016) for
testing the similarity was failed while the second one proposed by (Alpuente &
Romero, 2009) succeeded.
As a result, the second approach which suggested by (Alpuente & Romero,
2009) adapted for the latest HTML5 standards and feeling confident of using it. Next
sections contain the full review on the results.
65
5.4.5.1 Visual Similarity using Gowa et al. Method:
Testes were done using Matiskay’s ("HTML Similarity," 2017) python tool
which is the implementation for Gowa et al (Gowda & Mattmann, 2016) technique.
The tool has to main parts:
1- Html Similarity part by applying TED on the two documents.
2- CSS Similarity part by applying Jaccard similarity between the set of CSS
classes.
The results for the similarity test for each web page shown on table 5.11.
Table (5.11): Web Page Similarity results by applying Matiskay’s tool. Website Similarity
Thinkprogress 8%
Businessinsider 10%
Buzzfeed 19%
Bbc 21%
Nrb 21%
Accuweather 22%
Nwac 35%
Jnto 39%
Bnm 41%
Forecast 41%
Cbsl 41%
Thedailybeast 45%
x-rates 45%
Huffingtonpost 45%
holiday-weather 46%
Xe 46%
Bernews 47%
Mashable 47%
Centralbank 48%
Thenextweb 48%
Marksandspencer 48%
Forex 48%
Intellicast 48%
Techcrunch 48%
66
Website Similarity
Timeanddate 48%
Gizmodo 48%
weather-forecast 48%
Wellingtonfx 49%
Yr 50%
forex-ratings 50%
5.4.5.2 Visual Similarity using Romero and Maria Method:
Visual Similarity tested by using a xml2maude tool (Alpuente & Romero, 2009)
which compares the two web pages similarity by a list of complex normalization and
transformation setups then it generated the tree edit distance for each web page and
calculate the similarity by their suggested formulas (Alpuente & Romero, 2009).
Each website tested individually and the website category as the whole also
tested then (Table 5.12) and (Table 5.13) illustrate the results respectively.
Table (5.12): Website page similarity between original and generated website.
Website Similarity
Accuweather 99.82%
Bbc 100.00%
Bernews 100.00%
Bnm 99.95%
Businessinsider 99.66%
Buzzfeed 100.00%
Cbsl 99.75%
Centralbank 98.98%
Forecast 99.11%
Forex 100.00%
forex-ratings 100.00%
Gizmodo 100.00%
holiday-weather 97.03%
Huffingtonpost 99.92%
Intellicast 100.00%
Jnto 98.67%
67
Website Similarity
Marksandspencer 100.00%
Mashable 100.00%
Nrb 97.74%
Nwac 100.00%
Techcrunch 100.00%
Thedailybeast 99.92%
Thenextweb 99.89%
Thinkprogress 100.00%
Timeanddate 100.00%
weather-forecast 100.00%
Wellingtonfx 100.00%
Xe 100.00%
x-rates 100.00%
Yr 100.00%
Table (5.13): Website Category similarity test.
Category Similarity
News 99.94%
Currency 99.64%
Weather 99.46%
5.4.6 Result Discussion: Similarity
Two methods for similarity applied, first of all, the first methodology proposed
by (Gowda & Mattmann, 2016) and implemented on python by Matiskay ("HTML
Similarity," 2017) applied but the results were not matching the expectations and it
was too hard to find a relation between the characteristics of each website and the
results such as:
1- Relation between the calculated similarity and the file size.
2- Relation between the calculated similarity and CSS coverage inside the HTML.
Then, Matiskay implementation for web page similarity fails to work on this
model while the two web pages have the same look and feel as shown in figure 5.6 and
5.7.
68
Figure (5.6): The original offline version of CBSL website.
Figure (5.7): Generated version of CBSL website.
69
As a result of that, another solution used that compares the two web pages
visually not by other means. The other solution compares the two web pages visually
by transformation and compression the two web pages then calculate the similarity.
The results of Romero’s and Marıa’s approach match the expectations because
it measures the difference between the original documents and the generated
documents and then calculates how much the generated document similar to the
original documents. The similarity values ranged from 97.0% to 100% depending on:
1- How many unsupported tags cleared during applying the proposed solution?
Such as the following tags:
a. <b:if> and <b:else/>
b. <gcse:search/>
2- How many Run-time generated DOM elements inserted updated or deleted?
because the compare tool will dismiss all of them such as:
a. Facebook Social button and dialogs: the following code snippet on
figure 5.8 demonstrate an example of Facebook for changing DOM at
run-time, so empty div tag with id “fb-root” will be replaced by the
code snippet shown on figure 5.9 to show the quote button illustrated
on figure 5.10.
Figure (5.8): Facebook Quote Dialog Example
(Facebook, 2018).
70
Figure (5.9): Facebook generated code replacing the fb-root div.
Figure (5.10): Facebook generated Quote button.
b. AddThis Social buttons: Many types of buttons with counters and
statistics produced and maintained by AddThis as a service. For
example, share buttons can be used by code demonstrated in figure 5.11
and on the runtime it replaced by the code illustrated in figure 5.12 and
finally represented like figure 5.13.
71
Figure (5.11): AddThis setup code.
(AddThis, 2018)
Figure (5.12): AddThis generated code.
Figure (5.13): AddThis generate buttons look and feel.
72
5.4.7 Re-Run Web Scraper
The web scraper executed for three times to extract data but it failed to get any
single data as presented by Table 5.14 all websites succeeded to stop web scraper and
its data are protected.
Table (5.14): Results for running web scraper after applying the proposed solution.
# Website Prevent Web Scraper
1 Bbc YES
2 Businessinsider YES
3 Buzzfeed YES
4 Gizmodo YES
5 Huffingtonpost YES
6 Mashable YES
7 Techcrunch YES
8 Thedailybeast YES
9 Thenextweb YES
10 Thinkprogress YES
11 Cbsl YES
12 Forex YES
13 forex-ratings YES
14 Marksandspencer YES
15 Nrb YES
16 Xe YES
17 x-rates YES
18 Wellingtonfx YES
19 Bnm YES
20 Centralbank YES
21 Accuweather YES
73
# Website Prevent Web Scraper
22 Intellicast YES
23 weather-forecast YES
24 Yr YES
25 holiday-weather YES
26 Timeanddate YES
27 Nwac YES
28 Jnto YES
29 Forecast YES
30 Bernews YES
For instance, figure 1 represents a code snippet from the original website and before
randomizing the markup and figure 2 represents the code snippet after the
randomization, On the other hand, table 1 contains the data that have been scrapped
from the website before randomizing the markup and while no data extracted after
website randomization.
Figure (14): Website markup before randomization
74
Figure (15): Website markup after randomization
Table (15): Website extracted data before randomization
News Title News Url
McMaster: Evidence of Russian
meddling in the US election is
'now really incontrovertible'
http://www.businessinsider.com/mcmaster-
russia-meddling-us-election-
incontrovertible-2018-2
The Mueller indictments —
here's which Russians were
charged with interfering in the
2016 US election
http://www.businessinsider.com/russians-
mueller-charged-with-interfering-2016-
election-2018-2
Twitter users are being called out
for posting fake claims of racially
motivated assaults at 'Black
Panther' showings
http://www.businessinsider.com/twitter-
users-post-fake-claims-assaults-black-
panther-showings-2018-2
A hedge fund that focuses solely
on marijuana is crushing it
http://www.businessinsider.com/bi-prime-
navy-capital-investing-in-the-public-
marijuana-market-2018-2
Video shows buildings swaying
violently during a massive
earthquake in Mexico
http://www.businessinsider.com/mexico-
city-earthquake-video-building-2018-2
75
5.5 Summary:
This chapter presents the experiments and the discussion done for the proposed
solution to measure processing time, File size and similarity.
The processing time required for applying the proposed solution on each web
page is measured because it is the most important factor in real production
environments. File size also tested because of the website life is all about resources.
Finally, similarity measured between the markup before and after applying the
proposed solution which will prove that changes are done on the markup while no
visual effect happened during the process.
Time is the most important factor for real environments so the lower time will
let the proposed solution applicable more than the higher time, will the experiment
show that the time need to apply the proposed solution on a particular web page is 2
minutes for web pages have total markup code less than 4500 lines, therefore higher
time justified well so that the developers should benefit for the outcome of the
justification and build the web page with fewer lines and fewer CSS attributes by
defining root CSS class for each block and use the element as selector for the required
style.
File size is enhanced at most cases and not increased because the proposed
solution normalized the web pages by removing all unnecessary white-spaces and line
breaks as well as code comments, and may increase for normalized web pages by 9%
only on the most cases while one case increased by 39%.
Similarity tests show that most of the web pages have no visual changes
happened during applying the proposed solution while the web pages’ results contain
1-3% change justified by run-time generated code by some third party or by un-
resolved third party HTML tags.
Finally, re-testing the web scraper for all dataset websites shows that all
websites are protected after applying the proposed solution periodically.
Chapter 6
Conclusion
76
Chapter 6
Conclusion
Web scraping problem is a trend legal and business issue affecting many fields
of websites such as bloggers and online businesses websites. The problem of web
scraping activity leads to steels the original content then publish it immediately without
preserving the intellectual property or copyrights for online businesses.
Figure (6.1): Proposed mode based on Markup Randomization.
The proposed solution based on Markup Randomization shown on Figure 6.1
preventing websites from the web scrapers by generated a randomized version of each
single web page that will equals the original one without any changes happened during
the process which will prevent the web scraper and permanently solve the web
scraping issue repeating the process on time span can be defined and adjusted by the
website administrator.
Experiments done over the dataset which contains 30 websites from three
categories News, Weather forecasting and Currency markets to test the total processing
time required the randomization, File size changes before and after the processing and
finally, the visual similarity between the generated and the original web page.
Results show that processing time needs less than 2 minutes for instances that
total lines for HTML and CSS less than 4500 line. While the file size is enhanced and
decreased with few exceptional cases by removing all unnecessary white-spaces and
line breaks as well as code comments but increased for normalized web pages by 9%
only on the most cases only one case increased by 39%. Lastly, visual similarity test
Send to Browser
Send randomized
version to the browers
HTML Sync with new CSS
CSS Randomization
77
proved that most of the web pages have no visual changes happened during the
processing the proposed solution while few websites have 1-3% change justified by
run-time generated code by some third party or by un-resolved third party HTML tags.
The proposed solution for markup randomization proposed solution is highly
distinct and never achieved or discussed because it prevents websites from the web
scrapers and can be applicable for latest web standards and technology and can be
embedded within Web Cache systems or can act as an intermediate layer for the web
server.
Online businesses now can depend on the proposed solution because it will
prevent them web scraping problem then this will lead to more success on the
competition and beat the competitors who steal the prices. On the other hand, Bloggers
can also take advantage of the proposed solution and increasing the traffic and SEO
ratings with more revenue and traffic.
Future works:
Many different enhancements have been left for the future due to the lack of
resources and time and the future work should concern about the following points:
1- Web Scrapers that lookup for the specific content by intelligent methods such
as regular expression, semantic search or even machine learning.
2- The proposed solution takes a lot of time to generate the new HTML and have
time complexity O(nx) while n is the number of CSS classes and x is the number
of classes used on the HTML document.
3- Changing the structure of HTML in the way that will mislead the web scraper.
78
References
AddThis. (2018). AddThis. Retrieved from https://www.addthis.com/
Alpuente, M., & Romero, D. (2009). A visual technique for web pages comparison.
Electronic Notes in Theoretical Computer Science, 235, 3-18.
Beale, J., Baker, A. R., & Esler, J. (2007). Snort: IDS and IPS toolkit: Syngress.
Behnel, S., Faassen, M., & Bicking, I. (2005). lxml: XML and HTML with Python. In.
Bonifacio, C., Barchyn, T. E., Hugenholtz, C. H., & Kienzle, S. W. (2015). CCDST:
A free Canadian climate data scraping tool. Computers & Geosciences, 75, 13-
16.
Catalin, M., & Cristian, A. (2017). An efficient method in pre-processing phase of
mining suspicious web crawlers. Paper presented at the System Theory,
Control and Computing (ICSTCC), 2017 21st International Conference on.
Copyright. (2018). Retrieved from https://en.wikipedia.org/wiki/Copyright
Digital Millennium Copyright Act of 1998. (1998). Retrieved from
https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act_of_1998
Distil Networks. (2018, 06/30/2018). Retrieved from
https://www.crunchbase.com/organization/distil
Duffield, N., Haffner, P., Krishnamurthy, B., & Ringberg, H. A. (2018). Systems and
methods for rule-based anomaly detection on IP network flow. In: Google
Patents.
Facebook. (2018). Quote Plugin. Retrieved from
https://developers.facebook.com/docs/plugins/quote#example
Gormley, C., & Tong, Z. (2015). Elasticsearch: The Definitive Guide: A Distributed
Real-Time Search and Analytics Engine: " O'Reilly Media, Inc.".
Gowda, T., & Mattmann, C. A. (2016). Clustering Web Pages Based on Structure and
Style Similarity (Application Paper). Paper presented at the 2016 IEEE 17th
International Conference on Information Reuse and Integration (IRI).
Gupta, Y. (2015). Kibana Essentials: Packt Publishing Ltd.
Haque, A., & Singh, S. (2015). Anti-scraping application development. Paper
presented at the Advances in Computing, Communications and Informatics
(ICACCI), 2015 International Conference on.
HTML Similarity. (2017). Retrieved from https://github.com/matiskay/html-similarity
Jaccard index. (2018). Retrieved from https://en.wikipedia.org/wiki/Jaccard_index
Kouzis-Loukas, D. (2016). Learning Scrapy: Packt Publishing Ltd.
Mahto, D. K., & Singh, L. (2016). A dive into Web Scraper world. Paper presented at
the Computing for Sustainable Global Development (INDIACom), 2016 3rd
International Conference on.
Malik, S. K., & Rizvi, S. (2011). Information extraction using web usage mining, web
scrapping and semantic annotation. Paper presented at the Computational
Intelligence and Communication Networks (CICN), 2011 International
Conference on.
Mathew, A., Balakrishnan, H., & Palani, S. (2015). Scrapple: a Flexible Framework
to Develop Semi-Automatic Web Scrapers. International Review on
Computers and Software (IRECOS), 10(5), 475-480.
Mi, X., Liu, Y., Feng, X., Liao, X., Liu, B., Wang, X., . . . Sun, L. (2019). Resident
Evil: Understanding Residential IP Proxy as a Dark Service. Paper presented
at the Resident Evil: Understanding Residential IP Proxy as a Dark Service.
79
Mirkovic, J., & Reiher, P. (2004). A taxonomy of DDoS attack and DDoS defense
mechanisms. ACM SIGCOMM Computer Communication Review, 34(2), 39-
53.
Mitchell, R. (2015). Web scraping with Python: collecting data from the modern web:
" O'Reilly Media, Inc.".
Mobasher, B. (2006). Web usage mining. Web data mining: Exploring hyperlinks,
contents and usage data, 12.
Nie, T., Shen, D., Yu, G., Kou, Y., & Yang, D. (2011). Construct the XQuery-based
wrapper for extracting web data. Paper presented at the Fuzzy Systems and
Knowledge Discovery (FSKD), 2011 Eighth International Conference on.
Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using of
Jaccard coefficient for keywords similarity. Paper presented at the Proceedings
of the International MultiConference of Engineers and Computer Scientists.
Parikh, K., Singh, D., Yadav, D., & Rathod, M. (2018). DETECTION OF WEB
SCRAPING USING MACHINE LEARNING.
Pawlik, M., & Augsten, N. (2016). Tree edit distance: Robust and memory-efficient.
Information Systems, 56, 157-173.
Richardson, L. (2008). Beautiful Soup-HTML. XML parser for Python.
Safe harbor (law). (2018). Retrieved from
https://en.wikipedia.org/wiki/Safe_harbor_(law)
ScrapeDefender. (2018). ScrapeDefender. Retrieved from http://scrapedefender.com/
ScrapeSentry. (2018). ScrapeSentry. Retrieved from https://www.scrapesentry.com/
ShieldSquare. (2013). ShieldSquare Bot Mitigation and Bot Management solution.
Retrieved from https://www.shieldsquare.com/
Thelwall, M. (2001). A web crawler design for data mining. Journal of Information
Science, 27(5), 319-325.
Turnbull, J. (2013). The Logstash Book: James Turnbull.
Wetterström, R., & Andersson, S. (2009). Web information scraping protection. In:
Google Patents.
XQuery. (2016). Retrieved from https://en.wikipedia.org/wiki/Web_scraping
Yu, H.-t., Guo, J.-y., Yu, Z.-t., Xian, Y.-t., & Yan, X. (2014). A novel method for
extracting entity data from Deep Web precisely. Paper presented at the The
26th Chinese Control and Decision Conference (2014 CCDC).
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance
between trees and related problems. SIAM journal on computing, 18(6), 1245-
1262.