34
ISTANBUL TECHNICAL UNIVERSITY FACULTY of COMPUTER and INFORMATICS EVALUATION of DOM TREE SIMILARITIES Graduation Project Teoman Turan 040100014 Department: Computer Engineering Supervisor: Asst. Prof. Dr. Tolga OVATMAN June 2015

The Report for My Undergraduate Thesis (Graduation Project)

Embed Size (px)

Citation preview

Page 1: The Report for My Undergraduate Thesis (Graduation Project)

ISTANBUL TECHNICAL UNIVERSITY FACULTY of COMPUTER and INFORMATICS

EVALUATION of DOM TREE SIMILARITIES

Graduation Project

Teoman Turan 040100014

Department: Computer Engineering

Supervisor: Asst. Prof. Dr. Tolga OVATMAN

June 2015

Page 2: The Report for My Undergraduate Thesis (Graduation Project)

Declaration of Originality

I declare that

1. In this study, all citations from other sources are clearly indicated by giving reference

to the relevant sources, and

2. Sections except for the citations, especially theoretical studies and software/hardware

that form the main subject of the project, are prepared by me.

Istanbul, 06/29/2015

Teoman Turan

Page 3: The Report for My Undergraduate Thesis (Graduation Project)

1

EVALUATION of DOM TREE SIMILARITIES

(SUMMARY)

HTML (Hyper-Text Markup Language) is a markup language being used to design web pages.

The syntax of an HTML file consists of a tree whose major nodes are the elements (tags between

the signs < and >) of the file, and whose minor nodes connecting to them as their children are

the attributes (color, format, link direction, image etc.) of these elements, and texts which settle

between their opening and closure markers. Beginning from the root node corresponding to the

<html> element of an HTML file, the rest of element nodes form a tree according to the order

of their nested settlements in the syntax. This tree is called “DOM (Document Object Model)

Tree”. The main purpose of this graduation project is to develop an algorithm to evaluate the

similarity level between the designs of two HTML files comparing their DOM trees. Along

with the development of this algorithm, a simple GUI (graphical user interface) for the

application has also been developed in order to provide an interface where buttons to load

HTML files and compare them, and text labels to see the results are located.

The project has been developed under Windows 8.1 (64-bit, English) with all updates installed.

The most updated version of Eclipse IDE for Java EE Developers (Luna) has been used as the

IDE, and Java programming language with the most updated versions of Java Development Kit

(JDK) and Java Runtime Environment (JRE) has been chosen. For the first part of the

development, in order to extract and parse the DOM tree of an HTML file, Dom4J Java library

has been used. The rest of the development mostly consists of generating the algorithm to

compare two extracted and parsed DOM trees, using the object oriented programming concept

of Java.

The system returns three sorts of similarity: the similarity ratio with respect to the frequency of

the elements of HTML files, the similarity ratio with respect to the parents (Which nodes own

at least one attribute?) and the number of attributes of HTML files, and the similarity ratio with

respect to the parents (Which nodes own text children?) and the number of text nodes of HTML

files. Also, there is an overall similarity ratio as the fourth one that is calculated based on the

influence proportion of the former three similarity ratios. Here, comments in an HTML file are

ignored if exist. To use the application, user just has to press the relevant buttons to load HTML

files by choosing them through the file explorer, then press the button regarding to calculating

the four similarity levels among them. The results are printed on the same window.

Page 4: The Report for My Undergraduate Thesis (Graduation Project)

2

DOM AĞAÇLARININ BENZERLİĞİNİN DEĞERLENDİRİLMESİ

(ÖZET)

HTML (İngilizce: Hyper-Text Markup Language, Türkçe: Hiper-Metin İşaretleme Dili), Web

sayfalarının tasarlanmasında kullanılan bir işaretleme dilidir. Bir HTML dosyasının söz dizimi

(sentaksı); ana düğümleri bu HTML dosyasının elemanları (< ve > işaretleri arasında kalan

etiketler), diğer düğümleri ise bu eleman düğümlerinin çocuğu olmak üzere bu düğümlere bağlı

olan özellikler (İngilizce: Attribute) (yazının biçimi, yazının rengi, dâhil edilen resim,

bağlantının gittiği adres, sayfanın dili, tablo sütunlarının genişliği vb.) ve işaretçilerin aralarında

yer alan metinlerdir. Bir HTML dosyasının en başında, daha doğrusu, en dış kabuğunda yer

alan <html> elemanına denk gelen kök (İngilizce: Root) düğümünden başlamak üzere, bu

HTML dosyasında yer alan elemanların iç içe yerleşme sıralarına göre bir ağaç yapısı meydana

gelir. Aşağıdaki kod parçasında bunun küçük bir örneği görülebilir.

<html>

<body>

<h1>Benim İlk Paragrafım</h1>

<p id=”para”>Türkiye’nin başkenti Ankara.</p>

</body>

</html>

Bu kod parçasını bir ağaca döktüğümüzde html, body, h1 ve p, ağacın ana düğümleri olacaktır.

İç içe yerleşme sırasına göre html en tepede yer alan kök düğümü olacak, bu düğümden body

düğümü doğacak, bu düğümün de h1 ve p çocuk düğümleri olacak. Ayrıca, <p> etiketinde yer

alan id=”para” özelliği nedeniyle, ilgili p düğümüne bir id çocuk düğümü de bağlı olacak. Ek

olarak, <h1> etiketinin açılış ve kapanışının arasında “Benim İlk Paragrafım” metni yer aldığı

için bu elemana bağlı bir metin çocuk düğümü yer alacak Aynı şey, “Türkiye’nin başkenti

Ankara.” metni nedeniyle p düğümü için de geçerli olacak. İşte oluşan bu ağaca “DOM

(İngilizce: Document Object Model, Türkçe: Belge Nesnesi Modeli) ağacı” adı verilir.

Bu raporda anlatılan bitirme projesinin asıl amacı, iki HTML dosyasının tasarımları arasındaki

benzerlik düzeyini, bu HTML dosyalarının DOM ağaçlarını karşılaştırarak ölçecek bir

algoritma geliştirmektir. Sistem, yüklenen iki HTML dosyasını inceleyip çözümleyecek ve

DOM ağaçlarını ortaya çıkaracaktır. Daha sonra, geliştirilen algoritmayı kullanarak iki DOM

ağacı arasındaki benzerlik yüzdesi bilgilerini dönecektir. Algoritmanın geliştirilmesinin

yanında, HTML dosyalarının yüklenmesi ve benzerlik değerlendirilmesinin yapılması için

butonların yer aldığı ve sonuç bilgilerinin yazıldığı sade bir kullanıcı arayüzü de tasarlanmıştır.

Proje, tüm güncelleştirmeleri yüklü olan Windows 8.1 (64-bit, İngilizce) işletim sisteminde

geliştirilmiştir. IDE olarak Eclipse IDE for Java EE Developers (Luna) adlı geliştirme

ortamının en güncel sürümü kullanılmıştır. Programlama dili olarak, nesneye yönelik

programlama dili desteği ve bazı aşamalar için uygun ve kullanımı kolay kütüphaneleri olduğu

için, Java Geliştirme Kiti’nin (JDK) ve Java Runtime Environment’ın (JRE) en güncel

sürümleriyle Java programlama dili tercih edilmiştir. Geliştirmenin ilk kısmı için HTML

belgeleri sistem tarafından algılanmalı, baştan sona taratılmalı ve DOM ağaçları ayıklanmalıdır.

Page 5: The Report for My Undergraduate Thesis (Graduation Project)

3

Bunun için de bir Java kütüphanesi olan Dom4j kullanılmıştır. Geliştirmenin kalan kısmı;

algoritmanın üretilmesi, arayüzün geliştirilmesi ve test süreci ile geçmiştir.

Geliştirilen sistem, önce iki HTML dosyasını algılayıp incelemekte ve bu dosyaların DOM

ağaçlarını tüm düğümleri ve bağlantıları ile birlikte çıkarmakta, daha sonra, geliştirilen

algoritmaya ile bu ağaçları karşılaştırıp, tasarımlarının yüzde kaç benzediğini dönmektedir.

Geliştirilen algoritmaya göre sistem üç tip benzerlik yüzdesi dönmektedir: HTML

dosyalarındaki özgün eleman düğümlerinin miktarına (frekansı) göre benzerlik oranı, HTML

dosyalarındaki metin düğümlerinin sayısına ve en az bir metin düğümüne sahip olan ebeveyn

düğümlerinin sayısına (“Hangi düğümlerin metin çocukları var?” sorusunun yanıtı aranmakta

ve bulunan ebeveyn düğümler olan eleman düğümleri için, ilk benzerlik oranı tipi bulunurken

uygulanan metodun aynısı uygulanmaktadır.) göre benzerlik oranı, HTML dosyalarındaki

eleman özelliklerinin sayısına ve en az bir özellik düğümüne sahip olan ebeveyn düğümlerinin

sayısına (Metin düğümleri için uygulanan metodun aynısı: “Hangi düğümlerin özellik çocukları

var?” sorusunun yanıtı aranmakta ve bulunan ebeveyn düğümler olan eleman düğümleri için,

ilk benzerlik oranı tipi bulunurken uygulanan metodun aynısı uygulanmaktadır.) göre benzerlik

oranı. Bunların dışında, dördüncü bir benzerlik oranı tipi olarak da “genel benzerlik oranı”

hesaplanmaktadır. Bu oran, ilk 3 benzerlik oranının etki ve önem ağırlıklarına göre ortalama

bir değerdir.

Sistem iki HTML dosyasının, dolayısıyla iki DOM ağacının tasarımını karşılaştırdığı için,

metin düğümlerinin içindeki değerlerin, yani metinlerin kendisinin ve eleman özelliklerinin

(Attributes) aldığı değerlerin önemi yoktur. “Tasarım” söz konusu olduğu için düğümlerin

varlığı ve birbirleri ile bağlantıları önemlidir. Ayrıca, sistem, HTML dosyalarının içindeki

yorum satırlarını da (İngilizce: Comments) dikkate almaz.

Uygulamanın kullanışı kolaydır. Kullanıcı, öncelikle HTML dosyalarını sisteme yüklemek için

ilgili butonu tıklar ve dosya gezgini aracılığıyla HTML dosyasını bulup, seçer. Daha sonra,

benzerlik oranının hesaplanması ile ilgili butona tıklar. Arka planda hesaplanan benzerlik

oranları verileri aynı pencerede basılır. Bu işlem, uygulamadan çıkmadan, farklı HTML

dosyalarını seçerek tekrar edilebilir.

Page 6: The Report for My Undergraduate Thesis (Graduation Project)

4

TABLE of CONTENTS

1 – INTRODUCTION…………………………………………………………………………5

1.1. The Augmentation of Web Pages and the Results………………………………...5

1.2. A Brief Introduction to the Main Problem………………………………………...6

1.3. The Study Done and the Results…………………………………………………..6

1.4. The Sections of the Report………………………………………………………...7

2 – THE PROJECT DESCRIPTION and PLAN........................................................................9

2.1. The Project Description……………………………………………………………9

2.2. The Project Plan…………………………………………………………………...9

3 – THEORETICAL INFORMATION………………………………………………………11

3.1. HTML (Hyper-Text Markup Language)…………………………………………11

3.2. DOM (Document Object Model)………………………………………………...13

4 – ANALYSIS and MODELLING………………………………………………………….16

4.1. Understanding the Main Problem………………………………………………..16

4.2. Modelling………………………………………………………………………...16

4.2.1. The Programming Language and the Development Environment……..16

4.2.2. The Project Hierarchy…………………………………………………..16

4.2.3. The Programming Concept…………………………………………......17

4.2.4. The Modelling of Classes………………………………………………17

4.2.5. The Comparison and Similarity Evaluation Algorithm…………….......18

5 – DESCRIPTION, IMPLEMENTATION and TEST………………………………………20

5.1. Classes and Methods……………………………………………………………..20

5.1.1. ElementNode.java……………………………………………………...20

5.1.2. AttributeNode.java……………………………………………………..20

5.1.3. TextNode.java………………………………………………………….21

5.1.4. TreeSim.java…………………………………………………………...22

6 – EXPERIMENTAL RESULTS…………………………………………………………...25

7 – THE RESULTS and SUGGESTIONS…………………………………………………...31

8 – REFERENCES…………………………………………………………………………...32

Page 7: The Report for My Undergraduate Thesis (Graduation Project)

5

1. INTRODUCTION

1.1. The Augmentation of Web Pages and the Results

While the role World Wide Web (WWW) plays in our lives is growing in a grandiose pace, the

total amount of web pages increases in a parallel way as a matter of course. In order to meet

our diversified point of interests and specific requirements such as researches or acquiring

media, the total number of web sites all around the world has got closer to 1 billion. According

to the statistics published by Internet Live States, assuming that distinctive hostnames are meant

by the term “web site”, by the end of the mid-year of 2014, there are 968.882.453 web sites

visited by 2.295.249.355 Internet users, which means 1 website is visited approximately by 3

users. [1]

This huge growth results in the following fact: Since it is impossible to give rise to web page

designs in an assortment that meets those almost billions of web sites, the difference among the

designs of web pages serving in the same purpose such as the official web site of a product or

company, a bulletin board, a video sharing platform is shrinking. Web page templates can be

an admirable pattern for this issue. Within the context of the requirement for the designs of new

web pages providing service in the same concept, today, a couple of prepared templates can be

utilized. At the present time, it is possible to see thousands of web pages whose schematic

structures are close to each other. In Figure 1.1 below, a sample template that can be used for

the designs of a great number of new web sites can be seen.

The fact that web page designs are getting closer to each other involves in an issue that ought

to be studied: the evaluation of web page similarities.

Figure 1.1: A sample template

Page 8: The Report for My Undergraduate Thesis (Graduation Project)

6

1.2. A Brief Introduction to the Main Problem

A web site consists of connections between a couples of web pages. As stated in the summary

section of the thesis, a web page is designed being written by HTML, and an HTML file forms

the source of a web page. Here, the content of the web page can be described as a reflection of

the construction of the statements in the source HTML file. The settlements of the elements in

an HTML file, their attributes, and the texts that settle among them form the schema, in other

words, design of the relevant web page. This settlement order forms a tree called DOM Tree.

All of these information can be said to lead the following fact: Evaluating how much two web

pages are similar to each other can be achieved via an algorithm that compares the DOM trees

that are extracted from these pages. Hence, the main problem underlying in this thesis is to

generate an algorithm to compare two DOM trees by the aspect of a property, then to measure

the similarity level among them in the nearby way.

1.3. The Study Done and the Result

Under the light of the facts explained in the section 1.2, the nodes of the DOM trees and the

connections among them have to be extracted by parsing the HTML files, as the first step. In

order to achieve this, Dom4j, a Java library that embodies classes, data types and methods for

parsing an HTML file and extracting its DOM tree has been used. Taking the advantage of the

object oriented programming concept of Java programming language at the same time, using

the special content of the library, the following DOM components of an HTML file whose path

is sent to the system as a parameter to load it are extracted:

Element nodes: The major nodes corresponding to the tags between the marks < and >

Attribute nodes: One of the sort of minor nodes corresponding to the attributes of the

elements, which are the statements inside the tags to indicate the specific properties of

that element such as the path of an image being put there, font size, text format, the type

of script being used there, the direction of a link, the identity of an element. These nodes

are connected to the elements nodes as their children.

Text nodes: Like the attribute nodes, these nodes are also connected to the elements

nodes corresponding to the tags whose opening (<…>) and closing (</…>) ends harbor

a text inside. If an element has such a feature, its corresponding node have children

nodes that are text nodes.

The vast of the rest of the development is about the construction of the comparison algorithm.

Obtained the array of the DOM components explained above, the algorithm having been

developed can much briefly be claimed to be based on the ratio of the number of the like-for-

like nodes over the number of the total nodes. To dig in the issue a little, the ways to calculate

the following 4 similarity ratios (in percentages) have been implemented:

The similarity ratio between the element nodes of two DOM trees: The frequency, in

other words, the amount of each distinct element node extracted in the first step achieved

before and stored in a list (for example, the number of li, the number of table, the number

of p, the number of script) has been found, then the sum of the lower frequencies is

divided by the number of the total element nodes

The similarity ratio between the text nodes of two DOM trees: The frequency of each

distinct element node which has at least one child as text node has been found, then the

Page 9: The Report for My Undergraduate Thesis (Graduation Project)

7

same way as that implemented for the element nodes has been followed to reach the

result.

The similarity ratio between the attribute nodes of two DOM trees: The way followed

to calculate the similarity ratio between the text nodes has been followed for each

distinct element node which has at least one child as attribute node. The way followed

to calculate the same thing between the element nodes has also been followed for each

distinct attribute node. The average of these two results lead the final ratio.

The overall similarity ratio: This is calculated based on the influence portion of the

similarities described above as follows: 60% of the element node similarity, 30% of the

text node similarity, and 10% of the attribute node similarity.

Here, because of the fact that only the design, in other words, the schema, the structure or the

skeleton of a web page is examined, the values of attribute nodes and text nodes have not been

taken into consideration. Moreover, the object oriented programming concept of Java has been

utilized to achieve the implementation of the algorithm.

With respect to the results of some tests that have been done by loading a couple of dissimilar

HTML files whose contents are known by tester as well, the algorithm can be said to be

successful as the similarity ratios are coherent and satisfying. The more similar DOM structure

with elements owning the same or the similar features an HTML file harbors, the higher

similarity ratios the application developed within the context of this thesis project outcomes,

and vice versa.

1.4. The Sections of the Report

This thesis report consists of the following sections.

Summary: The project is summarized under this section both in Turkish and English.

Introduction: The main problem underlying the implementation of the project is

introduced here.

The Project Description and Plan: The project whose implementation is involved due

to the requirement to solve the major problem mentioned is described here. Then, the

project plan that gives information regarding to how long a part of the process (survey,

development, test etc.) had been planned to take is presented under the same section.

Theoretical Information: Under this section, theoretical information used for the

implementation for the project, which had actually been collected during the research

section at the beginning, are presented.

Analysis and Modelling: It is achieved to understand the main issue, and present the

way towards how to realize the system to serve solutions to the problem. Design,

Implementation, and Test: After the modelling of the project in the previous section,

here, the soft implementation of the system is explained avoiding put the vast of source

codes. Furthermore, the way the system can be tested is also presented under this

section.

Experimental Results: The results of tests that have been performed in a satisfying

number of times are presented here to prove that the system works free of bugs, and the

main algorithm implemented background is reliable.

The Result and Suggestions: Under this section, the solution produced with this project

is interpreted considering some factors including the performance of the application, the

Page 10: The Report for My Undergraduate Thesis (Graduation Project)

8

budget. Plus, some suggestions for those who would like to examine the same issue are

included here.

Reference: The sources cited within the report are listed under this section clearly

including their addresses.

Page 11: The Report for My Undergraduate Thesis (Graduation Project)

9

2. THE PROJECT DESCRIPTION and PLAN

2.1. The Project Description

“DOM Similarity Evaluator” is a cross-platform Java application which has been developed

using Eclipse IDE for Java EE Developers (Luna) and written in Java programming language.

The application developed within the context of the thesis project can be said to be simple as it

can be launched from a single JAR file, and the graphical user interface of the application is

easy-to-use.

The name of the development project that takes place in the workspace of Eclipse, and is the

source of the application is “TreeSimilarity”, and the library used to parse an HTML file and

extract its DOM tree, Dom4j (dom4j-1.6.1.jar), is also included to the development

environment.

The aim of the project is to compare the DOM trees that are extracted as a result of parsing two

HTML files, the evaluate (measure) the similarity level among their designs with respect to

their element nodes, text nodes and attribute nodes. Therefore, this comparison also signifies

the comparison between the designs of two web pages whose sources are two dissimilar HTML

files.

2.2. The Project Plan

From assenting the thesis topic to its submission, the project plan consists of the following

phases:

Theoretical research

Analysis and Modelling

Development

Test

Documentation

Theoretical research: This is the first phase. It was planned to take approximately 3 months.

The purpose of this phase is to understand the main problem, to gather theoretical information

regarding to the topic, to decide which methods and technologies are going to be followed, and

to plan the project development. Within this phase, the supervisor provides some academic

sources as well.

Analysis and Modelling: This is the second phase. It was planned to take approximately 1

month. The purpose of this phase is to analyze the main problem, then to prepare the solution

Page 12: The Report for My Undergraduate Thesis (Graduation Project)

10

that is going to be followed during the development process. Within this phase, the answer of

the question what can be done for the implementation of the solution is found too.

Development: This is the third phase. It was planned to take approximately 5 months. After

the collection of theoretical information, and the analysis and modelling of the project, this

phase consists of the implementation.

Test: This is the fourth phase. It was planned to take approximately 1 month. After the

development of the project is completed, the application that is outcome of the development is

tested with various parameters (several HTML files for this project) to see if the results are

acceptable, satisfying and rational.

Documentation: This is the final phase. It was planned to take approximately 1 month. This

phase actually consists of three sort of reports: the project plan that is submitted at the beginning

of the project, the interim report that is submitted around the midway of the project, and the

final report that is the major one submitted at the end of the project. But most of the period is

spent to the final one where the project is introduced, theoretical information used for the

implementation are presented, the project plan is given, and the development process is

explained.

Figure 2.1. Gantt Chart

Page 13: The Report for My Undergraduate Thesis (Graduation Project)

11

3. THEORETICAL INFORMATION

3.1. HTML (Hyper-Text Markup Language)

Hyper-Text Markup Language, mostly called HTML which is its abbreviation, is a markup

language used to create a web page. If a text file that contains HTML statements is saved with

a filename extension of “.html” or “.htm”, the file immediately becomes the source of a web

page whose design is the visual reflection of what the code inside the file tells. The output can

easily be seen by opening the file with a web browser. Web browser can read and render an

HTML file into web page. This transformation is also called interpreting an HTML code for

the content of the page. The filename extension can either be .html or .htm. Two major

differences between .html and .htm are some host servers’ requirement to let the starting page

be named as “index.html”, not “index.htm”, and the fact that DOS/Windows 3.x platform did

not allow a filename extension be longer than 3 characters. [2]

The major component of the syntax of an HTML code is HTML elements which are written as

tags enclosed in the angle bracket signs, < and >. Some of these elements consist of opening

and closing pairs, like <p> and </p> whereas some being unpaired like <hr/>, <img/>. The

elements in an HTML code forms the basic schema of the output page. Each element in the

code corresponds to an embedded block or action on the output page.

The second significant component of the HTML syntax is HTML attributes. An attribute

indicates a specific property of the owner element. An attribute is or the series of attributes are

put right after the tag of the owner element. For instance, id attribute whose value is “x” of a

div element means that the div element’s identity that is unique throughout the whole page is x.

As another example, src attribute whose value is the path of an image file that is expected to be

embedded on the page of an img element corresponds to the existence of that image on the

output page.

The latter considerable component of an HTML file is texts. Texts often settle between the

opening and closing tags of an element that corresponds to a block containing a text. A web

browser interprets the texts in an HTML file affecting from the attributes wrapping them up.

A sample HTML code is given below.

<html>

<head>

<title>Sample HTML File</title>

</head>

<body>

Page 14: The Report for My Undergraduate Thesis (Graduation Project)

12

<div id=”first_block”>

<h1>The Latest Sport News</h1>

<p id=”233”><a

href=”www.foxsports.com/soccer”>Los Angeles Galaxy has won the

Major League Soccer title!</a></p>

<button type=”button”

onclick=”alert(‘Congrats!’)”>Are you happy?</button>

</div>

</body>

</html>

In this code piece, the following components are HTML elements: html, div, body, head, title,

p, h1, button, a.

id attribute of div element takes the value “first_block”, which means the unique identity of the

block is first_block as known throughout the whole document. href attribute whose value is

www.foxsports.com/soccer of a element stands for a link that goes to the URL in the attribute

value. type attribute of button element indicates the button property, and onclick attribute whose

value is a JavaScript method, alert(‘Congrats!’), shows what happens when user clicks on the

button. The outcome page can be seen in Figure 3.1 below. (The button had been clicked before

taking the screenshot as well.)

Figure 3.1. The output page from the HTML code given above

Page 15: The Report for My Undergraduate Thesis (Graduation Project)

13

3.2. DOM (Document Object Model)

In accordance with the definition by W3C, the Document Object Model, mostly called DOM

which is its abbreviation, is an interface independent of programming language and platform,

through which programs and scripts have dynamic access to the content, structure and style of

a structural document. Along with accessing them, via DOM, a program or script can also

update the accessed component. [3] The DOM interface can provide structural representation

for HTML, XML, and SVG documents, and a component is often accessed through DOM using

JavaScript. [4]

The order of nested settlement of the components of an HTML, XML or SVG document

actually forms a tree structure. This structure is called “DOM Tree” where an element with the

outermost opening and closing tags within the syntax in such a document corresponds to the

root node, the inner one corresponds to its child node or one of its children nodes, and the rest

of the inner ones settle in this way.

<element1>

<element2>

<element3>

</element3>

<element4>

</element4>

</element2>

</element1>

A structure example given above as pseudo-code forms the following DOM tree as shown on

the right-hand side above: Since element1 is the outermost element, it is the root node. Since

element2 is a further inner one, the corresponding node connects to the root node as its child.

Since element3 and element4 are a further inner ones under the same shell of element2, the

nodes corresponding to them become the children of that corresponding to element2.

The application having been developed within this thesis project deals with HTML DOM. The

basic concept having been explained here is okay for an HTML document as well, plus some

specific components corresponding to new nodes in DOM tree. As stated in W3Schools,

everything can be said to correspond to a node in an HTML DOM as follows: The document

itself already corresponds to a document node. HTML elements correspond to element nodes

in the tree structure. The attributes of these elements correspond to attribute nodes. Texts settle

in between elements also correspond to text nodes. Moreover, comment lines in an HTML

document also correspond to the comment nodes in the relevant DOM tree. [5]

Element1

Element2

Element3

Element4

Graph 3.1. The DOM Tree

of the Sample Structure on

the Left-Hand Side

Page 16: The Report for My Undergraduate Thesis (Graduation Project)

14

In an HTML DOM tree, since elements are the major components in the syntax of an HTML

code, nodes corresponding to the elements of an HTML file can be described as the major sort

of nodes. If an element owns at least one attribute, the corresponding attribute nodes are

connected to the belonging element node as its children. If there is a text between the tags of an

element, the corresponding text node is connected to the belonging element node as its child.

Under the light of these information, the sample HTML code piece given in Section 3.1 can be

claimed to have the DOM tree given below.

Graph 3.2. The HTML DOM Tree of the Sample HTML Code in Section 3.1

Html

Head Body

Title Div

Text

“Sample HTML File” Id

“first_block”

h1

“The Latest

Sport News”

br p

id

“233”

a Href

“www.foxsports.com/soccer”

Text

“Los Angeles Galaxy has won

the MLS title.”

Button

Type

“button”

Text

“Are

you

happy?”

Onclick

“alert(‘

Congrats

!’)”

Page 17: The Report for My Undergraduate Thesis (Graduation Project)

15

Within the context of this thesis project, since only the designs of web pages are considered,

values text and attribute nodes have are not taken into consideration during the development of

the comparison algorithm even though they are included in the representation of the DOM tree.

Page 18: The Report for My Undergraduate Thesis (Graduation Project)

16

4. ANALYSIS and MODELLING

4.1. Understanding the Main Problem

The core of the project is to construct an algorithm to compare two DOM trees extracted from

two parsed HTML files, then to measure the similarity level among them. This problem

involves a more general question in itself: how to compare two trees consisting of dissimilar

nodes. Thus, beyond the aspect of parsing an HTML file and extracting its DOM tree, the actual

step that has to be achieved is to prepare a solution to the way of comparing two tree structure

consisting of dissimilar nodes, and evaluating the similarity level among them. Here, in order

to provide the comparison algorithm and evaluate how much they are similar to each other, a

feature a tree structure owns should be focused on, like the frequency dissimilarity for the nodes

carrying the same characteristics. Hence, the project should be modelled with respect to a tree

structure: classes representing node types, method to iterate over the tree, method to extract the

content and the feature of each node, special methods to implement the comparison and

similarity evaluation.

4.2. Modelling

4.2.1. The Programming Language and the Development Environment

The application being developed within the context of the thesis project has been chosen to be

a Java application. It can be launched from a single JAR file in a simple way. Related to this

programming language preference, the integrated development environment (IDE) has been

chosen as Eclipse IDE for Java EE Developers.

Since it is a Java application, it is also a cross-platform application. This means that it can run

under any operating system that supports Java, like Windows, any Linux distribution, any

UNIX operating system.

4.2.2. The Project Hierarchy

As stated in Section 2.1, the name of the development project that settles in the workspace of

Eclipse IDE is TreeSimilarity. Under src folder, there is a package for the whole project, which

can be assimilated to an umbrella, named as treesim. This package contains all Java files. Each

of them with the filename extension .java contains necessary classes. The main file from which

the project is launched after the compilation, and where the main method takes place is

TreeSim.java. The rest consists of classes representing node types: AttributeNode.java,

ElementNode.java, and TextNode.java.

Page 19: The Report for My Undergraduate Thesis (Graduation Project)

17

Figure 4.1. The Project Hierarchy in Eclipse IDE

4.2.3. The Programming Concept

The concept of Java programming language is based on object oriented programming; even the

main method of the source code is encapsulated in a class. Owing to the beneficial features, for

the implementation of the solution to the analyzed problem, object oriented programming has

adequately been utilized. Moreover, Java’s container data types like ArrayList have been pretty

used as they serve beneficial solutions for data storage.

The vast majority of the project is formed by TreeSim class. Other classes that represent the

node types are imported into this class.

4.2.4. The Modelling of Classes

ElementNode class represents element nodes. It contains the following data: the name of an

element, the frequency of an element which means how many instances from it exist in the tree.

AttributeNode class represents attribute nodes. It contains the following data: the name of an

attribute, the frequency of an attribute.

TextNode class represents a text node. Since the value of a text node which is the text output

itself is not taken into consideration due to dealing only with the design, it contains the

following data that are regarding to the parent of a text node: the name of the parent element

node a text node is connected, the frequency of the parent element node a text node is connected.

TreeSim class forms the core of the project as it encapsulates the main method. Along with the

encapsulation of the main method, it contains almost all of the rest of the methods in the project,

and codes for the implementation the graphical user interface of the application. The other three

classes modelled as above are imported into this class.

Page 20: The Report for My Undergraduate Thesis (Graduation Project)

18

4.2.5. The Comparison and Similarity Evaluation Algorithm

This is the most significant part of the development as it forms the core of the project, and a

solution to the main problem is modelled.

Before beginning the comparison operation, a method that iterates over the DOM tree that has

been extracted from a parsed HTML file to visit all nodes, then stores the data encapsulated in

the class representing the sort of the visited node in a private storage field has to be

implemented. After collecting all necessary data, the following comparison ways according to

the node types are followed.

Element nodes: First, all distinct elements in the iterated DOM tree are collected with their

frequency information. For example, let the iteration over a DOM tree give the following list

of element nodes.

html head title body div h1 p div h2 p a button div h1 p table ul li li li li

The distinct elements with their frequencies become as follows.

html: 1

head: 1

title: 1

body: 1

div: 3

h1: 2

h2: 1

p: 3

a: 1

button: 1

table: 1

ul: 1

li: 3

The same way is followed for the other DOM tree. Acquiring the frequency data for all distinct

elements for both trees, for common elements which means elements that exist in both trees,

the less frequency for each element is added to a special frequency list. Then, the sum of them

is divided by the greater number of all elements nodes in either tree. When the result is

multiplied by 100, the similarity ratio with respect to elements nodes is obtained.

Page 21: The Report for My Undergraduate Thesis (Graduation Project)

19

Attribute nodes: The way being followed for element nodes is followed for the attribute nodes

themselves as well. A ratio value is obtained from this phase. Plus, the same way is followed

for the element nodes that have at least one attribute, in other words, the parent element nodes

of the attribute nodes, as well. Another ratio value is obtained from this phase. The average of

these values give the final similarity ratio with respect to attribute nodes.

Text nodes: The way being followed for element nodes is followed for the element nodes that

have children as text nodes, in other words, the parent element nodes of the text nodes. A

similarity ratio with respect to text nodes is obtained.

Overall: Considering the importance of the difference in the same node types for two DOM

trees, the influence order can be as follows: element nodes > text nodes > attribute nodes.

According to the importance greatness, managing to be rational, 60% effect margin for element

nodes, 30% effect margin for text nodes, and 10% effect margin for attribute nodes have been

assigned. Based on this distribution, the overall similarity ratio between two DOM trees is

obtained.

Page 22: The Report for My Undergraduate Thesis (Graduation Project)

20

5. DESCRIPTION, IMPLEMENTATION, AND TEST

5.1. Classes and Methods

5.1.1. ElementNode.java

Figure 5.1. ElementNode Class

This class represents the structure of an element node. It holds the name of element, and its

frequency which means how many instances from it exist in the tree. Since they are private

fields, there are also getter and setter methods for them.

5.1.2. AttributeNode.java

Figure 5.2. AttributeNode Class

Page 23: The Report for My Undergraduate Thesis (Graduation Project)

21

This class represents the structure of an attribute node. It holds the name of attribute, and its

frequency. There are also getter and setter methods for these private fields.

5.1.3. TextNode.java

Figure 5.3. TextNode Class

This class represents the structure of a text node. It holds the name of the parent element node

that owns text node, and its frequency. There are also getter and setter methods for these private

fields.

5.1.4. TreeSim.java

This class can be described as the main class of the project since it encapsulates the main

method. Along with the main method, almost all of the methods have been implemented here.

Page 24: The Report for My Undergraduate Thesis (Graduation Project)

22

Figure 5.4. TreeSim Class

The body of the class begins with plenty of private fields such as containers, GUI components,

arrays. But the methods within this class should be focused on.

main(String[]): This is the main method of the class. The application is launched after the

compilation from this method.

TreeSim(): This is the default constructor of the class. Actually, this constructor contains

statements to load GUI components and to add action events linked to them.

actionPerformed(ActionEvent): This GUI method tells the system what must happen when a

button is clicked. The action events sent to here as parameter are linked to the GUI components

(buttons here) inside the body of TreeSim(). If the action event of pressing “Calculate the

Similarity” button on the application is triggered, the comparison and similarity ratio

measurement algorithm having been generated steps in.

commonAvailabilityChecker methods: These methods check if a given node exists in both

DOM trees. Which method checks for what sort of node can be understood from the method

name. If the node can be found in both DOM trees, the method returns true. Otherwise, it returns

false.

extractUnique methods: These methods extract distinct elements in a DOM tree having been

iterated.

preParentOrderFirstDOM(Node): This method iterates over the first DOM tree which is

extracted from the first HTML’s parse, and discloses the information stored in each node and

Page 25: The Report for My Undergraduate Thesis (Graduation Project)

23

the connections between these nodes. Gathered data is stored in the node arrays for the first

DOM tree.

preParentOrderFirstDOM(Node): This method is the equivalent of the previous one for the

second DOM tree.

Page 26: The Report for My Undergraduate Thesis (Graduation Project)

24

5.2. Test

When the application is launched, the window of its simple GUI appears on screen as can be

seen in the figure below.

Figure 5.5. The Application

“Load the First HTML File” button: This button opens a file explorer to search for the HTML

file to be loaded to the system.

“Load the Second HTML File” button: Equivalent to the first one.

“Calculate the Similarity” button: Once this button is pressed, the similarity ratios that can

be seen in the lower line are calculated according to the algorithm having been generated within

the context of the thesis project.

“Quit the Application” button: If user wants to exit the application, he/she can click on this

button. A dialog box with “Yes” and “No” choices to make sure appears on screen.

Figure 5.6. A Sample Run

Page 27: The Report for My Undergraduate Thesis (Graduation Project)

25

6. EXPERIMENTAL RESULTS

For this thesis report, some HTML files whose content (their syntax and their outputs) are

already known have been tested by loading them to the system. Two of them are very similar

to each other whereas another couple consisting of ones very dissimilar to each other. The files

Page 28: The Report for My Undergraduate Thesis (Graduation Project)

26

containing longer HTML statements had been readily found from a template portal. The other

two files containing smaller pieces of codes had been written.

Evaluating the similarity between the DOM trees of two similar HTML files:

Figure 6.1. The First HTML File

Figure 6.2. The Second HTML File

Page 29: The Report for My Undergraduate Thesis (Graduation Project)

27

Figure 6.3. The Output of the First HTML File

Figure 6.4. The Output of the Second HTML File

Page 30: The Report for My Undergraduate Thesis (Graduation Project)

28

Figure 6.5. The Test Result

As can be seen from Figure 6.5, the result is satisfying and rational according to the loaded

HTML files. The overall similarity ratio is high, but not close to 100% due to design (structure,

text owning and attribute) dissimilarities.

Evaluating the similarity between the DOM trees of two dissimilar HTML files:

Figure 6.6. The First HTML File – 2

Page 31: The Report for My Undergraduate Thesis (Graduation Project)

29

Figure 6.7. The Second HTML File – 2

Figure 6.8. The Output of the First HTML File – 2

Page 32: The Report for My Undergraduate Thesis (Graduation Project)

30

Figure 6.9. The Output of the Second HTML File – 2

Figure 6.10. The Test Result – 2

As can be seen from Figure 6.10, the overall similarity ratio is too low. It is satisfying and

rational, because it is obvious that the HTML files are too dissimilar to each other.

Under the light of these test results, the algorithm can be claimed to be successful, and the

application operates the algorithm free of bugs.

Page 33: The Report for My Undergraduate Thesis (Graduation Project)

31

7. THE RESULT and SUGGESTIONS

First of all, as can be seen from the test results in the previous section, the algorithm can be

claimed to be successful, and the application operates the algorithm free of bugs.

The graphical user interface (GUI) of the application is very simple. It is very easy to use, but

it should be enhanced to be more stylish and useful.

A considerable con of the application and/or the algorithm is that the system now expects an

HTML source code free of bugs. This means that the system does not have an error detection

and correction mechanism. In addition, Dom4j library can fail for elements that consist of

unpaired (single) tags where “/” mark does not exist right before the closing bracket, >.

The project is eligible to be improved easily; it is open source.

Page 34: The Report for My Undergraduate Thesis (Graduation Project)

32

8. REFERENCES

[1] Internet Live Stats, The total number of Websites, http://www.internetlivestats.com/total-

number-of-websites/

[2] Sight Specific, What is the difference between the HTM and HTML extension,

http://www.sightspecific.com/~mosh/www_faq/ext.html

[3] W3C, Document Object Model, http://www.w3.org/DOM/

[4] Mozilla Developer Network, Document Object Model (DOM),

https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model

[5] W3Schools, The HTML DOM Element Object,

http://www.w3schools.com/jsref/dom_obj_all.asp