15
[VNSGU JOURNAL OF SCIENCE AND TECHNOLOGY] Vol.5. No. 1, July, 2016 119 133,ISSN : 0975-5446 Comparative study of link analysis web page ranking algorithms based on weights of links to extract pertinent links PATEL Hemangini S. Bhagwan Mahavir college of Comp. App. (BCA), Bharthana, vesu. [email protected] DESAI Apurva A. Department of Computer Science , Veer Narmad South Gujarat University, Surat. [email protected] Abstract Ranking is a key factor meant for information retrieval. The web is a huge set of pages to provide an unlimited source of information, which consist of myriad hyperlinks. Search engine database includes the huge quantity of web pages so the ranking of web page is a crucial for satisfying the user requirements. Hypertext Induced Topic Search (HITs) computes hubs and authority scores and Page rank algorithm computes the rank of particular web page. In this paper, the analysis of the significance of web page is compared by utilizing ranking algorithms. The modified weighted HITs(WHITs) algorithm is proposed and its superiority has been compared with existing algorithms like Hypertext Induced Topic Search (HITs), Page Rank, Norm (p) and sNorm (p) to help the search engine to extract a pertinent and valuable link. The algorithms are tested over various short term queries. Keywords: Information Retrieval, HITs, weighted HITs (WHITs), Page Rank, SALSA, norm (P), sNorm(p). 1. Introduction The prime objective of information retrieval is to discover the entire pertinent documents for a user query within a set of documents. With the advent of the web, novel resources of information became obtainable [1]. Link analysis ranking algorithms ranks the qualitative and pertinent documents. The main task of ranking algorithm is to recognize the high rank authorities’ documents within the bulk of the pages. Most of this research has centred on presenting an improved link-based ranking algorithm however weighted HITs (WHITs) to increase the computational efficiency of existing ones (primarily, HITs). However, it allocates every link by the equivalent weight. These hypothesis outcomes in topic drift. For web search, HITs algorithm is said to be most significant; to a few measure, one may imagine HITs to be further useful as compared to former link based characteristics as it is query reliant: it attempts to determine the importance of pages among to a specified query [2]. Page Rank is a query independent algorithm used by Google search engine based on the connectivity structure of the web pages. The Page Rank importance of a page is weighted by every link to the page proportionally to the superiority of the page holding the links; i.e., the Page Rank importance of a page will extend uniformly to all the pages it directs to [3].

Comparative study of link analysis web page ranking ...vnsgu.ac.in/dept/publication/ETIT/ETITPaper13_Pages 119 to133.pdf · The prime objective of information retrieval is to discover

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

[VNSGU JOURNAL OF SCIENCE AND TECHNOLOGY] Vol.5. No. 1, July, 2016 119 – 133,ISSN : 0975-5446

Comparative study of link analysis web page ranking algorithms based

on weights of links to extract pertinent links

PATEL Hemangini S. Bhagwan Mahavir college of Comp. App.

(BCA), Bharthana, vesu.

[email protected]

DESAI Apurva A. Department of Computer Science ,

Veer Narmad South Gujarat University, Surat.

[email protected]

Abstract

Ranking is a key factor meant for information retrieval. The web is a huge set of pages to provide

an unlimited source of information, which consist of myriad hyperlinks. Search engine database

includes the huge quantity of web pages so the ranking of web page is a crucial for satisfying the

user requirements. Hypertext Induced Topic Search (HITs) computes hubs and authority scores and

Page rank algorithm computes the rank of particular web page. In this paper, the analysis of the

significance of web page is compared by utilizing ranking algorithms. The modified weighted

HITs(WHITs) algorithm is proposed and its superiority has been compared with existing

algorithms like Hypertext Induced Topic Search (HITs), Page Rank, Norm (p) and sNorm (p) to

help the search engine to extract a pertinent and valuable link. The algorithms are tested over

various short term queries.

Keywords: Information Retrieval, HITs, weighted HITs (WHITs), Page Rank, SALSA, norm (P),

sNorm(p).

1. Introduction

The prime objective of information retrieval is to discover the entire pertinent documents for a user

query within a set of documents. With the advent of the web, novel resources of information

became obtainable [1]. Link analysis ranking algorithms ranks the qualitative and pertinent

documents. The main task of ranking algorithm is to recognize the high rank authorities’

documents within the bulk of the pages. Most of this research has centred on presenting an

improved link-based ranking algorithm however weighted HITs (WHITs) to increase the

computational efficiency of existing ones (primarily, HITs). However, it allocates every link by the

equivalent weight. These hypothesis outcomes in topic drift.

For web search, HITs algorithm is said to be most significant; to a few measure, one may imagine

HITs to be further useful as compared to former link based characteristics as it is query reliant: it

attempts to determine the importance of pages among to a specified query [2]. Page Rank is a

query independent algorithm used by Google search engine based on the connectivity structure of

the web pages. The Page Rank importance of a page is weighted by every link to the page

proportionally to the superiority of the page holding the links; i.e., the Page Rank importance of a

page will extend uniformly to all the pages it directs to [3].

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 120

Hence, The Page Rank of a web page is computed as a summation of the Page Ranks of all pages

linking towards it (its inlinks) divided by the number of links on each of those pages (its outlinks).

HITs initially constructs a neighbourhood graph for the query which uses content based web search

engine to collect top 200 corresponding web results. It holds all the pages of the top 200 web pages

linked to and web pages that linked to these top 200 top pages. After an iterative computation on

the values of authority and hub, for each webpage p, the authority and hub values are calculated.

The authority importance of webpage p is the summation of hub scores of entire web pages that

directs to p and the hub importance of page p is the summation of authority scores of entire web

pages that p directs to. Until the values converged, iteration proceeds on the neighbourhood graph.

The SALSA algorithm combines the ideas both from page rank and HITs algorithm. In this

algorithm, a random walk on the bipartite hubs and authorities graph alternatively between hubs

and authorities is performed. When on an authority part of the bipartite graph at a node, the

algorithm chooses one of the inlinks consistently at arbitrary and shifts to a hub node on the hub

part. When at node on the hub part the algorithm chooses one of the outlinks consistently at

arbitrary and shifts to an authority. Norm(p) and sNorm(p) algorithms are belong to the family of

additive online learning algorithms [4,5,6]. Norm is a function which assigns a positive length to

all the vectors in the given vector space. These algorithms work on the standard of special

management of the authority weights. It can be implemented by using a norm or an operator. By

this, we will be able to use the fact that minor authority weights give fewer to the hub weight. The

simplest approach is to scale the weights. The most common solution to how to choose the scaling

aspects is to utilize the authority weight to determine the scaling factor. As higher authority

weights are significant in the calculation of hub weight so hub weight of the given node i is set to

be the p-norm of the vector of the authority weights of all the nodes directed to by the given node i,

Norm(p) scales the hub weight but here interest is to find authority with higher weight. It is

available in sNorm(p) due to symmetric setting of authority weight and hub weight.

In the present study, relative effectiveness of the ranking algorithms, namely HITS, Page Rank [2],

SALSA [3], Norm (p) and sNorm (p) [4] are compared with further algorithm proposed namely

weighted HITs (WHITs) and also compared according to their criteria.

2. Related Work

The thought of using hyperlink investigation arises about 1997 and evident itself by the PageRank

[7, 8] and HITs [2] algorithms meant for ranking web search results. There are various attempts to

improve better effectiveness of link analysis algorithms. Various researchers [9,10,11,12] have

provided explanation about the difficulty of searching, querying the Web, by considering its

structure-information as well as the meta-information incorporated in the hyperlinks and the text

adjacent to them. One way of doing the research is based on analyzing the mathematical assets of

link analysis algorithms. Langville and Meyer inspected PageRank for essential properties of it like

continued existence and individuality of an eigenvector and convergence of power iteration [13].

Dang and Croft [14] have introduced to utilize the anchor text due to a likeness of anchor text to

queries. Borodin et al. [15] have examined a variety of hypothetical properties of Page Rank, HITs

and SALSA including their similarity, locality and stability. Kraft et al. have analyzed anchor text

for query refinement due to the likeliness of queries and anchor texts [16]. Lee et al. [17] have

formed intention of a query is navigational or informational by locating all anchors that have same

text as the query term to discover an exact web page otherwise to visit multiple pages. Zhang et

al.[18] and Liu et al. [19] developed I-HITS algorithm supported on likeliness of the page and the

query by considering target page and query with its similarity of anchor text, and thereby improves

the capability of differentiate link importance and keep away topic drift. Craswell et. al. [20] have

found a way of appearance of inlinks is to see the anchor text as (element of) a “query” in model

Okapi BM25 to which the linked document is an “answer”. It can succeed when the required page

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 121

is outside the crawl or contains no text. Also, it is concluded that anchor text information without

additionally pre-processed or fine-tuned is more helpful than content information for the site

discovery task.

By considering anchor texts, one can additionally improve Web search rankings, especially for the

navigational queries, named page finding, the homepage, and ad-hoc search tasks ss[21].

Furthermore, Eiron and McCurley [22] have argued related to anchor text that it is extremely short

and web search engine users usually tend to submit very short queries and it summarized about

target document instead of the source document. On statistical bases, one of their interesting

comments is that likeliness of anchor texts and real user search queries. Initially motivated to

investigate for query refinement due to likeliness of anchor text and search queries. An ARC

(Automatic Resource Compiler) was built-up as component of the CLEVER scheme for indistinct

queries to repeatedly produce listing of hubs and authorities [23]. By using the anchor text as well

as a window of terms in the region of the anchor text to find out a target page is related to query

topic or not, and adjust the weights of the links in their web graph consequently. By contrast, our

study is based on link based modified HITs known as weighted HITs (WHITs) which collect all

anchors of outlinks, titles of inlinks and double weighting links whose anchors and titles matched

with query term.

3. Data Set

The study presented in this paper is based on the data sets collected from the Google AJEX search

API and Bing search API for the short term query Q which is generally of one term or two terms

are shown in Table 1. Root set contains around 100 highest ranked nodes (t) after eliminating

duplicates. By using nodes of Root set collection of out-links, anchors of out-links and in-links and

titles of in-links are performed to build base set (S). Afterwards web graph having maximum nodes

(after normalizing links embedded in the same web page) in the neighborhood graph which are

interconnected by links have been selected from base set. These nodes are represented as

connection between links in order to manage and calculate their respective rank values effectively.

On this web graph, Page Rank, HITs, weighted HITs, Norm(P) and sNorm(p) algorithms are

applied to check the relative effectiveness of ranking orders. The base sets for query (Q) are builds

in the manner described by Kleinberg. and Numerical data are in Table 1.

Table 1: Experimental data for various queries

Sr.

No. Query (Q) Nodes(t)

Out

link In link Links

After

normalization

Base Set (S)

1. Java 102 11546 1912 13458 10806

2. Jaguar 102 16527 744 17373 12711

3. Harvard 95 27243 4271 31514 13192

4. Search engine 100 8264 2273 10637 9152

5. Kyoto

University 94 6393 700 7093 6070

4. Web page ranking Algorithms

Jon Kleinberg’s HITs algorithm discovers superior authorities and hubs for a given topic by

assigning two statistics to a page: an authority and a hub weight. Page Rank is one of the most

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 122

significant ranking techniques used to measuring the importance of web pages. The SALSA is

based on the concept of Markov Chains and uses the stochastic properties of random walks which

is performed on the collection of web pages. In this approach, a neighborhood graph is determined

first. On that neighborhood graph, one step backward and one step forward random walk are

performed [3]. Norm (p) works on the principle that small authority weights should adds a lesser

amount to the computation of the hub weights [4]. The authority threshold algorithm works on the

standard of special management of the authority weights. That is, high authority weights should be

further significant in the calculation of the hub weight [5, 6]. Parameter value passed to the

algorithm is the value of p. It can be assumed that p [1, ∞] as p raises the significance of the p-

norm and is dominated by the top weights. For example, or p=2, we basically scale all weight with

themselves. The sNorm (p) algorithm [5,6] can be made symmetric by setting the authority weight

of a node to be the p-norm of the vector of the hub weights of the hubs that point to that node.

4.1 Weighted HITs Algorithm

Authorities are often not particularly self-descriptive. If one can try to discover the major “search

engines”, it would be a severe fault to confine his concentration to the set of the entire pages holds

the expression “search engines”. Even though this set is huge, it doesn’t contain the majority of the

innate authorities that one would like to discover (e.g. pages like Google, Yahoo!, Excite, Bing,

AltaVista, InfoSeek etc.). Likewise, there is no cause to imagine the home pages of Honda or

Toyota to hold the word “Japanese automobile manufacturers," or the home pages of Lotus or

Microsoft to holds the word “software companies." [9]. As such, non-self descriptive web pages are

not selected by the term based search engine; they cannot be included in the relevance set and

could not be selected in result by Page Rank. With respect to HITs and SALSA algorithms and

Norm(p) and SNorm(p) algorithms, consideration of hub and authority concept make it easy to

identify popular page on the given subject, even if the query words do not appear anywhere in the

page. According to these interpretation, the subsequent method has been recommended for

compiling the subset of the Web to create the root set from existing search engine which include

several pages that either links to a page into the root set, or is linked to by a page into the root set

and construct base set for which to compute hub and authority scores are computed.

The base set is created in such a way that: A fine authority page may not hold the query text (such

as search engine). The “expansion” of the root set into the base set improves the universal pool of

finer hubs and authorities. It will resolve the ambiguity at some extents by expansion the root set

into base set. To overcome this issue, construct the adjacency matrix that double weighting links

whose outlinks anchors and inlinks titles contains query word to capture pages which are highly

authoritative, but not self-descriptive. It is helpful for mining pertinent links for queries which are

not directly connected the related links.

Hence, in order to decrease the computation complication, calculating the likeliness of the

destination page and the query Q is simplified by calculating the likeliness of the anchor text of

links and outlinks as well as titles of inlinks with the query Q. It calculates authority scores and hub

scores using the weighted matrix, as compared to HITs algorithm. So, it will ultimately, increase

the weights of links which are mostly not self-descriptive on the basis of likeliness of outlinks

anchors and inlinks titles with query Q. For a given page i in S, a weighted authority score Wa(i)

and weighted hub score Wh(i) is assigned using weighted matrix.

Wa(i) = Wh(j)(𝑗 ,𝑖)∊E

(1) Wh(i) = Wa(j)(𝑖 ,𝑗 )∊E

(2)

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 123

The calculation of HITs algorithm is performed on our data set as described and compared with

modified improved weighted HITs (WHITs) algorithm. Results obtained are relatively increased

weights for highest authority is shown in Table 2 for top 10 authority ranks for a queries ”Java”,

”Jaguar”, “Harvard” ,“Search Engine” and “Kyoto University”.

From this, that the ranking of nodes is obtained whose weights are doubled by considering anchors

of outlinks as well as titles of inlinks and these nodes are ranked highest. Top authoritative pages

which describe the major search engines for query “search engine” are listed in top 10 authorities

using weighted HITs (WHITs) Algorithm. As shown in Table 2, Weighted HITs (WHITs)

increased the weights of the pages which are authoritative and provided almost all available major

search engines. In similar way, it increases weights of pertinent links for queries “Java”, “Jaguar”,

“Harvard”, “Kyoto University”, respectively.

Table 2: Top ten Authorities and weighted Authorities for queries “Java” , ”Jaguar” ,

“Harvard”, “Search Engine” and “Kyoto University”

Sr.

No Query

General HITs

Authority

weight

Links

WHITs

Authority

weight

Links

1. Java

0.3490 https://plus.google.co

m 0.3221

http://www.oracle.com/technet

work/java/index.html

0.2510

http://www.oracle.co

m/technetwork/java/i

ndex.html 0.2997 http://www.oracle.com

0.2071 http://www.youtube.c

om 0.2663 http://java.com

0.1885 http://www.oracle.co

m 0.2590

http://www.oracle.com/technet

work/java/javase/downloads/in

dex.html

0.1698 http://java.com 0.2262 https://www.oracle.com

0.1661 http://www.facebook.

com 0.2147 https://cloud.oracle.com

0.1605

http://www.oracle.co

m/technetwork/java/j

avase/downloads/ind

ex.html

0.2147 http://www.java.net

0.1592 https://twitter.com 0.1871 https://community.oracle.com

0.1573 http://twitter.com 0.1841 http://education.oracle.com

0.1530 https://www.oracle.c

om 0.1757 https://blogs.oracle.com

2. Jaguar

0.2188 http://www.jaguarusa

.com/index.html 0.6969

http://www.jaguarusa.com/ind

ex.html

0.1582 http://www.jaguar.co

.uk/index.html 0.6147 http://www.jaguarusa.com/

0.1507 http://www.jaguar.co

m/index.html 0.1449

http://www.jaguar.com/index.

html

0.1434 http://www.jaguar.co

m.au/index.html 0.1360

http://www.jaguar.co.uk/index.

html

0.1390 http://www.jaguar.ie/

index.html 0.1144

http://www.jaguar.co.za/index.

html

0.1384 http://www.jaguar.in/

index.html 0.1141

http://www.jaguar.com.au/inde

x.html

0.1361 http://www.jaguar.co

.za/index.html 0.1113

http://www.jaguar.in/index.ht

ml

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 124

0.1177 http://www.jaguar.co

m 0.1023

http://www.jaguar.ie/index.ht

ml

0.1086 http://jaguar.pl 0.0572 https://twitter.com

0.1071 http://www.jaguar.co

m.my 0.0366 http://instagram.com

3. Harvar

d

0.3300 http://twitter.com 0.3306 http://www.hbs.edu/

0.2914 https://twitter.com 0.3048 http://hms.harvard.edu/

0.2682 http://www.harvard.e

du 0.2864 https://www.hsph.harvard.edu/

0.2634 https://www.faceboo

k.com 0.2672 https://www.hms.harvard.edu/

0.2239 http://www.harvard.e

du/ 0.2545 http://www.gsd.harvard.edu/

0.2118 https://plus.google.co

m 0.2500 http://alumni.harvard.edu/

0.2087 http://www.facebook.

com 0.2409 https://college.harvard.edu/

0.1850 http://www.youtube.c

om

0.2284 https://www.gocrimson.com/

0.1816 http://www.linkedin.

com 0.1856 https://library.harvard.edu/

0.1639 http://news.harvard.e

du 0.1611 http://www.harvard.edu/

0.1558 http://trademark.harv

ard.edu 0.1585 http://hpac.harvard.edu/

4. Search

Engine

0.1199 http://www.google.co

m 0.1209 http://www.google.com

0.1176 http://www.bing.com 0.1191 http://www.bing.com

0.1129 http://www.ask.com 0.1136 http://www.ask.com

0.1083 http://www.yahoo.co

m 0.1088 http://www.yahoo.com

0.1008 http://www.lycos.co

m 0.1011 http://www.lycos.com

0.0975 http://www.facebook.

com 0.0987 http://www.facebook.com

0.0960 http://www.ixquick.c

om 0.0971 http://www.ixquick.com

0.0960 http://www.webcrawl

er.com 0.0962 http://www.webcrawler.com

0.0929 http://www.galaxy.co

m 0.0931 http://www.excite.com

0.0929 http://www.excite.co

m 0.0931 http://www.galaxy.com

5.

Kyoto

Univer

sity

0.7126 http://www.kyoto-

u.ac.jp/en 0.7216 http://www.kyoto-u.ac.jp/en

0.6204 http://www.kyoto-

u.ac.jp/en/

0.5809 http://www.kyoto-u.ac.jp/en/

0.1158 http://www.kyoto-

u.ac.jp 0.2267 http://www.kyoto-u.ac.jp

0.0740 http://www.opir.kyot

o-u.ac.jp 0.1037 http://www.opir.kyoto-u.ac.jp

0.0523 http://www.kyoto-

u.ac.jp/en/faculties-

and-graduate/

0.0866 http://www.opir.kyoto-

u.ac.jp/kuprofile/

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 125

0.0523 http://www.oc.kyoto-

u.ac.jp/en/ 0.0473 http://www.t.kyoto-u.ac.jp/en

0.0513 http://twitter.com 0.0452 http://www.kyoto-

u.ac.jp/en/faculties-and-

graduate/

0.0499 http://www.asafas.ky 0.0452 http://www.oc.kyoto-

oto-u.ac.jp/en/ u.ac.jp/en/

0.0494 https://www.faceboo

k.com 0.0444 http://sph.med.kyoto-u.ac.jp

0.0440 http://www.med.kyot

o-u.ac.jp 0.0437 http://www.t.kyoto-u.ac.jp

The calculation of HITs, Page-Rank, Norm(p), sNorm(p) and Weighted HITs (WHITs) are

performed on our data set then compared with Weighted HITs (WHITs). It shows that the ranking

of nodes is enhanced whose weights are doubled by considering anchors of links and those nodes

are ranked highest. Top authoritative pages which describes the major search engines, it is listed in

top 10 authorities using weighted HITs Algorithm (WHITs). As shown in table 3, weighted HITs

increased the weight of the pages which are authoritative and provided almost all available search

engines. Results obtained are shown in Table 3 for top 10 authority ranks for a queries ”Java”,

”Jaguar”, ”Harvard” ,“Search Engine” and ”Kyoto University” in which WHITs is outperformed

for pertinent links. It discovered links which are generally not discover by rest of the algorithms.

Table 3 : comparison of weights of Top ten weighted Authorities for queries “Java” ,

”Jaguar” , “Harvard”, “Search Engine” and “Kyoto University” with rest of algorithms

Sr.

No

.

Query No

of

lin

ks

Genera

l HITs

Author

ity

weight

Page

Rank

Norm

p (p),

P = 2

Normp

(p),

P = 3

Snor

mp

(p),

P = 2

Weig

hted

HITs

autho

rity

weigh

t

Links

1. Java

1. 0.2510 0.0095 0.243

4

0.2486 0000 0.322

1

http://www.oracle.co

m/technetwork/java/i

ndex.html

2. 0.1885 0.0017 0.185

1

0.1873 0000 0.299

7

http://www.oracle.co

m

3. 0.1698 0.0012 0.166

8

0.1687 0000 0.266

3

http://java.com

4. 0.1605 0.0010 0.157

2

0.1593 0000 0.259

0

http://www.oracle.co

m/technetwork/java/ja

vase/downloads/index

.html

5. 0.1530 0.0044 0.150

0

0.1520 0000 0.226

2

https://www.oracle.co

m

6. 0.1398 0.0010 0.136

9

0.1389 0000 0.214

7

https://cloud.oracle.co

m

7. 0.1398 0.0010 0.136

9

0.1389 0000 0.214

7

http://www.java.net

8. 0.1507 0.0010 0.147 0.1497 0000 0.187 https://community.ora

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 126

7 1 cle.com

9. 0.1461 0.0010 0.143

1

0.1451 0000 0.184

1

http://education.oracle

.com

10. 0.1101 0.0015 0.108

2

0.1094 0000 0.175

7

https://blogs.oracle.co

m

2. Jaguar 1. 0.2188 0.0146 0.215

5

0.2050 0.001

5 0.696

9

http://www.jaguarusa.

com/index.html

2. 0.0970 0.0131 0.093

7

0.0863 0.000

1 0.614

7

http://www.jaguarusa.

com/

3. 0.1507 0.0091 0.135

2

0.1265 0.000

1

0.144

9

http://www.jaguar.co

m/index.html

4. 0.1582 0.0100 0.151

9

0.1454 0.000

1

0.136

0

http://www.jaguar.co.

uk/index.html

5. 0.1361 0.0019 0.124

0

0.1168 0.000

1

0.114

4

http://www.jaguar.co.

za/index.html

6. 0.1434 0.0029 0.135

6

0.1294 0.000

1

0.114

1

http://www.jaguar.co

m.au/index.html

7. 0.1384 0.0024 0.125

0

0.1174 0.000

1

0.111

3

http://www.jaguar.in/i

ndex.html

8. 0.1390 0.0027 0.126

8

0.1195 0.000

1

0.102

3

http://www.jaguar.ie/i

ndex.html

9. 0.0932 0.0080 0.253

1

0.2976 0.000

1

0.057

2

https://twitter.com

10. 0.0545 0.0043 0.135

8

0.1572 0000 0.036

6

http://instagram.com

3. Harvar

d

1. 0.1419 0.0126 0.141

9

0.1420 0.018

9 0.330

6

http://www.hbs.edu/

2. 0.1423 0.0078 0.142

3

0.1424 0.004

3 0.304

8

http://hms.harvard.ed

u/

3. 0.0443 0.0128 0.044

3

0.0443 0000 0.286

4

https://www.hsph.har

vard.edu/

4. 0.0751 0.0068 0.075

2

0.0752 0.004

3 0.267

2

https://www.hms.harv

ard.edu/

5. 0.0785 0.0093 0.078

5

0.0785 0.004

5 0.254

5

http://www.gsd.harva

rd.edu/

6. 0.0326 0.0082 0.032

6

0.0657 0.008

4 0.250

0

http://alumni.harvard.

edu/

7. 0.0657 0.0080 0.065

7

0.0697 0.007

6 0.240

9

https://college.harvard

.edu/

8. 0.0697 0.0059 0.069

7

0.0196 0000 0.228

4

https://www.gocrimso

n.com/

9. 0.0196 0.0106 0.019

6

0.2239 0000 0.185

6

https://library.harvard

.edu/

10. 0.2239 0.0153 0.223

9

0.0327 0.059

5

0.161

1

http://www.harvard.e

du/

4. Search

Engine

1. 0.1199 0.0034 0.119

9

0.1213 0000 0.120

9

http://www.google.co

m

2. 0.1176 0.0007 0.117

6

0.1191 0000 0.119

1

http://www.bing.com

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 127

Figure 1-5 shows HITs, Page Rank, Norm(p), Snorm(p) and WHITS authority weights for

dataset queries, as depicted in figure WHITs outperforms for our dataset queries.

3. 0.1129 0.0004 0.112

9

0.1139 0000 0.113

6

http://www.ask.com

4. 0.1083 0.0009 0.108

3

0.1092 0000 0.108

8

http://www.yahoo.co

m

5. 0.1008 0.0011 0.100

8

0.1016 0000 0.101

1

http://www.lycos.com

6. 0.0975 0.0042 0.097

5

0.0983 0000 0.098

7

http://www.facebook.

com

7. 0.0960 0.0003 0.096

0

0.0968 0000 0.097

1

http://www.ixquick.c

om

8. 0.0960 0.0003 0.096

0

0.0967 0000 0.096

2

http://www.webcrawl

er.com

9. 0.0929 0.0003 0.092

9

0.0936 0000 0.093

1

http://www.excite.co

m

10. 0.0929 0.0003 0.092

9

0.0936 0000 0.093

1

http://www.galaxy.co

m

5. Kyoto

Univer

sity

1. 0.7126 0.0272 0.707

7

0.6723 0.025

0 0.721

6

http://www.kyoto-

u.ac.jp/en

2. 0.6204 0.0199 0.613

9

0.5698 0.025

0

0.580

9

http://www.kyoto-

u.ac.jp/en/

3. 0.1158 0.0608 0.125

8

0.1734 0.000

9 0.226

7

http://www.kyoto-

u.ac.jp

4. 0.0740 0.0084 0.077

0

0.0912 0.001

1 0.103

7

http://www.opir.kyoto

-u.ac.jp

5. 0.0440 0.0109 0.044

8

0.0476 0.002

8 0.086

6

http://www.opir.kyoto

-u.ac.jp/kuprofile/

6. 0.0384 0.0014 0.039

9

0.0475 0.000

7

0.047

3

http://www.t.kyoto-

u.ac.jp/en

7. 0.0523 0.0046 0.054

1

0.0628 0.112

9

0.045

2

http://www.kyoto-

u.ac.jp/en/faculties-

and-graduate/

8. 0.0523 0.0046 0.054

1

0.0628 0.000

7

0.045

2

http://www.oc.kyoto-

u.ac.jp/en/

9. 0.0330 0.0007 0.034

8

0.0442 0000 0.044

4

http://sph.med.kyoto-

u.ac.jp

10. 0.0276 0.0015 0.027

5

0.0267 0000 0.043

7

http://www.t.kyoto-

u.ac.jp

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 128

5. Comparison Of various Link Based Web Page Ranking Algorithms:

On the basis of analysis, a comparison of various link based web page ranking algorithms is carried

out on the basis of some basic criteria such as main techniques, key parameters, relevancy, size of

matrix etc.

Table 5: Comparison of Link based Ranking Algorithms or eigenvector-based ranking algorithms

(or linear link analysis algorithms)

Figure 1 “Java” authority weights Figure 2 “Jaguar” authority weights

Figure 3 “Harvard” authority weights Figure 4 “Search Engine” authority weights

Figure 5 “Kyoto University” authority weights

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 129

Algorithm

Criteria HITs Page Rank SALSA Norm(p) sNorm(p)

Weighted

HITs, WHITs

Basic Criteria Link analysis

algorithm

Link

analysis

algorithm

based on

random

surfer.

Link analysis

algorithm

based on

Markov chain.

Ranking

algorithm

Ranking

algorithm

Link analysis

algorithm

Mining

procedure

used

Web

Structure &

Web Content

Web

Structure

Web

Structure &

Web Content

Web

Structure

Web

Structure

Web

Structure &

Web Content

Key

parameters

Back &

Forward links Back-links

Back &

Forward links

Back &

Forward links

Back &

Forward links

Back &

Forward links

Used by

IBM clever Google

Twitter uses

salsa like

algorithm,

Research

Model

Research

Model

Research

Model

Research

Model

Query

dependency

Query

dependent

Query

independent

Query

dependent

Query

dependent

Query

dependent

Query

dependent

Neighborhood

Applied to the

neighborhood

of pages

adjacent to the

results of a

query, directed

Sub-graph

(1000-5000

nodes)

Whole-Web

Sub-graph

(bipartite

undirected

graph)

Sub-graph of

nodes

Sub-graph of

nodes

Applied to the

neighborhood

of pages

adjacent to the

results of a

query, directed

Sub-graph

(1000-5000

nodes)

Model Hubs and

Authorities

Authorities

and Markov

Model

of random

walk

Hubs,

Authorities

and

Markov chains

Hubs and

Authorities

with scaling

weights itself

Hubs and

Authorities

with scaling

weights itself

Hubs and

Authorities

with double

weighting

Authority &

Hub

computation

calculates

authority

and hub scores

by the un-

weighted

matrix

calculates

authority

score by a

row-

weighted

matrix

calculate its

hub and

authority

scores by both

row and

column

weighting.

small authority

weights should

contribute less

to calculation

of the hub

weights

Symmetric for

authority and

hub weight

calculates

authority

and hub scores

by the

weighted

matrix

Complexity

(Worst Case) < O(n)

2 O(log n) < O(n)

2 _ _ < O(n)

2

Mutual

reinforcement

Mutual

reinforcement

between

authority and

hub web pages

Doesn’t

distinguish

hubs and

authorities.

ranks pages

by

Departure

from HITs i.e.

Tightly Knit

Community

(TKC) effect

Mutual

reinforcement

between

authority and

hub web pages

Mutual

reinforcement

between

authority and

hub web pages

Mutual

reinforcement

between

authority and

hub web pages

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 130

Authority.

Stability

Can be

unstable:

changing a

few links can

lead to quite

different

rankings.

Can be

unstable:

changing a

few links

can lead to

quite

different

rankings.

Can be

unstable

Can be

unstable

Can be

unstable

Can be

unstable:

changing a

few links can

lead to quite

different

rankings.

Response

time

Crucial due to

calculation of

rank is done at

query time.

Calculation

of rank is

offline so

no crucial

issue for

response

time.

Crucial due to

calculation of

rank is done at

query time.

Larger

response time

Larger

response time.

Crucial due to

calculation of

rank is done at

query time.

Relevancy

More relevant

as it uses

hyperlink

structure and

consider the

content

Fewer

relevancies

as it ranks

the page at

indexing

time.

More

relevancies

More

relevancies

More

relevancies

More relevant

as it uses

hyperlink

structure and

consider the

content

Size of matrix

Matrix

calculation is

done on the

bases of

neighbourhood

directed graph.

World’s

largest

matrix

calculation

problem.

Matrix

calculation is

done on the

bases of

bipartite

undirected

graph.

Matrix

calculation is

done on the

bases of

neighbourhood

graph.

Matrix

calculation is

done on the

bases of

neighbourhood

graph.

Double

weighted

Matrix

calculation is

done on the

bases of

neighbourhood

directed graph.

Dual ranking Hub and

Authority rank

Doesn’t

supply dual

ranking,

Random

walk

Hub markov

chain and

Authority

markov chain

Higher

authority

weight should

be further

essential to

compute the

hub weight.

Symmetric by

setting the

authority

weight of node

to the p-norm

of the vector

the hub

weights of the

hubs that point

to node.

Hub and

Authority rank

with double

weighted

matrix

Efficiency

Less efficient,

today’s search

engines, which

need to handle

millions of

queries per

day.

Computes

score at

query time,

much

greater

efficiency.

Less efficient Less efficient Less efficient

Less efficient

for today’s

search

engines, which

need to handle

millions of

queries per

day.

Analysis

Scope Single page Single page Single page Single page Single page Single page

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 131

Quality of

result

Less than page

rank Medium Average Medium Medium

Improved then

HITs

Compute

eigenvector

Summing

weights of

linked nodes at

each step

Compute

eigenvector

using a

Markov

chain

Compute

eigenvector

using a

Markov chain

Compute

eigenvector by

scaling

weights itself

Compute

eigenvector by

scaling

weights itself

Summing

double

weighted

linked nodes at

each step

matched with

query, anchor

text and titles

Convergence

Converges to a

fix-point if

iterated for an

indefinite

period.

• Authority

vector, a,

converges to

the principal

eigenvector

of ATA.

• Hub vector,

h, converges

to the

principal

eigenvector

of AAT

• Generally, 20

iterations

turn out into

fairly stable

results.

The

percentage

of time

spent at

each page

will

converge to

a fixed

value.

• Page Rank

algorithm

converged

generally,

in about

52

iterations.

Matrices A

and H should

be irreducible,

• neighborhood

graph G,

• connected,

then both A

and H are

irreducible

• not

connected,

tperforming

power

method on A

and H will

not outcome

in

convergence

to a unique

dominant

eigenvector.

Varying on

importance of

P passed as

parameter and

converges to

fixed point.

Varying on

importance of

P passed as

parameter and

converges to

fixed point.

Converges to a

fix-point if

iterated

indefinitely.

• Authority

vector, a,

converges to

the principal

eigenvector

of ATA with

weighted

input

• Hub vector,

h, converges

to the

principal

eigenvector

of AAT

weighted

input

• Generally, 20

iterations

turn into

fairly stable

Vulnerability

• Query

Dependency.

• Irrelevant

authorities/

hub problem.

• Mutually

reinforcing

relationships.

• Topic Drift.

• Effect of

additional

pages.

• Query

Dependency.

• TKC effect.

• Query

dependent.

• Mutually

reinforcing.

• Query

dependent.

• Mutually

reinforcing.

• Query

Dependency.

• Mutually

reinforcing

relationships.

6. Conclusions

Most of the web information retrieval tools use the textual information while ignores the link

information that could be very valuable. Link analysis algorithms for ranking and utilization of

anchor text in IR were explored in the study. Experimentation with the weighted HITs (WHITs) by

double weighting of links of nodes which matches with query and the anchors of out-links and

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 132

titles of in-links was found outperformed for authority pages as compared to other algorithms like

HITs, Page Rank and norm (p) family of algorithms. Hence, WHITs with anchor texts, one can

improve Web search rankings for the pages which are not self-descriptive. Also, in future it will be

helpful for homepage finding, named page finding, navigational queries and ad hoc search tasks.

References

[1] M. Henzinger, Link analysis in web information retrieval, IEEE Data Engineering Bulleitin.

(2000)1-6.

[2] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM.

46(5) (1999) 604–632.

[3] R. Lempel, S. Moran, The stochastic approach for link-structure analysis (SALSA) and the

TKC effect, Computer Networks. 33(2000) 387–401.

[4] C. Rudin, Ranking with a P-Norm Push ,Springer-Verlag Berlin Heidelberg. (2006) 1-20.

[5] M. Kumar, Web Page Ranking Solution through Snorm (P) Algorithm Implementation

(Doctoral dissertation, Thapar University Patiala). (2008).

[6] M. Kumar, A New Approach for Web Page Ranking Solution: sNorm (p) Algorithm,

International Journal of Computer Applications (0975 – 8887) Volume 9– No.10, November.

(2010).

[7] S. Brin and L. Page, The anatomy of a large-scale hyper textual Web search engine. Computer

Networks and ISDN Systems. 30(1–7): (1998)107–117.

[8] L. Page, S. Brin, R. Motwani and T. Winograd, The page rank citation ranking: Bringing order

to the web. (1998).

[9] S. Chakrabati, B. Dom, D. Gibson, J. Kleinberg, , S. Kumar, P. Raghavan and A. Tomkins,

Mining the link structure of the World Wide Web. IEEE Computer, 32(8), (1999), 60-67.

[10] T.H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search

on the web. In Proceedings of the 11th international conference on World Wide Web, ACM.

(2002), 432-442.

[11] I. Varlamis, M. Vazirgiannis, M. Halkidi, and B. Nguyen. THESUS, a closer view on web

content management enhanced with link semantics. Knowledge and Data Engineering, IEEE

Transactions on. (2004), 16(6), 685-700.

[12] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In

Proceedings of the ninth ACM conference on Hypertext and hypermedia: links, objects, time

and space---structure in hypermedia systems: links, objects, time and space---structure in

hypermedia systems, ACM. (1998), 225-234.

[13] A. N. Langville, C.D. Mayer. Deeper inside Page Rank, Journal of Internet Mathematics.

Vol. 1, No. 3.(2003),335–380.

[14] V. Dang and W. B. Croft, Query reformulation using anchor text, In Proc. of the 3rd ACM Int.

Conf. on Web Search and Data Mining, WSDM'10. (2010) 41-50.

[15] A. Borodin, G. Roberts, J. Rosenthal, P. Tsaparas, Finding authorities and hubs from link

structures on the world wide web, Proceedings of the 10th International World Wide Web

Conference. (2001) 415–429.

[16] R. Kraft and J. Zien, Mining anchor text for query refinement, In Proceedings of WWW.

(2004) 666-674.

[17] U. Lee, Z. Liu, and J. Cho, Automatic identification of user goals in web search, In

Proceedings of the 14th Int. Conf. on WWW’05, ACM. (2005) 391–400.

[18] X. Zhang, H. Yu, C. Zhang, & X. Liu. (2007). An Improved Weighted HITS Algorithm Based

on Similarity and Popularity. Computer and Computational Sciences. IMSCCS 2007. Second

International Multi-Symposiums on. IEEE, (2007), 477- 480.

[19] X. Liu, H. Lin and C. Zhang, An improved HITS algorithm based on page-query similarity

and page popularity. Journal of Computers. 7(1), (2012) 130-134.

VNSGU Journal of Science and Technology – V 5(1) - 2016 - 133

[20] N. Craswell, D. Hawking, and S. Robertson, Effective site finding using link anchor

information. In Proceedings of the 24th Annual Int. ACM SIGIR Conf. on Research and

Development in Information Retrieval ACM. (2001) 250–257.

[21] H. S. Patel, A. A. Desai, An Anchor Based Information Retrieval For Link Analysis: A

Survey, VNSGU JOURNAL OF SCIENCE AND TECHNOLOGY. Vol.4.No.1, 22 - 35,

ISSN: 0975-5446], July (2015).

[22] N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In Proceedings of the

26th Annual International ACM SIGIR Conference on Research and Development in

Information Retrieval, ACM Press. (2003), 459–460.

[23] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic

resource compilation by analyzing hyperlink structure and associated text. Proceedings of the

7th World Wide Web Conference. (1998).