38
Webpage Ranking Algorithms Second Exam Report Grace Zhao Department of Computer Science Graduate Center, CUNY Exam Committee Professor Xiaowen Zhang, Mentor, College of Staten Island Professor Ted Brown, Queens College Professor Xiangdong Li, New York City College of Technology Initial version: March 8, 2015 Revision: May 1, 2015 1

Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Webpage Ranking AlgorithmsSecond Exam Report

Grace ZhaoDepartment of Computer Science

Graduate Center, CUNY

Exam CommitteeProfessor Xiaowen Zhang, Mentor, College of Staten Island

Professor Ted Brown, Queens CollegeProfessor Xiangdong Li, New York City College of Technology

Initial version: March 8, 2015Revision: May 1, 2015

1

Page 2: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Abstract

The traditional link analysis algorithms exploit the context in-formation inherent in the hyperlink structure of the Web, with thepremise being that a link from page A to page B denotes an endorse-ment of the quality of B. The exemplary PageRank algorithm weighsbacklinks with a random surfer model; Kleinberg’s HITS algorithmpromotes the use of hubs and authorities over a base set ; Lempel andMoran traverse this structure through their bipartite stochastic algo-rithm; Li examines the structure from head to tail, counting ballotsover hypertext. Semantic Web and its inspired technologies bring newcore factors into the ranking equation. While making continuous effortto improve the importance and relevancy of search results, Semanticranking algorithms strive to improve the quality of search results: (1)The meaning of the search query; and (2) The relevancy of the resultin relation to user’s intention. The survey focuses on an overview ofeight selected search ranking algorithms.

2

Page 3: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Contents

1 Introduction 4

2 Background 52.1 Core Concepts . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Search Engine . . . . . . . . . . . . . . . . . . . 52.1.2 Hyperlink Structure . . . . . . . . . . . . . . . 52.1.3 Search Query . . . . . . . . . . . . . . . . . . . 72.1.4 Web Graph . . . . . . . . . . . . . . . . . . . . 72.1.5 Base Set of Webpages . . . . . . . . . . . . . . 92.1.6 Semantic Web . . . . . . . . . . . . . . . . . . 92.1.7 Resource Description Framework and Ontology 10

2.2 Mathematical Notations . . . . . . . . . . . . . . . . . 12

3 Classical Ranking Algorithms 133.1 PageRank[1][2] . . . . . . . . . . . . . . . . . . . . . . 133.2 HITS[3][4] . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 SALSA[5] . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 HVV[6][7] . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Semantic Ranking Algorithms 234.1 OntoRank[8][9] . . . . . . . . . . . . . . . . . . . . . . 244.2 TripleRank[10] . . . . . . . . . . . . . . . . . . . . . . 264.3 RareRank[11] . . . . . . . . . . . . . . . . . . . . . . . 284.4 Semantic Re-Rank[12] . . . . . . . . . . . . . . . . . . 32

5 Conclusion and Future Research Direction 35

3

Page 4: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

1 Introduction

According to Netcraft.com, there were 915,780,262 websites world-wide as of December 2014 1, compared to 2,738 websites twenty yearsearlier in 1994 2. Today Google claims that the Web is “made up ofover 60 trillion individual pages and constantly growing 3.” The vastamount of information available on the Internet is a double-edgedsword - either an information blessing or, alternatively, a potentialinformation nightmare. Web search engines play a key role in today’slife, finding and taking in the blessing and abating the nightmarethrough sifting the vast information on the Web and providing themost relevant and authoritative information in a digestible amount tothe Internet user.

In order to cope with the ever growing web data, ever increasinguser demands, while fighting against web spams, search engine engi-neers are constantly making relentless efforts to refine their rankingalgorithms and bringing new factors into the equation. Google is saidto use over 200 factors 4 5 in its ranking algorithms.

The Web evolved rapidly for the past two decades – from “the Webof documents” in the 1990s, to “the Web of people” in the early 2000s,to the present “the Web of data and social networks”[13]. During theevolution process, the emerging Semantic Web (SW) sets out a newWeb platform to augment the highly unordered web data with suit-able semi-structured self-describing data and metadata, which greatlyimproved the quality of search results in a SW-enabled environment.Most importantly, SW brings the vision of adding “meaning” to theWeb resources via knowledge representation and reasoning. To under-stand a user’s intention of performing a search is the key to provideaccurate and customized search results.

The structure of this report is organized as follows: In Section2, I will introduce some core concepts in the field of search engineand ranking algorithms. In addition, some mathematical notations

1http://news.netcraft.com/archives/category/web-server-survey/2www.internetlivestats.com/total-number-of-websites/3http://www.google.com/insidesearch/howsearchworks/thestory/4http://www.google.com/insidesearch/howsearchworks/thestory/5http://backlinko.com/google-ranking-factors

4

Page 5: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

used in this report will be introduced. In Section 3, I will reviewfour iconic ranking algorithms: PageRank, Hyperlink-Induced TopicSearch (HITS), Stochastic Approach for Link-Structure Analysis (SALSA),and Hyperlink Vector Voting (HVV). Semantic ranking algorithms willbe examined in Section 4, and the report concludes in Section 5 withsummary and future research direction.

2 Background

2.1 Core Concepts

2.1.1 Search Engine

A search engine is typically composed of a crawler, an indexer anda ranker. The web crawler, also called spider, discovers and gatherswebsites and webpages on the Web. An indexer uses various methodsto index the contents of a website or of the Internet as a whole. Inorder to process the query data and provide a relevant result set, theranking engine is the “brain” of the search engine. Ranking algorithmsworks closely with the indexed data and metadata.

Other than the crawler-based search engine, human-powered direc-tories such as the Yahoo directory, are depending the human editors’manual effort and discretion to build its listing and search database.

2.1.2 Hyperlink Structure

A hyperlink has two components: link address and link text (hy-pertext). The link address is a Unique Resource Locater (URL) thatLi[6] refers it as the head anchor at the destination. Li calls the hy-pertext the tail anchor on the source page. Since a URL is a uniqueidentification on the Web, ranking algorithms tend to use a hyperlinkaddress as the webpage (document) ID.

Internal hyperlinks are links that go from one page on a domainto a different page on the same domain. They are commonly usedfor internal navigation. It’s sometime referred as intralink. An ex-ternal link, or an interlink, points at a page on a different domain.Most of the ranking algorithms examine carefully the interlinks, butgive little or no considerations to the intralinks. The web documentcontaining a hyperlink is known as the source document. The desti-

5

Page 6: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

nation document where the hyperlink points to is the target document.

Examining the nature of links between the target and the source, itdepicts an underlining graph structure. The hyperlinks can be viewedas the directed edges between the source nodes and the target nodes, inthe residing graph. “Backlinks, also known as incoming links, inboundlinks, inlinks, and inward links, are the in-edges to a node, either awebsite or webpage.6” A forward link is defined as the out-edge be-tween the source node and the target node.

Webpage A

Webpage B

Webpage C

link a

link b

link c

link d

link e

Figure 1: The link b and link d are backlinks of C

Search engines tend to give special attention to the number andquality of backlinks to a node, since it is considered to be an indica-tion of the popularity or authority of the node. This is similar to theimportance given to the number of citations of an academic paper bythe library community[14].

Almost all the classical ranking algorithms perform hyperlink anal-ysis. This type of algorithms are called Link Analysis Ranking (LAR)algorithms.

6http://en.wikipedia.org/wiki/Backlink

6

Page 7: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

2.1.3 Search Query

There are three categories of web search queries 7: transactional,informational, and navigational. These are often called “do, know,go.” Ranking algorithms often work with informational queries. Querytopics can be broad or narrow. The former pertains to “topics forwhich there is an abundance of information on the Web, sometimesas many as millions of relevant resources (with varying degrees ofrelevance)[5]. Narrow topic queries refer to those for which very fewresources exist on the Web. Therefore, it may require different tech-niques to handle each characteristic queries.

Search Engine Persuasion[15] means “there may be millions ofsites pertaining in some manner to broad-topic queries, but most userswill only browse through the first k (e.g.: 10) results returned by thesearch engine.”

To streamline and simplify the complexity of ranking algorithms,user queries and anchor links referred in this report are textual only.

2.1.4 Web Graph

It’s important to understand the general graph structure of theweb in order to design ranking algorithms.

The Web Graph is the graph of the webpages together with thehypertext links between them[16]. Each webpage or a website, identi-fied by an URL, on the Web Graph is a vertex, the link between twowebpages is an edge.

Broder et al.[17] present a bow-tie structure of the Web Graphin 2000 (see Figure 2), based on crawling 200 million pages via AltaVista search engine.

In the center of the bow-tie structure is a giant Strongly ConnectedComponent (SCC), a strongly connected directed graph, where everynode can reach to another node in the graph. IN denotes a set ofpages that can reach the SCC, but cannot be reached from the SCC;OUT consists of pages that can be reached from SCC but can not

7http://en.wikipedia.org/wiki/Web_search_query

7

Page 8: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

reach the SCC. TENDRILS are the orphan web resources that can-not reach the SCC nor be reached from the SCC.

The year 2000 research concluded that SCC contains 28% of thenodes on the Web. In addition, Broder et al. show that the IN degreeand OUT degree follow a power law distribution that is heavy tailed.

A research[18] conducted two years later on Web Graph negatedBroder’s power law distribution presumption. Meusel et al. analyzed3.5 billion webpages gathered by the Common Crawl Foundation andclaimed that the distributions may follow a log log law instead. Theyalso found that the SCC, LSCC in their lingo (the paper did not clar-ify why putting an ’L’ in front of ’SCC’), covers 51.28% web resourceson the Web, a 40% increase over the decade (see Figure 3).

Figure 2: Web Graph depicted by Broder et al. in 2000[17]

Web Graph can be further categorized into: Page-Level Graph,Host Graph and Pay-Level-Domain (PLD) Graph8.

8http://webdatacommons.org/hyperlinkgraph

8

Page 9: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Figure 3: Web Graph depicted by Meusel et al. in 2014[18]

2.1.5 Base Set of Webpages

Since the Web Graph is colossal, searching through the entire WebGraph is almost impossible. Therefore, every ad hoc ranking algo-rithm based on link analysis starts with a set of initial webpages, Klein-berg calls it “focused subgraph of WWW”[3]. The general method ofobtaining such a set is to use a spider and Breadth First Search (BFS).

The algorithms obtaining the initial set can be either query inde-pendent or query-dependent.

Brin and Page’s PageRank algorithm computes a query-independentauthority score for every page. Whereas, the HITS algorithm by Klein-berg crawls a root set — a collections of pages likely to contain themost authoritative pages for a given topic, and then augments it toa “focused subgraph of WWW”, the base set, by adding pages linkedfrom, or link to, the root set.

2.1.6 Semantic Web

The term “semantic” relates to meaning in language or logic, suchas the meaning of a word. When used in the context of the Semantic

9

Page 10: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Web, however, the term refers to formally defined meaning that canbe used in computation.

The Semantic Web, envisioned by our very Tim Berners-Lee inthe late 90’s and early 2000s, is an extension of the current Web inwhich information is given well-defined meaning, creating an environ-ment where software agents roaming from page to page can readilycarry out sophisticated tasks for users 9. The Semantic Web is alsocalled the “web of data.” The relations among data (resource), donot resemble the hyperlinks on the WWW that connect the currentpage with the target one. The SW relationships can be establishedbetween any two resources, not necessary a “current” page. Anothermajor difference is that the relationship (i.e, the link) itself is named.The definition of those relations allow for a better and automatic in-terchange of data10.

The vision of Semantic Web was well captured and illustrated inthe Semantic Web Tower11. It’s later on added one more axis – “P” tothe tower. P stands for perception or people (Figure 4). The “percep-tion” axis is to signify the abstraction/adaptation of the technologiesin the tower, towards the people. Therefore, Tim Berners-Lee’s orig-inal vision was “adjusted” during the years that human involvementis inevitable.

2.1.7 Resource Description Framework and Ontology

The Resource Description Framework (RDF) 12 is the fundamen-tal exchange protocol of Semantic Web. It’s a metadata data modelfor the web resources in the form of subject-predicate-object (triple)expressions. A triple is usually represented by an Uniform ResourceIdentifier (URI) of the source document (subject), an URI or literalfor the target document (object), and an URI pointing to a propertydefinition (predicate). A collection of RDF statements intrinsicallyrepresents a labeled, directed multi-graph, suitable for knowledge rep-resentation.

9http://www.cs.umd.edu/ golbeck/LBSC690/SemanticWeb.html10http://www.w3.org/RDF/FAQ11http://www.w3.org/RDF/Metalog/docs/sw-easy.html12http://www.w3.org/RDF/

10

Page 11: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

P

Figure 4: Semantic Web Tower

Ontologies are considered one of the pillars of the Semantic Web(SW). SW technologies went through ups and downs since its incep-tion in the late 90’s. Ontology however has been gaining continuouskeen interest from both the industry and the academia.

An ontology is a formal knowledge description of concepts andtheir relationships[19]. Ontologies, sometimes called “concept maps”,play an important role in a knowledge-based system13. In order tobuild an ontology, a well-defined lexicon and logics system such as anontology language, taxonomy/metadata system, and vocabulary, hasto be in place in order to safeguard the validity and soundness of theontology. An ontology is usually written in RDF-based languages suchas Resource Description Framework Schema (RDFS)14, Web OntologyLanguage (OWL)15.

13An system that is able to find implicit consequences of its explicitly representedknowledge[20].

14http://www.w3.org/TR/rdf-schema/15http://www.w3.org/TR/owl-semantics

11

Page 12: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

A SW vocabulary can be considered as a special form of ontology,usually light-weight, or sometimes as a collection of URIs with an de-scribed meaning.

An ontology can describe a body of knowledge, or a process. Awell-built ontology can enable subsequent data display, analysis, infer-encing, entailments, and the like. DBpedia, one of the largest onlineSW knowledge bases (KB), currently describes 4.22 million things inits consistent 2014-version ontology16. Savvy users can query the KBand draw inferences using vocabularies.

Ontologies help bring the semi-structured web data in order andoptimize query results, particularly in a domain-specific web commu-nity.

Any document available on the Web featuring SW technologiessuch as RDF, ontology, etc, is considered to be a Semantic Web Doc-ument (SWD).

2.2 Mathematical Notations

We closely follow the notations in [21].

Let S be the base set of webpages (nodes). Let G = (S,E) be theunderlining directed graph, where |S| = n. If node i has a hyperlinkpointing to node j in the graph, there is a directed edge placed be-tween the two vertices.

The graph can be transformed into an n× n adjacency matrix P,where P[i, j] = 1 if there is a link from i→ j, and 0 otherwise.

We define backlink set B, and forward link set F, for some node i,as follows:

B(i) = j : P[j, i] = 1

F (i) = j : P[i, j] = 1

|B(i)| is the in-degree of node i, and |F (i)| is the out-degree ofnode i.

16http://wiki.dbpedia.org/Ontology2014

12

Page 13: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

An authority node in the graph G is defined as a node with nonzeroin-degree, and a hub node is one with nonzero out-degree. Let A bethe set of authority nodes, and H be the set of hub nodes. If G is aconnected graph, meaning there are no isolated nodes in the graph,S = A ∪H.

A = sa|sa ∈ S and in-degree > 0

H = sh|sh ∈ S and out-degree > 0

A link weight is a non-zero real number.

3 Classical Ranking Algorithms

All ranking algorithms introduced in this section belong to LAR al-gorithms. PageRank, HITS and SALSA are tightly related algorithms,all are eigenvector-based ranking algorithms. HVV adopts informa-tion retrieval technics and take both link address and link text as thefactors of the ranking algorithm.

3.1 PageRank[1][2]

PageRank considers that not all links have the same weight. Forexample, links from New York Times website should have more weightthan links from an unpopular individual blog site. To assess the rankweight, PageRank leans its scale towards backlinks (citation) - A pagehas high rank if the sum of the ranks of its backlinks is high. Thiscovers two cases: 1) when a page has many backlinks; 2) and when apage has a few highly ranked backlinks. In general, the more backlinksin the WWW a webpage has, the more important it is.

Let’s start with a slightly simplified version of PageRank. For somenode u, let c be a factor used for normalization, the ranking R is:

R(u) = c∑

v∈B(u)

R(v)

|F (v)|, (1)

where c < 1 for that there are a number of pages with no forwardlinks, therefore, their weight is lost. The equation can be iterativelycomputed upon any set of ranks until it converges.

13

Page 14: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

However, there is a problem with the simplified version of PageR-ank algorithm. if two nodes point to each other but not to othernodes and suppose there is some node that points to one of them.Then, during the iteration process, the rank of the node in questionwill accumulate ranks but do not dispatch any of the ranks out. Thesituation is called a rank sink. It’s similar to the “absorbing state” inMarkov Chain. To overcome this problem, rank source and randomsurfer behavior are introduced:

Definition: For some node u ∈ S, let vector E be the rank source.Then:

R(u) = c∑

v∈B(u)

R(v)

|F (v)|+ cE(u), (2)

such that c is maximized and ||R||1 = 1 (||R||1 denotes the L1 normof R). E(u) is a factor of u.

Other than being a decay factor, E can be also viewed as a wayof modeling a random surfer behavior: “The surfer periodically ‘getsbored’ and jumps to a random page chosen based on the distributionin E[1].”

In [2], there is a variation of PageRank algorithm, which can betterillustrate the random surfer model:

R(u) =1− dN

+ d∑

v∈B(u)

R(v)

|F (v)|. (3)

The original paper writes (1−d) instead of (1−d)/N , which resultsthe sum of all rankings as N . However, Page and Brin state in thepaper of discussion: “the sum of all PageRanks is one.”

The damping factor d, 0 < d < 1, usually set to 0.85, enables thefollowing two behaviors:

1. From a given state s, a webpage, with probability d, an outgoinglink of s is picked uniformly at random and the surfer moves tothe destination, state s′.

2. With probability 1 − d, the surfer chooses a node uniformly atrandom, and jump to it, s′, with no consideration of s. This isthe core of the random surfer model.

14

Page 15: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

It is likely that [2] is the predecessor of [1]. The introduction ofE concept is the key difference between the two PageRank algorithms.

PageRank can be as well understood as a Markov Chain in whichthe states are the node set S, and the transitions, the edge set E.Therefore, we can understand the node weights as transition proba-bilities.

In PageRank, S can be almost any set over webpages, such as E, orthe entire WWW. PageRank algorithm is summarized in Algorithm 1.

Algorithm 1 PageRank algorithm1: R0 ← S2: do3: Ri+1 ← PRi

4: d← ||Ri||1 − ||Ri+1||15: Ri+1 ← Ri+1 + dE6: δ ← ||jRi+1 −Ri||17: while δ > ε

Note: E is a user defined parameter. In most case E can be uni-form over all webpages with value α. However, different values of Ecan generate “customized” page ranks.

Since PageRank is query-independent, it cannot by itself distin-guish between pages that are authoritative in general and authorita-tive on the query topic. An authoritative node in general may not beconsidered valuable within the domain on the topic of the query.

PageRank was the embryo of search engine Google.

3.2 HITS[3][4]

Unlike PageRank, HITS lands on an equal footing with both back-links and forward links. It finds a set of relevant authoritative pagesand a set of hub pages on the WWW, relative to a broad-topic query,through extracting information from hyperlink structure of the Web.

15

Page 16: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

To define the ranking criteria, the author proposed two notions,or rather two distinct types of webpages: hubs and authorities. Seedefinition in Section 2.2. A mutually reinforcing relationship depicts areciprocal relationship between hubs and authorities, that “a good hubis a page that points to many good authorities, and a good authorityis a page that is pointed to by many good hubs[3].”

Kleinberg defined the hub weight of a node to be the sum of theauthority weights of its forward links, and the authority weight to bethe sum of the hub weights of its backlinks.

Let h denote the n-dimensional vector of the hub weights, wherehi is the i-th coordinate of h and the hub weight of node i. Let a bethe n-dimensional vector of the authority weights, where ai, the i-thcoordinate of the vector a, is the authority weight of node i in thegraph.

ai =∑

j∈B(i)

hj . (4)

hi =∑

j∈F (i)

aj . (5)

With adjacency matrix P, we have

a = PTh and h = Pa.

The author states that if a hub node points to many pages withlarge authority weights, then it should receive a large hub weight;and if an authority node is pointed to by many pages with large hubweights, then it should receive a large authority value. Therefore heproposed a two-level weight propagation iterative algorithm, the Hoperation and the A operation, for computing the hub and authorityweights, respectively, using Equation 4 and 5. After each iteration,a normalization should be performed on the vectors a and h so thatthey become unit vectors. The iteration stops upon convergence.

Let k be a natural number. Let z denote the vector (1, 1, 1, ..., 1) ∈Rn. Algorithm 2 describes the pseudo code of HITS.

16

Page 17: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Authority

Hub

Figure 5: A densely linked set of hubs and authorities

Algorithm 2 HITS algorithm

1: Iterate(S, k)2: Set a0 := z.3: Set h0 := z.4: for i = 1, 2, . . . , k do5: Apply the A operation to (ai−1, hi−1), obtaining new authority-

weights a′i.6: Apply the H operation to (a′i, hi−1), obtaining new hub-weights h′i.7: Normalize a′i, obtaining ai.8: Normalize h′i, obtaining hi.9: end for

10: Return (ak, hk).

Then it follows a “filter” algorithm to select the top c authoritiesand top c hubs (Algorithm 3). Let k, c be natural numbers.

If k is of an arbitrarily large value, Kleinberg proved that the se-quences of vectors ak and hk converge to fixed points a∗ and h∗.The singular vectors a∗ and h∗ become the principal eigenvectors ofPTP and PPT , respectively.

Lempel and Moran further illustrated the concept via two associ-ation matrices, authority matrix AU , and hub matrix HU [5].

AU =Def PTP is the cocitation matrix of the set S.[AU ]i,j is the number of pages which jointly point at pages i

17

Page 18: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Algorithm 3 HITS algorithm

1: Filter(S, k, c)2: (ak, hk) := Iterate(S, k).3: Report the pages with the c largest coordinates in ak as authorities.4: Report the pages with the c largest coordinates in hk as hubs.

and j. Kleinberg’s iterative algorithm converges to author-ity weights which correspond to the entries of the (unique,normalized) principal eigenvector of AU .

HU =Def PPT is bibliographic coupling matrix. [HU ]i,jis the number of pages which are jointly pointed at by nodei and j[5]. Kleinberg’s iterative algorithm converges to hubweights which correspond to the entries of HU ’s (unique,normalized) principal eigenvector.

HITS algorithm was used by a number of search engines such asYAHOO!, Alta Vista, etc.

3.3 SALSA[5]

Lempel and Moran present SALSA, a new LAR algorithm withstochastic approach. The algorithm was inspired by the random surfermodel of PageRank, and the mutually reinforcing relationship betweenhubs and authorities of HITS.

SALSA, a “weighted in/out-degree analysis of the link-structure ofWWW subgraphs”, is said to be equivalent to Kleinberg’s algorithmthat both are broad-topic-dependent, working on a focused subgraph,and employing the same metagorithm.

SALSA computes the weights by simulating a random walk throughtwo different Markov Chains: a chain of hubs, H, and a chain of au-thorities, A, among nodes ∈ S. This is different than that of PageR-ank, which is “a single random leap on the entire WWW,” or in E,independent of the mathbb, with no distinction made between hubsand authorities. SALSA’s bipartite random surfing model is also adeparture from HITS’ mutually reinforcing relationship.

18

Page 19: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

Lempel and Moran construct a bipartite undirected graph G =(A,H,E). A and H are two independent sets. A node i can be inboth A and H.

1

2h

5

2

3 4

3h

4h

1h

4a

5a

3a

2a

Directed graph G Bipartite graph G'

Figure 6: Transforming directed graph G to bipartite graph G′

SALSA launches two distinct random walks. Each walk only visitsnodes on either the authority side or the hub side of the graph. Inthe initial state, the random walk starts from some authority nodeselected uniformly at random. In the next state, the random surfercrosses to the hub side and picks up a hub node uniformly at random.In the state after, the random surfer selects one of outgoing links (au-thority side) uniformly at random and moves to the authority node.The random walk proceeds by alternating between backward and for-ward steps.

Each node is assigned a authority and a hub weights, that are de-fined to be the stationary distributions of this random walk.

Here is the process of defining the authority and hub random walkmatrices, A, H, two doubly stochastic matrices. Firstly, two matricesare derived from P.

Let Pr denote the matrix derived from matrix P by normalizingthe entries such that, for each row, the sum of the entries is 1, andlet Pc denote the matrix derived from matrix P by normalizing the

19

Page 20: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

entries such that, for each column, the sum of the entries is 1. ThenA consists of the nonzero rows and columns of P T

c Pr, and H consistsof the nonzero rows and columns of PrP

Tc . Then the stationary distri-

butions of the SALSA algorithm are the principal eigenvector of thematrices.

The transition probabilities of the two Markov Chains, can be di-rectly computed via Equations 6 and 7.

Pa(i, j) =∑

k∈B(i)∩B(j)

1

|B(i)|1

|F (k)|. (6)

Ph(i, j) =∑

k∈F (i)∩F (j)

1

|F (i)|1

|B(k)|. (7)

SALSA can be seen as a variation of HITS[15]. In the H operationof the HITS algorithm, the hubs broadcast its weights to the author-ities and the authorities sum up the weight of the hubs that pointto them. In SALSA however, each hub divides its weight equallyamong the authorities to which it points. Similarly, the SALSA al-gorithm modifies the A operation so that each authority divides itsweight equally among the hubs that point to it. Therefore, we haveequations 8 9

ai =∑

j∈B(i)

1

|F (j)|hj . (8)

hi =∑

j∈F (i)

1

|B(j)|aj . (9)

The authors argue that in some cases, the mutually reinforcing re-lationship may result in a topological phenomenon called Tightly KnitCommunity (TKC) Effect, which refers to a highly connected graphwithin the Web that may cause the Mutual Reinforcement approachto identify false authorities or hubs. SALSA is on the other hand lessvulnerable to the effect due to its less tight coupling approach.

SALSA is said to be computationally lighter than HITS since itsranking is equivalent to a weighted in/out degree ranking. Both HITSand SALSA are ad hoc algorithms, meaning they are computed at

20

Page 21: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

query time, therefore, their computational costs become crucial andwill be in direct proportion with the response time of a search engine.Whereas, the PageRank is a query-independent approach, that can becomputed off-line.

In a study[15] comparing 34 queries (term-based), SALSA per-formed the best among the aforementioned three algorithms in findinghighly relevant pages.

SALSA algorithm was implemented by Twitter.

3.4 HVV[6][7]

HVV, like PageRank, thinks “what a site says about itself is notconsidered reliable,” but “only what others say that page is about!17”.In other words, HVV emphasized the importance of backlinks. LikeHITS and SALSA, HVV is an ad hoc algorithm. However, the algo-rithm follows an entirely different ranking approach than the otherthree. HVV computes the page rank based on the weighted terms inthe hypertext of backlinks, for a given query.

Li calls the hyperlink URL pointing to the destination, the headanchor, the hypertext, the tail anchor (at the source). The “documentID” is usually defined as the hyperlink’s head anchor. The hypertextis treated like the content of the link.

Traditional Information Retrieval method - Vector Space Model(VSM) - was the building block of HVV. VSM is often used to computethe similarity between two documents, or a query and a document. Ithandles a full-text search of a document, whereas HVV copes with thehyperlinks only within a webpage.

Like VSM, HVV presents both the document (webpage) and thequery as vectors.

In HVV, a document vector is represented by a vector of vectors.The dimension of each document vector is a hyperlink vector. Thedocument can have zero or more link vectors. Each dimension of a

17http://tech-insider.org/internet/research/1997/1210.html

21

Page 22: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

link vector is represented by the weight of the term (excluding stopwords) extracted from the hypertext. This is different than that inVSM, which is a vector of document terms.

Dj =

~L1

~L2

. . .~Ln

where Dj is a document ID, and ~Li is the ith hyperlink vector whosehead anchor is Dj .

~Ll = (wl,1, wl,2, ...wl,m)

where m is the length of unique terms in the link vector ~Ll. Thevalue of each link-vector dimension is calculated using a term weight-ing method, such as the popular Term Frequency - Inverse DocumentFrequency (tf-idf) model[22].

Like VSM, the query vector in HVV is represented by a vector ofweights of each keyword in the query.

~Q = (w1, w2, ...wn)

where wi is the weight of the ith term in the query.

In VSM, the relevance score between the document and the queryis calculated by the dot product of the document vector and queryvector. Since the document vector in HVV is a vector of link vectors,the HVV ranking score is therefore defined as the summation of allthe dot products between the query vector and each hyperlink vectorfor a given document.

R =

n∑t=1

( ~Q · ~Lt), (10)

where R is the ranking score, ~Q is the query vector, and ~Lt is the linkvector.

22

Page 23: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

The vector model of classic information retrieval techniques sug-gest that the more rare words two documents share, the more similarthey are considered to be.

Since HVV ranker does not search through full body text but hy-perlinks only, the Rankdex (a HVV-based search engine) index wassignificantly smaller than the other search engine indexes. Documentsize is no longer a factor in relevance ranking and thus shorter docu-ments are more likely to be selected. In addition, Li suggested, Im-ages, graphics, and sounds, which are not searchable by conventionalmethods–are searchable by hyperlink descriptions pointing to them.The same is true of foreign-language documents if there are hyperlinksto them in the user’s native language.

HVV algorithm underpinned the search engine Baidu.

4 Semantic Ranking Algorithms

Semantic Web and its inspired semantic technologies have been atthe center of research interest on search engines during recent years.Consumers increasingly expect search engines to understand naturallanguage and perceive the intent behind the words they type in, andsearch engine researchers are seeking new horizon to take up the chal-lenge.

In 2011, Microsoft, Yahoo, and Google jointly launched the “Schema.org”initiative, which defines a new set of HTML markup terms that canbe used as clues to the meaning of that page, and assist search enginesto recognize specific people, events, attributes, and so on. Meanwhile,Semantic search came into being.

Much of semantic search research is directed at adding semanticannotations to data in order to “improve search accuracy by under-standing searcher’s intent and the contextual meaning of terms as theyappear in the search-able database, whether on the Web or within aclosed system, to generate more relevant results18.”

18http://en.wikipedia.org/wiki/Semantic_search

23

Page 24: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

In this report, we focus semantic search on the Semantic Web.Data on the Semantic Web is divided into two categories: ontologicaland instance data. The actual data the user is interested in are theinstance data belonging to a class, but the domain knowledge and re-lationships is described primarily as class relationships in the ontology.

RDF is a standard model for data interchange on the SemanticWeb. It defines the main concepts such as classes and properties andhow they interact to create meanings. In a semantic graph, entities(classes, instances, property entities) are the nodes, the relationships(properties) between the entities are the edges.

Contrasting with the traditional rankers that swim in the tradi-tional Web Graph, the semantic spiders and semantic rankers leapthrough a conceptual network of the Web – the Web Knowledge Graph.

Jindal[23] classified semantic ranking into three types: Entity, Re-lationship, and Semantic Document.

4.1 OntoRank[8][9]

Swoogle’s OntoRank is a term-based query-dependent ranking al-gorithm for the Semantic Web. Instead of crawling webpages, Swooglelooks for SWDs such as ontologies and RDFs published on the Web.

There are two types of SWDs defined in the paper, SW ontologies(SWO), and SW database (SWDB), documents that mostly describeinstance data and individuals. SWO is said to be the TBox in a De-scription Logic (DL) knowledge base, and SWDB the ABox. A DLknowledge base typically comprises two components – a TBox and anABox. The TBox contains intensional knowledge in the form of a ter-minology or taxonomy and is built through declarations that describegeneral properties of concepts. The ABox contains extensional knowl-edge that is specific to the individuals of the domain of discourse[20].

Swoogle extracts the metadata from the harvested SWOs andSWDBs, and builds/extends its RDF graph accordingly. The graph isa directed, labeled graph, where the edges represent the named linksthat have explicit semantics, between two resources, represented bythe graph nodes. Swoogle identifies three categories of metadata: (i)

24

Page 25: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

basic metadata, which considers the syntactic and semantic features ofa SWD, (ii) relations, which consider the explicit semantics betweenindividual SWDs, and (iii) analytical results such as SWO /SWDBclassification, and SWD ranking.

For the directional relations (links) among SWDs, in other words,the navigational paths between the nodes, Swoogle classifies four typesof interlinks. For some SWDs i and j:

1. link(j, imports, i) denotes that j imports all terms and contentof i.

2. link(j, uses− term, i) denotes that j uses some of terms definedby i without importing i.

3. link(j, extends, i) denotes that j extends the definitions of termsdefined by i. And

4. link(j, asserts, i), j makes assertions about the individuals de-fined by i.

j contains the backlinks to i in all cases.

Heavily influenced by PageRank, Swoogle’s Ontology Rank (On-toRank) uses the number of backlinks to assess the importance of aSWD. OntoRank follows a random surfer model, called rational surfermodel. Let link(i, l, j) be the semantic link from SWD i to SWD jusing semantic tag l, d be a constant between 0 and 1. weight(l) beuser’s preference of choosing semantic links with tag l. The initialranking of some SWD i is defined as:

R(i) = (1− d) + d∑

j∈B(i)

R(j)f(j, i)

f(j), (11)

f(j, i) =∑

link(j,l,i)

weight(l), (12)

f(j) =∑

k∈F (j)

f(j, k). (13)

In the rational random surfer model, in addition to the dampingfactor d, the forwarding link is chosen with unequal probability f(j,i)

f(j) ,

where j is the current SWDB, i is the SWD that j links to, f(j, i) is

25

Page 26: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

the sum of all link weights from j to i.

The final rank, the OntoRank, of SWD i is defined as:

OntoRank(i) = R(i) +∑

link(j,imports,i)

R(j), (14)

where link(j, imports, i) is the transitive closure of SWO i importedby j. Thus i received accumulated scores from the its internal andexternal nodes. Evidently, OntoRank gives priority to SWOs over in-stance data.

Apart from ranking SWDs using OntoRank, Swoogle’s TermRankranks Semantic Web terms (SWTs) found on the Semantic Web. Givena term t and a SWD i, fq(t, i) donates number of occurrences of t ini. Let the SWD collection be Dt = d|fq(t, d) > 0. |D| denotes thenumber of SWDs that uses t. Then,

TermRank(t) =∑

fq(t,d)>0

OntoRank(d)× TW (d, t)∑fq(t′,d)>0 TW (d, t′)

, (15)

TW (d, t) = fq(t, d)× |Dt|. (16)

A general user can query with keywords, and the SWDs match-ing those keywords will be returned in ranked order. An advanceduser query on the underlying knowledge base using keywords, contentbased constraints, language and encoding based constraints.

In Swoogle’s metadata database, 13.29% SWDs are classified asSWOs. About half of all SWDs have rank of 0.15, which means theyare not referred to by any other SWDs. The mean of ranks is 0.8376,which implies that the SWDs Swoolgle spider have found are poorlyconnected.

OntoRank is a Semantic Document Ranking model.

4.2 TripleRank[10]

TripleRank is a HITS-inspired algorithm, an authority ranking inthe context of RDF knowledge bases. It uses a 3-dimensional tensormodel to represent SW triples, bringing geometric structure into the

26

Page 27: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

linear algebraic world. It approaches the Semantic Web graph froma different vantage point – associating the graph with a function thatreturns the link relations (properties, types) between two resources.

Definition: Let G = (V,E,Γ, θ), where V is a set of SWDs or SWresources, E is a set of links between SWDs/resources, Γ is a set ofliterals, and function (random variable) θ : V → E returns the URIof the property that links two resources.

Franz et al. model the graph using a 3rd-order tensor (3-way ar-ray), with object, subject, and predicate/property as the modes. Aorder-3 tensor produces three views (fibers, facets): row, column, andtube, and subsequently allows three-way slicing: horizontal, verticaland frontal[24] (See Figure 7). If slicing in the right direction, each ofits slices can represent an adjacency matrix with respect to one linktype or property. We can then use HITS to calculate the hub and au-thority scores for the slices. However, decomposition this way resultsin very sparse and not connected matrices. The tensor model, ana-lyzed through Parallel Factor Analysis (PARAFAC) decomposition,not only connects all the link properties together, but may detect fur-ther hidden relationships as well. PARAFAC is based on the trilinearmodel xijk =

∑Rr=1 airbjrckr[25].

Figure 7: (Top) FIBERS: (A) Columns, (B) Rows, and (C) Tubes of a 3rd or-der tensor. (Bottom) SLICES: (A) Horizontal, (B) Vertical, and (C) Frontalslices of a 3-way tensor[24]

Formally, a tensor T ∈ Rk×l×m is decomposed by PAPAFAC

27

Page 28: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

into components matrices U1 ∈ Rk×n, U2 ∈ Rl×n, andU3 ∈ Rm×n.PARAFAC also decomposes a tensor as a sum of rank-one tensors[26],or Krusal tensors. T =

∑nk=1 U

k1 Uk

2 Uk3 , where Uk

i is the kth columnof Ui and is the outer product. If U1, U2, U3 represent subject, object,and property respectively, in line with HITS, the largest entry of U1

1

corresponds to the largest hub scores, and the U12 is the largest au-

thority. As such, PARAFAC can be considered a 3-D version of HITS.

The authors proposed a pre-processing step before ranking, whichresonates to the traditional IR principle that rare words weigh morethan frequent words.

1. Predicates linking the majority of resources are pruned as theyconvey little information and dominate the data set.

2. Statements with less frequent predicates are amplified strongerthan more common statements

The authors experimented TripleRank in faceted search, a tech-nique allowing users to explore a collection of information by applyingmultiple filters, and conclude that the TripleRank approach results ina substantially increased recall without loss of precision.

TripleRank is a Relationship Ranking model.

4.3 RareRank[11]

RareRank, stands for “Rational Research Rank ing” model. Thesemantic ranker emulates a researcher’s search behavior in a scientificresearch environment. It argues that a researcher tends to make a “ra-tional choice” in searching an answer as opposed to take a “randomwalk” as in the PageRank.

RareRank focuses on entities and the relationships among them.The authoritativeness of an entity, say a book, is based on three fac-tors: citations (backlinks), popularity of its authors, and relevancy tothe query topic. As such, a newly written document can rise to promi-nence if it’s highly related to the query topic or its author is highly ven-erated in the field, which is missing in citation-based algorithms[14].

28

Page 29: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

In RareRank, there are two types of graphs presented in the sys-tem: ontology schema graph and knowledge base (instance data) graph,both are directed, labeled and weighted.

Definition: Let schema graph Gs = (V,E,Ω(e, v)), where V is aset of classes, V = vi|vi ∈ V, 0 < i ≤ |V |. E is a set of predi-cates E = ej |ej ∈ E, 0 < j ≤ |E|. Ω(e, v) is the set of weightsof predicate e whose domain is class v, Ω(e, v) = ω(ej , vi)|ω(ej , vi) ∈Ω(e, v), ω(ej , vi) ∈ [0, 1]. |F (vi)| denotes the number of outgoing linksfrom vi.

The schema graph designates the relations between ontologicalclasses and their transition weights.

Definition: Let knowledge base graph Gk = (V ′, E′), where V ′ isthe set of all instances defined in Gk and instantiated from E′ thatV ′ = v′i|v′i ∈ V ′, 0 < i ≤ |V ′|. E′ is the set of all predicate instancesdefined in Gk and instantiated from E′, E′ = e′j |e′j ∈ E′, 0 < j ≤|E′|. Let N denotes the number of instances in Gk.

The knowledge base graph consists of instances (or entities) andtheir relationships instantiated from the schema ontology. Weight ofa relation from instance id in its domain to instance ir in its range isdetermined by (i) the weight of the relation between the correspond-ing classes in the schema graph, (ii) how many instances of the sametype as ir that id links to, and (iii) strength of the association betweeninstances. RareRank score integrates both relevance (using domain-topic ontology) and quality.

In RareRank, computation of the ranking scores is based on theprinciple of convergence of a Markov Chain. An important property ofa Markov Chain is that for any starting point, the chain will convergeto the stationary distribution as long as the transition probabilitymatrix P obeys two properties, irreducibility and aperiodicity.

~πP = ~π

where ~π is the stationary probability vector associated with eigenvalue1. This eigenvector represents the ranking values for all the entitiesin the graph, and can be obtained with the power iteration method.The transition probability matrix, constructed over both the ontology

29

Page 30: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

schema graph and the knowledge base graph, is therefore the focalpoint of RareRank.

There are four types of teleport operations that governs the Rar-eRank transition probabilities. The notion teleport in RareRank means,from each node there is a probability to reach all other nodes in thegraph. In general, there are two scenarios of teleporting, the discussedclass doesn’t have an outgoing links in the ontology schema graph andthe class has an outgoing links in the schema graph.

1. Full Teleport Probability – This is the type that the class has nooutgoing links in the ontology schema. It also indicates that thecorresponding instance in the knowledge base has no outgoinglinks. The teleport of the type is denoted as: prft.

In the schema matrix,prfts = 1

In knowledge base,

prftk =1

N

2. Base Teleport Probability – This is the probability to initiate ateleport operation when a class has outgoing links in the ontologyschema (then an instance of the class in the knowledge base pos-

sibly has outgoing links), namely if∑|F (vi)|

j ω(ej , vi) = 1. The

teleport prbt and is set to 1− d, where d is the damping factor.

In schema,prbts = 1− d

In knowledge base,

prbtk =(1− d)

N

RareRank sets d = 0.95 (recall that PageRank sets it to 0.85)in order to minimize the “random surfer ” behavior, in turn toincrease the “rationality” of the ranker.

3. Schema Imbalance Teleport Probability – If the sum of transi-tion probabilities from one class to all other classes is less than

1,∑|F (vi)|

j ω(ej , vi) ∈ (0, 1) in schema, the teleport probability,

30

Page 31: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

denoted as prit, will be the difference between 1 and the sum.

In schema,

prits = d(1−|F (vi)|∑

j

ω(ej , vi))

In knowledge base,

pritk = d(1−

∑|F (vi)|j ω(ej , vi))

N

4. Link Zero-Instantiation Teleport Probability – When a predicateej is defined in schema, but not instantiated in the knowledgebase, the weight of the predicate is transferred for teleportingprzt:

In knowledge base,

prztk = d

∑|F (vi)|ej /∈E′ ω(ej , vi)

N

In addition to using teleport probability for transition matrices,RareRank has a jumping case to contribute to the transition proba-bilities. If predicate ej with a domain vi presents in Gk, the transitionprobability is defined as :

pr(i, j) = dω(ej , vi)

|(ej , vi)|

where |(ej , vi)| is the is the number of times that the predicate ej withdomain vi is instantiated in Gk.

Thus, the transition probability in Gs and the transition probabil-ity from instance i to j in Gk can be computed in Equation 17 and 18respectively:

prs =

1, if no outlinks

prbts + prits + przts , otherwise(17)

31

Page 32: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

prk =

1/N, if no outlinks

prbtk + pritk + prztk + pr(i, j), otherwise(18)

The characteristic feature of RareRank is that it adds a termino-logical topic ontology into ranking equation, in addition to knowledgebase to simulate a more structure and “rational” research environ-ment, and the relationships between entities simulate the behavior ofa rational researcher. Computation of the RareRank scores is basedon a set of teleport rules plus a jumping feature for computing thetransition probability matrix and is guaranteed to converge to an in-variant distribution.

RareRank is an Entity Ranking model.

4.4 Semantic Re-Rank[12]

Wang et al. propose a re-ranking method that first fetches the topN results returned by an authoritative search engines such as Google,then employs lexical semantic similarity to re-rank the results.

The paper critiques three limitations of keyword search. (i) Afew keywords may not effectively convey the intention of the user.(ii) Homonyms, homophones, homographs 19, and different sequenceof the keywords may result in inaccurate search result through exactkeyword matching. (iii) If a webpage doesn’t contain any of the searchkeywords but highly relevant to the topic in search, it is out of luckto get any attention of keyword search.

The semantic re-ranking method proposed in the paper is saidto have “downplayed the limitations” of keyword search, and betteradapt to the human thinking patterns, and thus attune the searchresults to the users’ search intention.

The re-ranking procedure is carried out in three steps:

1. Converting the returned ranking position of a candidate docu-ment to an importance score

19http://www.vocabulary.com/articles/chooseyourwords/homonym-homophone-homograph/

32

Page 33: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

2. Computing the semantic similarity score and relevance score be-tween each query keyword and each non-stop word in the docu-ment

3. Computing the new ranking via linear combination of the im-portance score and similarity score of the candidate document.

The importance score θ of a candidate webpage w was calculatedas:

θ(w) =1− (w − 1)/N

log2(w + 1), (19)

where w is the original ranking position and N is the number of thefetched webpages for a query.

This formula is based on Discounted Cumulative Goal (DCG)[27]formula that is a commonly used in evaluating the search result qual-ity. It measures the usefulness, or gain, of a document based on itsposition in the result list.

The authors developed an ontology platform, WorkiNet, that in-tegrates Wikipedia into WordNet 20. Other than the words and con-cepts already collected in WordNet, WorkiNet adopted 1,782,276 newconcepts from Wikipedia. The authors state that in order to exploitsemantic similarity, an ontology must be specified first.

The algorithm calculates the semantic similarities between thequery keywords and each non-stop word in the candidate document,in order to achieve the optimal relevancy score. Wang et al. borrowedthe concept from Leacock and Chodorow’s formula[28] (Equation 20)to calculate the similarity score π between two concepts ci, cj (wordson the candidate webpages v.s. the query keywords):

π(ci, cj) = max[−log(len(ci, cj)

2D], (20)

where len(ci, cj) is the shortest path from ci to cj in WorkiNet, andD is the maximum depth of the taxonomy.

20http://wordnet.princeton.edu

33

Page 34: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

The above formula does not consider where the two concepts ap-pear in the ontology. However, it is a general understanding thata node’s upper-level nodes in the ontology graph tend to be moregeneral in concept than its sibling nodes (of similar level of general-ization or specification) or child nodes (of more specification). Hence,“sibling-concepts with larger depth are more likely to have semanticcorrelations than the higher ones[29].” The authors proposed a moresophisticated formula accordingly:

π(ci, cj) =log

len(ci,cj)d(ci)+d(cj)

log 12(D+1)

, (21)

where d(i) is the length of the path from i to the root in the WorkiNetontology graph.

After getting the similarity score of each word in a webpage, it istime to calculate the relevance score σ of the webpage w:

σ(w) =

∑j∈w π(j)fq(j)∑

j∈w fq(j), (22)

where fq(j) means the number of occurrences the word j appeared inwebpage w.

Finally, the authors use linear combination to settle down the re-ranking of the candidate pages.

R(w) = α× θ(w) + (1− α)× σ(w), α ∈ [0, 1], (23)

where α is the adjusting parameter.

In the experiment, the authors state when α is set to 0, it gets theworst result. The result becomes better along with the increase of α.

Since the algorithm does a full-text search, the commutation isintensive. However, it can be done off-line.

34

Page 35: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

5 Conclusion and Future Research Di-

rection

This report presents a summary of my learning process in searchranking algorithms and the Semantic Web. It is by no means a thor-ough and complete review in the discussed field nor would I try to drawany conclusions among these ranking algorithms for their competi-tive edges. Algorithms surveyed in this paper are primarily generic,non-domain specific, broad-topic and textual-term-based search algo-rithms. In addition to generic search, localized search (such as Yelp),industry search (such as airline search), language-specific search, anddomain-specific search, to name a few, are the other landscapes in theland of Internet search. Re-ranking is gaining attention in both theindustry and research field to provide high quality and highly relevantsearch results.

It is my intent to conduct further research in the area of domain-specific semantic re-ranking algorithms and systems.

35

Page 36: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

References

[1] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Wino-grad. The pagerank citation ranking: bringing order to the web.1999.

[2] Sergey Brin and Lawrence Page. The anatomy of a large-scalehypertextual web search engine. Computer networks and ISDNsystems, 30(1):107–117, 1998.

[3] Jon M Kleinberg. Authoritative sources in a hyperlinked envi-ronment. Journal of the ACM (JACM), 46(5):604–632, 1999.

[4] David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Infer-ring web communities from link topology. In Proceedings of theninth ACM conference on Hypertext and hypermedia: links, ob-jects, time and space—structure in hypermedia systems: links,objects, time and space—structure in hypermedia systems, pages225–234. ACM, 1998.

[5] Ronny Lempel and Shlomo Moran. Salsa: the stochastic ap-proach for link-structure analysis. ACM Transactions on Infor-mation Systems (TOIS), 19(2):131–160, 2001.

[6] Yanhong Li. Toward a qualitative search engine. IEEE InternetComputing, 2(4):24–29, 1998.

[7] Yanhong Li and Larry Rafsky. Beyond relevance ranking: Hy-perlink vector voting. In RIAO, volume 97, pages 648–650, 1997.

[8] Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R Scott Cost, YunPeng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. Swoogle: asearch and metadata engine for the semantic web. In Proceedingsof the thirteenth ACM international conference on Informationand knowledge management, pages 652–659. ACM, 2004.

[9] Tim Finin, Li Ding, Rong Pan, Anupam Joshi, Pranam Kolari,Akshay Java, and Yun Peng. Swoogle: Searching for knowl-edge on the semantic web. In PROCEEDINGS OF THE NA-TIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE,volume 20, page 1682. Menlo Park, CA; Cambridge, MA; London;AAAI Press; MIT Press; 1999, 2005.

[10] Thomas Franz, Antje Schultz, Sergej Sizov, and Steffen Staab.Triplerank: Ranking semantic web data by tensor decomposition.Springer, 2009.

36

Page 37: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

[11] Wang Wei, Payam Barnaghi, and Andrzej Bargiela. Rational re-search model for ranking semantic entities. Information Sciences,181(13):2823–2840, 2011.

[12] Ruofan Wang, Shan Jiang, Yan Zhang, and Min Wang. Re-ranking search results using semantic similarity. In Fuzzy Sys-tems and Knowledge Discovery (FSKD), 2011 Eighth Interna-tional Conference on, volume 2, pages 1047–1051. IEEE, 2011.

[13] Wendy Hall and Thanassis Tiropanis. Web evolution and webscience. Computer Networks, 56(18):3859–3865, 2012.

[14] Steve Lawrence, C Lee Giles, and Kurt Bollacker. Digital librariesand autonomous citation indexing. Computer, 32(6):67–71, 1999.

[15] Massimo Marchiori. The quest for correct information on the web:Hyper search engines. Computer Networks and ISDN Systems,29(8):1225–1235, 1997.

[16] Jean-Loup Guillaume and Matthieu Latapy. The webgraph: an overview. In Actes d’ALGOTEL’02 (QuatriemesRencontres Francophones sur les aspects Algorithmiques desTelecommunications), 2002.

[17] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Ragha-van, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, andJanet Wiener. Graph structure in the web. Computer networks,33(1):309–320, 2000.

[18] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Chris-tian Bizer. Graph structure in the web—revisited: a trick ofthe heavy tail. In Proceedings of the companion publication ofthe 23rd international conference on World wide web compan-ion, pages 427–432. International World Wide Web ConferencesSteering Committee, 2014.

[19] Jihyun Lee, Jun-Ki Min, Alice Oh, and Chin-Wan Chung. Effec-tive ranking and search techniques for web resources consideringsemantic relationships. Information Processing & Management,50(1):132–155, 2014.

[20] Franz Baader. The description logic handbook: theory, implemen-tation, and applications. Cambridge university press, 2003.

[21] Allan Borodin, Gareth O Roberts, Jeffrey S Rosenthal, andPanayiotis Tsaparas. Link analysis ranking: algorithms, the-ory, and experiments. ACM Transactions on Internet Technology(TOIT), 5(1):231–297, 2005.

37

Page 38: Webpage Ranking Algorithms Second Exam Report · core factors into the ranking equation. While making continuous e ort to improve the importance and relevancy of search results, Semantic

[22] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vectorspace model for automatic indexing. Communications of theACM, 18(11):613–620, 1975.

[23] Vikas Jindal, Seema Bawa, and Shalini Batra. A review of rankingapproaches for semantic search on web. Information Processing& Management, 50(2):416–425, 2014.

[24] Bulent Yener, Evrim Acar, Pheadra Aguis, Kristin Bennett,Scott L Vandenberg, and George E Plopper. Multiway modelingand analysis in stem cell systems biology. BMC Systems Biology,2(1):63, 2008.

[25] Richard A Harshman and Margaret E Lundy. Parafac: Paral-lel factor analysis. Computational Statistics & Data Analysis,18(1):39–72, 1994.

[26] Tamara G Kolda and Brett W Bader. Tensor decompositions andapplications. SIAM review, 51(3):455–500, 2009.

[27] Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-basedevaluation of ir techniques. ACM Transactions on InformationSystems (TOIS), 20(4):422–446, 2002.

[28] Claudia Leacock and Martin Chodorow. Combining local contextand wordnet similarity for word sense identification. WordNet:An electronic lexical database, 49(2):265–283, 1998.

[29] Michael Sussna. Word sense disambiguation for free-text indexingusing a massive semantic network. In Proceedings of the secondinternational conference on Information and knowledge manage-ment, pages 67–74. ACM, 1993.

38