14
An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha An Algorithm for Construction of High Efficient Web Page Tree 1,3 Debajyoti Mukhopadhyay, 2,3 Sukanta Sinha 1 Calcutta Business School, D.H. Road, Bishnupur 743503, India, [email protected] *2,Corresponding Author Tata Consultancy Services, Whitefield Rd, Bangalore 560066, India, [email protected] 3 WIDiCoReL, Green Tower C- 9/1, Golf Green, Kolkata 700095, India doi: 10.4156/jcit.vol5.issue5.5 Abstract Everyone knows that to search any information from Internet we need Search Engine. But user doesn’t typically like to search one topic for a long time. To offer a better solution, we introduce a Web-search Engine which is based on an Ontology to retrieve relevant pages. To make the search faster, we incorporate a new concept called Index based on High Efficient Relevant page Tree. This tree contains relevant pages which are generated from Relevant Page Tree during original crawling. Keywords: Search engine, Ontology, Ontology based search, Relevance, Domain specific search, Relevance Page Tree, High Efficient Relevance Tree 1. Introduction Internet is a huge reservoir of information. To find information from the Internet we need a document retrieval system called search engine. A Web search engine mainly searches for the documents in the World Wide Web (WWW) [1][2]. Researchers have earlier proposed various methods for retrieving relevant information faster from the Web. Focused crawling and mining the Web’s link structures were done [3][4]. To make the crawling even faster and focused to the relevant topic, further research was carried out [5][6][7]. But the concept of Ontology was not included in those works making the searching longer as the crawling was done over the entire WWW. In our proposed work, we introduce the concept of Ontology to make the searching more efficient and relevant to the need of the user. In one of our earlier works, the concept of Ontology with the construction of Relevant Page Tree (RPaT) was proposed [8]. Relevant Page Tree is a type of tree which is constructed by using domain specific crawler and it consists of a particular domain specific Web-pages. However, this model took longer time to retrieve the data when a search was made based on the specified model specially for handling large data storage. In this background, we propose to incorporate a new High Efficient Relevance page Tree (HERT), which is generated from RPaT and contains same number of Web-pages. In HERT, we use indexing mechanism and divide it into multiple levels based on their relevance range, which offers faster access to Web pages. The proposed approach involves the basic idea of construction of HERT from RPaT and also provides searching technique from HERT. This paper discusses the Ontology based on Domain specific Searching in Section 2. Section 3 depicts the Existing Model of Relevant Page Tree. Section 4 shows the Proposed Approach and Section 5 shows performance of our model using graph plot. Finally, Section 6 concludes the paper. 2. Ontology Based On Domain Specific Searching In this section we will describe domain specific searching and how ontology finds domain specific pages. 2.1. Domain Specific Searching 44

An Algorithm for Construction of High Efficient Web Page Tree

Embed Size (px)

Citation preview

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

An Algorithm for Construction of High Efficient Web Page Tree

1,3Debajyoti Mukhopadhyay, 2,3Sukanta Sinha 1 Calcutta Business School, D.H. Road, Bishnupur 743503, India,

[email protected] *2,Corresponding Author Tata Consultancy Services, Whitefield Rd, Bangalore 560066, India,

[email protected] 3WIDiCoReL, Green Tower C- 9/1, Golf Green, Kolkata 700095, India

doi: 10.4156/jcit.vol5.issue5.5

Abstract Everyone knows that to search any information from Internet we need Search Engine. But user

doesn’t typically like to search one topic for a long time. To offer a better solution, we introduce a Web-search Engine which is based on an Ontology to retrieve relevant pages. To make the search faster, we incorporate a new concept called Index based on High Efficient Relevant page Tree. This tree contains relevant pages which are generated from Relevant Page Tree during original crawling.

Keywords: Search engine, Ontology, Ontology based search, Relevance, Domain specific search,

Relevance Page Tree, High Efficient Relevance Tree 1. Introduction

Internet is a huge reservoir of information. To find information from the Internet we need a document retrieval system called search engine. A Web search engine mainly searches for the documents in the World Wide Web (WWW) [1][2]. Researchers have earlier proposed various methods for retrieving relevant information faster from the Web. Focused crawling and mining the Web’s link structures were done [3][4]. To make the crawling even faster and focused to the relevant topic, further research was carried out [5][6][7]. But the concept of Ontology was not included in those works making the searching longer as the crawling was done over the entire WWW.

In our proposed work, we introduce the concept of Ontology to make the searching more efficient and relevant to the need of the user. In one of our earlier works, the concept of Ontology with the construction of Relevant Page Tree (RPaT) was proposed [8]. Relevant Page Tree is a type of tree which is constructed by using domain specific crawler and it consists of a particular domain specific Web-pages. However, this model took longer time to retrieve the data when a search was made based on the specified model specially for handling large data storage.

In this background, we propose to incorporate a new High Efficient Relevance page Tree (HERT), which is generated from RPaT and contains same number of Web-pages. In HERT, we use indexing mechanism and divide it into multiple levels based on their relevance range, which offers faster access to Web pages. The proposed approach involves the basic idea of construction of HERT from RPaT and also provides searching technique from HERT.

This paper discusses the Ontology based on Domain specific Searching in Section 2. Section 3 depicts the Existing Model of Relevant Page Tree. Section 4 shows the Proposed Approach and Section 5 shows performance of our model using graph plot. Finally, Section 6 concludes the paper. 2. Ontology Based On Domain Specific Searching

In this section we will describe domain specific searching and how ontology finds domain specific pages. 2.1. Domain Specific Searching

44

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

From the name domain specific searching means search a topic for a particular domain.

Searching in a particular domain we use ontology for that domain. The Ontology is basically a set of information which is kept in an organized way [9] [10]. Each and every domain has used different ontology. When we search any topic that time we select that domain related ontology and get the domain related pages. It is a formal description of concepts and the relationships between them. Definitions associate the names of entities in the ontology with human-readable text that describes what the names mean. The ontology can also contain rules that constrain the interpretation and use of these terms.

2.2. Ontology Based Web Page Finding

Ontology can be used to find domain specific Web pages. A domain specific crawler uses ontology to describe the area of interest, in the same way as a searching in a search engine uses a list of keywords to describe the area of interest. A problem with standard keyword based on search queries is that it is difficult to express advanced search queries. By using ontology it is possible to express richer and more accurate queries. The system has an ontology that describes the area in which the search will be performed, and the user enters different parameters to say what should be weighted in the search. Then the program crawl the Web for pages containing text that describes the area given by the ontology.

Definition 1. Ontology – It is a set of domain related key information which is kept in an organized way based on their importance. Definition 2. Seed URL – It is a set of base URL from where crawler starts crawling the Web pages from Internet. Definition 3. Weight Table - This table contains two columns; first column denotes Ontology terms and second column denotes weight value of that Ontology term. Weight value must be within ‘0’ and ‘1’. Definition 4. Syntable - This table contains two columns; first column denotes Ontology terms and second column denotes synonym of that ontology term. For a particular ontology term, if more than one synonyms exists then it should be kept using comma (,) separator. Definition 5. Relevance Limit – It is a predefined static relevance cut-off value to recognize whether a Web-page is domain specific or not. 3. Existing Model of Relevant Page Tree

In this section we describe the Relevant Page Tree (RPaT) and how it is generated. Every Crawler needs some seed URLs to retrieve Web pages [8]. To retrieve relevant Web pages we need Ontology, Weight Table and Syntable [11] [12]. First a crawler takes one seed URL and

Figure 1. Relevant Page Tree from Original Crawling

45

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

calculates relevance value; if this page cross Relevance Limit then the crawler takes that page otherwise reject that page. Therefore, if the relevance value is greater than the predefined relevance limit then that page is called relevant page. Crawler crawls through the relevant pages and continues till it can’t cross certain predefine Depth. Again the crawler collects another seed URL and does those operations until seed URL database becomes empty. Above operation is done directly over the Internet [13] and we thus get a graph which is typically called Relevant Page Tree (RPaT).

Relevant Page Tree (RPaT) is shown in Fig.1. Each node in RPaT contains two parts, one is page URL and another is Relevance Value. Here page a, b and c is seed URLs and their relevance values are 32, 15 and 25 respectively. 4. Proposed Approach

In our approach we construct HERT from RPaT. First we generate RPaT which is constructed from original crawling and then take this RPaT as an input to HERT. RPaT is based on an Ontology and we are taking RPaT pages for HERT construction, therefore, HERT is also based on that Ontology. 4.1. High Efficient Relevant Page Tree

In this section we describe High Efficient Relevant Page Tree (HERT). To clarify the name HERT, we break this name into two parts. First part High Efficient and Second part is Relevant Page. High Efficient means fast access or reduced time and Relevant Page means our domain related pages or Ontology related pages. HERT contains relevant pages in an organized way. HERT is generated from RPaT. In Fig.2 a sample HERT is shown. RPaT pages are related to an Ontology, and the HERT generated from this specified RPaT is also related to the same Ontology. Each node in the figure of HERT contains page URL and relevance value. HERT is divided into different relevance span level and each span has an Index. Index 0 points to Root page, Index 1 points to next level page and so on. HERT construction mechanism requires "Maximum Relevance Span Value" (α), "Minimum Relevance Span Value" (β) and “Number of Relevance Span level” (n) for calculating Gap Factor. It is calculated using the formula given below:

Gap Factor (ρ) = (α - β) / n.

Now we define ranges such as β to β+ ρ, β+ ρ to β+ 2ρ, β+ 2ρ to β+ 3ρ and so on.

Figure 2. High Efficient Relevant Page Tree (HERT) 4.2. Searching Technique

In this section we will describe searching technique of both HERT and RPaT.

46

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

4.2.1. RPaT Searching Technique

RPaT searching generally uses level based traversal algorithm. Suppose in Fig.1 we want to find Page ‘h’. Searching start from Page ‘a’ and then traverses Next Page ‘b’ and so on, path is shown in Fig.3.

Figure 3. RPaT Searching

4.2.2. HERT Searching Technique

In Fig.2 we want to find the same Page ‘h’. To do that, first we lookup the RANGE_INDEX table shown in Table1. This table contains index of each range. Now according to our search we first find range, then find corresponding index of that range. From that index we start searching. Here we use linear searching. Fig.4 shows how searching takes place in HERT.

Figure 4. HERT Searching

Table 1. RANGE_INDEX Table RANGE_INDEX

INDEX STR_RNG END_RNG Index0 ** ** Index1 40 ** Index2 30 40 Index3 20 30 Index4 10 20

4.3. Our Challenges

47

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

To construct HERT from RPaT we first define range. Suppose there are four ranges x1-x2, x2-x3, x3-

x4 and above x4. Now, suppose that, not a single page belongs to the range x3-x4, but other ranges contain lots of pages. This type of situation hampering our HERT construction, for this we bring Dummy Page concept which overcomes these type of situations. Initially we make HERT using dummy pages. Fig.5 shows HERT with four levels. Each level contains a dummy page which satisfies the relevance span level value.

Figure 5. Dummy Pages for HERT initialization

4.4. Generation of RPaT

In this section we describe how to generate RPaT [3]. We first crawl Web pages from Web. Then, for each Web page from crawled Web pages, we calculate relevance value using weight table and Syntable of our ontology terms. Table2 shows how Web pages are stored in database. Each URL has a page id (P_ID), relevance value and ten parent page id (PP_ID1 to PP_ID10). Suppose Page ‘a’ is used as the Seed URL which has no parent page id. In that case, we put PP_ID1 as ‘*’ and other PP_IDs as ‘**’. Here ‘**’ denotes that no applicable field is present. Again page ‘j’ linked with two parents ‘d’ and ‘e’. Therefore, we put PP_ID1 is for parent page ‘d’ and PP_ID2 for parent page ‘e’ and other PP_IDs put ‘**’.

Table 2. RPaT Page Repository

P_ID URL REL_VAL PP_ID1 PP_ID2 … PP_ID10

1 a 32 * ** … ** … 4 d 23 1 ** … **

5 e 13 1 ** … ** … 10 j 46 4 5 … **

14 n 24 10 ** … **

4.5. Algorithm for Construction of HERT from RPaT

In this section we propose an algorithm which generates HERT from RPaT. We consider RPaT which is constructed from Original Crawling, Number of Relevance Span Level, Maximum Relevance Span value and Minimum Relevance Span value as input. From those input we construct HERT using our algorithm.

48

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

INPUT: Relevant Page Tree (RPaT) Constructed from Original Crawling, Number of Relevance Span Level, Maximum Relevance Span and Minimum Relevance Span.

OUTPUT: High Efficient Relevant Page Tree (HERT). Step1: Take Relevant Page Tree (RPaT) Constructed from Original Crawling, Number of Relevance

Span Level, Maximum Relevance Span and Minimum Relevance Span from user and generate one Dummy Page for each Relevance Span Level.

Step2: Take one Page ‘P’ from RPaT and check Relevance value and find Relevance Span Level. Step3: If this Relevance Span Level contains only a Dummy Page Then replace the Dummy Page. Step4: If more than one Parent found in RPaT Then select highest relevance value page as a parent of

Page ‘P’ in RPaT. Step5: Find the Parent of Page ‘P’ in HERT and insert Page ‘P’ in HERT as follows:

a) If Relevance Span of Parent in RPaT > Relevance Span of ‘P’ Then Find the parent page in the parent Relevance Span Level of HERT. i) If not found Then go to higher level of HERT until parent not found. ii) After finding parent of ‘P’ from HERT, If Child exists Then come down through left most Child and add the page ‘P’ to HERT on the corresponding Relevance Span Level such as all left side pages relevance value higher than Relevance of ‘P’. Otherwise, add Page ‘P’ as left most page of corresponding Relevance Span Level and make Parent of Next Page is Parent of Page ‘P’.

b) If Relevance Span of Parent in RPaT = Relevance Span of ‘P’ Then Add Next Right Child of ‘Parent of Parent of P in RPaT’ to HERT. c) If Relevance Span of Parent in RPaT < Relevance Span of ‘P’ Then

The Page ‘P’ is called Orphan Page and adds Page ‘P’ to HERT on that Relevance Span Level such as relevance of all left side pages of Page ‘P’ always higher. Now, If Page ‘P’ is Right Most Page in HERT Then Parent of ‘P’ is Left Side Page Parent Otherwise Right Side Page Parent.

Step6: GOTO Step2 until all the pages traverses in RPaT. Step7: End.

Now let us consider an example, where we construct HERT from RPaT using above algorithm. Here Fig.1 taken as RPaT, Maximum Relevance Span Value is 50, Minimum Relevance Span Value is 10 and Number of Relevance Span level is 4. To generate ranges we calculate gap factor. Here Gap Factor (ρ) is (50-10)/4=10. The ranges are <=20 && >10, <=30 && >20, <=40 && >30, >40. Now generate one HERT that contains Dummy Pages and which is shown and described in Fig.6 and Section 6 respectively. According to our algorithm we show some different steps below, how to generate HERT from RPaT.

We traverse RPaT pages level-wise, left to right. First take Page ‘a’ from RPaT and check relevance value and also find relevance span level. We observe that relevance span level of Page ‘a’ contains dummy page. So, according to Step3 in our algorithm, we replace dummy page. Fig.6 shows how HERT looks after dummy page dum3 is replaced by Page ‘a’.

Figure 6. HERT after Insertion of Page a

According to Section 4.5 of the algorithm, we insert Page ‘b’ and Page ‘c’ in same way, just replacing the dummy page. After insertion of Page ‘b’ and Page ‘c’ , HERT is shown in Fig.7.

49

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

Figure 7. HERT after Insertion of Page b and Page c

Now we insert Page ‘d’ using Step5 (a) (ii) of our algorithm in HERT (Fig.8).

Figure 8. HERT after Insertion of Page d

Next we consider Page ‘e’ and using Step5 (a) of our algorithm we insert it into HERT. First we find the parent using Step5 (a) (i), and then using Step5 (a) (ii) insert page ‘e’ into HERT. Here Parent of Page ‘e’ in RPaT is Page ‘a’. So, we come down through left child of Page ‘a’ and we get Page ‘c’ and add Page ‘e’ at proper child position of Page ‘c’. Fig.9 shows HERT after insertion of Page ‘e’.

Figure 9. HERT after Insertion of Page e

By using Step5 (a) of our algorithm, Page ‘f’ is inserted in HERT (Fig.10). Here parent page of Page ‘f’ in RPaT is Page ‘b’. So, both are in same relevance range level. Here “Parent of ‘Parent of Page f in RPaT’ in HERT” is Page c and add Page f at next right child position of Page b.

50

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

Figure 10. HERT after Insertion of Page f

When we consider Page ‘g’, we find that parent page relevance value of Page ‘g’ in RPaT is lower than relevance value of Page ‘g’. According to our algorithm, Page ‘g’ is called Orphan Page. Our algorithm inserts Page ‘g’ into HERT using Step5 (c). Fig.11 shows HERT after Page ‘g’ insertion.

Figure 11. HERT after Insertion of Page g

Now we consider Page h. We find that parent page relevance value in RPaT is higher than Page ‘h’ relevance value and that parent page also belongs to parent relevance span level in HERT. So using Step5 (a)(ii) of our algorithm we insert Page h. Fig.12 shows HERT after Page ‘h’ insertion.

Figure 12. HERT after Insertion of Page h

When we consider Page ‘i’, we find that parent page relevance value in RPaT is higher than Page ‘i’ relevance value and that parent page also belongs to parent relevance span level in HERT. So using Step5 (a)(ii) of our algorithm we insert Page ‘i’. Fig.13 shows HERT after Page i insertion.

51

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

Figure 13. HERT after Insertion of Page i

When we consider Page ‘j’, we find that parent page relevance value of Page ‘j’ in RPaT is lower than relevance value of Page ‘j’. According to our algorithm Page ‘j’ is called Orphan Page. Our algorithm inserts Page ‘j’ into HERT using Step5 (c) (See Fig.14).

Figure 14. HERT after Insertion of Page j

Now we consider Page ‘k’, and again we find that parent page relevance value of Page ‘k’ in RPaT is lower than relevance value of Page ‘k’. According to our algorithm Page ‘k’ is called Orphan Page. Our algorithm inserts Page ‘k’ into HERT using Step5 (c) (Fig.15).

Figure 15. HERT after Insertion of Page k

When we insert Page ‘l’ in HERT that time we find that two parent exists in RPaT for Page l. So using Step4 we select the parent and then using Step5 (a) we insert Page ‘l’ (Fig.16).

52

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

Figure 16. HERT after Insertion of Page l

When we consider Page ‘m’, we see that parent page relevance value of Page ‘m’ in RPaT is lower than relevance value of Page ‘m’. According to our algorithm Page ‘m’ also called Orphan Page. Our algorithm inserts Page ‘m’ into HERT using Step5(c). Fig.17 shows HERT after Page m insertion.

Figure 17. HERT after Insertion of Page m

Again, when we insert Page ‘n’ we see that parent page of Page ‘n’ in RPaT is Page ‘j’ and Relevance value of Page ‘j’ is higher than Page ‘n’ but relevance span level difference is 2. Also there does not exist any such child of Page ‘j’. So using Step5(a)(ii) we insert Page ‘n’ (See Fig.18).

Figure 18. HERT after Insertion of Page n

When we insert Page ‘o’ in HERT, we find that two parent exists in RPaT for Page ‘o’. So using Step4 we select the parent and then using Step5 (a) we insert Page ‘o’. In the same manner, we inserted all RPaT pages into HERT using our algorithm and finally we got Fig.19.

53

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

Figure 19. HERT after Insertion of all RPaT Pages 5. Performance Analysis

In this section we describe our test settings and the performance of our system. 5.1. Test Settings

In this section we will describe different parameter settings to generate HERT page repository. 5.1.1. RPaT Pages

RPaT page repository is used as an input to generate HERT page repository. To make RPaT page repository, we use seed URLs, Weight table, Syntable for a particular domain.

Seed URL is a set of URLs from where crawler starts crawling i.e.; downloading the Web pages from WWW. For the crawler to start crawling we provide some seed URL as shown in Table 3, depending on our Ontology.

Table 3. Seed URLs

http://icc-cricket.yahoo.com/ http://www.cricketnext.com/index.html

http://www.cricketworld.com/

Each Ontology term has an importance of that domain. We assign weight for each Ontology term

based on their importance. The strategy of assigning weights is that, the more significant term will have more weight and the terms which are common to more than one domain have less weight. Some of weight values for corresponding Ontology term are shown in Table 4.

Table 4. Weight Table

cricket 0.9 wicket keeper 0.8

umpire 0.4 bat 0.2

match 0.1

In order to get appropriate result for a domain specific Web page we use Syntable, which contains two fields. One field is for Ontology term and another field for Synterm. In this table we store all Ontology terms and their Synonyms. Some of them are shown in Table 5.

Table 5. Syntable

Match competition,contest Stamp stick,wicket

54

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

Ball conglobate,conglomerate Umpire judge,moderator,referee Catch capture

Using seed URLs, term weight and term synonyms, we get RPaT records and some of them are

shown below. Those records are our one input to generate HERT. The other inputs are maximum span value=25, minimum span value=5 and number of relevance span level=4. Each record contains P_ID – URL - REL_VAL - PP_ID1 - PP_ID2 - PP_ID3 - PP_ID4 - PP_ID5 - PP_ID6 - PP_ID7 - PP_ID8 - PP_ID9 - PP_ID10 fields.

1 - http://icc-cricket.yahoo.com/ - 10.7- * - ** -** - ** - ** - ** - ** - ** - ** - **, 51- http://icc-cricket.yahoo.com/news/news.html - 6.2 - 1 - ** -** - ** - ** - ** - ** - ** - ** - **, 52- http://icc-cricket.yahoo.com/media-release/media-release.html - 7.2 - 51- ** -** - ** - ** - ** - ** - ** - ** - **, 53- http://icc-cricket.yahoo.com/news/media-notes.html - 5.8 - 51- ** -** - ** - ** - ** - ** - ** - ** - **, 54- http://icc-cricket.yahoo.com/scorecards/schedule1.html - 8.9 - 51 - ** -** - ** - ** - ** - ** - ** - ** - **, 55- http://icc-cricket.yahoo.com/scorecards/results.html - 6.5 -51 - ** -** - ** - ** - ** - ** - ** - ** - **, 56- http://icc-cricket.yahoo.com/wcl/news-wcl.html - 7.1 - 51- ** -** - ** - ** - ** - ** - ** - ** - **.

5.2. Test Results

In this section we have shown some test results through graph plot.

5.2.1. HERT Pages

In HERT, we maintain different database for each level. We are taking four tables for Index1, Index2, Index3 and Index4. The tables are HERT_DB_LEVEL_1, HERT_DB_LEVEL_2, HERT_DB_LEVEL_3 and HERT_DB_LEVEL_4 for Index1, Index2, Index3 and Index4 respectively. In Fig.20, HERT page storage distribution is shown.

Figure 20. HERT Page Storage Distribution

Each HERT page repository contains five fields; they are Page Identifier (P_ID), Unified Resource Locator (URL), Relevance value (REL_VAL), Parent Page Identifier (PP_ID) and Next Page P_ID (NP_ID). In HERT, Page Identifier (P_ID) comes from RPaT Page Repository. Their each URL has a unique P_ID and we put same P_ID of that corresponding URL into HERT page repository. We also take relevance value from RPaT page repository. Now we put PP_ID and NP_ID into HERT page repository according to our algorithm. Our algorithm finds the parent of a page and we put PP_ID field of that parent page P_ID. Our algorithm also finds after which page the new page is inserted for same level. We update previous page NP_ID field by the new page P_ID value for tracking order of the same level pages. Some HERT repository records are given below:

55

An Algorithm for Construction of High Efficient Web Page Tree Debajyoti Mukhopadhyay, Sukanta Sinha

1 - http://icc-cricket.yahoo.com/ - 10.7- * - 6, 51- http://icc-cricket.yahoo.com/news/news.html - 6.2 - 1 - 71, 52- http://icc-cricket.yahoo.com/media-release/media-release.html - 7.2 - 73 -62, 53- http://icc-cricket.yahoo.com/news/media-notes.html - 5.8 - 76 - 86, 54- http://icc-cricket.yahoo.com/scorecards/ schedule1.html - 8.9 - 69 - 63, 55- http://icc-cricket.yahoo.com/scorecards/results.html - 6.5 - 97 - 96, 56- http://icc-cricket.yahoo.com/wcl/news-wcl.html - 7.1 - 81 - 48. 5.2.2. Performance of HERT Searching Over RPaT Searching

In Fig.21 we have shown HERT searching performance over RPaT searching. RPaT searching generally uses level wise searching where as HERT searching uses Index based searching. Each level represented by an index and when we try to search one page, we directly reach the index. Then we traverse the page using linear search technique. In HERT we don’t need to traverse other pages those belong to other relevance span level. For this it is faster than RPaT searching.

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Number of Relevant Pages (Thousands)

Tim

e (S

ec)

HERT Search TimeRPaT Search Time

Figure 21. Time taken in HERT Searching and RPaT Searching

For experimental purpose, we have a set of Search Strings, which we applied for both RPaT and HERT for our comparative study. First we have taken 1000 URLs from RPaT and executed our HERT generation algorithm. Then we applied all the Search Strings to find search time taken by both models - RPaT and HERT, which contain 1000 URLs in each model. Then we average search time for each model and plot the graph. Same way we have taken 2000, 3000, 4000 and 5000 URLs to calculate search time and plot the graph. Finally from the graph we can see that HERT searching is faster than RPaT searching. 6. Conclusion

Web searchers use domain specific search engine to find particular domain related pages. But in current scenario, a particular domain does not contain small number of pages. To get faster information, we need to construct a structure which provides faster access. In our approach we generate a High Efficient Relevant page Tree (HERT). This tree has some features like Index based, Relevance Span Level etc. For this reason, we can access HERT pages faster than RPaT search techniques. 7. References [1] W.Willinger, R.Govindan, S.Jamin, V.Paxson and S.Shenker,"Scaling phenomena in the Internet";

in Proceedings of the National Academy of Sciences, 99, suppl. 1, pp. 2573–2580 [2] J.J.Rehmeyer, "Mapping a medusa: The Internet spreads its tentacles"; Science News 171(June

23,2007); pp. 387-388; Available at Sciencenews.org

56

Journal of Convergence Information Technology Volume 5, Number 5, July 2010

[3] S.Chakrabarti, M.Berg, B.E.Dom, “Focused Crawling: a New Approach to Topic-specific Web Resource Discovery”; in Proceedings of the Eigth International World Wide Web Conference, Elsevier, Toronto, Canada; 1999; pp. 545-562

[4] S.Chakrabarti, B.E.Dom, R.Kumar, P.Raghavan, S.Rajagopalan, A.Tomkins, D.Gibson, J.Kleinberg, “Mining the Web’s Link Structure”; IEEE Computer, Vol.32, No.8, August 1999; pp.60-67

[5] A.Kundu, R.Dutta, R.Dattagupta, D.Mukhopadhyay, “Mining the Web with Hierarchical Crawlers – A Resource Sharing based Crawling Approach”; International Journal on Intelligent Information and Database Systems, Vol.3, No.1, 2009; pp.90-106

[6] A.Kundu, R.Dutta, D.Mukhopadhyay, Y.C.Kim, “A Hierarchical Web Page Crawler for Crawling the Internet Faster”; in Proceedings of the International Conference on Electronics and Information Technology Convergence, Korea; 2006; pp.61-67

[7] D.Mukhopadhyay, S.Mukherjee, S.Ghosh, S.Kar, Y.C.Kim, “Architecture of A Scalable Dynamic Parallel Web Crawler with High Speed Downloadable Capability for aWeb Search Engine”; in the Proceedings of the 6th International Worrkshop MSPT 2006, Youngil Publication, Korea; November 2006; pp.103-108

[8] D.Mukhopadhyay, A.Biswas, S.Sinha, “A New Approach to Design Domain Specific Ontology Based Web Crawler”; 10th International Conference on Information Technology, ICIT2007, Rourkela, India; IEEE Computer Society Press, California, USA; December 17-20, 2007; pp.289-291

[9] N.F.Noy, D.L.McGuinness, “Ontology Development 101: A Guide to Creating Your First Ontology”; Available on:http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ ontology101-noymcguinness.html [Accessed May 2005]

[10] J.Heflin, J.Hendler, “Dynamic Ontologies on the Web”; Department of Computer Science University of Maryland College Park, MD 20742

[11] P.J.Hane, "Beyond Keyword Searching—Oingo and Simpli.com Introduce Meaning-Based Searching"; InfoToday, Posted On December 20, 1999

[12] A.Gangemi, R.Navigli, P.Velardi, "The OntoWordNet Project: Extension and Axiomatization of Conceptual Relations in WordNet"; In Proc. of International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE 2003), Catania, Sicily (Italy), 2003; pp. 820-838

[13] T.Berners-Lee, “Weaving the Web:The Original Design and Ultimate Destiny of the World Wide Web by its Inventor”; New York, Harper SanFrancisco; 1999

57