Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages

Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages

Chaker JEBARIKing Saud University

College of Computer & Information SciencesComputer Science Department

[email protected]

NLPIX’2008 Workshop April 22, Beijing, China

WWW’2008 Conference

Overview

Introduction

Related works

Centroid-based categorization

My approach

Experiments

Comparison

Introduction

Web page categorization become more and more useful to enhance search engine results

As the number of web pages increase every day, topic categorization become insufficient

Genre is another criteria used to classify web pages (Jebari and Ounelli, 2007)

Introduction

The genre of web pages (cybergenre) is characterized by the triple <content, form, functionality> (Shephered and Watters, 1999).

Web genres changes over time

I proposed a Refined and Incremental approach for genre categorization of web page

Related works

Authors Features Machinelearning

Corpora Accuracy

(Meyer Zu Eissen and Stein, 2004)

presentation-relatedclosed word

text statisticssyntactic features

SVM KI-04 0.70

(Kennedy and Shepherd, 2005)

Content, FormFunctionality

Neural network

321 web page

0.70

(Boese and Howe, 2005)

Style, FormContent

Logistic regression

KI-04 &WebKB

0.80

(Santini, 2007) frequencies of common words part-of-speech trigrams, html tags, punctuation marks, …

Naïve bayes

KI-04 0.70

(Kanaris and Stamatatos, 2007)

n-grams extracted from both text and structure

SVM KI-04 &WebKB

0.90 – 0.95

Finds a description (centroid or prototype) that summarizes all documents belonging (or not) to a given category.

The time and memory required by centroid-based models are proportional to the number of categories instead of the number of training documents like other machine learning techniques (Naïve Bayes, K nearest neighbors, decision trees, etc).


Centroid-based models can add more training documents and easily recalculate centroids.

Many models have been proposed to calculate centroids (Rocchio, average, sum, normalized sum models, etc).

Normalized sum is a most powerful model


ji cd i

jj d

c

1C

ji

jiji

Cd

CdC,dsim

The centroid Cj for a category cj is defined as follow:

A document dj is assigned to the category having most similarity calculated as follow:


My approach

Training web pages

Construction of centroidscentroids

New web page Pre-processing

URLLogical

structureHypertextstructure

Combination categorization

Construction of centroids

c = {c1, … , ck}: set of k predefined categories C = {C1, …, Cj, …, Ck}: set of genre centroids using the normalized sum formula I discarded

web pages that have a similarity with a centroid less than a predefined threshold s0

(noisy web pages). For each category cj, I calculate a new set of

training web pages sj as follow:

Construction of centroids

0jijij SC,psimandcps Where pi is a web page and sim is the cosine

similarity

The centroids Sj obtained after refining, using the normalized sum formula, is defined as follow:

ji sp ij

j ps

1S

Pre-Processing

Feature extraction: URL, Logical structure (the content of title and Hn tags) and Hypertext structure (the content of anchors)

Remove special characters and stop words Stemming remaining words Weighting terms using NormTFIDF (=0.5, =-

1 and =-0.5) (Lertnattee and Theeramunkong, 2004)

Categorization of a new page

Categorization of new web pages is performed one by one (incremental categorization).

For each new web page p, I calculate its cosine similarity with all centroids.

I refine the centroids, which have a similarity with the page p, greater or equal than S0.

Categorization of new pages

The refining step consists in:Adding the new page p to the normalized centroid of

the corresponding genre and renormalizes the centroid.

Each normalized centroid Sj is associated with the

non-normalized centroid NSj.

Refinement of the centroid Sj can be performed by

the following operations:

pNSNS jj j

jj

NS

NSS And

Combination

The aim is to combine the outputs of three homogenous classifiers, which uses respectively the URL, the logical structure and the hypertext structure (Jebari, 2007).

I used the decision templates for combination (Kuncheva et al., 2001)

Experiments

Corpora:

Genre # Of web pages

Article 127Download 151

Link collection 205Private portrayal 126

Non private portrayal 163

Discussion 127Help 139Shop 167

Genre # Of web pages

Student 1541

Faculty 1063

Staff 126

Department 170

Project 474

Course 875

KI-04 WebKB

Experiments

Aims: Measure the effect of vocabulary size in genre

categorization of web pages Measure the usefulness of refining, incrementing and Combination in genre categorization of web pages Comparison with other works and machine learning

techniques Experimental setup: I have used the Micro-averaged accuracy as a

performance measure I used 5*2 cross-validation methodology

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

# of terms

Mic

ro

- avera

ge a

ccura

cy

URLLogical StructureHypertext Structure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

# of terms

Mic

ro

- avera

ge a

ccura

cy

URLLogical StructureHypertext Structure

Effect of vocabulary size:Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the number of terms between 5 and 3000

(a) (b)

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Refining Thresholding

Mic

ro

- aver

age

accu

racy

URL

Logical Structure

Hypertext Structure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Refining Threshold

Mic

ro

- aver

age

accu

racy

URL

Logical Structure

Hypertext Structure

Usefulness of refining:Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the refining threshold between 0 and 1 by step of 0.1

(b)(a)

Results

Usefulness of incrementing:I varied the proportion of testing web pages on each feature between 10% and 90% by step of 10%. For bothKI-04 (a) and WebKB (b) corpora I have obtained the following micro-averaged accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60 70 80 90% of test w eb pages

Mic

ro

- avera

ge a

ccura

cy

URL

Logical Structure

Hypertext Structure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50 60 70 80 90% of test w eb pages

Mic

ro

- aver

age

accu

racy

URL

Logical Structure

Hypertext Structure

(a) (b)

Results

0.7

0.75

0.8

0.85

0.9

0.95

1

URL Logical

Structure

Hypertext

Structure

DT

Combination

Mic

ro

- av

era

ge

Ac

cu

rac

y

KI-04 WebKB

Usefulness of combination:Micro-averaged accuracy for each classifier (URL, logical, hypertext and combined classifiers) and for both KI-04 (a) and WebKB (b) corpora

Comparison

Problems: No publicly available and standard benchmark

corpora for genre categorization task Not agreed sense of web page genres and each

study focuses on a different set of genres Comparison with other works: Only Kanaris and Stamatatos (Kanaris and

Stamatatos, 2007) provide good micro-averaged accuracy using KI-04 corpus because they are based on structural information as in my approach.(Jebari and Ounalli, 2004)

Comparison

Author KI-04 WebKB[14] 0.70 -[1] 0.75 0.80[9] 0.84 -[17] 0.70 -

My approach 0.96 0.98

Comparison

Comparison with other machine learning techniques:

I have compared my approach with other categorization techniques implemented in the program Rainbow (http://www.cs.cmu.edu/~mccallum/bow/rainbow/)

I have used Rocchio, Naïve bayes (NB), K Nearest Neighbors (KNN) with K=30, SVM with Fisher

kernel and TreeNode because they are widely used in genre categorization of documents.

Comparison

KI-04

URL Logical HypertextSVM << ~ <<

ROCCHIO << < <<NB << << <<<

KNN << << <<<TreeNode <<< <<< <<<

WebKBSVM ~ << ~

ROCCHIO << < <<NB << << <<<

KNN <<< << <<<TreeNode <<< <<< <<<

To show that obtained results are really meaningful and not due to chance, I used the 5*2 cross-validation t-test (Dietterich, 1998)

Comparison

Time is a very important aspect for comparison.

Following figures shows a comparison of the time that each classification technique needs to execute, in both training and classification phases for each corpus and for each feature.

Comparison

0

20

40

60

80

100

0 50 100 150Train

Test

Rocchio

NBKNN

SVM

TreeNodeMy approach

0

20

40

60

80

100

0 50 100 150Train

Test

RocchioNBKNNSVMTreeNodeMy approach

Train and test time spend for URL and both KI-04 (a) and WebKB (b)

(a) (b)

Comparison

0

20

40

60

80

100

120

140

0 50 100 150Train

Te

st

Rocchio

NB

KNN

SVM

TreeNode

My approach

0

20

40

60

80

100

120

0 100 200Train

Te

st


Train and test time spend for logical structure and both KI-04 (a) and WebKB (b)

(a) (b)

Comparison

0

20

40

60

80

100

120

140

0 100 200

Train

Te

st


0

20

40

60

80

100

120

140

160

0 100 200Train

Test


Train and test time spend for hypertext structure and both KI-04 (a) and WebKB (b)

(a) (b)

Conclusion

The approach proposed in this paper uses three new features (the URL address, logical and hypertext structures).

My approach implements three new aspects (refinement, incrementing and combination) which not explored in previous studies on genre categorization.

Conducted experiments show the usefulness of each aspect in genre categorization.

The comparison with other approaches show that my approach is the fastest and outperforms many known categorization techniques.

References

Jebari, C., and Ounalli, H. The usefulness of Logical Structure in Flexible Document Categorization. International Journal of Information Technology, 2004, vol. 1, no. 3, pp. 117-121

Jebari, C. Combining Classifiers for web page genre categorization. In "Towards Genre-Enabled Search Engines: The Impact of NLP" International Workshop held in conjunction with International Conference in Recent Advances on Natural Language Processing RANLP07, Borovets, Bulgaria. 2007.

Jebari, C., and Ounelli, H. Genre Categorization of web pages, IEEE Computer Society, 2007. ACM Press.

Shepherd, M., and Watters, C. The functionality attribute of cybergenres. In Proceedings of the 32nd Hawaiian International Conference on System Sciences, January 1999, Hawaii.

References

Meyer zu Eissen, S., and Stein, B. Genre Classification of Web Pages: User Study and Feasibility Analysis. In Biundo S., Fruhwirth T. and Palm G. (eds.). KI2004: Advances in Artificial Intelligence, Springer. Berlin-Heidelberg-New York, pp. 256-269, 2004.

Kennedy, A., and Shephered, M. Automatic Identification of Home Pages. In Proceeding of the 38th Hawaii International Conference on System Sciences, 2005.

Boese, E. S., and Howe, A. E. Effect of web document evolution on genre classification. Proceedings of the 14th ACM International conference on Information and knowledge management, pp. 632-639. 2005.

Santini, M. Automatic identification of genre in web pages. Ph.D Thesis, University of Brighton, UK, 2007.

References

Kanaris, I., and Stamatatos, E. Webpage Genre Identification Using Variable-length Character n-grams. Proceeding of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. 2007.

Lertnattee, V., and Theeramunkong, T. Effect of term distributions on centroid-based text categorization. Journal of Information Sciences, 2004, vol. 158, no. 1, p. 89-115.

Kuncheva, L.I., Bezdek, J.C., and Duin, R.P.W. Decision templates for multiple classifier fusion. Pattern Recognition, 34 (2), 2001, 299-314.

Dietterich, T .G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7): 1895-1923. 1998.

Documents

Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages