Upload
perdy
View
43
Download
2
Embed Size (px)
DESCRIPTION
Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages. Chaker JEBARI King Saud University College of Computer & Information Sciences Computer Science Department [email protected]. WWW’2008 Conference. NLPIX’2008 Workshop April 22, Beijing, China. - PowerPoint PPT Presentation
Citation preview
Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages
Chaker JEBARIKing Saud University
College of Computer & Information SciencesComputer Science Department
NLPIX’2008 Workshop April 22, Beijing, China
WWW’2008 Conference
Overview
Introduction
Related works
Centroid-based categorization
My approach
Experiments
Comparison
Introduction
Web page categorization become more and more useful to enhance search engine results
As the number of web pages increase every day, topic categorization become insufficient
Genre is another criteria used to classify web pages (Jebari and Ounelli, 2007)
Introduction
The genre of web pages (cybergenre) is characterized by the triple <content, form, functionality> (Shephered and Watters, 1999).
Web genres changes over time
I proposed a Refined and Incremental approach for genre categorization of web page
Related works
Authors Features Machinelearning
Corpora Accuracy
(Meyer Zu Eissen and Stein, 2004)
presentation-relatedclosed word
text statisticssyntactic features
SVM KI-04 0.70
(Kennedy and Shepherd, 2005)
Content, FormFunctionality
Neural network
321 web page
0.70
(Boese and Howe, 2005)
Style, FormContent
Logistic regression
KI-04 &WebKB
0.80
(Santini, 2007) frequencies of common words part-of-speech trigrams, html tags, punctuation marks, …
Naïve bayes
KI-04 0.70
(Kanaris and Stamatatos, 2007)
n-grams extracted from both text and structure
SVM KI-04 &WebKB
0.90 – 0.95
Finds a description (centroid or prototype) that summarizes all documents belonging (or not) to a given category.
The time and memory required by centroid-based models are proportional to the number of categories instead of the number of training documents like other machine learning techniques (Naïve Bayes, K nearest neighbors, decision trees, etc).
Centroid-based categorization
Centroid-based models can add more training documents and easily recalculate centroids.
Many models have been proposed to calculate centroids (Rocchio, average, sum, normalized sum models, etc).
Normalized sum is a most powerful model
Centroid-based categorization
ji cd i
jj d
c
1C
ji
jiji
Cd
CdC,dsim
The centroid Cj for a category cj is defined as follow:
A document dj is assigned to the category having most similarity calculated as follow:
Centroid-based categorization
My approach
Training web pages
Construction of centroidscentroids
New web page Pre-processing
URLLogical
structureHypertextstructure
Combination categorization
Construction of centroids
c = {c1, … , ck}: set of k predefined categories C = {C1, …, Cj, …, Ck}: set of genre centroids using the normalized sum formula I discarded
web pages that have a similarity with a centroid less than a predefined threshold s0
(noisy web pages). For each category cj, I calculate a new set of
training web pages sj as follow:
Construction of centroids
0jijij SC,psimandcps Where pi is a web page and sim is the cosine
similarity
The centroids Sj obtained after refining, using the normalized sum formula, is defined as follow:
ji sp ij
j ps
1S
Pre-Processing
Feature extraction: URL, Logical structure (the content of title and Hn tags) and Hypertext structure (the content of anchors)
Remove special characters and stop words Stemming remaining words Weighting terms using NormTFIDF (=0.5, =-
1 and =-0.5) (Lertnattee and Theeramunkong, 2004)
Categorization of a new page
Categorization of new web pages is performed one by one (incremental categorization).
For each new web page p, I calculate its cosine similarity with all centroids.
I refine the centroids, which have a similarity with the page p, greater or equal than S0.
Categorization of new pages
The refining step consists in:Adding the new page p to the normalized centroid of
the corresponding genre and renormalizes the centroid.
Each normalized centroid Sj is associated with the
non-normalized centroid NSj.
Refinement of the centroid Sj can be performed by
the following operations:
pNSNS jj j
jj
NS
NSS And
Combination
The aim is to combine the outputs of three homogenous classifiers, which uses respectively the URL, the logical structure and the hypertext structure (Jebari, 2007).
I used the decision templates for combination (Kuncheva et al., 2001)
Experiments
Corpora:
Genre # Of web pages
Article 127Download 151
Link collection 205Private portrayal 126
Non private portrayal 163
Discussion 127Help 139Shop 167
Genre # Of web pages
Student 1541
Faculty 1063
Staff 126
Department 170
Project 474
Course 875
KI-04 WebKB
Experiments
Aims: Measure the effect of vocabulary size in genre
categorization of web pages Measure the usefulness of refining, incrementing and Combination in genre categorization of web pages Comparison with other works and machine learning
techniques Experimental setup: I have used the Micro-averaged accuracy as a
performance measure I used 5*2 cross-validation methodology
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
# of terms
Mic
ro
- avera
ge a
ccura
cy
URLLogical StructureHypertext Structure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# of terms
Mic
ro
- avera
ge a
ccura
cy
URLLogical StructureHypertext Structure
Effect of vocabulary size:Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the number of terms between 5 and 3000
(a) (b)
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Refining Thresholding
Mic
ro
- aver
age
accu
racy
URL
Logical Structure
Hypertext Structure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Refining Threshold
Mic
ro
- aver
age
accu
racy
URL
Logical Structure
Hypertext Structure
Usefulness of refining:Micro-averaged accuracy for each feature and for both (a) KI-04 and (b) WebKB corpora is obtained by varying the refining threshold between 0 and 1 by step of 0.1
(b)(a)
Results
Usefulness of incrementing:I varied the proportion of testing web pages on each feature between 10% and 90% by step of 10%. For bothKI-04 (a) and WebKB (b) corpora I have obtained the following micro-averaged accuracy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 20 30 40 50 60 70 80 90% of test w eb pages
Mic
ro
- avera
ge a
ccura
cy
URL
Logical Structure
Hypertext Structure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 20 30 40 50 60 70 80 90% of test w eb pages
Mic
ro
- aver
age
accu
racy
URL
Logical Structure
Hypertext Structure
(a) (b)
Results
0.7
0.75
0.8
0.85
0.9
0.95
1
URL Logical
Structure
Hypertext
Structure
DT
Combination
Mic
ro
- av
era
ge
Ac
cu
rac
y
KI-04 WebKB
Usefulness of combination:Micro-averaged accuracy for each classifier (URL, logical, hypertext and combined classifiers) and for both KI-04 (a) and WebKB (b) corpora
Comparison
Problems: No publicly available and standard benchmark
corpora for genre categorization task Not agreed sense of web page genres and each
study focuses on a different set of genres Comparison with other works: Only Kanaris and Stamatatos (Kanaris and
Stamatatos, 2007) provide good micro-averaged accuracy using KI-04 corpus because they are based on structural information as in my approach.(Jebari and Ounalli, 2004)
Comparison
Author KI-04 WebKB[14] 0.70 -[1] 0.75 0.80[9] 0.84 -[17] 0.70 -
My approach 0.96 0.98
Comparison
Comparison with other machine learning techniques:
I have compared my approach with other categorization techniques implemented in the program Rainbow (http://www.cs.cmu.edu/~mccallum/bow/rainbow/)
I have used Rocchio, Naïve bayes (NB), K Nearest Neighbors (KNN) with K=30, SVM with Fisher
kernel and TreeNode because they are widely used in genre categorization of documents.
Comparison
KI-04
URL Logical HypertextSVM << ~ <<
ROCCHIO << < <<NB << << <<<
KNN << << <<<TreeNode <<< <<< <<<
WebKBSVM ~ << ~
ROCCHIO << < <<NB << << <<<
KNN <<< << <<<TreeNode <<< <<< <<<
To show that obtained results are really meaningful and not due to chance, I used the 5*2 cross-validation t-test (Dietterich, 1998)
Comparison
Time is a very important aspect for comparison.
Following figures shows a comparison of the time that each classification technique needs to execute, in both training and classification phases for each corpus and for each feature.
Comparison
0
20
40
60
80
100
0 50 100 150Train
Test
Rocchio
NBKNN
SVM
TreeNodeMy approach
0
20
40
60
80
100
0 50 100 150Train
Test
RocchioNBKNNSVMTreeNodeMy approach
Train and test time spend for URL and both KI-04 (a) and WebKB (b)
(a) (b)
Comparison
0
20
40
60
80
100
120
140
0 50 100 150Train
Te
st
Rocchio
NB
KNN
SVM
TreeNode
My approach
0
20
40
60
80
100
120
0 100 200Train
Te
st
RocchioNBKNNSVMTreeNodeMy approach
Train and test time spend for logical structure and both KI-04 (a) and WebKB (b)
(a) (b)
Comparison
0
20
40
60
80
100
120
140
0 100 200
Train
Te
st
RocchioNBKNNSVMTreeNodeMy approach
0
20
40
60
80
100
120
140
160
0 100 200Train
Test
RocchioNBKNNSVMTreeNodeMy approach
Train and test time spend for hypertext structure and both KI-04 (a) and WebKB (b)
(a) (b)
Conclusion
The approach proposed in this paper uses three new features (the URL address, logical and hypertext structures).
My approach implements three new aspects (refinement, incrementing and combination) which not explored in previous studies on genre categorization.
Conducted experiments show the usefulness of each aspect in genre categorization.
The comparison with other approaches show that my approach is the fastest and outperforms many known categorization techniques.
References
Jebari, C., and Ounalli, H. The usefulness of Logical Structure in Flexible Document Categorization. International Journal of Information Technology, 2004, vol. 1, no. 3, pp. 117-121
Jebari, C. Combining Classifiers for web page genre categorization. In "Towards Genre-Enabled Search Engines: The Impact of NLP" International Workshop held in conjunction with International Conference in Recent Advances on Natural Language Processing RANLP07, Borovets, Bulgaria. 2007.
Jebari, C., and Ounelli, H. Genre Categorization of web pages, IEEE Computer Society, 2007. ACM Press.
Shepherd, M., and Watters, C. The functionality attribute of cybergenres. In Proceedings of the 32nd Hawaiian International Conference on System Sciences, January 1999, Hawaii.
References
Meyer zu Eissen, S., and Stein, B. Genre Classification of Web Pages: User Study and Feasibility Analysis. In Biundo S., Fruhwirth T. and Palm G. (eds.). KI2004: Advances in Artificial Intelligence, Springer. Berlin-Heidelberg-New York, pp. 256-269, 2004.
Kennedy, A., and Shephered, M. Automatic Identification of Home Pages. In Proceeding of the 38th Hawaii International Conference on System Sciences, 2005.
Boese, E. S., and Howe, A. E. Effect of web document evolution on genre classification. Proceedings of the 14th ACM International conference on Information and knowledge management, pp. 632-639. 2005.
Santini, M. Automatic identification of genre in web pages. Ph.D Thesis, University of Brighton, UK, 2007.
References
Kanaris, I., and Stamatatos, E. Webpage Genre Identification Using Variable-length Character n-grams. Proceeding of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. 2007.
Lertnattee, V., and Theeramunkong, T. Effect of term distributions on centroid-based text categorization. Journal of Information Sciences, 2004, vol. 158, no. 1, p. 89-115.
Kuncheva, L.I., Bezdek, J.C., and Duin, R.P.W. Decision templates for multiple classifier fusion. Pattern Recognition, 34 (2), 2001, 299-314.
Dietterich, T .G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7): 1895-1923. 1998.