[Lecture Notes in Computer Science] Advances in Knowledge Discovery and Data Mining Volume 2637 || An Integrated System of Mining HTML Texts and Filtering Structured Documents

K.-Y. Whang, J. Jeon, K. Shim, J. Srivatava (Eds.): PAKDD 2003, LNAI 2637, pp. 350–355, 2003.© Springer-Verlag Berlin Heidelberg 2003

An Integrated System of Mining HTML Texts andFiltering Structured Documents

Bo-Hyun Yun1, Myung-Eun Lim1, and Soo-Hyun Park2

1Dept. of Human Information ProcessingElectronics and Telecommunications Research Institute

161, Kajong-Dong, Yusong-Gu, Daejon, 305-350, Korea{ ybh, melim }@etri.re.kr

2School of Business IT, Kookmin University,861-1, Cheongrung-dong, Sungbuk-ku, Seoul, 136-702, Korea

[email protected]

Abstract. This paper presents a method of mining HTML documents intostructured documents and of filtering structured documents by using both slotweighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests instructured document filtering, slot and token are considered. Our preferencecomputation algorithm applies vector similarity and Bayesian probability tofilter structured documents. The experimental results show that it is importantto consider hyperlinking and unlablelling in mining HTML texts; slot and tokenweighting can enhance the performance of structured document filtering.

1 Introduction

The Web has presented users with huge amounts of information, and some may feelthey will miss something if they do not review all available data before making adecision. These needs results in HTML text mining and structured document filtering.The goal of mining HTML documents is to transform HTML texts into a structuralformat and thereby reducing the information in texts to slot-token patterns. Structureddocument filtering is to provide users with only information which satisfies userinterests.

Conventional approaches of mining HTML texts can be divided into 3 types ofsystems such as FST(Finite-State Transducers) method[1, 5, 7], relational rulelearning method [2, 6, 10], and knowledge-based method [4]. Knowledge-basedmethod is easy to understand and implement but needs prior knowledge construction.Thus, we choose knowledge-based method for mining HTML texts with consideringhyperlinking and unlabelling. The existing methods[8, 11] of filtering structureddocuments use a naïve Bayesian classifier and decision tree and Bayesian classi-fication algorithm. However, these methods use only token weighting in slot-tokenpatterns for structured documents.

In this paper, we propose an integrated system of mining HTML texts byconverting them to slot-token patterns and filtering structured document. At first,Information extraction (IE) system extracts structured information such as slot-token

An Integrated System of Mining HTML Texts and Filtering Structured Documents 351

patterns from Web pages. And then, our information filtering(IF) system predicts userpreference of that information and filters only information which predicted usermaybe interested in. We analyze user feedback from user logs and construct a profileusing slot preference and token preference.

2 Mining HTML Texts and Filtering Structured Document

Fig. 1 shows our integrated system of mining HTML texts and filtering structureddocuments. IE system mines Web documents into structured document and IF systemfilters the extracted information to provides the filtered information to users.

Fig. 1. An Integrated System Configuration

Given HTML documents, our IE system interprets the document and mines themeaningful information fragment. The system finds the repeated patterns from theobjects of the interpreted document. Using the repeated patterns, we induce thewrapper of the HTML document. In order to extract the information, wrapperinterpreter loads the generated wrapper and maps the objects into the wrapper.Finally, the wrapper extracts the information and stores the extracted information asthe XML based document.

Because of unlabeled data, mining HTML texts may often fail to extract structuredinformation. In order to assign labels to unlabeled data, we assume that targetdocuments consist of many labeled information and a little unlabeled information. Torecognize the proper label of information, we observe the previous extraction results.We calculate each token’s probability within the slot. The equation (1) shows theprobability that tokens within a slot belongs to a labeled slot.

∑=

≅=v

kijiijiji stp

vspsTpspTsp

1

)|(1

*)()|(*)()|((1)

Here, si means slot i, Tj means token set of jth slot in all templates, p is probability,and v is total number of templates in document. The probability that jth slot label is sequals the probability of token set in the slot j.

To refer sub-linked pages, we have to integrate main pages and sub-linked pagesand extract the slot-token patterns. The integration phase analyzes the structure ofsites and integrates main pages and sub-linked pages if the linked pages have the validinformation. On the other hand, the extraction phase reads wrapper and extractinformation from main pages and sub-linked pages.

Structured document filtering can be divided into the learning part and the filteringpart. In the learning stage, when user logs are obtained by user feedback, the indexerloads structured documents such as slot-token patterns and computes token and slotpreference within documents. Token preference is the frequency of the tokens and slot

352 B.-H. Yun, M.-E. Lim, and S.-H. Park

preference is assigned with token preference. Finally, user profile is updated inlearning stage. In filtering stage, when new structured documents come from IEsystem, the indexer extracts the tokens and token preference within each slot iscalculated. And then, document similarity is calculated by using slot preference.Finally, filtered documents are suggested to users.

Slot preference is the degree of importance which user regards to be important forchoosing information. We assume that if slot preference is high, users judgeinformation value by that slot. On the contrary, if slot preference is low, that cannothelp user choosing information.

To evaluate slot preference, we premise hypothesis as follows:• There is at least one slot which is used for user to estimate information in a

structured document.ex) Some users decide information according to ‘genre’ and ‘cast’ in the ‘movie’document.

• User refers only slot with high preference to evaluate information.ex) Some user is interested in ‘genre’ and ‘director’ of the movie, but not in ‘title’and ‘casting’.

Base slot is used as a baseline for choosing information. If slot preference issufficiently high, that slot becomes one of the base slots. The filtering system predictspreference of the document by using only tokens of base slots.

In vector space model, user preference is composed of some tokens which areselected from documents rated by users. Token weights represent user’s interests. Ifusers have a long-term interest – it can be positive or negative – to a certain token,then the variance of the token weight will be low. On the other hand, if users don’thave any interest to a token, the weight will be represented variously.

The importance of the slot is high if slots contain many important tokens in the userprofile; otherwise, the importance is low. We determine the slot preference by thenumber of important tokens within a slot. Important tokens are selected by itsvariance in each slot. The equation (1) means the token variance and the equation (2)is the equation of slot weighting.

∑∑==

=−=n

iai

n

iaiaaia freqNfreqww

N 11

2 ,)(1σ

(2)

aσ is the variance of token a and aiw is preference of token a in slot i. freq is token

frequency, n is the number of preference w, and N is the total number of tokens in allslots.

==<

=∑

=

0

,1 , ,1

i

ii

k

ii

m xelse

xthresholdif

k

xSW

σ(3)

In the equation (3), SWm is preference of slot m and k is total number of tokens inslot. x is constant means count of tokens. x is 1 if the variance of token is larger thenthreshold; otherwise, 0.

The filtering is done by calculating the similarity between documents and profiles[9]. The vectors of documents and profiles are generated and only base slot is used incalculating the similarity. The equation (4) shows the similarity of slot. Spa is slot a’s


token vector of profile and Sda is vector of document. tw is token’s weight of documentslot and pw is token’s weight of profile slot.

∑∑∑=⋅=

22 )()(

)()(

ipwitw

ipwitwSSSS DaPaa

(4)

The equation (5) shows the final document preference. SSi is the slot similarity ofslot i and SWi is the slot preference. s is the number of slot whose preference weight islarger than threshold.

s

SWSSDPSim

s

iii∑

== 1),(

(5)

3 Experimental Results

The evaluation data of mining Web pages are seven movie sites such as Core Cinemaand Joy Cinema, etc. Because movie sites have the characteristics to be updatedperiodically, wrapper induction is performed by recent data to detect slot-tokenpatterns. Because our knowledge for wrapper induction is composed of Koreanlanguage, we test our method of mining Korean movie sites. We determine 12 oftarget slots such as title, genre, director, actor, grade, music, production, running time,and so on.

We evaluate three kinds of mining methods such as “knowledge only”, “linkextraction”, and “label detection”. “knowledge only” is the method of mining moviesites by using only knowledge without considering hyperlinking and unlabelling.“link extraction” means the method of using hyperlinking. “label detection” is themethod of considering both hyperlinking and unlabelling.

(a) (b)

Fig. 2. Results of Mining HTML Texts

354 B.-H. Yun, M.-E. Lim, and S.-H. Park

The precision of each site is in Fig. 2 (a) and global precision is in Fig. 2 (b). Theperformance of “link extraction” and “label detection” is better than that of“knowledge only” significantly. In other words, experimental results show that it isimportant to consider hyperlinking and unlablelling in mining HTML texts and oursystem can transform HTML texts into slot-token patterns.

The experimental data of filtering structured documents are eachmovie dataset[3].This data includes a rating of users to the movie information. It consists of 2,811,983ratings of 72,916 users to 1,628 movies. The range of rating is between 0 and 1. Fromthat data, we choose only 6,044 user’s data whose rating is more than 100. It can bethought that users with more than 100 ratings are reliable. Over 1,628 movies, 1,387real data are gathered from IMDB site (http://www.imdb.com). This informationcontains title, director, genre, plot outline, user comments, and cast overview and soon.

To compare the relative performance, we test 3 different methods such as “unst,“st”, and “sw”. “unst” regards a structured document as an unstructured one. Slots aredealt with tokens without distinguishing slots and tokens. “st” deals with structureddocuments but doesn’t use slot weighting. User profile is constructed by slots andtokens but has only weights of tokens. The document similarity is calculated byweights of tokens within the same slot in user profile and document. “st with sw” usesboth slot weighting and token weighting. We select 1,000 of 6,044 users randomlyand divide their rating data into training set and test set such as 1:1 and 3:1. This isreason that we see the change of performance according to the size of data. We test 50times repeatedly at each method and get the precision and recall of each method.

The experimental results of filtering structured documents are shown in Fig. 3. Fig.3 (a) and (b) shows the precision of “st” is higher than that of “unst” and the precisionof “st with sw” is higher than that of “st”. From this result, we can see that structureddocument filtering methods have to consider both slot and token weighting.Comparing Fig. 3 (a) and (b), we see that precision with 3:1 data ratio is a little higherthan precision with 1:1 ratio. This means that increasing size of training data canimprove the performance.

(a) (b)

Fig. 2. Results of Filtering Structured Documents(a) is the result of 3:1 ratio, and (b) is 1:1 ratio.


4 Conclusion

This paper has described a method of mining HTML documents into structureddocuments and of filtering structured documents by using both slot weighting andtoken weighting. Our integrated system mines HTML texts by IE system comprisedof wrapper generation and interpretation. We used Bayes probability to solve theunlabelling problem and integrated main pages and sub-linked pages for hyperlinking.And then, structured documents are filtered by IF system using slot and tokenweighting.

The experimental results shows that it is important to consider hyperlinking andunlablelling in mining HTML texts and our system can transform HTML texts intoslot-token patterns. In structured document filtering, slot weighting can enhance theperformance of structured document filtering. For the future works, we try to evaluatehow the total number of tokens affects slot weighting.

References

[1] Chun-nan Hsu and Ming-tzung Dung, Generating Finite-State Transducers for semi-structured data extraction from the web. Information Systems vol.23, no.8, p 521–538,1998.

[2] Dayne Freitag, Toward General-Purpose Learning for Information Extraction,Proceedings of the 36th annual meeting of the Association for Computational Linguisticsand 7th International Conference on Computational Linguistics, 1998.

[3] eachmovie data download site , http://www.research.compaq.com/SRC/eachmovie/data/.[4] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi, Knowledge-based Wrapper

Generation by Using XML, Workshop on Adaptive Text Extraction and Mining(ATEM2001), pp. 1–8, Seattle, USA, 2001.

[5] Ion Muslea, Steven Minton, Craig A. Knoblock, Hierarchical Wrapper Induction forSemistructured Information Soueces.

[6] Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rulesfor Information Extraction, Proceedings of the 16th National Conference on ArtificialIntelligence, p. 328–334, Orlando, FL, July, 1999.

[7] Naveen Ashish and Craig Knoblock, Semi-automatic Wrapper Generation for InternetInformation Sources, Proceedings of the Second International Conference onCooperative Information Systems, Charleston, SC, 1997.

[8] Raymond J. Mooney. Content-Based Book Recommending Using Learning for TextCategorization, Proceedings of the 5th ACM conference on Digital Libraries, June 2000.

[9] Robert. B. Allen, User models: theory, method, and practice, international journal onman-machine studies, vol.32, p. 511–543, 1990.

[10] Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Freetext. Machine Learning, 34(1-3):233–272, 1999.

[11] Yanlei Diao, Hongjun Lu, and Dekai Wu, A Comparative Study of Classification BasedPersonal E-mail Filtering, Proceedings of the 4th Pacific-Asia Conference on KnowledgeDiscovery and Data Mining, Kyoto, Japan, April 2000.