00942178

Embed Size (px)

Citation preview

  • 8/12/2019 00942178

    1/7

    Text Data Mining: Discovery of Important Keywordsin the Cyberspace

    Hiroki Arimura Junichiro Abe Ryoichi FujinoKyushu University Kyushu University fujino @po0.kix.or.jpj-abe @ i .kyushu-u.ac .jp

    Department of Informatics Department of Inform atics EN ICO MPREST O, JST.arim @ .kyushu-u.ac.jp

    Hiroshi Sakamoto Shinichi Shimozono Setsuo ArikawaDepartment of Informatics Department of Artif. Intell. Department of InformaticsKyushu University Kyushu Inst. Tech. Kyushu Universityhiroshi @i.kyushu-u.ac.jp sin @ai.kyutech.ac.jp [email protected]

    AbstractThis paper describes applications of the optimized pat-tern discover) , ramewo rk to text and Web mining. In partic-ular; we introduce a cla ss of simple co mbin atorial patternsover phrases, called proximity phrase association patterns,and co nsider the problem of jinding the patterns that opti-mize a given statistical measure within the whole class ofpatterns in a large collection of unstructured texts. For thisclass of patterns, we develop fast and robust text miningalgorithms based on techniques in com putational geome-try and string matching. Finally, we successfully apply the

    developed text mining algorithms to the experiments on i nteractive document browsing in a large text database andkeyword discovery r om Web bases.

    1. IntroductionThe rapid progress of computer and network technolo-gies makes i t easy to collect and store a large amount ofunstructured or semi-structured texts such as webpages,HTM L/XM L archives, E-mails, an d text files. These textdata can be thought of large sca le text databases, and thus itbecomes important to develop an efficient tools to discoverinteresting kn owledge from such text databases.

    There are a large body of data mining researches to dis-cover interesting rules or patterns from well-structured datasuch as transaction databases with boolean or numeric at-

    *Th is research is supported in part by the Ministry of Education, Sci-ence, Sports, and Culture of Japan, Grant-in-Aid for Scientific Researchon Priority Areas Discovery Science.

    tributes [ 1 81. However, it is difficult to directly a pply thetraditional data mining technologies to such text or semi-structured data since these text databases consist of (i) het-erogeneous and (ii) huge collections of (iii) un-structuredor semi-structured data. Therefore , there still arc a smallnumber of studies on text data mining, e.g. [S, 61.Our research goal is to devise an efficient semi-automatic tool that supports human discovery from largetext database s. Therefore , we require a fast pattern dis-covery algorithm that can work in t ime O n ) o O n og n )to respond in real time on an unstructured data set of to-tal s i z e n . F u r t h e r m o r e , s u c h a l g o r it h m h a s to be r o b u s t inthe sense that it can work on a large amount of noisy andincomplete unstructured data without the assumption of anunknown hypothesis class. To achieve this goal, we adoptthe framew ork of optimized pattern discover?,[8] describedbelow, and develo p efficient and robust pattern discove ry al-gorithms combining the advanced technologies in string al-gorithm, comp utational geometry, and computational learn-ing theory.2. Framework of text mining2.1. Optimized pattern discovery

    The f ramework of optimized pattern discov ery, adoptedin this paper, is originally proposed by Fukuda et al. [8] inthe field of data mining and also known as Agnostic PAClearning [ 1 I ] in com putatio nal learning theory. In opti-mized pattern discovery, a pattern discovery algorithm triesto find a pattern from a given hypothesis space that opti-mizes a statist ical measure function, such as class cation

    2200-7695-1022-1101 10 00 001 IEEE

    http://po0.kix.or.jp/http://po0.kix.or.jp/http://kyushu-u.ac.jp/mailto:i.kyushu-u.ac.jphttp://ai.kyutech.ac.jp/http://ai.kyutech.ac.jp/mailto:[email protected]:[email protected]://ai.kyutech.ac.jp/mailto:i.kyushu-u.ac.jphttp://kyushu-u.ac.jp/http://po0.kix.or.jp/
  • 8/12/2019 00942178

    2/7

    error [ I ] inforimtian entropy [13], Gini index [ 3 ] , nd ,y3index [ 131 to discriminate a given target (or positive) dataset from ano ther background (negative)data set [3 , 131.More precisely, we define the optimized pattern discov-ery as follows. Suppose that we are given a set S ={s l , . . , s of texts and an objective function : S0, }, where each s, is called a document . The value of theobjective function < ( s t ) ndicates that the document s, is

    interesting (positive) if [ si)= 1 and not interesting (nega-tive) otherwise .Let P s a (possibly infinite) class of patterns, where forany pattern 7r E P nd any string s, we define n s ) = 1if r matches s and n s ) = 0 otherwise. Let S be a set ofdocum ents and be an objective function. Then, a pattern7r defines a contingency table (MI ,M O , I , o ) , where N I(resp., N o ) is the number of all positive (resp., negative)documents in S and (resp., M O ) s the number of allpositive (resp., negative) docum ents s E S such that T (s) =1. An impurity function is any real-valued function :[0, 1] such that (i) it takes the maxim um value ( / 2 ) ,(ii) the minimum value ( O ) = (1) = 0, and (iii) isconvex, i.e., ( ( x + y) /2) 2 ( ( x ) + ( y ) ) / 2 for everyx y E [0, 11. The following s are examp les of impurity func -tions:

    Th e prediction error: 1(x) = i i i in (x , 1- x ) .The information entropy:~ ~ ( 2 ) = - ~ l o g z - ( 1 - x ) l o g ( l - 2 ) .The gini index: + 3 x ) = 2x(1 ) .

    Then , the evaluation function based on over S and is(=z,t r) 441 NI + dl live)No ,where (M, , ON I , o ) is the contingency table definedby the pattern r over S nd ENow, we state the our data mining problem called theoptimal pattern discovery probleni as follows. Let P e theclass of candidate patterns and be any impurity function.

    Optimal Pattern Discovery ProblemGiven: a set S of documents and an objectivefunct ion< : S 0 , l ) .Problem: Find an optimal pattern r E P thatminimizes the cost G ,f n) within P .

    In what follows, we consider the information entropymeasure only, but not limited to i t . From recent devel-opment in learning theory, i t is known that any algorithmthat efficiently solves, e.g., classification error minimiza-tion, can approximate arbitrary unknow n probability distri-butions and thus can work with noisy environ ments [ 111.

    2.2. Keyword discovery by optimized pattern dis-coveryAn intuition behind the application of the optimized pat-tern discovery to text mining can be explained as follows.

    Suppose that we are given as the target set a collection oftext files with unknown vocabulary, say, Reuter newswiresfor one year of 1987 [12]. We want to take a look at thecontents and find a set of topic keyw ords characterizing themajor topics arising for one year of 1987.A possible way to find such keywords or phrases is tofind the keyw ords that frequently appear in the target set asin traditional da ta mining. How ever, this does not works inmost text collections because in a typical English text, themost frequent keywords are stopwords like the or an(see Table 1 (a) and Table 3 (a)). These keywords are basicconstituents of English grammars and convey no informa-tion on the contents of the text collection. Such frequentbut less informative stopwords may hide less frequent in-formative keywords. Th e traditional information retrievaltechnique called stopword elimination may not w ork, too.A basic idea behind our method is to use an average setof texts as the control set used for cance ling the occurrencesof frequent and non-informative keywords. The control setwill be a set of documents randomly drawn from the wholetext collection or the internet. We can easily observe thatmost stopwords appear evenly in the target and the con-trol set , while informative keywords appear more frequentlyin the target set than the control set. The refo re, the opti-mized pattern discovery algorithm w ill find those keyw ordsor phrases that appe ar mo re frequently in the target set thanthe control set by minimizing a given statistical measuresuch as the information entropy or the prediction error (SeeTable 1 (b)-(c) and Table 3 (b)-(c)).2.3. The class of patterns

    The class of patterns we consider is the class of proxim-ity phrase association patterns [ 3 ] We mean by a phraseany string of tokens, which may be either letters or words,of arbitrary length. A phrase associatiori pattern (phrasepattern) is an expre ssion of the form

    at ta cks , r ani an o 1 1at orm ;8which expresses that phrase at tacks) irst appears in adocument and then phrase i r ani an oil pl atf ormfollows within eight words. A phrase pattern ca n containarbitrary many but bounded number of phrases as its com-ponents. If the order of the phrases in a pattern matters asi n the above examp le then we call it ordered and otherwiseunordered. Proximity phrase association patterns can be re-garded as a generalization of association rules in transactiondatabases [ 11 such that (i) each item is a phrase of arbitrary

    22 1

  • 8/12/2019 00942178

    3/7

    Table 1. Phrase mining with entropy optimization to capture shi pcategory. To see the effectivenessof entropy optimization, we mine only patterns with d = 1, i.e., single phrases. a) The best tenfrequent phrases found by traditional frequent pattern mining. b) Short phrases of smallest rank,1 - 10 and c) Long phrases of middle rank, 261 - 270 found by entropy minim ization mining. Thedata set cons ists of 19,043 articles of 15 2MB from Reuters Newswires in 1987.

    (a) Frequency maximization1 2 3 4 5 6 7 < n >8 9 < s >

    10

    b) Entropy minimization1 2 3 i shi ppi ng >4 Ci r ani an >5 < ran >6 7 8 9

    10

    (c) Entropy minimization261 262 264 < r ani an oi l pl at f or m>265 266 267

    length, (ii) items are ordere d, and (iii) a proxim ity constraintis introduced.

    3. A fast and robust text mining algorithm forordered patternsIf the maximum number of phrases in a pattern isbounded by a constant d then the frequent pattern prob-

    lems for both unordered an d ordered proximity phrase asso-ciation patterns are solvable by Enumerate-Scan algorithm[ 1.51, a modification of a naive generate-and-test algorithm,in 0 ( ? 1 ~ + )ime and

  • 8/12/2019 00942178

    4/7

    5 Application to Interactive document brows-ingWe applied our text mining method to inteructivu docu-ment browsing. Based on the Split-Merge algorithm [4], wedeveloped a prototype system on a unix workstation, andrun experiments on a medium sized English text collection.

    5.1. Data sets and a prototype systemWe used an English text collection, called Re uters-21578

    data [ 121, which consists of news articles of 27MB on inter-national affairs and trades from February to August in 1987.The sample set is a collection of unstructured texts of the to-tal size 15.2MB obtained from Reuters-21578 by removingall but category tags. The target (positive) set cons ists of285 articles with category shi p and the background (neg-ative) set consists of 18,758 articles with other c ategoriessuch as t r ade, grai n, met al , nd so on. T he averagelength of articles is 799 letters.By experiments on Reuters-21578 data above of15. 2M B, the prototype system finds the best 600 patternsat the entropy measure in seconds for d = 2 and a few min-utes for d = 4 and with k = 2 words using a few hundredsmega -bytes of main mem ory on Sun Ultra 60 (Ultra SPARCI 300MHz, g++ on Solaris 2.6,5 12M B main memory) 191.5.2. The first experiment

    In Table 1 we show the list of the phrase patterns dis-covered by our mining system, which capture the categoryshi p relative to other categories of Reuters newswires. InFig. I (a), we show the list of most frequent keywords dis-covered by traditional frequency maximization method. Onthe other hands, we list in Fig. 1 (b) and (c), we show thelist of optimal patterns discovered by Entropy minimizationmethod. The patterns of sm allest rank (1 - 10) contain thetopic keywords i n the major news stories for the period in1987 (Fig. I(b)). Such keyw ords are hard to find by tradi-tional frequent pattern discov ery becaus e of the existence ofthe high frequency words such as the) and are). Thepatterns of medium rank (261 - 270) are long phrases,such that l l oyds shi ppi ng i nt el l i gence) andi r ani an oi l pl atf orm, as a summary (Table 1(c)), which ca nnot be represented by any combination ofnon-contiguous keywords.5.3. The second experiment

    Table 2 shows an experiment on interactive text brows-ing, where we try to find an article containing a specifictopic from a collection of documents by using optimizedpattern discovery combined with keyword search.

    First, we suppose that a user is looking for articles re-lated to the labor problem, but he does not know any spe-cific keywords enough to identify such articles. (a) Start-ing with the original target and the background sets re-lated to shi p category in the last section, the user firstfinds topic keywords in the original target set using opti-mized pattern mining with d = 1, i.e., phrase mining. Letuni on be a keyword found. (b) Then, he builds a new tar-get set by drawing all articles with keyword uni on fromthe original target set. T he last target se t is used as the newbackground set. As a result, we obtained a list of topicphrases concerned to uni on. (c) Using a term seamenfound in the last stage, we try to find long patterns con-sisting of four phrases such that the first phrase is sea-men using proximity k = 2 words. In the table, a pattern

    seamen ) were ) sti l l on st r i ke) ound by thealgorithm indicates that there is a strike by a union of sea-men .6. Discovery of important keywords in WebIn this section, we apply our text mining system to discov-ery of important keywords that characterize a communityof web pages. In the Web search, a user may use a searchengine to find a set of Web pages by giving a key word rel-evant to an interesting topic. However, for too ambigu ouskeywords the search engine often returns a large amount ofdocuments as the answers, and thus a postprocessing is re-quired to select the pages that the user is really interested.6.1. Method

    In the experiments, the mining sy stem receives two key-words, called the target and the control as inputs, and com -putes the collections of Web pages called the base sets ofthe target and the control keywords as follows. For a key-word w, let the base set of w be the set of best, say 200,pages obtained by making a query w to a Web search en-gine together with those pages pointed by the hyperlinksfrom the first set of pages. It is well known that the baseset of a keyword forms a well connected community [101.Then , the base sets corresponding to the target and the con-trol keywords are given the text mining system as positiveand negative examples, respectively, and the system dis cov-ers the phrases that characterize the base set of the targetkeyword relative to the base set of the control keyword byusing entropy minimization.6 2 Data

    We used AltaVista [2] as an Web search engine andPer1 and Java scripts to collect Web pages from internet.

    223

  • 8/12/2019 00942178

    5/7

    Table 2. Document browsing by optimized pattern discovery. (a) First, a user tries to mine the originaltarget set using optimized pattern discovery over phrases using the background set (d = 1,k = 0 .The user selected a term uni on of rank 51. (b) Next, the user mines a subset of articles relatingto uni onand mine this set again by optimized phrases (d = 1,k = 0 . He obtained topic terms onseamen. (c) Finally, the user tries to discover optimized patterns star ting with seamen on the sametarget set (d = 4, k = 2 words). The underlined pattern indicates a strike by a union of seamen.

    Rank Pattern46 l oadi ng )47 U s f l ag48 pl at f orms49 s f l ag50 the st r i ke )51 union)5 2 kuwai t5 3 waters54 m ssi l es )55 navy )

    (b) Stage 2Rank Pattern

    I uni on3 seamen)4 str i ke2 U. S. )

    pay6 gul f )

    9 de )7 empl oyers8 l abor

    10 r edundanci es \

    (c) S tage 3Rank Pat tern

    1 seamen ) and ) the3 seamen ) were ) st r i ke. )4 seamen ) were) stil l on strike. )5 seamen ) were ) on st r i ke.6 seamen ) to ) af ter )7 seamen ) sai d ) were9 seamen ) on ) str i ke.

    2 seamen ) a ) pay )

    8 seamen ) pe t ) pay )10 seamen ? l eaders ? (o f ?

    Table 3 Mining the Web. Characterizing communities of Web pages. In each table the first, thesecond, the third, the fourth columns show, repectively, the rank, the number of occurrences in thepositive and the negative documents, and the phrase found. The number of phrases is d = 1.(a) Frequency m aximizationPOS = honda (b) Entropy minimizationPOS = honda, NEG = softbank

    (c) Entropy m inimizationPOS = honda, NEG = toyota

    1 17893 2 12377 3 11904 4 11291 5 8728 6 8239 7 7278 < n >8 5752 < s >9 5626 < >

    10 5477 13 4571 14 4545 15 3880 < t >16 3650 17 3447 18 3279 19 3115 20 3085

    1 5477 4 2 2125 0 i pr el ude >3 5626 1099 < >4 1863 68 5 1337 24 6 1472 92 7 862 1 8 744 0 9 3085 769

    10 718 0 11 754 16 12 629 2 13 662 6 14 734 20 15 732 29 16 802 45 17 526 1 18 586 13 19 555 9 20 1307 219

    1 5477 3022 2125 93 744 54 732 165 662 236 629 237 3085 9428 376 09 448 9

    10 555 3211 396 512 310 013 309 014 390 1215 1208 26616 274 017 396 1718 271 019 299 320 435 29

    i pr el ude s i >i honda prel ude >< 9 9 t i me >i honda s >i dat e 10 >

    224

  • 8/12/2019 00942178

    6/7

    The target keyword is honda), the name of an automo-bile company, and the control ke ywords are ( s o f tbank),the company on internet business, and toyota), anotherautomobile company. From these keywords honda),sof tbank), nd toyota) , we obtained collections ofplain text fi les of size 4.90MB (52,535 sentences), 3.65MB

    (34,170 sentences), and 5.05MB (40,757 sentences), re-spectively, after removing all HT ML tags. Altho ugh thekeywords honda) nd t oyot a) an be used as the nameof persons, they we re used as the nam es of automobile com-panies in most pages of high ranks.

    6.3. Results

    In Table 3, we sh ow the best 20 phrases found in the tar-get set honda) varying the control set. In the table (a),we show the phrases found by traditional frequent patternmining with the empty control [I]. In the next two tables,we show the phrases found by mining based on entropyminimization, where the target is honda) and the con-trol varies between sof tbank) and toyota) . We seethat b) the mining system found m ore general phrases, e.g.car s, par t s, engi ne n the target honda)with thecontrol so f tbank), while the system found more specificphrases, e.g . pr el ude s i ,val kyr i e, accor d, withthe control toyota) . This difference on patterns comefrom the fact that the target keyword honda) s more s im-ilar to the first control keyword t oyot a) han the secondcontrol keyword ( sof tbank).

    7. Conclusion

    In this paper, we investigate a new access method forlarge text databases on internet based on text mining. First,we formalized text mining problem as the optimized pat-tern discovery problem using a statistical measure. Then,we gave fast and robust pattern discovery algorithms, whichwas applicable for a large collection of unstructured textdata in real time. We ran computer experiments on in-teractive document browsing and keyword discovery fromWeb, which showed the efficiency of our method on realtext databases.

    In this paper, we regard Web pages a s unstructured texts.However, it is often m ore natural to consider Web pages asmarkup texts or semi-structured data [6, 141. Using such astructure information, we may obtain more interesting pat-tern that characterize a given dataset well. Thus , it will be aninteresting future problem to develop an efficient algori thmfor finding optimal pattern in such semi-structured data.

    References[ 11 R. Agrawal , R. Srikant, Fast algorithms for mining as-sociation rules, In Proc. VLD B94, 487-499, 1994.[2] Com paq Computer K.K., AltaVista.

    ht t p: / /w l t avi s a ,com , 000.[3] Arimura, H., Wataki, A., Fujino, R., Arikawa, S., A

    fast algorithm for discovering optimal string patternsin large text databases , In Proc. ALT9 8, LN AI 1501,247-261, 1998.

    [4] H. Arimura, S. Arikawa, S. Shimozono, Efficient dis-covery of optimal word-association patterns in largetext databases New Generation Computing, 18, 49-60,2000.

    [5] W. W. Cohen, Y. Singer, Context-sen sitive learningmethods for text categorization, J. ACM, 17(2), 141-173, 1999.

    [6] M. Craven, D. Dipasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigam, and S. Slattery, Learning toconstruct knowledge bases from the W orld Wide Web,Artificial Intelligence, 1 18, 69-1 14, 2000 .

    71 R. Fujino, H. Arimura, S. Arikawa, Discovering un-ordered and ordered phrase association patterns for textmining. Proc. PAKD D2000, LNA I 1805, 281-293,2000.81 T. Fuk uda , Y. Morimoto, S. Morishita, andData mining using two-dimensionalIn Proc. SIGMOD96,T. Tokuyama,optimized association rules,

    13-23, 1996.[9] T. Kasai, T. Itai, H. Arimura, S. Arikawa, Exploratorydocument browsing using optimized text data min-ing, In Proc. Data Mining Workshop, 24-30, 1999 (In

    Japanese).[lo] J. M . Kleinberg , Authoritative sources in a hyper-

    l inked environment. In Proc. SOD A98,6 68-677, 1998.[ l 11 M . J. Kearns, R. E. Shapire, L. M. Sell ie, Towardefficient agnostic learning. Machine Learning, 17(2-

    3), 115-141, 1994.[ 121 D. Lewis, Reuters-21578 text categorization testcollection, Distribution 1 O AT&T Labs-Research ,

    ht t p: / / www. r esear ch/ at t com 997.[13] S. Morishita, On classification and regression, Proc.DS98 , LNAI 1532 ,49-59 ,1998 .

    225

  • 8/12/2019 00942178

    7/7

    [I41 H. Sakamoto, H. Arimura, S. Arikawa, Identifica-tion of Tree Translation Rules from Examples, Proc.the 5th International Colloquium on Grammatical In-ference (ICGI 2000),L N A I 8 9 1 , 2 41-255, 2000.[15] J. T. L. Wang, G. W. Chirn, T. G. Marr , B Shapiro,

    D . Shasha and K. Zhang, Comb inatorial pattern dis-covery for scientific data: Some preliminary results, InProc. SIGMOD94, 115-125, 1994.

    226