Knowledge discovery in virtual community texts: … A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities A possible exception is the

Journal of Intelligent & Fuzzy Systems 14 (2003) 13–24 13IOS Press

Knowledge discovery in virtual communitytexts: Clustering virtual communities

A.M. Oudshoffa, I.E. Bosloperb, T.B. Klosc and L. Spaanenburgd,∗aKPN Mobile, P.O. Box 30139, 2500 GC The Hague, The NetherlandsbECCOO, Broerstraat 4, 9712 CP Groningen, The NetherlandscCWI, P.O.Box 94079, 1090 GB Amsterdam, The NetherlandsdLund University, Department of Information Technology, P.O. Box 118, 22100 Lund, Sweden

Abstract. Automatic knowledge discovery from texts (KDT) is proving to be a promising method for businesses today to dealwith the overload of textual information. In this paper, we first explore the possibilities for KDT to enhance communication invirtual communities, and then we present a practical case study with real-life Internet data. The problem in the case study is tomanage the very successful virtual communities known as ‘clubs’ of the largest Dutch Internet Service Provider. It is possiblefor anyone to start a club about any subject, resulting in over 10,000 active clubs today. At the beginning, the founder assignsthe club to a predefined category. This often results in illogical or inconsistent placements, which means that interesting clubsmay be hard to locate for potential new members. The ISP therefore is looking for an automated way to categorize clubs in alogical and consistent manner. The method used is the so-called bag-of-words approach, previously applied mostly to scientifictexts and structured documents. Each club is described by a vector of word occurrences of all communications within that club.Latent Semantic Indexing (LSI) is applied to reduce the dimensionality problem prior to clustering. Clustering is done by theWithin Groups Clustering method using a cosine distance measure appropriate for texts. The results show that KDT and the LSImethod can successfully be applied for clustering the very volatile and unstructured textual communication on the Internet.

Keywords: Knowledge discovery in text, clustering, virtual community, latent semantic indexing, web portal

1. Introduction

The Internet has grown by large factors in size andpopularity over just a few years. Such a steep outgrowthcreates scaling problems of a size unheard of in a dig-ital world. Early 2000, the World-Wide Web containsalready 19 terabytes of information in 1 billion docu-ments, and daily 1.5 million documents are added [4].Such an abundance of information stored with just aninformal organizational structure and over a very-widearea is bound to create a processing overload.

Search engines have been developed as a first line ofdefense. They promise to find some relevant informa-tion on the basis of a few keywords. But even whenaided by a pre-engineered index database, the operation

∗Corresponding author. E-mail: [email protected].

needs skill and experience to provide near-acceptableresults. Furthermore the first generation engines in-spected only static links and consequently reached only300 million documents [1]. Meta-engines and deep-crawlers are required to go beyond that point.

The search space can already be decreased by orga-nizational measures. In business-to-business applica-tions, a strict and enforced standardization of the di-rectory structure and file naming conventions allowslarger information chunks than mere files to be placedat the focal point of attention. Placing the entire prod-uct documentation at the disposal of a user communitycan be easily and efficiently accomplished through anoptimized storage agreement.

On first view, the situation is more difficult in con-sumer applications. Here a strict standardization ismore difficult to enforce and seemingly there is no op-portunity for a structured decrease of the search space.

1064-1246/03/$8.00 2003 – IOS Press. All rights reserved

14 A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities

A possible exception is the virtual community, whereusers create and find their own corner. But again thescaling problem plays havoc. When the number ofcommunities grows, the self-styled structure becomeshard to enforce or even maintain.

Not only on the Internet, but also in any organiza-tion, the amount of textual information that needs to beprocessed continues to grow. It has been estimated thatmore than 80% of the information in an organization isstored in textual form. As we have seen a large increasein the amount of numerical and symbolic data stored indatabases, such as transaction records in the telecom-munications or credit card industry, we can only imag-ine the quantity of textual information in organizationstoday. As Knowledge Discovery in Databases technol-ogy is finally widely being adopted by organizations toprocess numerical and symbolic information stored indatabases, in order to make more informed business de-cisions, we propose that Knowledge Discovery in Text(KDT) can be used to streamline the large amounts oftextual information and put the knowledge containedin these texts to business use.

In this paper we will investigate the use of KDT to theself-configuration of a virtual community portal. Firstwe will introduce the problem area: what is a virtualcommunity and why does an Internet Service Provider(ISP) host communities. Ensuing we overview the cur-rent application of Knowledge Discovery in Text ingeneral and within virtual communities. After a shortdiscussion on the MIDAS methodology, it is discussedhow such a systematic procedure helps to chart thecommunalities between the communities and illustratethis with the results of a recent survey executed with aDutch provider. The paper ends with a short display ofthe results obtained and future research directions.

2. Problem domain

An ISP provides services such as Internet access,website hosting, content provisioning through portaland information services, marketplace hosting etcetera.Most ISPs charge their customers a monthly fee for In-ternet access, but there is an increasing number of ISPsthat provide free Internet access. They make moneyeither on the kickback fee provided by telecommuni-cations companies based on the telephone traffic theygenerate, or by advertisements placed on their portals.

Both sources of income depend on the number ofvisitors that an ISP manages to attract and retain overa period of time. It is therefore vital for free access

ISPs to offer enough interesting content on their portals.This content should attract new customers and keepcurrent customers returning as often as possible andstaying online as long as possible.

2.1. Virtual communities

In [10], a virtual community is defined as an “on-linesocial network of a group of people with a commoninterest”. Virtual communities have become a popularway to create meeting rooms for similar souls. Peoplewith a special interest are attracted to a site where theyare guaranteed to find other people with the same inter-est. This can be a common hobby, a common fanshipor even a common occupation. The needs that a virtualcommunity can serve are: information, relationships,relaxation and transactions. As [23] points out, virtualcommunities are a means to get people to return reg-ularly to the same place on the Internet. This is whyISPs in general and free access ISPs in particular findit useful to host virtual communities.

“Het Net” is the largest Dutch free access ISP, withover 1 million regular visitors. As stated above, thisISP has a business interest in hosting virtual commu-nities. As starting and maintaining communities canbe a very time-consuming task, “Het Net” decided tohost a virtual community portal, where each visitor canstart his or her own virtual community on virtually anytopic. This way, most of the effort is left to the visitors,and it is up to Het Net to provide the infrastructure andservices surrounding this virtual community portal. Avirtual community of Het Net is known as a club, andthere are currently almost 20,000 clubs, of which morethan 10,000 are active on a regular basis.

A typical virtual community offers a range of roomsand memorabilia. A club on the portal of “Het Net”consists largely of an agenda, a music & movies collec-tion, a chat room, a discussion forum, a photo gallery, aflee market, recent news topics and links to other sites(Fig. 1).

2.2. The club clustering problem

Every visitor of the portal can start a new club on vir-tually any topic that he or she is interested in. The por-tal provides a tree-like structure of categories in whichthe creator can place his or her club. The two mainproblems of this approach are first inconsistency andsecond maintenance. Inconsistency automatically en-sues when humans are required to categorize items, asis well known from e.g. library cataloguing studies. In

A.M. Oudshoff et al. / Knowledge discovery in virtual community texts: Clustering virtual communities 15

Fig. 1. Homepage of a virtual community.

the club context this results in inconsistent club place-ment, where e.g. some of the Britney Spears fan clubsreside in the “music” category and other Britney Spearsfan clubs reside in the “fan clubs” directory. The main-tenance problem is related to the growth and declineof interest in certain topics over time. Currently, thereare so many Britney Spears fan clubs on “Het Net” thatthis is a category of clubs in itself. However, it is likelythat in the future this number of clubs will graduallydecrease and eventually only a few will survive. If “HetNet” does not keep up with these changes, the cate-gory tree of clubs will become very unbalanced, andthus it will be harder for potential new members to findthe clubs that they are interested in within a reasonableamount of time.

Knowledge Discovery in Texts holds the promise ofautomating or supporting tasks that involve text docu-ments. The communication within a club can be con-sidered a text document. Examples of texts in a club arechat messages, posts on the forum, the club description,and etcetera. As there will be a lively communicationwithin the active communities, one may suppose thatthe collection of these messages may bring the knowl-edge that guides the dynamic organization of the clubson the portal. The purpose of our case study is to inves-tigate the usefulness of KDT to automate or support theregular construction of a balanced club category tree.

A main issue in this problem domain is the natureof the text documents used. Problems in the automaticprocessing of club communication include: lack of at-tention on correctness and clearness of spelling andstyle, emotional coloring rather than objective seman-tics, as well as the particular vocabulary. As researchhas demonstrated, the vocabulary in speech is only 20%of the full language; the volatile communication withina community by chatting and email comes closer to oralthan to written communication. Such observations leadto the question whether community communicationhassufficient content to allow for automatic processing.

The problem is therefore to extract from the con-tent of the textual messages a structure of the interest,background or other participant attributes that providesenough information on the topic of the club to allow forthe construction of a balanced and useful topic categorytree. In this context, useful refers to the ability for newvisitors to find the clubs they are potentially interestedin within a reasonable amount of time. Members of aclub can always locate their clubs because the portalprovides a personalized club shortlist for members.

Success in this application of KDT within virtualcommunities may pave the way for other useful appli-cations in this volatile and unstructured text domain. Inthe next section, we will discuss a number of potentialapplications.


3. KDT for virtual communities

We define Knowledge Discovery in Texts (KDT) as:“The process of extracting interesting and non-trivialpatterns or knowledge from textual documents”. Thisdefinition is analogous to the definition of data mining,by substituting ‘textual documents’ with ‘large amountsof data’. Important in the definition is the fact thatKDT is a process, i.e. it is not one magical technique,but constitutes a complete process of collecting infor-mation, pre-processing texts, mining the pre-processedinformation and using the results in an intelligent way.This process will be described in more detail in Sec-tion 4.

3.1. Definitions

KDT and KDD (Knowledge Discovery in Databases)are very similar in the mining step, and many tech-niques familiar from data mining can be used to per-form text mining as well. The main difference is inthe pre-processing step. Even more than in KDD, thepre-processing step in KDT is crucial to find a repre-sentation of the texts that can be used by the text miningtechniques to produce meaningful results. The currentKDT literature describes an abundance of possibly use-ful text preprocessing steps. There are however no clearguidelines when to use which preprocessing step(s).In Section 4.2 we will discuss the preprocessing stepschosen for our case study.

One question that surfaces when studying the avail-able KDT literature is the question of the differencebetween the fields of KDT and that of natural languageprocessing (NLP). NLP has been defined, as “The goalof NLP is to better understand natural language by us-ing computers”. Both fields study texts and try to ex-tract information. There are however three main dif-ferences, as pointed out by [11]:

– KDT uses induction techniques such as classifica-tion to produce results

– KDT results are comprehensible and actionable,whereas NLP can produce statistical tables andprincipal component analysis results

– KDT is usually applied to more texts at the sametime, whereas NLP is mainly concerned withstudying one text at a time.

3.2. KDT applications

Apparently, KDT is a process that induces meaning-ful and actionable information from collections of textdocuments. Its applications can be roughly dividedinto four main areas: (a) Information Retrieval, (b) In-formation Filtering, (c) Information Routing, and (d)Knowledge Extraction.

Textual information such as the contents of a libraryor a collection of web pages usually need to be accessi-ble in order to retrieve information from the collectionwhenever it is needed. This means that textual infor-mation needs to be organized in such a way that it canbe navigated and searched. This field of expertise isgenerally known as information retrieval, and a largeamount of the available KDT literature is concernedwith this expertise. Text mining techniques can for ex-ample be applied to automatically categorize or labelnew documents in a collection, which was previously alabor-intensive and rather subjective task.

New information is generated every day, for examplenewspaper information and electronic mail messages.To reduce the amount of information reaching for ex-ample the employees of a company, information filter-ing techniques can be applied. Simple versions of thesetechniques can be found for example in Microsoft Out-look’s Inbox Assistant, which can apply rules to deletemessages with certain keywords in the subject or com-ing from a certain source. More intelligent techniquesmodel the interests of a user and provide only thosearticles or messages that a user will be interested in.

Text mining techniques can also be applied to assessthe content of a message and then route it to the appro-priate recipient. These techniques now start to be usedin e.g. e-mail handling in customer contact centers.Some early experiences are discussed in [19].

The last main area of text mining applications is toextract knowledge about one or more texts, in a moreconcise form than the texts themselves. Examples in-clude automatic intelligent summarization, trend de-tection in documents about a certain topic, analysisof co-occurring words in texts and document cluster-ing where each cluster of documents is described verybriefly which provides a quick overview of even largecollections of documents.

3.3. Text mining techniques

The result of pre-processing texts is usually aweighted term vector per document. This numeric rep-resentation can be used in the actual mining step in the


KDT process. Well-known data mining methods suchas classification, association and clustering are appliedin the text mining process. Also, special text miningtechniques such as summarization can be applied.

– Classification: the method of classification is usedto label each document into specific predefinedcategories. Classification is usually a supervisedlearning process, where the labeling is based onthe labeling of documents in a training collection.Techniques that can be used for text classificationinclude among others decision trees, neural net-works, k-nearest neighbor, rule induction, naı̈veBayes and support vector machines. Classifica-tion is used for example in information retrieval toclassify documents based on their content in orderto be able to retrieve them efficiently.

– Clustering: the method of clustering is used togroup documents into document clusters. Docu-ments within a cluster are similar to one another,and between clusters documents are dissimilar.Clustering is usually achieved in an unsupervisedtraining process, where the desired clustering isunknown beforehand. The most difficult part ofclustering is to establish a good similarity mea-sure, which is used to compute the effectivenessof a specific clustering, in order to select the bestclustering. Clustering is used in information re-trieval, to order document collections. It can alsobe used in information filtering: if a document isclustered into the same group as a number of docu-ments which the user is interested in, it is likely tobe of interest to this user as well. In much the sameway, clustering can be used in information rout-ing. Last but not least, clustering is also used inknowledge extraction: a clustering of a large num-ber of documents with a short description of eachcluster can provide quick insight into the structureand content of a large document collection.

– Association: the method of association is usedto detect patterns in term usage in documents.Through association techniques it can be discov-ered which items co-occur more frequently thanis expected on the basis of their individual occur-rences. This indicates a relation between theseitems. Association is mainly used in knowledgeextraction. For example, association can be usedto detect changes in term usage over time, by com-paring association results of document collectionsat different points in time.

– Summarization: summarization is a specific textmining technique, used to provide concise and in-telligent summaries of long documents. This tech-nique can be used in information retrieval: a usercan first read a summary, and if he is interested hecan retrieve the entire document. In the Nether-lands, this service is already offered for mobile de-vices by the small Sumatra company. Also, sum-marization is used in knowledge extraction, be-cause a summary provides knowledge for whichthe user does not need to read the entire document.

3.4. KDT applications in virtual communities

As described above, KDT may be applied to auto-mate or support the construction of a topic categorytree for virtual communities. There are however manyother possible applications of KDT to enhance the com-munication in virtual communities. These applicationscan be divided into the same four categories as in Sec-tion 3.2: (a) Information Retrieval, (b) Information Fil-tering, (c) Information Routing, and (d) KnowledgeExtraction.

– Information retrieval: Community information onthe Internet should be organized in such a way that(potential) community members can easily find theinformation they require. An example text miningapplication is to organize the many different clubsof “Het Net” based on the content of the informa-tion exchange within clubs. Another example isthe construction of a personalized list of commu-nities that a particular person may be interested in.This list can be constructed based on a comparisonof the texts that the person has accessed and thetexts in the communities.

– Information filtering: Members of virtual commu-nities have access to large amounts of information.Information filtering techniques can be applied tosend them only the information, which they arereally interested in. This service can even take theform of a “personal community visor”: a view ofInternet communities based on the topics of inter-est to a particular user. Using this visor, an In-ternet user can quickly detect what other commu-nities discuss topics that are related to his or herown interests, and can also assess the amount ofinformation on each topic at these communities.

– Information routing: New information that be-comes available on the Internet can automaticallybe analyzed and used to inform possibly interested


users of its existence. E-mail messages posted in acommunity forum can automatically be redirectedto community members who might be able to re-spond. Also, these techniques can be used to au-tomatically remove unwanted content, e.g. mes-sages containing racist or sexist remarks.

– Knowledge extraction: Text mining techniquescan be used to provide for information needs ofvirtual community users, e.g. automatic intelli-gent summarization of the content of one or sev-eral information items in the community. Thesetechniques can also be used to help communityowners by providing insight into the topics dis-cussed, the trends or patterns in topics in differentcommunities, the factors that influence the successand duration of a club, the factors that influence thesuccess of banner advertisements at certain loca-tions, etcetera. Knowledge extraction can also beused to detect a latent interest in a community ona certain topic, based on the topic drift in closelyrelated communities.

4. The case study

The structured discovery of knowledge from dataand text needs a phased development process. A num-ber of such techniques have been proposed in the past.They share the global division of labor in (a) data col-lection, (b) data preparation, (c) mining and (d) visu-alization [3]. As they also share a lack of generalityand tend to focus on specific parts of the process, KPNResearch has developed for internal use the MiningData Successfully (or MIDAS for short) procedure [8].It consists of 8 subsequent phases, which will be pre-sented here as a further detailing of the more common4-step procedure.

4.1. Data collection

The data collection starts with the phase of problemdefinition. What is the actual text-mining problem?Clearly, a problem can only be solved once it has beendefined. In the classical sense of the physical experi-ment, this means that the search question must be for-mulated, the existence of a data set must be establishedand validated, and finally the nature of the desired an-swer must be defined. In short, the problem needs tobe formulated in terms of text-mining, and the solvingstrategy sketched and analyzed that will either give a

desired answer or show irrefutably that the answer isnot possible.

In the next step the required data must be collectedand analyzed. It must be determined what kind ofdata is needed, where this data can be found and howthey can be made available to the experiment. Text-mining is usually performed on a collection of textdocuments, that can differ in format (HTML, Word orpdf) and stored on different media (tape, hard disk orcompact disk). During the data collection, all relevantdocuments are retrieved from the various sources andmedia onto a single container. The single static set tooperate on eases all the subsequent processing. Theyare at a single location, in a single format and will notbe changed by further activities inside the community.

This requirement is sufficient but not necessary. Theprocedure can also work within a dynamic web envi-ronment, but will then require a more elaborate schemeof time stamping and synchronization. Further the sin-gle and central static storage allows ensuring that alldata will be available once collected and will not beremoved while the process is well under way. In otherwords, it only requires agents for the data collectionand not guards during the lifetime of the project [14].

For the project reported here, the documents arebased on the archives within the virtual communitiesof “Het Net”. This is a dynamic environment as thesecommunities are on-line for 24 hours a day. Club-members make access over a browser to the communitydatabase with some safety arrangements. The contentof the database changes constantly as the members passby and communicate through messages and images.

For the ease of the experiment, we have refrainedfrom a direct access to the database. We rather wanta well-behaved environment for the experiment, thatwould enable to draw irrefutable conclusions. In otherwords, we have simple made an entire dump of thedatabase content on 15 November 2000 and developedonly on this copy. Despite the obvious dynamics ofcommunity communication, we will thus largely focuson a historical analysis, i.e. we will analyze the full col-lection of text documents on archive up to a specifiedmoment. In this paper, that moment was 15 Novem-ber 2000, resulting in a collection of 900 Mbytes ofinformation [2].

A coarse overview of the available tables in thedatabase is given in Table 1. All non-textual data suchas images and music fragments are not contained inthe database but stored on dedicated file-servers. Wehave exempted such files from our experiment. Also,the database tables with scarce or less relevant textual


Table 1Content of the club database

Category #Records Short description Usability

Agenda 11985 Minutes of community life Very littleAlbum 68583 Portrait gallery MediumBargain 3195 Internal flee-market Medium but scarceCommunity 15409 Membership administration RelevantDocument 51078 File directory ZeroForum 196346 Threaded links to the chat rooms HighPhoto 484415 Annotated images MediumLink to other sites 112556 Private and personal links to outside the community Very littleNews 100725 The news corner Medium to High

information have been discarded for our experiments.This leaves us with only and pure text in the categories:Forum, Photo and News. The tables for the data stor-age in these categories are merged to create a singlecommunity text warehouse to be operated upon.

Subsequently we start to identify the variables withrelevance to the search question. This assumes a rea-sonable knowledge about the application area. Suchvariables or features are not necessarily directly avail-able and must often be deduced from the existing data.Text-mining is characterized by its insatiable desire forinput features. For reason of the large amount of mu-tual correlation, this does not imply a very large searchspace and we will therefore later see that the problemspace dimensions can be appreciably be brought down.

Last but not least we need to select the tools and/oralgorithms to be used during the coming text-miningprocess. Often this will not be a monolithic approach,but rather a judicious set of tool application to achievethe desired effect. In a next section we will come backon this issue with a direct focus on ways to achieve theclub clustering from text documents.

4.2. Data preparation

In the second phase, the selected and unified datasets must be transformed and filtered to optimally suitthe application of the text-mining tools. The result ofdata preparation in text mining problems is usually aweighted term vector per document, suitable for fur-ther processing with techniques well known from datamining. The algorithmic side of this problem has beenwidely researched in the text-mining literature. Thereexist many nice techniques that solve a limited part ofthe entire problem. Hence we will see a deliberate se-quence of tool application to create clean data to op-erate on. In terms of data-mining such steps involvedomain transformation (as time sequence into spectraldata), attribute typing, data coding and the division ofthe data set into a train, a test and a validation part.

In text-mining we see the same principles but in somedisguise.

We follow Luhn in [16], who proposed “that the fre-quency of word occurrence in an article furnishes a use-ful measurement of word significance”. The majorityof text mining literature uses this same “bag of words”approach, where the result of the pre-processing stepis usually one vector per document describing the fre-quency of word occurrences in that document. In orderto convert documents into such vectors, the followingactivities can be performed:

Noise removal: this eliminates in succession all kindof irrelevant data, such as mark-up tags, punctuationmarks and spelling errors, cleaning up the input text. Inthis case study, spelling errors have not been correcteddue to the lack of an easily available algorithm for theDutch language.

Domain transformation: abbreviations and low fre-quency words can be substituted with the normalphrase. Subsequently terms can be extracted and re-duced to the stems. Stemming has not been used here,again due to the lack of a stemmer for the Dutch lan-guage. Instead, lemmatization has been used: com-paring the terms in the community texts to dictionaryterms in Celex [15]. If words could be mapped to morethan one meaning in the dictionary, the most frequentuse of the term has been chosen. The terms can also beenriched when a hierarchy of the terms is defined by theuser or preloaded with the system. Terms can be addedbased on for example hyponym relations. Hyponym isthe linguistic term for the ‘is a’ relationship – a knife is aweapon, therefore ‘weapon’ is a hyponymof ‘knife’. Arule-based learner could be aided by a feature engineer-ing method that mapped words with low informationgain to common hyponyms that yield a higher informa-tion gain. This feature engineering methods could relyon the use of WordNet, a large on-line thesaurus thatcontains information about synonymy and hyponymy.In the same way synonyms can be detected and trans-formed to one term used for further processing. In this


case study, no hyponymy or synonymy relations havebeen used.

Feature extraction: meaningful words are detected.For one part, this is performed by checking on lists ofwords without meaning (stop word removal); for theother part, this is performed by building histograms,following the advice of Luhn in [16]. For the virtualcommunities, we have discarded the first 100 most fre-quent terms in the community texts, and we have dis-carded terms with less than 6 occurrences, resulting ina reduction from 350,000 to 60,000 terms. Usually,terms are weighted to reflect for example their use ina headline or abstract, or to better represent the useful-ness of the word within a document, for example bynormalizing term frequency with document length. Inthis case study, the tf*idf weighting scheme has beenused [9]. This traditional term weighting scheme isused in all kinds of modified forms, the formal defini-tion we used is given below:

Aij = tf ∗ idf =log tij

nj∗ log

(ndocs

dj

)

wheretij is the occurrence count of termi in documentj, nj is the length of documentj, di is the number ofdocuments in whichi appears, and ndocs is the totalnumber of documents in the collection.

Additionally, word and noun phrases can be ex-tracted. Word phrases such as ‘New York’ only bearmeaning when they are recognized as being one term.When words are counted and these word phrases aresplit they lose their meaning. Extracting noun phrasessuch as ‘artificial intelligence’ from a document re-quires two separate algorithms. The first is a taggingalgorithm to assign part of speech tags (noun, verb,preposition, etc.) to the individual words, and the sec-ond is an algorithm to group the tagged words into nounphrases. Word and noun phrase extraction has not beenused in this case study.

It is well known in data mining that pre-processingis time consuming: “In particular, the pre-processingphase is crucial to the efficiency of the process, sinceaccording to the results in different domain areas andapplications, pre-processing can require as much as 80per cent of the total effort.” [18]. In text mining thisholds even more, as pre-processing is more difficultand domain dependent than in data mining, for examplethrough the use of different languages in a documentcollection. Also, the effect of most pre-processing ac-tivities on the result of text mining is not unambiguous,and different authors disagree on the usefulness of ac-tivities. Most authors for example use some form of

term filtering, but [21] shows that restricting the num-ber of words has only a minor effect of the performanceof a text retrieval system.

4.3. Document mining

The result of pre-processing is usually a weightedterm vector per document. This numeric representationcan be used in the actual mining. To achieve an efficientprocess, it is of interest to remove the potential corre-lation in the input features and to create an orthogonalbase. This shows from a (sometimes drastic) reductionin dimensionality. Latent Semantic Indexing (LSI) [7]attempts this by replacing the histogram vector of termsby their eigen-vectors. Alternatives are a clusteringin semantic categories or even random projection [11].We used LSI and achieved a reduction from 60,000 to80 dimensions.

Well-known data mining methods such as classifica-tion, association and clustering are applied in the textmining process. Also, special text mining techniquessuch as summarization can be applied. However, wefound that LSI imposes a constraint on such methods,asapparently not all mining techniques operate efficientlyon the LSI-generated term space [15].

Clustering is used to group documents. Documentswithin a cluster should be similar to one another, andbetween clusters documents should be dissimilar. Clus-tering is usually achieved in an unsupervised trainingprocess, where the desired clustering is unknown be-forehand. The most difficult part of clustering is toestablish a good similarity measure, which is used tocompute the effectiveness of a specific clustering, inorder to select the best clustering. Clustering is used ininformation retrieval, to order document collections. Itcan also be used in information filtering: if a documentis clustered into the same group as a number of docu-ments which the user is interested in, it is likely to beof interest to this user as well. In much the same way,clustering can be used in information routing. Last butnot least, clustering is also used in knowledge extrac-tion: a clustering of a large number of documents witha short description of each cluster can provide quick in-sight into the structure and content of a large documentcollection.

There are several algorithms to perform clustering.Clustering algorithms can be classified according totwo aspects [17]: the generated structure, which couldbe hierarchical, flat or overlapping, and the techniqueused to implement the structure. This technique can beeither partitional or agglomerative. Partitional methods


divide the set into several clusters at once and then shuf-fle examples between clusters in order to increase clus-ter similarity and decrease cluster dissimilarity. Ag-glomerative clustering methods gradually add exam-ples to clusters until the final clustering is achieved. Ag-glomerative cluster methods can be seed-based, whichmeans that they randomly pick a number of examplesfrom the set as the cluster centres, and gradually add theremaining examples to these clusters. Other non seed-based methods regard every example as a cluster of itsown, and continuously fuse the most similar clusters.

In this case study, we have used is a hierarchical ag-glomerative clustering (HAC) method. Advantages ofthis method are its simplicity, speed and global optimi-sation [22].

At the core of any clustering algorithm lies a meansto quantify the attraction between features. Fromthe vector representation of the input features followsthe notion of distance, measured in a non-linear n-dimensional space. There is more than one way to mea-sure such a distance: either Euclidean (can be measuredwith a ‘ruler’) or based on similarity. For example, interms of road distance (a Euclidean distance) York iscloser to Manchester than it is Canterbury. However, ifdistance is measured in terms of the characteristics ofa city York is closer to Canterbury.

Euclidean metrics measure true straight-line dis-tances in Euclidean space. Non-Euclidean metrics ap-ply to distances that are not straight-line, but whichobey certain rules. The Manhattan or City Block met-ric is an example of this type. Semi-metrics obey thefirst three rules but may not obey the ‘triangle’ rule.The Cosine measure is an example of this type. Thisis a pattern similarity measure. The cosine of the an-gle between 2 vectors is identical to their correlationcoefficient:

Similarity (x, y) =Σ(xy)

Σx2Σy2

From the dimensional reduction obtained by apply-ing LSI follows that the problem space is inherentlyboth non-linear and multi-dimensional. Consequently,the clustering must be performed with semi-metrics, asexperimentally confirmed in this project independentfrom [13].

Clustering methods will create clusters either se-quentially or in parallel. This is purely an algorithmicdifference. Actual clustering tools will also provide anumber of cost functions, that the algorithm will use todecide on the understanding of the measured distances.Altogether this gives the miner a lot of freedom and

therefore a choice problem. We have pre-selected thefollowing cost principles: (a) average linkage betweengroups, (b) average linkage within groups, (c) centroid,(d) median, (e) k-nearest neighbour and (f) Ward.

The HAC algorithm fits very well with LSI-vectors [20], because it organizes documents in a tree-like structure. The most popular implementation of theHAC algorithm uses the nearest neighbor cost function.It measures the similarity by joining the most similarpair of objects that are not yet in the same cluster. Thedistance between 2 clusters is the distance between theclosest pair of points, each of which is in one of thetwo clusters. The type of clusters that HAC was de-signed for are characterized as being (a) long stragglyclusters, (b) chains or (c) ellipsoidal clusters. From itsthoroughly understood and well-developed theoreticalbasis comes a very efficient implementation. No fore-knowledge in terms of a cluster centroid or representa-tive is required; also there is no need for re-computationof the similarity matrix during the clustering. How-ever, it has difficulty in handling poorly separated orintertwined clusters.

A second choice in clustering is an appropriate mea-sure for the distance between two documents. Theproblem space of LSI term vectors is inherently bothnon-linear and multi-dimensional. Consequently, theclustering must be performed with a semi-metric dis-tance measure such as cosine or Euclidean distance, asexperimentally confirmed in this project independentfrom [13]. In this case study, cosine distance has beenapplied, as this is the most appropriate measure fortexts.

A third choice in clustering with the HAC algorithmconcerns the cost function, i.e. the distance betweenclusters, to determine which two clusters are most simi-lar and should be joined in the next step. In-line with theconclusions in [24], we have initially used the compu-tationally efficient “nearest neighbor” measure. How-ever, the application of the nearest neighbour methodto the case study was only moderately successful. Theconstruction of a single very large cluster in the pres-ence of many small ones organizes the communitiesin an unbalanced tree, which will lead to a lack in ac-cessing ease over the portal (Fig. 1) and it was con-cluded that less computationally simple distance mea-sures should be evaluated. We have pre-selected thefollowing cost principles: (a) average linkage betweengroups, (b) average linkage within groups, (c) centroid,(d) median and (e) k-nearest neighbour [13].


1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

clu sters

Num

ber

of

club

s

Fig. 2. Histogram of cluster size after HAC clustering with nearest neighbours.

1

10

100

1000

10000

1 3 5 7 9 11 13 15 17 19

cluster number

clus

ter

size

Average Linkage(Between groups)

Average linkage(within groups)

Centroid

Nearest Neighbour

Median

Fig. 3. Comparison of cluster methods.

4.4. Evaluation and use

A comparison of clustering mechanisms has beenperformed on a first division in 20 clusters. This con-firms the previous observation that HAC tends to resultin extremely unbalanced categorizations (Fig. 2). The“average linkage within groups” method results in theonly clustering with clusters of size 100 to 1000 clubs,all other methods create one cluster with almost 10.000clubs and 19 clusters of on average 10 clubs. This is nota desirable feature of a topic category tree in a portalhosting almost 20,000 clubs.

Figure 3 shows clearly that the within groups averagelinkage method is best suited for our purposes, as itprovides the most balanced categorization. The clustersare still large, but this is not a problem because eachcluster can be subdivided in a second run.

After selecting the average linkage within groupsclustering algorithm, the quality of the resulting cat-

egorization has been evaluated. The clubs were firstclustered into eight groups or main topic categories.After that, the clubs in each main topic category wereagain clustered into eight groups or minor topic cat-egories. The resulting clusters or categories were la-belled manually by inspecting the content, i.e. by in-specting the names of the clubs. Figure 4 shows aschematic overview of a part of the result. All clubsin a cluster are covered by their respective label, withonly a few exceptions. For instance, the ‘Dutch soc-cer players’ cluster also contains a club called ‘Girls!’.Like most of the anomalies, this one can be explainedby inspecting the content of the club, it seems that thesegirls like to talk about soccer-players.

The necessity of preselecting the number of desiredclusters is problematic. Due to the still somewhat un-balanced clustering results, the clusters differ in size.The large clusters cover a much broader topic than thesmall ones, which is not desirable. For instance, ‘Asso-


Music

Computers &amusement

Sports

Dragon Ball &Pokemon

Chat, Movies

Associations,TV amusement

Anmimals

Sex,animals

cars, motorbikes,trains

root

Internationalsoccer

Dutch soccerplayers

Dutch majorleague

Feyenoord

Minorleague

athletics, tennis,swimming

bicycling(Tour de france)

Basketball

Fig. 4. Topic category tree after clustering.

ciations, TV Amusement, Animals’ contains over 2000clubs and therefore covers a broad topic. The smallclusters are more specifically targeted at a small topic,e.g. the ‘Pokemon & Dragon Ball’ cluster only con-tains 250 clubs. In this case, a cluster ‘Amusement’on the top level would be more appropriate, the ‘Poke-mon & Dragon Ball’ cluster would fit in perfectly. Amore flexible clustering algorithm, which determinesthe most appropriate number of clusters, could solvethis problem.

Also, some topics are represented in different clus-ters. For instance, ‘chat & soaps’ contains clubs abouttelevision soaps, but the ‘Associations, TV Amuse-ment, Animals’ section also contains clubs about(other) soaps. This makes a directed search difficult.One could distinguish user actions in doing a directedsearch on one side of the spectrum, and browsing onthe other the other side [6]. This clustering would havemore use in a browsing environment.

The above evaluation shows that automatic cluster-ing of clubs on the “Het Net” portal is indeed feasible.The first results render a topic category tree that is rel-atively balanced and logical. This automatic clusteringcan be the basis for a final topic category tree. This finalstep would involve moving the few illogically placedclubs such as the “Girls!” club, and rebalancing thevery large or very small topics or subtopics. This finalstep would require relatively few work, compared tothe huge task of sifting through 10,000 clubs manuallyand determining what categories to place them in. Dueto the general nature of our approach, we are confidentthat the same result can be achieved for other virtualcommunities or other textual web content.

An open question in clustering and in this case studyis the reproducibility of the results. There is no guar-

antee that the clustering of clubs at a later time willyield roughly the same categories. It is not desirable tocompletely change the topic category tree every monthor so. Therefore, further research is needed to deter-mine how previous clustering results can be taken intoaccount to ensure some stability in the topic tree.

5. Conclusions

This paper provides a step-by-step critique on the ap-plication of KDT to the categorization of virtual com-munities by mining the large variety of volatile and un-structured texts as communicated in the clubs of “HetNet”. The purpose is to give a “proof-of-concept” forthe maturity of the KDT technology in such a turbu-lent environment, demonstrating the feasibility of au-tomatic maintenance of communities as hosted by anInternet Service Provider.

As a first attempt, the experiment is reasonably suc-cessful. Text mining provides a drastic reduction ofthe feature space and clears the way for a restructur-ing of the access to the 10,000 clubs. The weak pointfor an all out automation remains the clustering itself.Here, known improvements to the creation of a bal-anced structure tree need to be introduced.

At the moment, the detailed analysis of the requiredsteps and the conscientious integration into a KDT suiteof algorithms eases a further experimentation. Thisserves a fast turnaround experimentation with alterna-tive tooling and also supports future in-line trend moni-toring. The fact, that even for the modest case study out-lined here the resulting categorization was intuitivelyalmost correct gives sufficient credibility.


Acknowledgements

This work was performed while Oudshoff, Bosloperand Klos were at KPN Research Laboratories inGroningen (The Netherlands), and Spaanenburg wason sabbatical leave from Rijksuniversiteit Groningen.A short version of this paper was presented at theBNAIC’01 conference in Amsterdam (The Nether-lands).

References

[1] K.D. Bollacker, S. Lawrence and C.L. Gilles, Citeseer: an au-tonomous web agent for automatic retrieval and identificationof interesting publications,Nature 400 (1998), 107–109.

[2] I.E. Bosloper,Categorizing communities on the Net, MSc.Thesis, Groningen University, The Netherlands, 2001.

[3] R. Brachman and T. Anand, The process of knowledge dis-covery in databases: a human-centered approach,Advances inKnowledge Discovery and Data Mining (1996), 37–57.

[4] Bright Planet,The deep Web: surfacing hidden value, whitepaper (http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp), 2000.

[5] CELEX, The Dutch Centre for Lexical Information,http://www. kun.nl/celex/.

[6] D.R. Cutting et al.,A Cluster-based Approach to BrowsingLarge Document Collections, Proceedings 15th InternationalSIGIR, 1992, pp. 318–329.

[7] Deerwester, Dumais, Furnas, Landauer and Harshman, Index-ing by Latent Semantic Analysis,Journal of the AmericanSociety for Information Science 41(6) (1990), 391–407.

[8] M.P. Dieben, S.H. Kloosterman and A.M. Oudshoff, MIDASof hoe een database in een goudmijn verandert (in Dutch),Internal report (KPN Research, Groningen), 1996.

[9] S.T. Dumais, Improving the Retrieval of Information fromExternal Sources,Behavior Research Methods, Instrumentsand Computers 23(2) (1991), 229–236.

[10] J. Hagel and A.G. Armstrong,Net Gain, Harvard BusinessSchool Press, 1997.

[11] Y. Kodratoff, Knowledge discovery in texts: a definition, and

applications, Foundations of Intelligent Systems. 11th Inter-national Symposium, ISMIS’99. Proceedings, 1999, pp. 16–29.

[12] T. Kohonen et al., Self organization of a massive document col-lection, IEEE Transaction on Neural Networks 11(3) (2000),574–585.

[13] G.N. Lance and W.T. Williams, A general theory of classifi-catory sorting strategies,Computer Journal 9(4) (1967), 373–380.

[14] D. Landau et al.,TextVis: an integrated visual environ-ment for text mining, Proceedings second European Sympo-sium on Principles of Data Mining and Knowledge DiscoveryPKDD’98, Nantes, France, 1998, pp. 56–64.

[15] T. Letsche and M. Berry, Large-scale information retrievalwith latent semantic indexing,Inforation. Sciences 100 (1997),105–137.

[16] H.P. Luhn, The automatic creation of literature abstracts,IBMJournal of Research and Development 2 (1958), 159–165.

[17] Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul and D. Pelg,Ephemeraldocument clustering for web applications, IBM Research Re-port RJ 10186, 2000.

[18] H. Mannila, Data mining: machine learning, statistics, anddatabases, Proceedings of the 8th International Conference onScientific and statistical Database Management, Stockholm,Sweden, 1996.

[19] M.A.H. Offenberg, ICT impact and organizational change,MSc. Thesis, Groningen University, The Netherlands, 1998.

[20] E. Rasmussen, Clustering Algorithms, ch. 6, in:InformationRetrieval: Data Structures and Algorithms, W.B. Frakes andR. Baeza-Yates, eds, New Jersey: Prentice Hall, 1992.

[21] G. Salton and C. Buckley,Term weighting approaches in au-tomatic text retrieval, Technical Report 87-881, Cornell Uni-versity, Department of Computer Science, 1987.

[22] R. Sibson, SLINK: an optimally efficient algorithm for thesingle link cluster method,Computer Journal 16(1) (1973),30–34.

[23] L. Wladimiroff, T.W. Geurts and S. Thie,Virtual communities:Service aan en interactie met klanten, (in Dutch), InternalReport (KPN Research, Groningen), 1998.

[24] O. Zamir and O. Etzioni,Web Document Clustering: A Fea-sibility Demonstration, Proceedings SIGIR Conferences onResearch and Development in Information Retrieval, 1998,pp. 46–54.