Learning Lightweight Ontologies from Text across Diﬀerent ... · Learning Lightweight Ontologies from Text across Diﬀerent Domains using the Web as Background Knowledge ... ontology

Learning Lightweight Ontologies from

Text across Different Domains using the

Web as Background Knowledge

Wilson Yiksen WongM.Sc. (Information and Communication Technology), 2005

B.IT. (HONS) (Data Communication), 2003

This thesis is presented for the degree of

Doctor of Philosophy

of The University of Western Australia

School of Computer Science and Software Engineering.

September 2009

To my wife Saujoe

and

my parents and sister

Abstract

The ability to provide abstractions of documents in the form of important con-

cepts and their relations is a key asset, not only for bootstrapping the Semantic

Web, but also for relieving us from the pressure of information overload. At present,

the only viable solution for arriving at these abstractions is manual curation. In

this research, ontology learning techniques are developed to automatically discover

terms, concepts and relations from text documents.

Ontology learning techniques rely on extensive background knowledge, ranging

from unstructured data such as text corpora, to structured data such as a semantic

lexicon. Manually-curated background knowledge is a scarce resource for many do-

mains and languages, and the effort and cost required to keep the resource abreast

of time is often high. More importantly, the size and coverage of manually-curated

background knowledge is often inadequate to meet the requirements of most on-

tology learning techniques. This thesis investigates the use of the Web as the sole

source of dynamic background knowledge across all phases of ontology learning for

constructing term clouds (i.e. visual depictions of terms) and lightweight ontolo-

gies from documents. To appreciate the significance of term clouds and lightweight

ontologies, a system for ontology-assisted document skimming and scanning is de-

veloped.

This thesis presents a novel ontology learning approach that is devoid of any

manually-curated resources, and is applicable across a wide range of domains (the

current focus is medicine, technology and economics). More specifically, this research

proposes and develops a set of novel techniques that take advantage of Web data to

address the following problems: (1) the absence of integrated techniques for cleaning

noisy data; (2) the inability of current term extraction techniques to systematically

explicate, diversify and consolidate their evidence; (3) the inability of current corpus

construction techniques to automatically create very large, high-quality text corpora

using a small number of seed terms; and (4) the difficulty of locating and preparing

features for clustering and extracting relations.

This dissertation is organised as a series of published papers that contribute to

a complete and coherent theme. The work into the individual techniques of the

proposed ontology learning approach has resulted in a total of nineteen published

articles: two book chapters, four journal articles, and thirteen refereed conference

papers. The proposed approach consists of several major contributions to each task

in ontology learning. These include (1) a technique for simultaneously correcting

noises such as spelling errors, expanding abbreviations and restoring improper casing

in text; (2) a novel probabilistic measure for recognising multi-word phrases; (3) a

probabilistic framework for recognising domain-relevant terms using formal word

distribution models; (4) a novel technique for constructing very large, high-quality

text corpora using only a small number of seed terms; and (5) novel techniques for

clustering terms and discovering coarse-grained semantic relations using featureless

similarity measures and dynamic Web data. In addition, a comprehensive review

is included to provide background on ontology learning and recent advances in this

area. The implementation details of the proposed techniques are provided at the

end, together with a description on how the system is used to automatically discover

term clouds and lightweight ontologies for document skimming and scanning.

Acknowledgements

First and foremost, this dissertation would not have come into being without the

continuous support provided by my supervisors Dr Wei Liu and Prof Mohammed

Bennamoun. Their insightful guidance, financial support and broad interest made

my research journey at the School of Computer Science and Software Engineering

(CSSE) an extremely fruitful and enjoyable one. I am proud to have Wei and

Mohammed as my mentors and personal friends.

I would also like to thank Dr Krystyna Haq, Mrs Jo Francis and Prof Robyn

Owens for being there to answer my questions on general research skills and scholar-

ships. A very big thank you goes to the Australian Government and the University

of Western Australia for sponsoring this research under the International Postgrad-

uate Research Scholarship and the University Postgraduate Award for International

Students. I am also very grateful to CSSE, and Dr David Glance of the Centre

for Software Practice (CSP) for providing me with a multitude of opportunities to

pursue this research further.

I would like to thank the other members of CSSE including Prof Rachell Cardell-

Oliver, Assoc/Prof Chris McDonald and Prof Michael Wise for their advice. My

appreciation goes to my office mates Faisal, Syed, Suman and Majigaa. A special

thank you to the members of the CSSE’s support team, namely, Laurie, Ashley,

Ryan, Sam and Joe, for always being there to restart the virtual machine and to

fix my laptop computers due to accidental spills. Not forgetting the amicable peo-

ple in CSSE’s administration office, namely, Jen Redman, Nicola Hallsworth, Ilse

Lorenzen, Rachael Offer, Jayjay Jegathesan and Jeff Pollard for answering my ad-

ministrative and travel needs, and making my stay at CSSE an extremely enjoyable

one.

I also had the pleasure of meeting with many reputable researchers during my

travel whose advice has been invaluable. To name a few, Prof Kyo Kageura, Prof

Udo Hahn, Prof Robert Dale, Assoc/Prof Christian Gutl, Prof Arno Scharl, Prof

Albert Yeap, Dr Timothy Baldwin, and Assoc/Prof Stephen Bird. A special thank

you to the wonderful people at the Department of Information Science, University of

Otago for being such a gracious host during my visit to Dunedin. I would also like to

extend my gratitude for the constant support and advice provided by researchers at

the Curtin University of Technology, namely, Prof Moses Tade, Assoc/Prof Hongwei

Wu, Dr Nicoleta Balliu and Prof Tharam Dillon. In addition, my appreciation goes

to my previous mentors Assoc/Prof Ongsing Goh and Prof Shahrin Sahib of the

Technical University of Malaysia Malacca (UTeM), and Assoc/Prof R. Mukundan

of the University of Canterbury. I should also acknowledge my many friends and

colleagues at the Faculty of Information and Communication Technology at UTeM.

My thank you also goes to the anonymous reviewers who have commented on all

publications that have arisen from this thesis.

Last but not least, I will always remember the unwavering support provided by

my wife Saujoe, my parents and my only sister, without which I would not have

cruised through this research journey so pleasantly. Also a special appreciation to

the city of Perth for being such a nice place to live in and to undertake this research.

i

Contents

List of Figures v

Publications Arising from this Thesis xiv

Contribution of Candidate to Published Work xviii

1 Introduction 1

1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Overview of Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Text Preprocessing (Chapter 3) . . . . . . . . . . . . . . . . . 4

1.3.2 Text Processing (Chapter 4) . . . . . . . . . . . . . . . . . . . 5

1.3.3 Term Recognition (Chapter 5) . . . . . . . . . . . . . . . . . . 6

1.3.4 Corpus Construction for Term Recognition (Chapter 6) . . . . 7

1.3.5 Term Clustering and Relation Acquisition (Chapter 7 and 8) . 9

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Layout of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 13

2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Ontology Learning from Text . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Outputs from Ontology Learning . . . . . . . . . . . . . . . . 15

2.2.2 Techniques for Ontology Learning . . . . . . . . . . . . . . . . 17

2.2.3 Evaluation of Ontology Learning Techniques . . . . . . . . . . 21

2.3 Existing Ontology Learning Systems . . . . . . . . . . . . . . . . . . 24

2.3.1 Prominent Ontology Learning Systems . . . . . . . . . . . . . 25

2.3.2 Recent Advances in Ontology Learning . . . . . . . . . . . . . 33

2.4 Applications of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Text Preprocessing 41

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Basic ISSAC as Part of Text Preprocessing . . . . . . . . . . . . . . . 45

3.4 Enhancement of ISSAC . . . . . . . . . . . . . . . . . . . . . . . . . . 48

ii

3.5 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.7 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.8 Other Publications on this Topic . . . . . . . . . . . . . . . . . . . . 56

4 Text Processing 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 A Probabilistic Measure for Unithood Determination . . . . . . . . . 60

4.3.1 Noun Phrase Extraction . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Determining the Unithood of Word Sequences . . . . . . . . . 62

4.4 Evaluations and Discussions . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 70



5 Term Recognition 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Notations and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.1 Existing Probabilistic Models for Term Recognition . . . . . . 79

5.3.2 Existing Ad-Hoc Techniques for Term Recognition . . . . . . . 80

5.3.3 Word Distribution Models . . . . . . . . . . . . . . . . . . . . 84

5.4 A New Probabilistic Framework for Determining Termhood . . . . . . 89

5.4.1 Parameters Estimation for Term Distribution Models . . . . . 95

5.4.2 Formalising Evidences in a Probabilistic Framework . . . . . . 100


5.5.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . 112

5.5.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 116

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123



6 Corpus Construction for Term Recognition 127

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.1 Webpage Sourcing . . . . . . . . . . . . . . . . . . . . . . . . 130

iii

6.2.2 Relevant Text Identification . . . . . . . . . . . . . . . . . . . 132

6.2.3 Variability of Search Engine Counts . . . . . . . . . . . . . . . 133

6.3 Analysis of Website Contents for Corpus Construction . . . . . . . . 134

6.3.1 Website Preparation . . . . . . . . . . . . . . . . . . . . . . . 136

6.3.2 Website Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.3.3 Website Content Localisation . . . . . . . . . . . . . . . . . . 143


6.4.1 The Impact of Search Engine Variations on Virtual Corpus

Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.4.2 The Evaluation of HERCULES . . . . . . . . . . . . . . . . . 149

6.4.3 The Performance of Term Recognition using SPARTAN-based

Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158



7 Term Clustering for Relation Acquisition 161

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.2 Existing Techniques for Term Clustering . . . . . . . . . . . . . . . . 163

7.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.3.1 Normalised Google Distance . . . . . . . . . . . . . . . . . . . 165

7.3.2 Ant-based Clustering . . . . . . . . . . . . . . . . . . . . . . . 168

7.4 The Proposed Tree-Traversing Ants . . . . . . . . . . . . . . . . . . . 171

7.4.1 First-Pass using Normalised Google Distance . . . . . . . . . . 172

7.4.2 n-degree of Wikipedia: A New Distance Metric . . . . . . . . 176

7.4.3 Second-Pass using n-degree of Wikipedia . . . . . . . . . . . . 178





8 Relation Acquisition 195

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.3 A Hybrid Technique for Relation Acquisition . . . . . . . . . . . . . . 197

8.3.1 Lexical Simplification . . . . . . . . . . . . . . . . . . . . . . . 200

8.3.2 Word Disambiguation . . . . . . . . . . . . . . . . . . . . . . 202

iv

8.3.3 Association Inference . . . . . . . . . . . . . . . . . . . . . . . 204

8.4 Initial Experiments and Discussions . . . . . . . . . . . . . . . . . . . 206



9 Implementation 211

9.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9.2 Ontology-based Document Skimming and Scanning . . . . . . . . . . 219

9.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

10 Conclusions and Future Work 229

10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 229

10.2 Limitations and Implications for Future Research . . . . . . . . . . . 231

Bibliography 233

v

List of Figures

1.1 An overview of the five phases in the proposed ontology learning

system, and how the details of each phase are outlined in certain

chapters of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Overview of the ISSAC and HERCULES techniques used in the text

preprocessing phase of the proposed ontology learning system. ISSAC

and HERCULES are described in Chapter 3 and 6, respectively. . . . 5

1.3 Overview of the UH and OU measures used in the text processing

phase of the proposed ontology learning system. These two measures

are described in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Overview of the TH and OT measures used in the term recognition

phase of the proposed ontology learning system. These two measures

are described in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Overview of the SPARTAN technique used in the corpus construc-

tion phase of the proposed ontology learning system. The SPARTAN

technique is described in Chapter 6. . . . . . . . . . . . . . . . . . . . 8

1.6 Overview of the ARCHILES technique used in the relation acquisi-

tion phase of the proposed ontology learning system. The ARCHILES

technique is described in Chapter 8, while the TTA clustering tech-

nique and noW measure are described in Chapter 7. . . . . . . . . . 9

2.1 The spectrum of ontology kinds, adapted from Giunchiglia & Za-

ihrayeu [89]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Overview of the outputs, tasks and techniques of ontology learning. . 15

3.1 Examples of spelling errors, ad-hoc abbreviations and improper casing

in a chat record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 The accuracy of basic ISSAC from previous evaluations. . . . . . . . 48

3.3 The breakdown of the causes behind the incorrect replacements by

basic ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Accuracy of enhanced ISSAC over seven evaluations. . . . . . . . . . 54

3.5 The breakdown of the causes behind the incorrect replacements by

enhanced ISSAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

vi

4.1 The output by Stanford Parser. The tokens in the “modifiee” column

marked with squares are head nouns, and the corresponding tokens

along the same rows in the “word” column are the modifiers. The

first column “offset” is subsequently represented using the variable i. 61

4.2 The output of the head-driven noun phrase chunker. The tokens

which are highlighted with a darker tone are the head nouns. The

underlined tokens are the corresponding modifiers identified by the

chunker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 The probability of the areas with darker shade are the denominators

required by the evidences e1 and e2 for the estimation of OU(s). . . . 66

4.4 The performance of OU (from Experiment 1) and UH (from Exper-

iment 2) in terms of precision, recall and accuracy. The last column

shows the difference between the performance of Experiment 1 and 2. 69

5.1 Summary of the datasets employed throughout this chapter for ex-

periments and evaluations. . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Distribution of 3, 058 words randomly sampled from the domain cor-

pus d. The line with the label “KM” is the aggregation of the

individual probability of occurrence of word i in a document, 1 −

P (0; αi, βi) using K-mixture with αi and βi defined in Equations 5.21

and 5.20. The line with the label “ZM-MF” is the manually fitted

Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of

occurrence computed as fi/F . . . . . . . . . . . . . . . . . . . . . . . 86

5.3 Parameters for the manually fitted Zipf-Mandelbrot model for the set

of 3, 058 words randomly drawn from d. . . . . . . . . . . . . . . . . . 87

5.4 Distribution of the same 3, 058 words as employed in Figure 5.2. The

line with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted us-

ing ordinary least squares method. The line labeled “ZM-WLS” is the

Zipf-Mandelbrot model fitted using weighted least squares method,

while “RF” is the actual rate of occurrence computed as fi/F . . . . . 97

5.5 Summary of the sum of squares of residuals, SSR and the coefficient

of determination, R2 for the regression using manually estimated pa-

rameters, parameters estimated using ordinary least squares (OLS),

and parameters estimated using weighted least squares (WLS). Ob-

viously, the smaller the SSR is, the better the fit. As for 0 ≤ R2 ≤ 1,

the upper bound is achieved when the fit is perfect. . . . . . . . . . . 98

vii

5.6 Parameters for the automatically fitted Zipf-Mandelbrot model for

the set of 3, 058 words randomly drawn. . . . . . . . . . . . . . . . . 98

5.7 Distribution of the 1, 954 terms extracted from the domain corpus

d sorted according to the corresponding scores provided by OT and

TH. The single dark smooth line stretching from the left (highest

value) to the right (lowest value) of the graph is the scores assigned

by the respective measures. As for the two oscillating lines, the dark

line is the domain frequencies while the light one is the contrastive

frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.8 Distribution of the 1, 954 terms extracted from the domain corpus d

sorted according to the corresponding scores provided by NCV and

CW . The single dark smooth line stretching from the left (highest

value) to the right (lowest value) of the graph is the scores assigned

by the respective measures. As for the two oscillating lines, the dark

line is the domain frequencies while the light one is the contrastive

frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.9 The means µ of the scores, standard deviations σ of the scores, sum of

the domain frequencies and of the contrastive frequencies of all term

candidates, and their ratio. . . . . . . . . . . . . . . . . . . . . . . . . 115

5.10 The Spearman rank correlation coefficients ρ between all possible

pairs of measure under evaluation. . . . . . . . . . . . . . . . . . . . . 115

5.11 An example of a contingency table. The values in the cells TP , TN ,

FP and FN are employed to compute the precision, recall, Fα and

accuracy. Note that |TC| is the total number of term candidates in

the input set TC, and |TC| = TP + FP + FN + TN . . . . . . . . . 116

viii

5.12 The collection all contingency tables for all termhood measures X

across all the 10 bins BXj . The first column contains the rank of the

bins and the second column shows the number of term candidates in

each bin. The third general column “termhood measures, X” holds

all the 10 contingency tables for each measure X which are organised

column-wise, bringing the total number of contingency tables to 40

(i.e. 10 bins, organised in rows by 4 measures). The structure of the

individual contingency tables follows the one shown in Figure 5.11.

The last column is the row-wise sums of TP + FP and FN + TN .

The rows beginning from the second row until the second last are the

rank bins. The last row is the column-wise sums of TP + FN and

FP + TN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.13 Performance indicators for the four termhood measures in 10 respec-

tive bins. Each row shows the performance achieved by the four

measures in a particular bin. The columns contain the performance

indicators for the four measures. The notation pre stands for preci-

sion, rec is recall and acc is accuracy. We use two different α values,

resulting in two F-scores, namely, F0.1 and F1. The values of the

performance measures with darker shades are the best performing ones.120

6.1 A diagram summarising our web partitioning technique. . . . . . . . . 135

6.2 An illustration of an example sample space on which the probabilities

employed by the filter are based upon. The space within the dot-filled

circle consists of all webpages from all sites in J containing W . The m

rectangles represent the collections of all webpages of the respective

sites u1, ..., um. The shaded but not dot-filled portion of the space

consists of all webpages from all sites in J that do not contain W . The

individual shaded but not dot-filled portion within each rectangle is

the collection of webpages in the respective sites ui ∈ J that do not

contain W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.3 A summary of the number of websites returned by the respective

search engines for each of the two domains. The number of common

sites is also provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.4 A summary of the Spearman’s correlation coefficients between web-

sites before and after re-ranking by PROSE. The native columns

show the correlation between the websites when sorted according to

their native ranks provided by the respective search engines. . . . . . 147

ix

6.5 The number of sites with OD less than −6 after re-ranking using

PROSE based on page count information provided by the respective

search engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.6 A listing of the 43 sites included in SPARTAN-V. . . . . . . . . . . . 151

6.7 The number of documents and tokens from the local and virtual cor-

pora used in this evaluation. . . . . . . . . . . . . . . . . . . . . . . . 154

6.8 The contingency tables summarising the term recognition results us-

ing the various specialised corpora. . . . . . . . . . . . . . . . . . . . 156

6.9 A summary of the performance metrics for term recognition. . . . . . 156

7.1 Example of TTA at work . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.2 Experiment using 15 terms from the wine domain. Setting sT = 0.92

results in 5 clusters. Cluster A is simply red wine grapes or red

wines, while Cluster E represents white wine grapes or white wines.

Cluster B represents wines named after famous regions around the

world and they can either be red, white or rose. Cluster C represents

white noble grapes for producing great wines. Cluster D represents

red noble grapes. Even though uncommon, Shiraz is occasionally

admitted to this group. . . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.3 Experiment using 16 terms from the mushroom domain. Setting

sT = 0.89 results in 4 clusters. Cluster A represents poisonous mush-

rooms. Cluster B comprises edible mushrooms which are prominent

in East Asian cuisine except for Agaricus Blazei. Nonetheless, this

mushroom was included in this cluster probably due to its high con-

tent of beta glucan for potential use in cancer treatment, just like Shi-

itake. Moreover, China is the major exporter of Agaricus Blazei, also

known as Himematsutake, further relating this mushroom to East

Asia. Cluster C and D comprise edible mushrooms found mainly

in Europe and North America, and are more prominent in Western

cuisines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

x

7.4 Experiment using 20 terms from the disease domain. Setting sT =

0.86 results in 7 clusters. Cluster A represents skin diseases. Cluster

B represents a class of blood disorders known as anaemia. Cluster

C represents other kinds of blood disorders. Cluster D represents

blood disorders characterised by the relatively low count of leukocytes

(i.e. white blood cells) or platelets. Cluster E represents digestive

diseases. Cluster F represents cardiovascular diseases characterised

by both the inflammation and thrombosis (i.e. clotting) of arteries

and veins. Cluster G represents cardiovascular diseases characterised

by the inflammation of veins only. . . . . . . . . . . . . . . . . . . . . 185

7.5 Experiment using 16 terms from the animal domain. Setting sT =

0.60 produces 2 clusters. Cluster A comprises birds and Cluster B

represents mammals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

7.6 Experiment using 16 terms from the animal domain (the same dataset

from the experiment in Figure 7.5). Setting sT = 0.72 results in

5 clusters. Cluster A represents birds. Cluster B includes hoofed

mammals (i.e. ungulates). Cluster C corresponds to predatory feline

while Cluster D represents predatory canine. Cluster E constitutes

animals kept as pets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.7 Experiment using 15 terms from the animal domain plus an addi-

tional term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60

(middle screenshot) and sT = 0.72 (right screenshot) result in 2 clus-

ters, 3 clusters and 5 clusters, respectively. In the left screenshot,

Cluster A acts as the parent for the two recommended clusters “bird”

and “mammal”, while Cluster B includes the term “Google”. In the

middle screenshot, the recommended clusters “bird” and “mammal”

were clearly reflected through Cluster A and C respectively. By set-

ting sT higher, we dissected the recommended cluster “mammal” to

obtain the discovered sub-clusters C, D and E as shown in the right

screenshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7.8 Experiment using 31 terms from various domains. Setting sT = 0.70

results in 8 clusters. Cluster A represents actors and actresses. Clus-

ter B represents musicians. Cluster C represents countries. Cluster

D represents politics-related notions. Cluster E is transport. Clus-

ter F includes finance and accounting matters. Cluster G constitutes

technology and services on the Internet. Cluster H represents food. . 189

xi

7.9 Experiment using 60 terms from various domains. Setting sT = 0.76

results in 20 clusters. Cluster A and B represent herbs. Cluster C

comprises pastry dishes while Cluster D represents dishes of Italian

origin. Cluster E represents computing hardware. Cluster F is a

group of politicians. Cluster G represents cities or towns in France

while Cluster H includes countries and states other than France. Clus-

ter I constitutes trees of the genus Eucalyptus. Cluster J represents

marsupials. Cluster K represents finance and accounting matters.

Cluster L comprises transports with four or more wheels. Cluster

M includes plant organs. Cluster N represents beverages. Cluster

O represents predatory birds. Cluster P comprises birds other than

predatory birds. Cluster Q represents two-wheeled transports. Clus-

ter R and S represent predatory mammals. Cluster T includes trees

of the genus Acacia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

8.1 An overview of the proposed relation acquisition technique. The main

phases are term mapping and term resolution, represented by black

rectangles. The three steps involved in resolution are simplification,

disambiguation and inference. The techniques represented by the

white rounded rectangles were developed by the authors, while exist-

ing techniques and resources are shown using grey rounded rectangles. 197

8.2 Figure 8.2(a) shows the subgraph WT constructed for T=‘baking

powder’,‘whole wheat flour’ using Algorithm 8, which is later pruned

to produce a lightweight ontology in Figure 8.2(b). . . . . . . . . . . 201

8.3 The computation of mutual information for all pairs of contiguous

constituents of the composite terms “one cup whole wheat flour” and

“salt to taste”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

8.4 A graph showing the distribution of noW distance and the stepwise

difference for the sequence of word senses for the term “pepper”. The

set of mapped terms is M=“fettuccine”, “fusilli”, “tortellini”, “vine-

gar”, “garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat

flour”, “egg white”, “baking powder”, “buttermilk”. The line “step-

wise difference” shows the ∆i−1,i values. The line “average stepwise

difference” is the constant value µ∆. Note that the first sense s1 is

located at x = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

xii

8.5 The result of clustering the non-existent term “conchiglioni” and

the mapped terms M=“fettuccine”, “fusilli”, “tortellini”, “vinegar”,

“garlic”,“red onion”,“coriander”, “maple syrup”, “whole wheat flour”,

“egg white”, “baking powder”, “buttermilk”,“carbonara”,“pancetta”

using TTA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.6 The results of relation acquisition using the proposed technique for

the genetics and the food domains. The labels “correctly xxx” and

“incorrectly xxx” represent the true positives (TP) and false positives

(FP). Precision is computed as TP/(TP + FP ). . . . . . . . . . . . . 206

8.7 The lightweight domain ontologies generated using the two sets of

input terms. The important vertices (i.e. NCAs, input terms, vertices

with degree more than 3) have darker shades. The concepts genetics

and food in the center of the graph are the NCAs. All input terms

are located along the side of the graph. . . . . . . . . . . . . . . . . . 207

9.1 The online interface for the HERCULES module. . . . . . . . . . . . . . 212

9.2 The input section of the interface algorithm issac.pl shows the er-

ror sentence “Susan’s imabbirity to Undeerstant the msg got her INTu

trubble.”. The correction provided by ISSAC is shown in the results

section of the interface. The process log is also provided through this

interface. Only a small portion of the process log is shown in this figure.213

9.3 The online interface algorithm unithood.pl for the module unithood.

The interface shows the collocational stability of different phrases de-

termined using unithood. The various weights involved in determin-

ing the extent of stability are also provided in these figures. . . . . . . 214

9.4 The online interfaces for querying the virtual and local corpora cre-

ated using the SPARTAN module. . . . . . . . . . . . . . . . . . . . . . 215

9.5 Online interfaces related to the termhood module. . . . . . . . . . . . 216

9.6 Online interfaces related to the ARCHILES module. . . . . . . . . . . . 217

9.7 The interface data lightweightontology.pl for browsing pre-constructed

lightweight ontologies for online news articles using the ARCHILES

module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

9.8 The screenshot of the aggregated news services provided by Google

(the left portion of the figure) and Yahoo (the right portion of the

figure) on 11 June 2009. . . . . . . . . . . . . . . . . . . . . . . . . . 220

9.9 A splash screen on the online interface for document skimming and

scanning at http://explorer.csse.uwa.edu.au/research/. . . . . 221

xiii

9.10 The cross-domain term cloud summarising the main concepts occur-

ring in all the 395 articles listed in the news browser. This cloud

currently contains terms in the technology, medicine and economics

domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.11 The single-domain term cloud for the domain of medicine. This cloud

summarises all the main concepts occurring in the 75 articles listed

below in the news browser. Users can arrive at this single-domain

cloud from the cross-domain cloud in Figure 9.10 by clicking on the

[domain(s)] option in the latter. . . . . . . . . . . . . . . . . . . . . 222

9.12 The single-domain term cloud for the medicine domain. Users can

view a list of articles describing a particular topic by clicking on the

corresponding term in the single-domain cloud. . . . . . . . . . . . . 223

9.13 The use of document term cloud and information from lightweight

ontology to summarise individual news articles. Based on the term

size in the clouds, one can arrive at the conclusion that the news

featured in Figure 9.13(b) carries more domain-relevant (i.e. medical

related) content than the news in Figure 9.13(a). . . . . . . . . . . . 224

9.14 The document term cloud for the news “Tai Chi may ease arthritis

pain”. Users can focus on a particular concept in the annotated news

by clicking on the corresponding term in the document cloud. . . . . 225

xiv

Publications Arising from this Thesis

This thesis contains published work and/or work prepared for publication, some of

which has been co-authored. The bibliographical details of the work and where it

appears in the thesis are outlined below.

Book Chapters (Fully Refereed)

[1] Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood

and Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook

of Research on Text and Web Mining Technologies, IGI Global.

This book chapter combines and summarises the ideas in [9][10] and [3][11][12],

which form Chapter 3 and Chapter 4, respectively.

[2] Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering.

M. Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining

Technologies, IGI Global.

The clustering algorithm reported in [5], which contributes to Chapter 7, was

generalised in this book chapter to work with both terms and Internet domain

names.

Journal Publications (Fully Refereed)

[3] Wong, W., Liu, W. & Bennamoun, M. (2009) A Probabilistic Framework for

Automatic Term Recognition. Intelligent Data Analysis, Volume 13, Issue 4,

Pages 499-539. (Chapter 4)

[4] Wong, W., Liu, W. & Bennamoun, M. (2009) Constructing Specialised Cor-

pora through Domain Representativeness Analysis of Websites. Accepted with

revision by Language Resources and Evaluation. (Chapter 5)

[5] Wong, W., Liu, W. & Bennamoun, M. (2007) Tree-Traversing Ant Algo-

rithm for Term Clustering based on Featureless Similarities. Data Mining and

Knowledge Discovery, Volume 15, Issue 3, Pages 349-381. (Chapter 7)

xv

[6] Liu, W. & Wong, W. (2009) Web Service Clustering using Text Mining

Techniques. International Journal of Agent-Oriented Software Engineering,

Volume 3, Issue 1, Pages 6-26.

This paper is an invited submission. It extends the work reported in [17].

Conference Publications (Fully Refereed)

[7] Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling

Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text.

In the Proceedings of the 5th Australasian Conference on Data Mining (AusDM),

Sydney, Australia.

The preliminary ideas in this paper were refined and extended to contribute

towards [8], which forms Chapter 2 of this thesis.

[8] Wong, W., Liu, W. & Bennamoun, M. (2007) Enhanced Integrated Scoring

for Cleaning Dirty Texts. In the Proceedings of the IJCAI Workshop on Ana-

lytics for Noisy Unstructured Text Data (AND), Hyderabad, India. (Chapter

2)

[9] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood

of Word Sequences using Mutual Information and Independence Measure. In

the Proceedings of the 10th Conference of the Pacific Association for Compu-

tational Linguistics (PACLING), Melbourne, Australia.

The ideas in this paper were refined and reformulated as a probabilistic frame-

work to contribute towards [10], which forms Chapter 3 of this thesis.

[10] Wong, W., Liu, W. & Bennamoun, M. (2008) Determining the Unithood of

Word Sequences using a Probabilistic Approach. In the Proceedings of the 3rd

International Joint Conference on Natural Language Processing (IJCNLP),

Hyderabad, India. (Chapter 3)

[11] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for

Learning Domain Ontologies using Domain Prevalence and Tendency. In the

Proceedings of the 6th Australasian Conference on Data Mining (AusDM),

Gold Coast, Australia.

xvi

The ideas in this paper were refined and reformulated as a probabilistic frame-

work to contribute towards [3], which forms Chapter 4 of this thesis.

[12] Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for

Learning Domain Ontologies in a Probabilistic Framework. In the Proceedings

of the 6th Australasian Conference on Data Mining (AusDM), Gold Coast,

Australia.

The ideas and experiments in this paper were further extended to contribute


[13] Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora

through Topical Web Partitioning for Term Recognition. In the Proceedings

of the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auck-

land, New Zealand.

The preliminary ideas in this paper were improved and extended to contribute


[14] Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for

Terms Clustering using Tree-Traversing Ants. In the Proceedings of the Inter-

national Symposium on Practical Cognitive Agents and Robots (PCAR), Perth,

Australia.

The preliminary ideas in this paper were refined to contribute towards [5],

which forms Chapter 7 of this thesis.

[15] Wong, W., Liu, W. & Bennamoun, M. (2009) Acquiring Semantic Relations

using the Web for Constructing Lightweight Ontologies. In the Proceedings of

the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining

(PAKDD), Bangkok, Thailand. (Chapter 6)

[16] Enkhsaikhan, M., Wong, W., Liu, W. & Reynolds, M. (2007) Measuring

Data-Driven Ontology Changes using Text Mining. In the Proceedings of the

6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.

This paper reports a technique for detecting changes in ontologies. The on-

tologies used for evaluation in this paper were generated using the clustering

technique in [5] and the term recognition technique in [3].

xvii

[17] Liu, W. & Wong, W. (2008) Discovering Homogenous Service Communities

through Web Service Clustering. In the Proceedings of the AAMAS Workshop

on Service-Oriented Computing: Agents, Semantics, and Engineering (SO-

CASE), Estoril, Portugal.

This paper reports the results of discovering web service clusters using the

extended clustering technique described in [2].

Conference Publications (Refereed on the basis of abstract)

[18] Wong, W., Liu, W., Liaw, S., Balliu, N., Wu, H. & Tade, M. (2008) Auto-

matic Construction of Lightweight Domain Ontologies for Chemical Engineer-

ing Risk Management. In the Proceedings of the 11th Conference on Process

Integration, Modelling and Optimisation for Energy Saving and Pollution Re-

duction (PRES), Prague, Czech Rep.

This paper reports the results of the preliminary integration of ideas in [1-15].

[19] Wong, W. (2008) Discovering Lightweight Ontologies using the Web. In the

Proceedings of the 9th Postgraduate Electrical Engineering & Computing Sym-

posium (PEECS), Perth, Australia.

This paper reports the preliminary results of the integration of ideas in [1-15]

into a system for document skimming and scanning.

Note:

• The 14 publications [1-5] and [7-15] describe research work on developing vari-

ous techniques for ontology learning. The contents in these papers contributed

directly to Chapters 3-8 of this thesis. The 5 publications [6] and [16-19] are

application papers that arose from the use of these techniques in several areas.

• All publications, except [18] and [19] are ranked B or higher by the Aus-

tralasian Computing Research and Education Association (CORE). Data Min-

ing and Knowledge Discovery (DMKD), Intelligent Data Analysis (IDA) and

Language Resources and Evaluation (LRE) each have a 2008/2009 ISI journal

impact factor of 2.421, 0.426 and 0.283, respectively.

xviii

Contribution of Candidate to Published Work

Some of the published work included in this thesis has been co-authored. The extent

of the candidate’s contribution towards the published work is outlined below.

• Publications [1-5] and [7-15]: The candidate is the first author of these 14

papers, with 80% contribution. He co-authored them with his two supervi-

sors. The candidate designed and implemented the algorithms, performed the

experiments and wrote the papers. The candidate’s supervisors reviewed the

papers and provided useful advice for improvements.

• Publications [6] and [17]: The candidate is the second author of these

papers with 50% contribution. His primary supervisor (Dr Wei Liu) is the

first author. The candidate contributed to the clustering algorithm used in

these papers, and wrote the experiment sections.

• Publication [16]: The candidate is the second author of this paper with

20% contribution. His primary supervisor (Dr Wei Liu) and two academic col-

leagues are the remaining authors. The candidate contributed to the clustering

algorithm and term recognition technique used in this paper. The candidate

conducted half of the experiments reported in this paper.

• Publication [18]: The candidate is the first author of this paper with 40%

contribution. He co-authored the paper with two academic colleagues, and

three researchers from the Curtin University of Technology. All techniques

reported in this paper are contributed by the candidate. The candidate wrote

all sections in this paper with advice from his primary supervisor (Dr Wei Liu)

and the domain experts from the Curtin University of Technology.

• Publication [19]: The candidate is the sole author of this paper with 100%

contribution.

1CHAPTER 1

Introduction

“If HTML and the Web made all the online documents

look like one huge book, [the Semantic Web] will make

all the data in the world look like one huge database.”

- Tim Berners-Lee, Weaving the Web (1999)

Imagine that every text document you encounter comes with an abstraction of

what is important. Then we would no longer have to meticulously sift through every

email, news article, search result or product review every day. If every document on

the World Wide Web (the Web) has an abstraction of important concepts and rela-

tions, we will be one crucial step closer to realising the vision of a Semantic Web. At

the moment, the widely adopted technique for creating these abstractions is manual

curation. For instance, authors of news articles create their own summaries. Regular

users assign descriptive tags to webpages using Web 2.0 portals. Webmasters pro-

vide machine-readable metadata to describe their webpages for the Semantic Web.

The need to automate the abstraction process becomes evident when we consider

the fact that more than 90% of the data in the World appear in unstructured forms

[87]. Indeed, search engine giants such as Yahoo!, Google and Microsoft’s Bing are

slowly and strategically gearing towards the presentation of webpages using visual

summary and abstraction.

In this research, ontology learning techniques are proposed and developed to au-

tomatically discover terms, concepts and relations from documents. Together, these

ontological elements are represented as lightweight ontologies. As with any processes

that involve extracting meaningful information from unstructured data, ontology

learning relies on extensive background knowledge. This background knowledge can

range from unstructured data such as a text corpus (i.e. a collection of documents) to

structured data such as a semantic lexicon. From here on, we shall take background

knowledge [32] in a broad sense as “information that is essential to understanding a

situation or problem”1. More and more researchers in ontology learning are turning

to Web data to address certain inadequacies of static background knowledge. This

thesis investigates the systematic use of the Web as the sole source of dynamic back-

ground knowledge for automatically learning term clouds (i.e. visual depictions of

terms) and lightweight ontologies from text across different domains. The signifi-

cance of term clouds and lightweight ontologies is best appreciated in the context of

document skimming and scanning as a way to alleviate the pressure of information

1This definition is from http://wordnetweb.princeton.edu/perl/webwn?s=background%20knowledge.

2 Chapter 1. Introduction

overload. Imagine hundreds of news articles, medical reports, product reviews and

emails summarised using connected (i.e. lightweight ontologies) key concepts that

stand out visually (i.e. term clouds). This thesis has produced an interface to do

exactly this.

1.1 Problem Description

Ontology learning from text is a relatively new research area that draws on

the advances from related disciplines, especially text mining, data mining, natural

language processing and information retrieval. The requirement for extensive back-

ground knowledge, be it in the form of text corpora or structured data, remains one

of the greatest challenges facing the ontology learning community, and hence, the

focus of this thesis.

The adequacy of background knowledge in language processing is determined by

two traits, namely, diversity and redundancy. Firstly, languages vary and evolve

across different geographical regions, genres, and time [183]. For instance, a general

lexicon for the English language such as WordNet [177] is of little or no use to a

system that processes medical texts or texts in other languages. Similarly, a text

corpus conceived in the early 90s such as the British National Corpus (BNC) [36]

cannot cope with the need of processing texts that contain words such as “iPhone”

or “metrosexual”. Secondly, redundancy of data is an important prerequisite in both

statistical and symbolic language processing. Redundancy allows language process-

ing techniques to arrive at conclusions regarding many linguistic events based on

observations and induction. If we observe “politics” and “hypocrisy” often enough,

we can say they are somehow related. Static background knowledge has neither

adequate diversity nor redundancy. According to Engels & Lech [66], “Many of

the approaches found use of statistical methods on larger corpora...Such approaches

tend to get into trouble when domains are dynamic or when no large corpora are

present...”. Indeed, many researchers realised this, and hence, gradually turned to

Web data for the solution. For instance, in ontology learning, Wikipedia is used for

relation acquisition [154, 236, 216], and word sense disambiguation [175]. Web search

engines are employed for text corpus construction [15, 228], similarity measurement

[50], and word collocation [224, 41].

However, the present use of Web data is typically confined to isolated cases where

static background knowledge has outrun its course. There is currently no study

concentrating on the systematic use of Web data as background knowledge for all

phases of ontology learning. Research that focuses on the issue of diversity and

1.2. Thesis Statement 3

redundancy of background knowledge in ontology learning is long overdue. How

do we know if we have the necessary background knowledge to carry out all our

ontology learning tasks? Where do we look for more background knowledge if we

know that what we have is inadequate?

1.2 Thesis Statement

The thesis of this research is that the process of ontology learning, which includes

discovering terms, concepts and coarse-grained relations, from text across a wide

range of domains can be effectively automated by relying solely upon background

knowledge on the Web. In other words, our proposed system employs Web data

as the sole background knowledge for all techniques across all phases of ontology

learning. The effectiveness of the proposed system is determined by its ability to

satisfy two requirements:

(1) Avoid using any static resources commonly used by current ontology learning

systems (e.g. semantic lexicons, text corpora).

(2) Ensure the applicability of the system across a wide range of domains (current

focus is technology, medicine and economics).

At the same time, this research addresses the following four problems by taking

advantage of the diversity and redundancy of Web data as background knowledge:

(1) The absence of integrated techniques for cleaning noisy text.

(2) The inability of current term extraction techniques, which are heavily influ-

enced by word frequency, to systematically explicate, diversify and consolidate

termhood evidence.

(3) The inability of current corpus construction techniques to automatically create

very large, high-quality text corpora using a small number of seed terms.

(4) The difficulty of locating and preparing features for clustering and acquiring

relations between terms.

1.3 Overview of Solution

The ultimate goal of ontology learning in the context of this thesis is to dis-

cover terms, concepts and coarse-grained relations from documents using the Web

as the sole source of background knowledge. Chapter 2 provides a thorough review


Figure 1.1: An overview of the five phases in the proposed ontology learning system,

and how the details of each phase are outlined in certain chapters of this dissertation.

of the existing techniques for discovering these ontological elements. An ontology

learning system comprising five phases, namely, text preprocessing, text processing,

corpus construction, term recognition and relation acquisition is proposed in this

research. Figure 1.1 provides an overview of the system. The common design and

development methodology for the core techniques in each phase is: (1) first, per-

form in-depth study of the requirements of each phase to determine the types of

background knowledge required, (2) second, identify ways of exploiting data on the

Web to satisfy the background knowledge requirements, and (3) third, devise high-

performance techniques that take advantage of the diversity and redundancy of the

background knowledge. The system takes as input a set of seed terms and natu-

ral language texts, and produces three outputs, namely, text corpora, term clouds,

and lightweight ontologies. The solution to each phase is described in the following

subsections.

1.3.1 Text Preprocessing (Chapter 3)

Figure 1.2 shows an overview of the techniques for text preprocessing. Unlike

data developed in controlled settings, Web data come in a varying degree of quality

which may contain spelling errors, abbreviations and improper casings. This calls

for serious attention to the issue of data quality. A review of several prominent

techniques for spelling error correction, abbreviation expansion and case restora-

tion was conducted. Despite the blurring of the boundaries between these different

errors in online data, there is little work on integrated correction techniques. For

1.3. Overview of Solution 5

Figure 1.2: Overview of the ISSAC and HERCULES techniques used in the text

preprocessing phase of the proposed ontology learning system. ISSAC and HER-

CULES are described in Chapter 3 and 6, respectively.

instance, is “ocat” a spelling error (with the possibilities “coat”, “cat” or “oat”), or

an abbreviation (with the expansion “Ontario Campaign for Action on Tobacco”)?

A technique called Integrated Scoring for Spelling Error Correction, Abbreviation

Expansion and Case Restoration (ISSAC) is proposed and developed for cleaning

potentially noisy texts. ISSAC relies on edit distance, online dictionaries, search en-

gine page counts and Aspell [9]. An experiment using 700 chat records showed that

ISSAC achieved an average accuracy of 98%. In addition, a heuristic technique,

called Heuristic-based Cleaning Utility for Web Texts (HERCULES), is proposed

for extracting relevant contents from webpages amidst HTML tags, boilerplates,

etc. Due to the significance of HERCULES to the corpus construction phase, the

details of this techniques are provided in Chapter 6.

1.3.2 Text Processing (Chapter 4)

Figure 1.3 shows an overview of the techniques for text processing. The cleaned

texts are processed using the Stanford Parser [132] and Minipar [150] to obtain part-

of-speech and grammatical information. This information is then used for chunking

noun phrases and extracting instantiated sub-categorisation frames (i.e. syntac-

tic triples, ternary frames) in the form of <arg1,connector,arg2>. Two measures

based on search engine page counts are introduced as part of the noun phrase chunk-


Figure 1.3: Overview of the UH and OU measures used in the text processing

phase of the proposed ontology learning system. These two measures are described

in Chapter 4.

ing process. These two measures are used to determine the collocational stability

of noun phrases. Noun phrases are considered as unstable if they can be further

broken down to create non-overlapping units that refer to semantically distinct con-

cepts. For example, the phrase “Centers for Disease Control and Prevention” is

stable and semantically meaningful unit while “Centre for Clinical Interventions

and Royal Perth Hospital” is an unstable compound. The first measure, called

Unithood (UH) is an adaptation of existing word association measures, while the

second measure, called Odds of Unithood (OU), is a novel probabilistic measure to

address the ad-hoc nature of combining evidence. An experiment using 1, 825 test

cases in the health domain showed that OU achieved a higher accuracy at 97.26%

compared to UH at only 94.52%.

1.3.3 Term Recognition (Chapter 5)

Figure 1.4 shows an overview of the techniques for term recognition. Noun

phrases in the two arguments arg1 and arg2 of the instantiated sub-categorisation

frames are used to create a list of term candidates. The extent to which each term

candidate is relevant to the corresponding document is determined. Several existing

techniques for measuring termhood using various term characteristics as termhood

evidence are reviewed. Major shortcomings of existing techniques are identified and

discussed, including the heavy influence of word frequency (especially techniques


Figure 1.4: Overview of the TH and OT measures used in the term recognition

phase of the proposed ontology learning system. These two measures are described

in Chapter 5.

based on TF-IDF), mathematically-unfounded derivation of weights and implicit

assumptions regarding term characteristics. An analysis is carried out using word

distribution models and text corpora to predict word occurrences for quantifying

termhood evidence. Models that are considered include K-Mixture, Poisson, and

Zipf-Mandelbrot. Based on the analysis, two termhood measures are proposed,

which combine evidence based on explicitly defined term characteristics. The first

is a heuristic measure called Termhood (TH), while the second, called Odds of Ter-

mhood (OT), is based on a novel probabilistic measure founded on the Bayes Theo-

rem for formalising termhood evidence. These two measures are compared against

two existing ones using the GENIA corpus [130] for molecular biology as benchmark.

An evaluation using 1, 954 term candidates showed that TH and OT achieved the

best precision at 98.5% and 98%, respectively, for the first 200 terms.

1.3.4 Corpus Construction for Term Recognition (Chapter 6)

Figure 1.5 shows an overview of the techniques for corpus construction. Given

that our goal is to rely solely on dynamic background knowledge in ontology learning,

it is important to see how we can avoid the use of manually-curated text corpora dur-

ing term recognition. Existing techniques for automatic corpus construction using

data from the Web are reviewed. Most of the current techniques employ the naive


Figure 1.5: Overview of the SPARTAN technique used in the corpus construction

phase of the proposed ontology learning system. The SPARTAN technique is de-

scribed in Chapter 6.

query-and-download approach using search engines to construct text corpora. These

techniques require a large number of seed terms (in the order of hundreds) to create

very large text corpora. They also disregard the fact that the webpages suggested by

search engines may have poor relevance and quality. A novel technique called Spe-

cialised Corpora Construction based on Web Texts Analysis (SPARTAN) is proposed

and developed for constructing specialised text corpora from the Web based on the

systematic analysis of website contents. A Probabilistic Site Selector (PROSE) is

proposed as part of SPARTAN to identify the most suitable and authoritative data

for contributing to the corpora. A heuristic technique called HERCULES, mentioned

before in the text preprocessing phase, is included in SPARTAN for extracting rel-

evant contents from the downloaded webpages. A comparison using the Cleaneval

development set2 and a text comparison module based on vector space3 showed that

HERCULES achieved a 89.19% similarity with the gold standard. An evaluation

was conducted to show that SPARTAN requires only a small number of seed terms

(three to five), and that SPARTAN -based corpora are independent of the search

engine employed. The performance of term recognition across four different corpora

(both automatically constructed and manually curated) was assessed using the OT

2http://cleaneval.sigwac.org.uk/devset.html3http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm


measure, 1, 300 term candidates, and the GENIA corpus as benchmark. The evalu-

ation showed that term recognition using the SPARTAN -based corpus achieved the

best precision at 99.56%.

1.3.5 Term Clustering and Relation Acquisition (Chapter 7 and 8)

Figure 1.6: Overview of the ARCHILES technique used in the relation acquisition

phase of the proposed ontology learning system. The ARCHILES technique is

described in Chapter 8, while the TTA clustering technique and noW measure are

described in Chapter 7.

Figure 1.6 shows an overview of the techniques for relation acquisition. The

flat lists of domain-relevant terms obtained during the previous phase are organ-

ised into hierarchical structures during the relation acquisition phase. A review of

the techniques for acquiring semantic relations between terms was conducted. Cur-

rent techniques rely heavily on the presence of syntactic cues and static background

knowledge such as semantic lexicons for acquiring relations. A novel technique

named Acquiring Relations through Concept Hierarchy Disambiguation, Associa-

tion Inference and Lexical Simplification (ARCHILES) is proposed for constructing

lightweight ontologies using coarse-grained relations derived from Wikipedia and

search engines. ARCHILES combines word disambiguation, which uses the distance

measure n-degree of Wikipedia (noW), and lexical simplification to handle complex

and ambiguous terms. ARCHILES also includes association inference using a novel

multi-pass Tree-Traversing Ant (TTA) clustering algorithm with the Normalised


Web Distance (NWD)4 as the similarity measure to cope with terms not covered

by Wikipedia. This technique can be used to complement conventional techniques

for acquiring fine-grained relations. Two small experiments using 11 terms in the

genetics domain and 31 terms in the food domain revealed precision scores between

80% to 100%. The details about TTA and noW are provided in Chapter 7. The

descriptions of ARCHILES are included in Chapter 8.

1.4 Contributions

The standout contribution of this dissertation is the exploration of a complete

solution to the complex problem of automatic ontology learning from text. This

research has produced several other contributions to the field of ontology learning.

The complete list is as follows:

• A technique which consolidates various evidence from existing tools and from

search engines for simultaneously correcting spelling errors, expanding abbre-

viations and restoring improper casing.

• Two measures for determining the collocational strength of word sequences

using page counts from search engines, namely, an adaptation of existing word

association measures, and a novel probabilistic measure.

• In-depth experiments on parameter estimation and linear regression involving

various word distribution models.

• Two measures for determining term relevance based on explicitly defined term

characteristics and the distributional behaviour of terms across different cor-

pora. The first measure is a heuristic measure, while the second measure

is based on a novel probabilistic framework for consolidating evidence using

formal word distribution models.

• In-depth experiments on the effects of search engine and page count variations

on corpus construction.

• A novel technique for corpus construction that requires only a small num-

ber of seed terms to automatically produce very large, high-quality text cor-

pora through the systematic analysis of website contents. The on-demand

4NWD is a generalisation of the Normalised Google Distance (NGD) [50] that employs any

available Web search engines.

1.5. Layout of Thesis 11

construction of new text corpora enables this and many other term recogni-

tion techniques to be widely applicable across different domains. A generally-

applicable heuristic technique is also introduced for removing HTML tags and

boilerplates, and extracting relevant content from webpages.

• In-depth experiments on the peculiarities of clustering terms as compared to

other forms of feature-based data clustering.

• A novel technique for constructing lightweight ontologies in an iterative pro-

cess of lexical simplication, association inference through term clustering, and

word disambiguation using only Wikipedia and search engines. A generally-

applicable technique is introduced for multi-pass term clustering using feature-

less similarity measurement based on Wikipedia and page counts by search

engines.

• Demonstration of the use of term clouds and lightweight ontologies to assist

the skimming and scanning of documents.

1.5 Layout of Thesis

Overall, this dissertation is organised as a series of papers published in interna-

tionally refereed book chapters, journals and conferences. Each paper constitutes an

independent set of work into ontology learning. However, these papers together con-

tribute to a complete and coherent theme. In Chapter 2, a background to ontology

learning and a review on several prominent ontology learning systems is presented.

The core content of this dissertation is laid out in Chapter 3 to 8. Each of these

chapters describes one of the five phases in our ontology learning system.

• Chapter 3 (Text Preprocessing) features an IJCAI workshop paper on the text

cleaning technique called ISSAC.

• In Chapter 4 (Text Processing), an IJCNLP conference paper describing the

two word association measures UH and OU is included.

• An Intelligent Data Analysis journal paper on the two term relevance measures

TH and OT is included in Chapter 5 (Term Recognition).

• In Chapter 6 (Corpus Construction for Term Recognition), a Language Re-

sources and Evaluation journal paper is included to describe the SPARTAN

technique for automatically constructing text corpora for term recognition.


• A Data Mining and Knowledge Discovery journal paper that describes the

TTA clustering technique and noW distance measure is included in Chapter

7 (Term Clustering for Relation Acquisition).

• In Chapter 8 (Relation Acquisition), a PAKDD conference paper is included

to describe the ARCHILES technique for acquiring coarse-grained relations

using TTA and noW.

After the core content, Chapter 9 elaborates on the implementation details of

the proposed ontology learning system, and the application of term clouds and

lightweight ontologies for document skimming and scanning. In Chapter 10, we

summarise our conclusions and provide suggestions for future work.

13CHAPTER 2

Background

“A while ago, the Artificial Intelligence research community got

together to find a way to enable knowledge sharing...They proposed an

infrastructure stack that could enable this level of information exchange,

and began work on the very difficult problems that arise.”

- Thomas Gruber, Ontology of Folksonomy (2007)

This chapter provides a comprehensive review on ontology learning. It also

serves as a background introduction to ontologies in terms of what they are, why

they are important, how they are obtained and where they can be applied. The

definition of an ontology is first introduced before a discussion on the differences

between lightweight ontologies and the conventional understanding of ontologies

is provided. Then the process of ontology learning is described, with a focus on

types of output, commonly-used techniques and evaluation approaches. Finally,

several current applications and prominent systems are explored to appreciate the

significance of ontologies and the remaining challenges in ontology learning.

2.1 Ontologies

Ontologies can be thought of as directed graphs consisting of concepts as nodes,

and relations as the edges between the nodes. A concept is essentially a mental

symbol often realised by a corresponding lexical representation (i.e. natural language

name). For instance, the concept “food” denotes the set of all substances that can

be consumed for nutrition or pleasure. In Information Science, an ontology is a

“formal, explicit specification of a shared conceptualisation” [92]. This definition

imposes the requirement that the names of concepts, and how the concepts are

related to one another have to be explicitly expressed and represented using formal

languages such as Web Ontology Language (OWL). An important benefit of a formal

representation is the ability to specify axioms for reasoning to determine validity

and to define constraints in ontologies.

As research into ontology progresses, the definition of what constitutes an on-

tology evolves. The extent of relational and axiomatic richness, and the formality

of representation eventually gave rise to a spectrum of ontology kinds [253] as il-

lustrated in Figure 2.1. At one end of the spectrum, we have ontologies that make

little or no use of axioms referred to as lightweight ontologies [89]. At the other

end, we have heavyweight ontologies [84] that make intensive use of axioms for spec-

ification. Ontologies are fundamental to the success of the Semantic Web as they

14 Chapter 2. Background

Figure 2.1: The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu

[89].

enable software agents to exchange, share, reuse and reason about concepts and

relations using axioms. In the words of Tim Berners-Lee [24], “For the semantic

web to function, computers must have access to structured collections of informa-

tion and sets of inference rules that they can use to conduct automated reasoning”.

However, the truth remains that the automatic learning of axioms is not an easy

task. Despite certain success, many ontology learning systems are still struggling

with the basics of extracting terms and relations [84]. For this reason, the majority

of ontology learning systems out there that claim to learn ontologies are in fact

creating lightweight ontologies. At the moment, lightweight ontologies appear to

be the most common type of ontologies in a variety of Semantic Web applications

(e.g. knowledge management, document retrieval, communities of practice, data

integration) [59, 75].

2.2 Ontology Learning from Text

Ontology learning from text is the process of identifying terms, concepts, rela-

tions and optionally, axioms from natural language text, and using them to construct

and maintain an ontology. Even though the area of ontology learning is still in its

infancy, many proven techniques from established fields such as text mining, data

mining, natural language processing, information retrieval, as well as knowledge rep-

resentation and reasoning have powered a rapid growth in recent years. Information

retrieval provides various algorithms to analyse associations between concepts in

texts using vectors, matrices [76] and probabilistic theorems [280]. On the other

hand, machine learning and data mining provides ontology learning the ability to

2.2. Ontology Learning from Text 15

extract rules and patterns out of massive datasets in a supervised or unsupervised

manner based on extensive statistical analysis. Natural language processing pro-

vides the tools for analysing natural language text on various language levels (e.g.

morphology, syntax, semantics) to uncover concept representations and relations

through linguistic cues. Knowledge representation and reasoning enables the onto-

logical elements to be formally specified and represented such that new knowledge

can be deduced.

Figure 2.2: Overview of the outputs, tasks and techniques of ontology learning.

In the following subsections, we look at the types of output, common techniques

and evaluation approaches of a typical ontology learning process.

2.2.1 Outputs from Ontology Learning

There are five types of output in ontology learning, namely, terms, concepts,

taxonomic relations, non-taxonomic relations and axioms. Some researchers [35]

refer to this as the “Ontology Learning Layer Cake”. To obtain each output, cer-

tain tasks have to be accomplished and the techniques employed for each task may

vary between systems. This view of output-task relation that is independent of any

implementation details promotes modularity in designing and implementing ontol-

ogy learning systems. Figure 2.2 shows the output and the corresponding tasks.

Each output is a prerequisite for obtaining the next output as shown in the figure.


Terms are used to form concepts which in turn are organised according to relations.

Relations can be further generalised to produce axioms.

Terms are the most basic building blocks in ontology learning. Terms can be

simple (i.e. single-word) or complex (i.e. multi-word), and are considered as lexical

realisations of everything important and relevant to a domain. The main tasks as-

sociated with terms are to preprocess texts and extract terms. Preprocessing ensures

that the input texts are in an acceptable format. Some of the techniques relevant

to preprocessing include noisy text analytics and the extraction of relevant contents

from webpages (i.e. boilerplate removal). The extraction of terms usually begin

with some kind of part-of-speech tagging and sentence parsing. Statistical or prob-

abilistic measures are then used to determine the extent of collocational strength

and domain relevance of the term candidates.

Concepts can be abstract or concrete, real or fictitious. Broadly speaking, a

concept can be anything about which something is said. Concepts are formed by

grouping similar terms. The main tasks are therefore to form concepts and label

concepts. The task of forming concepts involve discovering the variants of a term

and grouping them together. Term variants can be determined using predefined

background knowledge, syntactic structure analysis or through clustering based on

some similarity measures. As for deciding on the suitable label for a concept, existing

background knowledge such as WordNet may be used to find the name of the nearest

common ancestor. If a concept is determined through syntactic structure analysis,

the heads of the complex terms can be used as the corresponding label. For instance,

the common head noun “tart” can be used as the label for the concept comprising

of “egg tart”, “French apple tart”, “chocolate tart”, etc.

Relations are used to model the interactions between the concepts in a domain.

There are two types of relations, namely, taxonomic relations and non-taxonomic

relations. Taxonomic relations are the hypernymies between concepts. The main

task is to construct hierarchies. Organising concepts into a hierarchy involves the

discovery of hypernyms and hence, some researchers may also refer to this task as

extracting taxonomic relations. Hierarchy construction can be performed in various

ways such as using predefined relations from existing background knowledge, using

statistical subsumption models, relying on semantic relatedness between concepts,

and utilising linguistic and logical rules or patterns. Non-taxonomic relations are the

interactions between concepts (e.g. meronymy, thematic roles, attributes, possession

and causality) other than hypernymy. The less explicit and more complex use of

words for specifying relations other than hypernymy causes the tasks to discover


non-taxonomic relations and label non-taxonomic relations to be more challenging.

Discovering and labelling non-taxonomic relations are mainly reliant on the analysis

of syntactic structures and dependencies. In this aspect, verbs are taken as good

indicators for non-taxonomic relations and help from domain experts are usually

required to label such relations.

Lastly, axioms are propositions or sentences that are always taken as true. Ax-

ioms act as a starting point for deducing other truth, verifying correctness of existing

ontological elements and defining constraints. The task involved here is to discover

axioms. The task of learning axioms usually involve the generalisation or deduction

of a large number of known relations that satisfy certain criteria.

2.2.2 Techniques for Ontology Learning

The techniques employed by different systems may vary depending on the tasks

to be accomplished. The techniques can generally be classified into statistics-based,

linguistics-based, logic-based, or hybrid. Figure 2.2 illustrates the various commonly-

used techniques, and each technique may be applicable to more than one task.

The various statistics-based techniques for accomplishing the tasks in ontology

learning are mostly derived from information retrieval, machine learning and data

mining. The lack of consideration for the underlying semantics and relations between

the components of a text makes statistics-based techniques more prevalent in the

early stages of ontology learning. Some of the common techniques include clustering

[272], latent semantic analysis [252], co-occurrence analysis [34], term subsumption

[77], contrastive analysis [260] and association rule mining [239]. The main idea

behind these techniques is that the extent of occurrence of terms and their contexts

in documents often provide reliable estimates about the semantic identity of terms.

• In clustering, some measures of relatedness (e.g. similarity, distance) is em-

ployed to assign terms into groups for discovering concepts or constructing

hierarchy [152]. The process of clustering can either begin with individual

terms or concepts and grouping the most related ones (i.e. agglomerative

clustering), or begin with all terms or concepts and dividing them into smaller

groups to maximise within-group relatedness (i.e. divisive clustering). Some

of the major issues in clustering are working with high-dimensional data, and

feature extraction and preparation for similarity measurement. This gave rise

to a class of feature-less similarity and distance measures based solely on the

co-occurrence of words in large text corpora. The Normalised Web Distance

(NGD) is one example [262].


• Relying on raw data to measure relatedness may lead to data sparseness [35].

In latent semantic analysis, dimension reduction techniques such as singular

value decomposition are applied on the term-document matrix to overcome the

problem [139]. In addition, inherent relations between terms can be revealed

by applying correlation measures on the dimensionally-reduced matrix, leading

to the formation of groups.

• The analysis of the occurrence of two or more terms within a well-defined

unit of information such as sentence or more generally, n-gram is known as co-

occurrence analysis. Co-occurrence analysis is usually coupled with some mea-

sures to determine the association strength between terms or the constituents

of terms. Some of the popular measures include dependency measures (e.g.

mutual information [47]), log-likelihood ratios [206] (e.g. chi-square test), rank

correlations (e.g. Pearson’s and Spearman’s coefficient [244]), distance mea-

sures (e.g. Kullback-Leiber divergence [161]), and similarity measures (e.g.

cosine measures [223]).

• In term subsumption, the conditional probabilities of the occurrence of terms

in documents are employed to discover hierarchical relations between them

[77]. A term subsumption measure is used to quantify the extent of a term x

being more general than another term y. The higher the subsumption value,

the more general term x is with respect to y.

• The extent of occurrence of terms in individual documents and in text corpora

is employed for relevance analysis. Some of the common relevance measures

from information retrieval include the Term Frequency-Inverse Document Fre-

quency (TF-IDF) [215] and its variants, and others based on language mod-

elling [56] and probability [83]. Contrastive analysis [19] is a kind of relevance

analysis based on the heuristic that general language-dependent phenomena

should spread equally across different text corpora, while special-language phe-

nomena should portray odd behaviours.

• Given a set of concept pairs, association rule mining is employed to describe the

associations between the concepts at the appropriate level of abstraction [115].

In the example by [162], given the already known concept pairs chips, beer

and peanuts, soda, association rule mining is then employed to generalise

the pairs to provide snacks, drinks. The key to determining the degree of

abstraction in association rules is provided by user-defined thresholds such as


confidence and support.

Linguistics-based techniques are applicable to almost all tasks in ontology learn-

ing and are mainly dependent on the natural language processing tools. Some of

the techniques include part-of-speech tagging, sentence parsing, syntactic structure

analysis and dependency analysis. Other techniques rely on the use of semantic

lexicon, lexico-syntactic patterns, semantic templates, subcategorisation frames, and

seed words.

• Part-of-speech tagging and sentence parsing provide the syntactic structures

and dependency information required for further linguistic analysis. Some ex-

amples of part-of-speech tagger are Brill Tagger [33] and TreeTagger [219].

Principar [149], Minipar [150] and Link Grammar Parser [247] are among the

few common sentence parsers. Other more comprehensive toolkits for nat-

ural language processing include General Architecture for Text Engineering

(GATE) [57], and Natural Language Toolkit (NLTK) [25]. Despite the place-

ment under the linguistics-based category, certain parsers are built on statis-

tical parsing systems. For instance, the Stanford Parser [132] is a lexicalised

probabilistic parser.

• Syntactic structure analysis and dependency analysis examines syntactic and

dependency information to uncover terms and relations at the sentence level.

In syntactic structure analysis, words and modifiers in syntactic structures (e.g.

noun phrases, verb phrases and prepositional phrases) are analysed to discover

potential terms and relations. For example, ADJ-NN or DT-NN can be extracted

as potential terms, while ignoring phrases containing other part-of-speech such

as verbs. In particular, the head-modifier principle has been employed exten-

sively to identify complex terms related through hyponymy with the heads of

the terms assuming the hypernym role [105]. In dependency analysis, gram-

matical relations such as subject, object, adjunct and complement are used

for determining more complex relations [86, 48].

• Semantic lexicon can either be general such as WordNet [177] or domain-

specific such as the Unified Medical Language System (UMLS) [151]. Semantic

lexicon offers easy access to a large collection of predefined words and rela-

tions. Concepts from semantic lexicon are usually organised in sets of similar

words (i.e. synsets). These synonyms are employed for discovering variants of

terms [250]. Relations from semantic lexicon have also been proven useful to


ontology learning. These relations include hypernym-hyponym (i.e. parent-

child relation) and meronym-holonym (i.e. part-whole relation). Many of the

work related to the use of relations in WordNet can be found in the area of

word sense disambiguation [265, 145] and lexical acquisitions [190].

• The use of lexico-syntactic patterns was proposed by [102], and has been em-

ployed to extract hypernyms [236] and meronyms. Lexico-syntactic patterns

capture hypernymy relations using patterns such as NP such as NP, NP,...,

and NP. For extracting meronyms, patterns such as NP is part of NP can be

useful. The use of patterns provide reasonable precision but the recall is low

[35]. Due to the cost and time involved in manually producing such patterns,

efforts [234] have been taken to study the possibility of learning them. Seman-

tic templates [238, 257] are similar to lexico-syntactic patterns in terms of their

purpose. However, semantic templates offer more detailed rules and conditions

to extract not only taxonomic relations but also complex non-taxonomic rela-

tions.

• In linguistic theory, the subcategorisation frame [5, 85] of a word is the number

and kinds of other words that it selects when appearing in a sentence. For

example, in the sentence “Joe wrote a letter”, the verb “write” selects “Joe”

and “letter” as its subject and object, respectively. In other words, “Person”

and “Written-Communication” are the restrictions of selection for the subject

and object of the verb “write”. The restrictions of selection extracted from

parsed texts can be used in conjunction with clustering techniques to discover

concepts [68].

• The use of seed words (i.e. seed terms) [281] is a common practice in many

systems to guide a wide range of tasks in ontology learning. Seed words provide

good starting points for the discovery of additional terms relevant to that

particular domain [110]. Seed words are also used to guide the automatic

construction of text corpora from the Web [15].

Logic-based techniques are the least common in ontology learning and are mainly

adopted for more complex tasks involving relations and axioms. Logic-based tech-

niques have connections with advances in knowledge representation and reasoning,

and machine learning. The two main techniques employed are inductive logic pro-

gramming [141, 283] and logical inference [227].


• In inductive logic programming, rules are derived from existing collection of

concepts and relations which are divided into positive and negative examples.

The rules proves all the positive and none of the negative examples. In an

example by Oliveira et al. [191], induction begins with the first positive ex-

ample “tigers have fur”. With the second positive example “cats have fur”, a

generalisation of “felines have fur” is obtained. Given the third positive exam-

ple “dogs have fur”, the technique will attempt to generalise that “mammals

have fur”. When encountered with a negative example “humans do not have

fur”, then the previous generalisation will be dropped, giving only “canines

and felines have fur”.

• In logical inference, implicit relations are derived from existing ones using

rules such as transitivity and inheritance. Using the classic example, given the

premises “Socrates is a man” and “All men are mortal”, we can discover a

new attribute relation stating that “Socrates is mortal”. Despite the power of

inference, the possibilities of introducing invalid or conflicting relations may

occur if the design of the rules is not complete. Consider the example where

“human eats chicken” and “chicken eats worm” yield a new relation that is

not valid. This happened because the intransitivity of the relation “eat” was

not explicitly specified in advance.

2.2.3 Evaluation of Ontology Learning Techniques

Evaluation is an important aspect of ontology learning, just like any other re-

search areas. Evaluation allows individuals who use ontology learning systems to

assess the resulting ontologies, and to possibly guide and refine the learning process.

An interesting aspect about evaluation in ontology learning, as opposed to informa-

tion retrieval and other areas, is that ontologies are not an end product but rather,

a means to achieve some other tasks. In this sense, an evaluation approach is also

useful to assist users in choosing the best ontology that fits their requirements when

faced with a multitude of options.

In document retrieval, the object of evaluation is documents and how well sys-

tems provide documents that satisfy user queries, either qualitatively or quantita-

tively. However, in ontology learning, we cannot simply measure how well a system

constructs an ontology without raising more questions. For instance, is the ontology

good enough? If so, with respect to what application? An ontology is made up of

different layers such as terms, concepts and relations. If an ontology is inadequate

for an application, then which part of the ontology is causing the problem? Consid-


ering the intricacies of evaluating ontologies, a myriad of evaluation approaches have

been proposed in the past few years. Generally, these approaches can be grouped

into one of the four main categories depending on the kind of ontologies that are

being evaluated and the purpose of the evaluation [30]:

• The first approach evaluates the adequacy of ontologies in the context of other

applications. For example Porzel & Malaka [202] evaluated the use of ontolog-

ical relations in the context of speech recognition. The output from the speech

recognition system is compared with a gold standard generated by humans.

• The second approach uses domain-specific data sources to determine to what

extent the ontologies are able to cover the corresponding domain. For instance,

Brewster et al. [31] described a number of methods to evaluate the ‘fit’ between

an ontology and the domain knowledge in the form of text corpora.

• The third approach is used for comparing ontologies using benchmarks includ-

ing other ontologies [164].

• The last approach rely on domain experts to assess how well an ontology meets

a set of predefined criteria [158].

Due to the complex nature of ontologies, evaluation approaches can also be

distinguished by the layers of an ontology (e.g. term, concept, relation) they evaluate

[202]. More specifically, evaluations can be performed to assess the (1) correctness at

the terminology layer, (2) coverage at the conceptual layer, (3) wellness at taxonomy

layer, and (4) adequacy of the non-taxonomic relations.

The focus of evaluation at the terminology layer is to determine if the terms

used to identify domain-relevant concepts are included and correct. Some form of

lexical reference or benchmark is typically required for evaluation in this layer. Typ-

ical precision and recall measures from information retrieval are used together with

exact matching or edit distance [164] to determine performance at the terminology

layer. The lexical precision and recall reflect how good the extracted terms cover

the target domain. Lexical Recall (LR) measures the number of relevant terms ex-

tracted (erelevant) divided by the total number of relevant terms in the benchmark

(brelevant), while Lexical Precision (LP) measures the number of relevant terms ex-

tracted (erelevant) divided by the total number of terms extracted (eall). LR and LP

are defined as [214]:

LP =erelevant

eall

(2.1)


LR =erelevant

brelevant

(2.2)

The precision and recall measure can be also combined to compute the corresponding

Fβ-score. The general formula for non-negative real β is:

Fβ =(1 + β2)(precision× recall)

β2 × precision + recall(2.3)

Evaluation measures at the conceptual level are concerned with whether the de-

sired domain-relevant concepts are discovered or otherwise. Lexical Overlap (LO)

measures the intersection between the discovered concepts (Cd) and the recom-

mended concepts (Cm). LO is defined as:

LO =|Cd ∩ Cm|

|Cm|(2.4)

Ontological Improvement (OI) and Ontological Loss (OL) are two additional mea-

sures to account for newly discovered concepts that are absent from the benchmark,

and for concepts which exist in the benchmark but were not discovered, respectively.

They are defined as [214]:

OI =|Cd − Cm|

|Cm|(2.5)

OL =|Cm − Cd|

|Cm|(2.6)

Evaluations at the taxonomy layer is more complicated. Performance measures

for the taxonomy layer are typically divided into local and global [60]. The similarity

of the concepts’ positions in the learned taxonomy and in the benchmark is used

to compute the local measure. The global measure is then derived by averaging

the local scores for all concept pairs. One of the few measures for the taxonomy

layer is the Taxonomic Overlap (TO) [164]. The computation of the global similarity

between two taxonomies begins with the local overlap of their individual terms. The

semantic cotopy, the set of all super- and sub-concepts, of a term varies depending

on the taxonomy. The local similarity between two taxonomies given a particular

term is determined based on the overlap of the term’s semantic cotopy. The global

taxonomic overlap is then defined as the average of the local overlaps of all the

terms in the two taxonomies. The same idea can be applied to compare adequacy

non-taxonomic relations.


2.3 Existing Ontology Learning Systems

Before looking into some of the prominent systems and recent advances in on-

tology learning, a recap of three previous independent surveys is conducted. The

first is a report by the OntoWeb Consortium [90], a body funded by the Information

Society Technologies Programme of the Commission of the European Communi-

ties. This survey listed 36 approaches for ontology learning from text. Some of the

important findings presented by this review paper are:

• There is no detailed methodology that guides the ontology learning process

from text.

• There is no fully automated system for ontology learning. Some of the systems

act as tools to assist in the acquisition of lexical-semantic knowledge, while

others help to extract concepts and relations from annotated corpora with the

involvement of users.

• There is no general approach for evaluating the accuracy of ontology learning,

and for comparing the results produced by different systems.

The second survey, released during the same time as the OntoWeb Consortium

survey, was performed by Shamsfard & Barforoush [226]. The authors claimed to

have studied over fifty different approaches before selecting and including seven

prominent ones in their survey. The main focus of the review was to introduce a

framework for comparing ontology learning approaches. The approaches included in

the review merely served as test cases to be fitted into the framework. Consequently,

the review provided an extensive coverage of the state-of-the-art of the relevant

techniques but was limited in terms of discussions on the underlying problems and

future outlook. The review arrived at the following list of problems:

• Much work has been conducted on discovering taxonomic relations, while non-

taxonomic relations were given less attention.

• Research into axiom learning was nearly unexplored.

• The focus of most research is on building domain ontologies. Most of the tech-

niques were designed to make heavy use of domain-specific patterns and static

background knowledge, with little regard to the portability of the systems

across different domains.

2.3. Existing Ontology Learning Systems 25

• Current ontology learning systems are evaluated within the confinement of

their domains. Finding a formal, standard method to evaluate ontology learn-

ing systems remains an open problem.

• Most systems are either semi-automated or tools for supporting domain ex-

perts in curating ontologies. Complete automation and elimination of user

involvement requires more research.

Lastly, Ding & Foo [62] presented a survey of 12 major ontology learning projects.

The authors wrapped up their survey with following findings:

• Input data are mostly structured. Learning from free texts remains within the

realm of research.

• The task of discovering relations is very complex and a difficult problem to

solve. It has turned out to be the main impedance to the progress of ontology

learning.

• The techniques for discovering concepts have reached a certain level of matu-

rity.

A closer look into the three survey papers revealed a consensus on several aspects of

ontology learning that required more work. These conclusions are in fact in line with

the findings of our literature review in the following Sections 2.3.1 and 2.3.2. These

conclusions are (1) fully automated ontology learning is still in the realm of research,

(2) current approaches are heavily dependent on static background knowledge, and

may face difficulty in porting across different domains and languages, (3) there is no

common evaluation platform for ontology learning, and (4) there is a lack of research

on discovering relations. The validity of some of these conclusions will become more

evident as we look into several prominent systems and recent advances in ontology

learning in the following two sections.

2.3.1 Prominent Ontology Learning Systems

A summary of the techniques used by five prominent ontology learning systems,

and the evaluation of these techniques are provided in this section.

OntoLearn

OntoLearn [178, 182, 259, 260], together with Consys (for ontology validation by

experts) and SymOntoX (for updating and managing ontology by experts) are part


of a project for developing an interoperable infrastructure for small and medium

enterprises in the tourism sector under the Federated European Tourism Informa-

tion System1 (FETISH). OntoLearn employs both linguistics and statistics-based

techniques in four major tasks to discover terms, concepts and taxonomic relations.

• Preprocess texts and extract terms: Domain and general corpora are first

processed using part-of-speech tagging and sentence parsing tools to produce

syntactic structures including noun phrases and prepositional phrases. For

relevance analysis, the approach adopts two metrics known as Domain Rel-

evance (DR) and Domain Consensus (DC). Domain relevance measures the

specificity of term t with respect to the target domain Dk through comparative

analysis across a list of predefined domains D1, ..., Dn. The measure is defined

as

DR(t,Dk) =P (t|Dk)

∑

i=1...n P (t|Di)

where P (t|Dk) and P (t|Di) are estimated asft,k

∑

t∈Dkft,k

andft,i

∑

t∈Dift,i

, respec-

tively. ft,k and ft,i are the frequencies of term t in domain Dk and Di, re-

spectively. Domain consensus, on the other hand, is used to measure the

appearance of a term in a single document as compared to the overall occur-

rence in the target domain. The domain consensus of a term t in domain Dk

is an entropy defined as

DC(t,Dk) =∑

d∈Dk

P (t|d)log1

P (t|d)

where P (t|d) is the probability of encountering term t in document d of domain

Dk.

• Form concepts: After the list of relevant terms has been identified, concepts

and glossary from WordNet are employed for associating the terms to existing

concepts and to provide definitions. The author named this process as seman-

tic interpretation. If multi-word terms are involved, the approach evaluates all

possible sense combinations by intersecting and weighting common semantic

patterns in the glossary until it selects the best sense combinations.

• Construct hierarchy: Once semantic interpretation has been performed on the

terms to form concepts, taxonomic relations are discovered using hypernyms

from WordNet to organise the concepts into domain concept trees.

1More information is available via http://sourceforge.net/projects/fetishproj/. Last accessed

25 May 2009.


An evaluation of the term extraction technique was performed using the F-

measure. A tourism corpus was manually constructed from the Web containing

about 200, 000 words. The evaluation was done by manually looking at 6, 000 of the

14, 383 candidate terms and marking all the terms judged as good domain terms

and comparing the obtained list with the list of terms automatically filtered by the

system. A precision of 85.42% and recall of 52.74% were achieved.

Text-to-Onto

Text-to-Onto [51, 162, 163, 165] is a semi-automated system that is part of an

ontology management infrastructure called KAON2. KAON is a comprehensive tool

suite for ontology creation and management. The authors claimed that the approach

has been applied to the tourism and insurance sector, but no further information

was presented. Instead, ontologies3 for some toy domains4 have been constructed

using this approach. Text-to-Onto employs both linguistics and statistics-based

techniques in six major tasks to discover terms, concepts, taxonomic relations and


• Preprocess texts and extract terms: Plain text extraction is performed to ex-

tract plain domain texts from semi-structured sources (i.e. HTML documents)

and other formats (e.g. PDF documents). Abbreviation expansion is per-

formed on the plain texts using rules and dictionaries to replace abbreviations

and acronyms. Part-of-speech tagging and sentence parsing are performed on

the preprocessed texts to produce syntactic structures and dependencies. Syn-

tactic structure analysis is performed using weighted finite state transducers

to identify important noun phrases as terms. These natural language process-

ing tools are provided by a system called Saarbruecken Message Extraction

System (SMES) [184].

• Form concepts: Concepts from domain lexicon are required to assign new

terms to predefined concepts. Unlike other approaches that employ general

background knowledge such as WordNet, the lexicon adopted by Text-to-Onto

are domain-specific containing over 120, 000 terms. Each term is associated

with concepts available in a concept taxonomy. Other techniques for concept

2More information is available via http://kaon.semanticweb.org/. Last accessed 25 May 2009.3The ontologies can be downloaded from http://kaon.semanticweb.org/ontologies. Last ac-

cessed 25 May 2009.4The term toy domain is in wide use in the research community to describe work in extremely

restricted domains.


formations are also performed such as the use of co-occurrence analysis but no

additional information was provided.

• Construct hierarchy: Once the concepts have been formed, taxonomic relations

are discovered by exploiting the hypernyms from WordNet. Lexico-syntactic

patterns are also employed to identify hypernymy relations in the texts. The

authors refer to these hypernyms as oracle, denoted by H. The projection

H(t) will return a set of tuples (x, y) where x is a hypernym for term t and y

is the number of times the algorithm has found evidence for it. Using cosine

measure for similarity and the oracle, a bottom-up hierarchical clustering is

carried out with a list T of n terms as input. When given two terms which

are similar according to the cosine measure, the algorithm works by ordering

them as sub-concepts if one is a hypernym of the other. If the previous case

does not apply, the most frequent common hypernym h is selected to create a

new concept to accommodate both terms as siblings.

• Discover non-taxonomic relations and label non-taxonomic relations: For non-

taxonomic relations extraction, association rules together with two user-defined

thresholds (i.e. confidence, support) are employed to determine associations

between concepts at the right level of abstraction. Typically, users start with

low support and confidence to explore general relations, and later increases

the values to explore more specific relations. User participation is required to

validate and label the non-taxonomic relations.

An evaluation of the relation discovery technique was performed using a measure

called the Generic Relations Learning Accuracy (RLA). Given a set of discovered

relations D, precision is defined as |D ∩ R|/|D| and recall as |D ∩ R|/|R| where R

is the non-taxonomic relations prepared by domain experts. RLA is a measure to

capture intuitive notions for relation matches such as utterly wrong, rather bad, near

miss and direct hit. RLA is the averaged accuracy that the instances of discovered

relations match against their best counterpart from manually-curated gold-standard.

As the learning algorithm is controlled by support and confidence parameters, the

evaluation is done by varying the support and the confidence values. When both the

support and the confidence thresholds are set to 0, 8, 058 relations were produced

with a RLA of 0.51. Both the number of relations and the recall decreases with

growing support and confidence. Precision increases at first but drops when so few

relations are discovered that almost none is a direct hit. The best RLA at 0.67 is

achieved with a support at 0.04 and a confidence at 0.01.


ASIUM

ASIUM [71, 70, 69] is a semi-automated ontology learning system that is part of

an information extraction infrastructure called INTEX, by the Laboratoire d’Automatique

Documentaire et Linguistique de l’Universite de Paris 7. The aim of this approach

is to learn semantic knowledge from texts and use the knowledge for the expansion

(i.e. portability from one domain to the other) of INTEX. The authors mentioned

that the system has been tested by Dassault Aviation, and has been applied on a toy

domain using cooking recipe corpora in French. ASIUM employs both linguistics

and statistics-based techniques to carry out five tasks to discover terms, concepts

and taxonomic relations.

• Preprocess texts and discover subcategorisation frames: Sentence parsing is

applied on the input text using functionalities provided by a sentence parser

called SYLEX [54]. SYLEX produces all interpretations of parsed sentences in-

cluding attachments of noun phrases to verbs and clauses. Syntactic structure

and dependency analysis is performed to extract instantiated subcategorisation

frames in the form of <verb><syntactic role|preposition:head noun>∗

where the wildcard character ∗ indicates the possibility of multiple occurrences.

• Extract terms and form concepts: The nouns in the arguments of the sub-

categorisation frames extracted from the previous step are gathered to form

basic classes based on the assumption “head words occurring after the same,

different prepositions (or with the same, different syntactic roles), and with the

same, different verbs represent the same concept” [68]. To illustrate, suppose

that we have the nouns “ballpoint pen”, “pencil and “fountain pen” occurring

in different clauses as adjunct of the verb “to write” after the preposition

“with”. At the same time, these nouns are the direct object of the verb “to

purchase”. From the assumption, these nouns are thus considered as variants

representing the same concept.

• Construct hierarchy: The basic classes from the previous task are successively

aggregated to form concepts of the ontology and reveal the taxonomic relations

using clustering. Distance between all pairs of basic classes is computed and

two basic classes are only aggregated if the distance is less than the threshold

set by the user. On the one hand, the distance between two classes containing

the same words with same frequencies have the distance 0. On the other hand,

a pair of classes without a single common word have distance 1. The clustering


algorithm works bottom-up and performs first-best using basic classes as input

and builds the ontology level by level. User participation is required to validate

each new cluster before it can be aggregated to a concept.

An evaluation of the term extraction technique was performed using the precision

measure. The evaluation uses texts from the French journal Le Monde that have

been manually filtered to ensure the presence of terrorist event descriptions. The

results were evaluated by two domain experts who were not aware of the ontology

building process using the following indicators: OK if extracted information is cor-

rect, FALSE if extracted information is incorrect, NONE if there were no extracted

information, and FALSE for all other cases. Two precision values are computed,

namely, precision1 which is the ratio between OK and FALSE, and precision2 which

is the same as precision1 by taking into consideration NONE. Precision1 and pre-

cision2 have the value 86% and 89%, respectively.

TextStorm/Clouds

TextStorm/Clouds [191, 198] is a semi-automated ontology learning system that

is part of an idea sharing and generation system called Dr. Divago [197]. The

aim of this approach is to build and refine domain ontology for use in Dr. Divago

for searching resources in a multi-domain environment to generate musical pieces

or drawings. No information was provided on the availability of any real-world

applications, nor testing on toy domains. TextStorm/Clouds employs logic and

linguistics-based techniques to carry out six tasks to discover terms, taxonomic

relations, non-taxonomic relations and axioms.

• Preprocess texts and extract terms: The part-of-speech information in Word-

Net is used to annotate the input text. Later, syntactic structure and depen-

dency analysis is performed using an augmented grammar to extract syntactic

structures in the form of binary predicates. The Prolog-like binary predicates

represent relations between two terms. Two types of binary predicates are

considered. The first type captures terms in the form of subject and object

connected by a main verb. The second type captures the property of com-

pound nouns, usually in the form of modifiers. For example, the sentence “Ze-

bra eat green grass” will result in two binary predicates namely eat(Zebra,

grass) and property(grass, green). When working with dependent sen-

tences, finding the concepts may not be straightforward and this approach

performs anaphora resolution to resolve ambiguities. The anaphora resolution


uses a history list of discourse entities generated from preceeding sentences

[6]. In the presence of an anaphora, the most recent entities are given higher

priority.

• Construct hierarchy, discover non-taxonomic relations and label non-taxonomic

relations: Next, the binary predicates are employed to gradually aggregate

terms and relations to an existing ontology with user participation. Hy-

pernymy relations appear in binary predicates in the form of is-a(X,Y)

while part-of(X,Y) and contain(X,Y) provide good indicators for meronyms.

Attribute-value relations are obtainable from the predicates in the form of

property(X,Y). During the aggregation process, users may be required to in-

troduce new predicates to connect certain terms and relations to the ontology.

For example, in order to attach the predicate is-a(predator, animal) to

an ontology with the root node living entity, user will have to introduce

is-a(animal, living entity).

• Extract axioms: The approach employs inductive logic programming to learn

regularities by observing the recurrent concepts and relations in the predicates.

For instance, the approach using the extracted predicates below

1: is-a(panther, carnivore)

2: eat(panther, zebra)

3: eat(panther, gazelle)

4: eat(zebra, grass)

5: is-a(zebra,herbivore)

6: eat(gazelle, grass)

7: is-a(gazelle,herbivore)

will arrive at the conclusions that

1: eat(A, zebra):- is-a(A, carnivore)

2: eat(A, grass):- is-a(A, herbivore)

These axioms describe relations between concepts in terms of its context (i.e.

the set of neighbourhood connections that the arguments have).

Using the accuracy measure, the performance of the binary predicate extraction

task was evaluated to determine if the relations hold between the corresponding

concepts. A total of 21 articles from the scientific domain were collected and analysed

by the system. Domain experts then determined the coherence of the predicates and


its accuracy with respect to the corresponding input text. The authors concluded

an average accuracy of 52%.

SYNDIKATE

SYNDIKATE [96, 95] is a stand-alone automated ontology learning system. The

authors have applied this approach in two toy domains, namely, information tech-

nology and medicine. However, no information was provided on the availability of

any real-world applications. SYNDIKATE employs purely linguistics-based tech-

niques to carry out five tasks to discover terms, concepts, taxonomic relations and


• Extract terms: Syntactic structure and dependency analysis is performed on

the input text using a lexicalised dependency grammar to capture binary va-

lency5 constraints between a syntactic head (e.g. noun) and possible modifiers

(e.g. determiners, adjectives). In order to establish a dependency relation be-

tween a head and a modifier, the term order, morpho-syntactic features com-

patibility and semantic criteria have to be met. Anaphora resolution based on

the centering model is included to handle pronouns.

• Form concepts, construct hierarchy, discover non-taxonomic relations and la-

bel non-taxonomic relations: Using predefined semantic templates, each term

in the syntactic dependency graph is associated with a concept in the domain

knowledge and at the same time, used to instantiate the text knowledge base.

The text knowledge base is essentially an annotated representation of the in-

put texts. For example, the term “hard disk” in the graph is associated with

the concept HARD DISK in domain knowledge, and at the same time, an in-

stance called HARD DISK3 will be created in the text knowledge base. The

approach then tries to find all relational links between conceptual correlates

of two words in the subgraph if both grammatical and conceptual constraints

are fulfilled. The linkage may either be constrained by dependency relations,

by intervening lexical materials, or by conceptual compatibility between the

concepts involved. In the case where unknown words occur, semantic inter-

pretation of the dependency graph involving unknown lexical items in the text

knowledge base is employed to derive concept hypothesis. The structural pat-

terns of consistency, mutual justification and analogy relative to the already

5Valency refers to the capacity of a verb to take a specific number and type of arguments (noun

phrase positions).


available concept descriptions in the text knowledge base will be used as initial

evidence to create linguistic and conceptual quality labels. An inference en-

gine is then used to estimate the overall credibility of the concept hypotheses

by taking into account the quality labels.

An evaluation using the precision, recall and accuracy measures was conducted

to assess the concepts and relations extracted by this system. The use of semantic

interpretation to discover the relations between conceptual correlates yielded 57%

recall and 97% precision, and 31% recall and 94% precision, for medicine and infor-

mation technology texts, respectively. As for the formation of concepts, an accuracy

of 87% was achieved. The authors also presented the performance of other aspects

of the system. For example, sentence parsing in the system exhibits a linear time

complexity while a third-party parser runs in exponential time complexity. This

behaviour was caused by the latter’s ability to cope with ungrammatical input. The

incompleteness of the system’s parser results in a 10% loss of structural information

as compared to the complete third-party parser.

2.3.2 Recent Advances in Ontology Learning

Since the publication of the three survey papers [62, 226, 90], the research activ-

ities within the ontology learning community have been mainly focusing on (1) the

advancement of relation acquisition techniques, (2) the automatic labelling of con-

cept and relation, (3) the use of structured and unstructured Web data for relation

acquisition, and (4) the diversification of evidence for term recognition.

On the advancement of relation acquisition techniques, Specia & Motta [237]

presented an approach for extracting semantic relations between pairs of entities

from texts. The approach makes use of a lemmatiser, syntactic parser, part-of-speech

tagger, and word sense disambiguation models for language processing. New entities

are recognised using a named-entity recognition system. The approach also relies

on a domain ontology, a knowledge base, and lexical databases. Extracted entities

that exist in the knowledge base are semantically annotated with their properties.

Ciaramita et al. [48] employ syntactic dependencies as potential relations. The

dependency paths are treated as bi-grams, and scored with statistical measures of

correlation. At the same time, the arguments of the relations can be generalised to

obtain abstract concepts using algorithms for Selectional Restrictions Learning [208].

Snow et al. [234, 235] also presented an approach that employs the dependency

paths extracted from parse trees. The approach receives trainings using sets of

text containing known hypernym pairs. The approach then automatically discovers


useful dependency paths that can be applied to new corpora for identifying new

hypernyms.

On the automatic concept and relation labelling, Kavalec & Svatek [123] studied

the feasibility of label identification for relations using semantically-tagged corpus

and other background knowledge. The authors suggested that the use of verbs,

identified through part-of-speech tagging, can be viewed as a rough approximation

of relation labels. With the help of semantically-tagged corpus to resolve the verbs

to the correct word sense, the quality of relations labelling may be increased. In

addition, the authors also suggested that abstract verbs identified through generali-

sation via WordNet can be useful labels. Jones [119] proposed in her PhD research a

semi-automated technique for identifying concepts and simple technique for labelling

concepts using user-defined seed words. This research was carried out exclusively

using small lists of words as input. In another PhD research by Rosario [211], the

author proposed the use of statistical semantic parsing to extract concepts and rela-

tions from bioscience text. In addition, the research presented the use of statistical

machine learning techniques to build a knowledge representation of the concepts.

The concepts and relations extracted by the proposed approach are intended to be

combined by some other systems to produce larger propositions which can then be

used in areas such as abductive reasoning or inductive logic programming. This

approach has only been tested with a small amount of data from toy domains.

On the use of Web data for relation acquisition, Sombatsrisomboon [236] pro-

posed a simple 3-step technique for discovering taxonomic relations (i.e. hyper-

nym/hyponym) between pairs of terms using search engines. Search engine queries

are first constructed using the term pairs and patterns such as X is a/an Y. The

webpages provided by search engines are then gathered to create a small corpus.

Sentence parsing and syntactic structure analysis is performed on the corpus to dis-

cover taxonomic relations between the terms. Such use of Web data redundancy

and patterns can also be extended to discover non-taxonomic relations. Sanchez

& Moreno [217] proposed methods for discovering non-taxonomic relations using

Web data. The authors developed a technique for learning domain patterns using

domain-relevant verb phrases extracted from webpages provided by search engines.

These domain patterns are then used to extract and label non-taxonomic relations

using linguistic and statistical analysis. There is also an increasing interest in the

use of structured Web data such as Wikipedia for relation acquisition. Pei et al.

[196] proposed an approach for constructing ontologies using Wikipedia. The ap-

proaches uses a two-step technique, namely, name mapping and logic-based map-


ping, to deduce the type of relations between concepts in Wikipedia. Similarly,

Liu et al. [154] developed a technique called Catriple for automatically extract-

ing triples using Wikipedia’s categorical system. The approach focuses on category

pairs containing both explicit property and explicit value (e.g. “Category:Songs

by artist”-“Category:The Beatles songs” where “artist is property and “The Bea-

tles” is value), and category pairs containing explicit value but implicit property

(e.g. “Category:Rock songs”-“Category:British rock songs” where British is a value

with no property). Sentence parsers and syntactic rules are used to extract the

explicit properties and values from the category names. Weber & Buitelaar [267]

proposed a system called Information System for Ontology Learning and Domain

Exploration (ISOLDE) for deriving domain ontologies using manually-curated text

corpora, a general-purpose named-entity tagger, and structured data on the Web

(i.e. Wikipedia, Wiktionary and a German online dictionary known as DWDS) to

derive a domain ontology.

On the diversification of evidence for term recognition, Sclano & Velardi [222] de-

veloped a system called TermExtractor for identifying relevant terms in two steps.

TermExtractor uses a sentence parser to parse texts and extract syntactic struc-

tures such as noun compounds, and ADJ-N and N-PREP-N sequences. The list of

term candidates is then ranked and filtered using a combination of measures for

realising different evidence, namely, Domain Pertinence (DP), Domain Consensus

(DC), Lexical Cohesion (LC) and Structural Relevance (SR). Wermter & Hahn [269]

incorporated a linguistic property of terms as evidence, namely, limited paradigmatic

modifiability into an algorithm for extracting terms. The property of paradigmatic

modifiability is concerned with the extent to which the constituents of a multi-word

term can be modified or substituted. The more we are able to substitute the con-

stituents by other words, the less probable it is that the corresponding multi-word

lexical unit is a term. There is also an increase in interest to automatically construct

the text corpora required for term extraction using Web data. Agbago & Barriere

[4] proposed the use of richness estimators to assess the suitability of webpages pro-

vided by search engines for constructing corpora for use by terminologists. Baroni

& Bernardini [15] developed the BootCat technique for bootstrapping text corpora

and terms using Web data and search engines. The technique requires as input a

set of seed terms. The seeds are used to build a corpus using webpages suggested

by search engines. New terms are then extracted from the initial corpus, which in

turn are used as seeds to build larger corpora.

There are several other recent advances that fall outside the aforementioned


groups. Novacek & Smrz [187] developed two frameworks for bottom-up generation

and merging of ontologies called OLE and BOLE. The latter is a domain-specific

adaptation of the former for learning bio-ontologies. OLE is designed and imple-

mented as a modular framework consisting of several components for providing

solutions to different tasks in ontology learning. For example, the OLITE module

is responsible for preprocessing plain text and creating mini ontologies. PALEA

is a module responsible for extracting new semantic relation patterns while OLE-

MAN merges the mini ontologies resulting from the OLITE module and updates

the base domain ontology. The authors mentioned that any techniques for auto-

mated ontology learning can be employed as an independent part of any modules.

Another research which contributed from a systemic point of view is the CORPO-

RUM system [72]. OntoExtract is part of the CORPORUM OntoBuilder toolbox

that analyses natural language texts for generating lightweight ontologies. No spe-

cific detail was provided regarding the techniques employed by OntoExtract. The

author [72] merely mentioned that OntoExtract uses a repository of background

knowledge to parse, tokenise and analyse texts on both the lexical and syntactic

level, and generates nodes and relations between key terms. Liu et al. [156] pre-

sented an approach to semi-automatically extend and refine ontologies using text

mining techniques. The approach make use of news from media sites to expand a

seed ontology by first creating a semantic network through co-occurrence analysis,

trigger phrase analysis, and disambiguation based on the WordNet lexical dictio-

nary. Spreading activation is then applied on the resulting semantic network to find

the most probable candidates for inclusion in the extended ontology.

2.4 Applications of Ontologies

Ontologies are an important part of the standard stack for the Semantic Web6

by the World Wide Web Consortium (W3C). Ontologies are used to exchange data

among multiple heterogeneous systems, provide services in an agent-based environ-

ment, and promote the reusability of knowledge bases. While the dream of realising

the Semantic Web is still years away, ontologies have already found their ways into

a myriad of applications such as document retrieval, question answering, image re-

trieval, agent interoperability and document annotation. Some of the research areas

which have found use for ontologies are:

• Document retrieval: Paralic & Kostial [194] developed an document retrieval

6http://www.w3.org/2001/sw/

2.4. Applications of Ontologies 37

system based on the use of ontologies. The authors demonstrated that the

retrieval precision and recall of the ontology-based information retrieval sys-

tem outperforms techniques based on latent semantic indexing and full-text

search. The system registers every new document to several concepts in the

ontology. Whenever retrieval requests arrive, resources are retrieved based on

the associations between concepts, and not on partial or exact term matching.

Similarly, Vallet et al. [255] and Castells et al. [39] proposed a model that

uses knowledge in ontologies for improving document retrieval. The retrieval

model includes an annotation weighting algorithm and a ranking algorithm

based on the classic vector-space model. Keyword-based search is incorpo-

rated into their approach to ensure robustness in the event of incompleteness

of ontology.

• Question answering: Atzeni et al. [10] reported the development of an ontology-

based question answering system for the Web sites of two European universi-

ties. The system accepts questions and produces answers in natural language.

The system is being investigated in the context of an European Union project

called MOSES.

• Image retrieval: Hyvonen et al. [111] developed a system that uses ontologies

to assist image retrieval. Images are first annotated with concepts in an on-

tology. Users are then presented with the same ontology to facilitate focused

image retrieval and the browsing of semantically-related images using the right

concepts.

• Multi-agent system interoperability: Malucelli & Oliveira [167] proposed an

ontology-based service to assist the communication and negotiation between

agents in a decentralised and distributed system architecture. The agents

typically have their own heterogeneous private vocabularies. The service uses a

central ontology agent to monitor and lead the communication process between

the heterogeneous agents without having to map all the ontologies involved.

• Document annotation: Corcho [55] surveyed several approaches for annotating

webpages with ontological elements for improving information retrieval. Many

of the approaches described in the survey paper rely on manually-curated

ontologies for annotation using a variety of tools such as SHOE Annotator7

7http://www.cs.umd.edu/projects/plus/SHOE/KnowledgeAnnotator.html


[103], CREAM [101], MnM8 [258], and OntoAnnotate [241].

In addition to the above-mentioned research areas, ontologies have also been

deployed in certain applications across different domains. One of the most successful

application area of ontologies is bioinformatics. Bioinformatics have thrived on the

advances in ontology learning techniques and the availability of manually-curated

terminologies and ontologies (e.g. Unified Medical Language System [151], Gene

Ontology [8] and other small domain ontologies at www.obofoundry.org). The

computable knowledge in ontologies is also proving to be a valuable resource for

reasoning and knowledge discovery in biomedical decision support systems. For

example, the inference that a disease of the myocardium is a heart problem is possible

using the subsumption relations in an ontology of disease classification based on

anatomic locations [52]. In addition, terminologies and ontologies are commonly

used for annotating biological datasets, biomedical literature and patient records,

and improving the access and retrieval of biomedical information [27]. For instance,

Baker et al. [13] presented a document query and delivery system for the field of

lipidomics9. The main aim of the system is to overcome the navigation challenges

that hinder the translation of scientific literatures into actionable knowledge. The

system allows users to access tagged documents containing lipid, protein and disease

names using description logic-based query capability that comes with the semi-

automatically created lipid ontology. The lipid ontology contains a total of 672

concepts. The ontology is the result of merging existing biological terminologies,

knowledge from domain experts, and output from a customised text mining system

that recognises lipid-specific nomenclature.

Another visible application of ontologies is in the manufacturing industry. Cho

et al. [43] looked at the current approach for locating and comparing parts informa-

tion in an e-procurement setting. At present, buyers are faced with the challenge

of accessing and navigating through different parts libraries from multiple suppliers

using different search procedures. The authors introduced the use of the “Parts

Library Concept Ontology” to integrate heterogeneous parts library to enable the

consistent identification and systematic structuring of domain concepts. Lemaignan

et al. [143] presented a proposal for a manufacturing upper ontology. The authors

stressed on the importance of ontologies as a common way for describing manufac-

turing processes for produce lifecycle management. The use of ontologies ensures

the uniformity in assertions throughout a product’s lifecycle, and the seamless flow

8http://projects.kmi.open.ac.uk/akt/MnM/index.html9Lipidomics is the study of pathways and networks of cellular lipids in biological systems.

2.5. Chapter Summary 39

of data between heterogeneous manufacturing environments. For instance, assume

that we have these relations in an ontology:

isMadeOf(part,rawMaterial)

isA(aluminium,rawMaterial)

isA(drilling,operation)

isMachinedBy(rawMaterial,operation)

and the drilling operation has the attributes drillSpeed and drillDiameter. Using

these elements, we can easily specify rules such as if isMachinedBy(aluminium,drilling)

and the drillDiameter is less than 5mm, then drillSpeed should be 3000 rpm

[143]. This ontology allows a uniform interpretation of assertions such as

isMadeOf(part,aluminium) anywhere along the product lifecycle, thus facilitating

the inference of standard information such as the drill speed.

2.5 Chapter Summary

In this chapter, an overview on ontologies and ontology learning from text was

provided. In particular, we looked the types of output, techniques and evaluation

methods related to ontology learning. The differences between a heavyweight (i.e.

formal) and a lightweight ontology are also explained. Several prominent ontology

learning systems, and some recent advances in the field were summarised. Finally,

some current noticeable applications were included to demonstrate the applicabil-

ity of ontologies to a wide range of domains. The use of ontologies for real-world

applications in the area of bioinformatics and the manufacturing industry was high-

lighted.

Overall, it was concluded that the automatic and practical construction of full-

fledged formal ontologies from text across different domains is currently beyond

the reach of conventional systems. Many current ontology learning systems are

still struggling to achieve high-performance term recognition, let alone more com-

plex tasks (e.g. relation acquisition, axiom learning). An interesting point revealed

during the literature review is that most systems ignore the fact that the static

background knowledge relied upon by their techniques is a rare resource and may

not have the adequate size and coverage. In particular, all existing term recognition

techniques are rested on a false assumption that the domain corpora required will

always be available. Only recently, there is a growing interest in automatically con-

structing text corpora using Web data. However, the governing philosophy behind


these existing corpus construction techniques is inadequate for creating very large

high-quality text corpora. In regard to relation acquisition, existing techniques rely

heavily on static background knowledge, especially semantic lexicon, such as Word-

Net. While there is an increasing interest in the use of dynamic Web data for relation

acquisition, more research work is still required. For instance, new techniques are

appearing every now and then that make use of Wikipedia for finding semantic re-

lations between two words. However, these techniques often leave out the details

on how to cope with words that do not appear in Wikipedia. Moreover, the use of

clustering techniques for acquiring semantic relations may appear less attractive due

to the complications in feature extraction and preparation. The literature review

also exposes the lack of treatment for data cleanliness during ontology learning. As

the use of Web data becomes more common, integrated techniques for removing

noises in texts are turning into a necessity.

All in all, it is safe to conclude that there is currently no single system that sys-

tematically uses dynamic Web data to meet the requirements for every stage of the

ontology learning process. There are several key areas that require more attention,

namely, (1) integrated techniques for cleaning noisy text, (2) high-performance term

recognition techniques, (3) high-quality corpus construction for term recognition,

and (4) dynamic Web data for clustering and relation acquisition. Our proposed

ontology learning system is designed specifically to address these key areas. In the

subsequent six chapters (i.e. Chapter 3 to 8), details are provided on the design,

development and testing of novel techniques for the five phases (i.e. text prepro-

cessing, text processing, term recognition, corpus construction, relation acquisition)

of the proposed system.

41CHAPTER 3

Text Preprocessing

Abstract

An increasing number of ontology learning systems are gearing towards the use

of online sources such as company intranet and the World Wide Web. Despite such

rise, not much work can be found in aspects of preprocessing and cleaning noisy

texts from online sources. This chapter presents an enhancement of the Integrated

Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration

(ISSAC) technique. ISSAC is implemented as part of the text preprocessing phase

in an ontology learning system. New evaluations performed on the enhanced ISSAC

using 700 chat records reveal an improved accuracy of 98% as compared to 96.5%

and 71% based on the use of basic ISSAC and of Aspell, respectively.

3.1 Introduction

Ontology is gaining applicability across a wide range of applications such as

information retrieval, knowledge acquisition and management, and the Semantic

Web. The manual construction and maintenance of ontologies was never a long-term

solution due to factors such as the high cost of expertise and the constant change in

knowledge. These factors have prompted an increasing effort in automatic and semi-

automatic learning of ontologies using texts from electronic sources. A particular

source of text that is becoming popular is the World Wide Web.

The quality of texts from online sources for ontology learning can vary anywhere

between noisy and clean. On the one hand, the quality of texts in the form of

blogs, emails and chat logs can be extremely poor. The sentences in noisy texts are

typically full of spelling errors, ad-hoc abbreviations and improper casing. On the

other hand, clean sources are typically prepared and conformed to certain standards

such as those in the academia and journalism. Some common clean sources include

news articles from online media sites, and scientific papers. Different text quality

requires different treatments during the preprocessing phase and noisy texts can be

much more demanding.

An increasing number of approaches are gearing towards the use of online sources

0This chapter appeared in the Proceedings of the IJCAI Workshop on Analytics for Noisy

Unstructured Text Data (AND), Hyderabad, India, 2007, with the title “Enhanced Integrated

Scoring for Cleaning Dirty Texts”.

42 Chapter 3. Text Preprocessing

such as corporate intranet [126] and search engines retrieved documents [51] for

different aspects of ontology learning. Despite such growth, only a small number of

researchers [165, 187] acknowledge the effect of text cleanliness on the quality of their

ontology learning output. With the prevalence of online sources, this “...annoying

phase of text cleaning...”[176] has become inevitable and ontology learning systems

can no longer ignore the issue of text cleanliness. An effort by Tang et al. [246]

showed that the accuracy of term extraction in text mining improved by 38-45%

(F1-measure) with the additional cleaning performed on the input texts (i.e. emails).

Integrated techniques for correcting spelling errors, abbreviations and improper

casing are becoming increasingly appealing as the boundaries between different er-

rors in online sources are blurred. Along the same line of thought, Clark [53] de-

fended that “...a unified tool is appropriate because of certain specific sorts of er-

rors”. To illustrate this idea, consider the error word “cta”. Do we immediately

take it as a spelling error and correct it as “cat”, or is there a problem with the

letter casing, which makes it a probable acronym? It is obvious that the problems

of spelling error, abbreviation and letter casing are inter-related to a certain extent.

The challenge of providing a highly accurate integrated technique for automatically

cleaning noisy text in ontology learning remains to be addressed.

In an effort to provide an integrated technique to solve spelling errors, ad-hoc

abbreviations and improper casing simultaneously, we have developed an Integrated

Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration

(ISSAC)1 technique [273]. The basic ISSAC uses six weights from different sources

for automatically correcting spelling error, expanding abbreviations and restoring

improper casing. These includes the original rank by the spell checker Aspell [9],

reuse factor, abbreviation factor, normalised edit distance, domain significance and

general significance. Despite the achievement of 96.5% in accuracy by the basic

ISSAC, several drawbacks have been identified that require additional work. In

this chapter, we present the enhancement of the basic ISSAC. New evaluations

performed on seven different sets of chat records yield an improved accuracy of 98%

as compared to 96.5% and 71% based on the use of basic ISSAC and of Aspell,

respectively.

In Section 2, we present a summary of work related to spelling error detection and

correction, abbreviation expansion, and other cleaning tasks in general. In Section 3,

1This foundation work on ISSAC appeared in the Proceedings of the 5th Australasian Con-

ference on Data Mining (AusDM), Sydney, Australia, 2006, with the title “Integrated Scoring for

Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text”.

3.2. Related Work 43

we summarise the basic ISSAC. In Section 4, we propose the enhancement strategies

for ISSAC. The evaluation results and discussions are presented in Section 5. We

summarise and conclude this chapter with future outlook in Section 6.

3.2 Related Work

Spelling error detection and correction is the task of recognising misspellings in

texts and providing suggestions for correcting the errors. For example, detecting

“cta” as an error and suggesting that the error to be replaced with “cat”, “act” or

“tac”. More information is usually required to select a correct replacement from

a list of suggestions. Two of the most studied classes of techniques are minimum

edit distance and similarity key. The idea of minimum edit distance techniques

began with Damerau [58] and Levenshtein [146]. Damerau-Levenshtein distance is

the minimal number of insertions, deletions, substitutions and transpositions needed

to transform one string into the other. For example, changing the word “wear” to

“beard” requires a minimum of two operations, namely, a substitution of ‘w’ with

‘b’, and an insertion of ‘d’. Many variants were developed subsequently such as

the algorithm by Wagner & Fischer [266]. The second class of techniques is the

similarity key. The main idea behind similarity key techniques is to map every

string into a key such that similarly spelt strings will have identical keys [135].

Hence, the key, computed for each spelling error, will act as a pointer to all similarly

spelt words (i.e. suggestions) in the dictionary. One of the earliest implementation

is the SOUNDEX system [189]. SOUNDEX is a phonetic algorithm for indexing

words based on their pronunciation in English. SOUNDEX works by mapping a

word into a key consisting of its first letter followed by a sequence of numbers. For

example, SOUNDEX replaces the letter li ∈ A,E, I, O, U,H,W, Y with 0 and

li ∈ R with 6, and hence, wear → w006 → w6 and ware → w060 → w6. Since

SOUNDEX, many improved variants were developed such as the Metaphone and

the Double-metaphone algorithm [199], Daitch-Mokotoff Soundex [138] for Eastern

European languages, and others [108]. One of the famous implementation that

utilises the similarity key technique is Aspell [9]. Aspell is based on the Metaphone

algorithm and the near-miss strategy from its predecessor Ispell [134]. Aspell begins

by converting a misspelt word to its soundslike equivalent (i.e. metaphone) and then

finding all words that have a soundslike within one or two edit distances from the

original word’s soundslike2. These soundslike words are the basis of the suggestions

by Aspell.

2Source from http://aspell.net/man-html/Aspell-Suggestion-Strategy.html


Most of the work in detecting and correcting spelling errors, and expanding ab-

breviations are carried out separately. The task of abbreviation expansion deals

with recognising shorter forms of words (e.g. “abbr.” or “abbrev.”), acronyms (e.g.

“NATO”) and initialisms (e.g. “HTML”, “FBI”), and expanding them to their cor-

responding words3. The work on detecting and expanding abbreviations are mostly

conducted in the realm of named-entity recognition and word-sense disambiguation.

The technique presented by Schwartz & Hearst [221] begins with the extraction of

all abbreviations and definition candidates based on the adjacency to parentheses.

A candidate is considered as the correct definition for an abbreviation if they ap-

pears in the same sentence, and the candidate has no more than min(|A|+5, |A|∗2)

words, where |A| is the number of characters in an abbreviation A. Park & Byrd

[195] presented an algorithm based on rules and heuristics for extracting definitions

for abbreviations from texts. Several factors are employed in this technique such as

syntactic cues, priority of rules, distance between abbreviation and definition and

word casing. Pakhomov [193] proposed a semi-supervised technique that employs

a hand-crafted table of abbreviations and their definitions for training a maximum

entropy classifier. For case restoration, improper letter casings in words are detected

and restored. For example, detecting the letter ‘j’ in “jones” as improper and cor-

recting the word to produce “Jones”. Lita et al. [153] presented an approach for

restoring cases based on the context in which the word exists. The approach first

captures the context surrounding a word and approximates the meaning using n-

grams. The casing of the letters in a word will depend on the most likely meaning of

the sentence. Mikheev [176] presented a technique for identifying sentence bound-

aries, disambiguating capitalised words and identifying abbreviations using a list of

common words. The technique can be described in four steps: identify abbreviations

in texts, disambiguate ambiguously capitalised words, assign unambiguous sentence

boundaries and disambiguate sentence boundaries if an abbreviation is followed by

a proper name.

In the context of ontology learning and other related areas such as text min-

ing, spelling error correction and abbreviation expansion are mainly carried out as

part of the text preprocessing (i.e. text cleaning, text normalisation) phase. Some

other common tasks in text preprocessing include plain text extraction (i.e. format

conversion, HTML/XML tag stripping, table identification [185]), sentence bound-

ary detection [243], case restoration [176], part-of-speech tagging [33] and sentence

3Some researchers refer to this relationship as abbreviation and definition or short-form and

long-form.

3.3. Basic ISSAC as Part of Text Preprocessing 45

parsing [149]. A review by Gomez-Perez & Manzano-Macho [90] showed that nearly

all ontology learning systems in the survey perform only shallow linguistic analysis

such as part-of-speech tagging during the text preprocessing phase. These exist-

ing systems require the input to be clean and hence, the techniques for correcting

spelling errors, expanding abbreviations and restoring cases are considered as un-

necessary. Ontology learning systems such as Text-to-Onto [165] and BOLE [187]

are the few exceptions. In addition to shallow linguistic analysis, these systems

incorporate some cleaning tasks. Text-to-Onto extracts plain text from various for-

mats such as PDF, HTML, XML, and identifies and replaces abbreviations using

substitution rules based on regular expressions. The text preprocessing phase of

BOLE consists of sentence boundary detection, irrelevant sentence elimination and

text tokenisation using Natural Language Toolkit (NLTK).

In a text mining system for extracting topics from chat records, Castellanos

[38] presented a comprehensive list of text preprocessing techniques. The system

employs a thesaurus, constructed using the Smith-Waterman algorithm [233], for

correcting spelling errors and identifying abbreviations. In addition, the system

removes program codes from texts, and detects sentence boundary based on simple

heuristics (e.g. shorter lines in program codes, and punctuation marks followed by

an upper case letter). Tang et al. [246] presented a cascaded technique for cleaning

emails prior to text mining. The technique is composed of four passes: non-text

filtering for eliminating irrelevant data such as email header, sentence normalisation,

case restoration, and spelling error correction for transforming relevant text into

canonical form. Many of the techniques mentioned above perform only one out

of the three cleaning tasks (i.e. spelling error correction, abbreviation expansion,

case restoration). In addition, the evaluations conducted to obtain the accuracy are

performed in different settings (e.g. no benchmark, test data and agreed measure

of accuracy). Hence, it is not possible to compare these different techniques based

on the accuracy reported in the respective papers. As pointed out earlier, only

a small number of integrated techniques are available for handling all three tasks.

Such techniques are usually embedded as part of a larger text preprocessing module.

Consequently, the evaluations of the individual cleaning task in such environments

are not available.

3.3 Basic ISSAC as Part of Text Preprocessing

ISSAC was designed and implemented as part of the text preprocessing phase

in an ontology learning system that uses chat records as input. The use of chat


Figure 3.1: Examples of spelling errors, ad-hoc abbreviations and improper casing

in a chat record.

records has required us to place more effort in ensuring text cleanliness during

the preprocessing phase. Figure 3.1 highlights the various spelling errors, ad-hoc

abbreviations and improper casing that occur much more frequently in chat records

than in clean texts.

Prior to spelling error correction, abbreviation expansion and case restoration,

three tasks are performed as part of the text preprocessing phase. Firstly, plain text

extraction is conducted to remove HTML and XML tags from the chat records us-

ing regular expressions and Perl modules, namely, XML::Twig4 and HTML::Strip5.

Secondly, identification of URLs, emails, emoticons6 and tables is performed. Such

information is extracted and set aside for assisting in other business intelligence

analysis. Tables are removed using the signatures of a table such as multiple spaces

between words, and words aligned in columns for multiple lines [38]. Thirdly, sen-

tence boundary detection is performed using Lingua::EN::Sentence7 Perl module.

Firstly, each sentence in the input text (e.g. chat record) is tokenised to obtain a

set of words T = t1, ...tw. The set T is then fed into Aspell. For each word e that

Aspell considers as erroneous, a list of ranked suggestions S is produced. Initially,

4http://search.cpan.org/dist/XML-Twig-3.26/5http://search.cpan.org/dist/HTML-Strip-1.06/6An emoticon, also called a smiley, is a sequence of ordinary printable characters or a small

image, intended to represent a human facial expression and convey an emotion.7http://search.cpan.org/dist/Lingua-EN-Sentence/

3.3. Basic ISSAC as Part of Text Preprocessing 47

S = s1,1, ..., sn,n is an ordered list of n suggestions where sj,i is the jth suggestion

with rank i (smaller i indicates higher confidence in the suggested word). If e appears

in the abbreviation dictionary, the list S is augmented by adding all the correspond-

ing m expansions in front of S as additional suggestions with rank 1. In addition,

the error word e is appended at the end of S with rank n + 1. These augmentations

produce an extended list S = s1,1, ..., sm,1, sm+1,1, ..., sm+n,n, sm+n+1,n+1, which is

a combination of m suggestions from the abbreviation dictionary (if e is a potential

abbreviation), n suggestions by Aspell, and the error word e itself. Placing the error

word e back into the list of possible replacements serves one purpose: to ensure that

if no better replacement is available, we keep the error word e as it is. Once the

extended list S is obtained, each suggestion sj,i is re-ranked using ISSAC. The new

score for the jth suggestion with original rank i is defined as

NS(sj,i) = i−1 + NED(e, sj,i) + RF (e, sj,i)

+AF (sj,i) + DS(l, sj,i, r) + GS(l, sj,i, r)

where

• NED(e, sj,i) ∈ (0, 1] is the normalised edit distance defined as (ED(e, sj,i) +

1)−1 where ED is the minimum edit distance between e and sj,i.

• RF (e, sj,i) ∈ 0, 1 is the boolean reuse factor for providing more weight to

suggestion sj,i that has been previously used for correcting error e. The reuse

factor is obtained through a lookup against a history list that ISSAC keeps

to record previous corrections. RF (e, sj,i) provides factor 1 if the error e has

been previously corrected with sj,i and 0 otherwise.

• AF (sj,i) ∈ 0, 1 is the abbreviation factor for denoting that sj,i is a potential

abbreviation. A lookup against the abbreviation dictionary, AF (sj,i) yields

factor 1 if suggestion sj,i exists in the dictionary and 0 otherwise. When the

scoring process takes place and the corresponding expansions for potential

abbreviations are required, www.stands4.com is consulted. A copy of the

expansion is stored in a local abbreviation dictionary for future reference.

• DS(l, sj,i, r) ∈ [0, 1] measures the domain significance of suggestion sj,i based

on its appearance in the domain corpora by taking into account the neigh-

bouring words l and r. This domain significance weight is inspired by the

TF-IDF [210] measure commonly used for information retrieval. The weight is


defined as the ratio between the frequency of occurrence of sj,i (individually,

and within l and r) in the domain corpora and the sum of the frequencies of

occurrences of all suggestions (individually, and within l and r).

• GS(l, sj,i, r) ∈ [0, 1] measures the general significance of suggestion sj,i based

on its appearance in the general collection (e.g. webpages indexed by the Gog-

gle search engine). The purpose of this general significance weight is similar

to that of the domain significance. In addition, the use of dynamic Web data

allows ISSAC to cope with language change that is not possible with static

corpora and Aspell. The weight is defined as the ratio between the number

of documents in the general collection containing sj,i within l and r and the

number of documents in the general collection that contains sj,i alone. Both

the ratios in DS and GS are offset by a measure similar to that of the IDF

[210]. For further details on DS and GS, please refer to Wong et al. [273].

3.4 Enhancement of ISSAC

The list of suggestions and the initial ranks provided by Aspell are an integral

part of ISSAC. Figure 3.2 summarises the accuracy of basic ISSAC obtained from

the previous evaluations [273] using four sets of chat records (where each set contains

100 chat records). The achievement of 74.4% accuracy by Aspell from the previous

evaluations, given the extremely poor nature of the texts, demonstrated the strength

of the Metaphone algorithm and the near-miss strategy. The further increase of 22%

in accuracy using basic ISSAC demonstrated the potential of the combined weights

NS(sj,i).

Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4 Average

number of correct

replacements using

ISSAC

97.06% 97.07% 95.92% 96.20% 96.56%

number of correct

replacements using

Aspell

74.61% 75.94% 71.81% 75.19% 74.39%

Figure 3.2: The accuracy of basic ISSAC from previous evaluations.

Based on the previous evaluation results, we discuss in detail the three causes

behind the remaining 3.5% of errors which were incorrectly replaced. Figure 3.3

shows the breakdown of the causes behind the incorrect replacements by the basic

ISSAC. The three causes are summarised as follow:

3.4. Enhancement of ISSAC 49

Basic ISSAC

2.00%

1.00%

0.50%

Causes

Correct replacement not in suggestion list

Inadequate/erroneous neighbouring words

Anomalies

Figure 3.3: The breakdown of the causes behind the incorrect replacements by basic

ISSAC.

1. The accuracy of the corrections by basic ISSAC is bounded by the coverage

of the list of suggestions S produced by Aspell. About 2% of the wrong

replacements is due to the absence of the correct suggestions produced by

Aspell. For example, the error “prder” in the context of “The prder number”

was incorrectly replaced by both Aspell and basic ISSAC as “parader” and

“prder”, respectively. After a look into the evaluation log, we realised that the

correct replacement “order” was not in S.

2. The use of the two immediate neighbouring words l and r to inject more con-

textual consideration into domain and general significance has contributed to

a huge increase in accuracy. Nonetheless, the use of l and r in ISSAC is

by no means perfect. About 1% of the wrong replacements is due to two

flaws related to l and r, namely, neighbouring words with incorrect spelling,

and inadequate neighbouring words. Incorrectly spelt neighbouring words in-

ject false contextual information into the computation of DS and GS. The

neighbouring words may also be considered as inadequate due to their indis-

criminative nature. For example, the left word “both” in “both ocats are” is

too general and does not offer much discriminatory power for distinguishing

between suggestions such as “coats”, “cats” and “acts”.

3. The remaining 0.5% is considered as anomalies where basic ISSAC cannot

address. There are two cases of anomalies: the equally likely nature of all pos-

sible suggestions, and the contrasting value of certain weights. As an example

for the first case, consider the error “Janice cheung has”. The left word is

correctly spelt and has adequately confined the suggestions to proper names.

In addition, the correct replacement “Cheung” is present in the suggestion list

S. Despite all these, both Aspell and ISSAC decided to replace “cheung” with

“Cheng”. A look into the evaluation log reveals that the surname “Cheung”

is as common as “Cheng”. In such cases, the probability of replacing e with


the correct replacement is c−1 where c is the number of suggestions with ap-

proximately same NS(sj,i). The second case of anomalies is due to contrasting

value of certain weights, especially NED and i−1, that causes wrong replace-

ments. For example, in the case “cannot chage an”, basic ISSAC replaced

the error “chage” with “charge” instead of “change”. All the other weights

for “change” are comparatively higher (i.e. DS and GS) or the same (i.e.

RF , NED and AF ) as “charge”. Such inclination indicates that “change” is

the most proper replacement given the various cues. Nonetheless, the original

rank by Aspell for charge is i=1 while change is i=6. As smaller i indicates

higher confidence, the inverse of the original Aspell’s rank i−1 results in the

plummeting of the combined weight for “change”.

In this chapter, we approach the enhancement of ISSAC from the perspective

of the first and second cause. For this purpose, we proposed three modifications to

the basic ISSAC :

1. We proposed the use of additional spell checking facilities as the answer to the

first cause (i.e. compensating for the inadequacy of Aspell). Google spellcheck,

which is based on statistical analysis of words on the World Wide Web8, ap-

pears to be the ideal candidate for complementing Aspell. Using the Google

SOAP API9, we can have easy access to one of the many functions provided

by Google, namely, Google spellcheck. Our new evaluations show that Google

spellcheck works well for certain errors where Aspell fails to suggest the cor-

rect replacements. Similar to adding the expansions for abbreviations and the

suggestions by Aspell, the suggestion provided by Google is added at the front

of the list S with rank 1. This places the suggestion by Google on the same

rank as the first suggestion by Aspell.

2. The basic ISSAC relies only on Aspell for determining if a word is an error.

For this purpose, we decided to include Google spellcheck as a complement.

If a word is detected as a possible error by either Aspell or Google spellcheck,

then we have adequate evidence to proceed and correct it using enhanced

ISSAC. In addition, errors that result in valid words are not recognised by

Aspell. For example, Aspell does not recognise “hat” as an error. If we were

to take into consideration the neighbours that it co-occurs with, namely, “suret

hat they”, then “hat” is certainly an error. Google contributes in this aspect.

8http://www.google.com/help/features.html9http://www.google.com/apis

3.4. Enhancement of ISSAC 51

In addition, the use of Google spellcheck has also indirectly provided ISSAC

with a partial solution to the second cause (i.e. erroneous neighbouring words).

Whenever Google is checking a word for spelling error, the neighbouring words

are simultaneously examined. For example, while providing a suggestion for

the error “tha”, Google simultaneously takes into consideration the neighbours,

namely, “sure tha tthey”, and suggest that the right word “tthey” be replaced

with “they”. Google spellcheck’s ability in considering contextual information

is empowered by its large search engine index and the statistical evidence that

comes with it. Word collocations are ruled out as statistically improbable

when their co-occurrences are extremely low. In such cases, Google attempts

to suggest better collocates (i.e. neighbouring words).

3. We have altered the reuse factor RF by eliminating the use of history list

that gives more weight to suggestions that have been previously chosen to

correct particular errors. We have come to realise that there is no guarantee a

particular replacement for an error is correct. When a replacement is incorrect

and is stored in the history list, the reuse factor will propagate the wrong

replacement to the subsequent corrections. Therefore, we adapted the reuse

factor to support the use of Google spellcheck in the form of entries in a local

spelling dictionary. There are two types of entries in the spelling dictionary.

The main type is the suggestions by Google for spelling errors. This type

of entries is automatically updated every time Google suggest a replacement

for an error. The second type, which is optional, is the suggestions for errors

provided by users. Hence the modified reuse factor will now assign the weight

of 1 to suggestions that are provided by Google spellcheck or predefined by

users.

Despite a certain level of superiority that Google spellcheck exhibits in the three

enhancements, Aspell remains necessary. Google spellcheck is based on the occur-

rences of words on the World Wide Web. Determining whether a word is an error

or not depends very much on its popularity. Even if a word does not exist in the

English dictionary, Google will not judge it as an error as long as its popularity

exceeds some threshold set by Google. This popularity approach has both its pros

and cons. On the one hand, such approach is suitable for recognising proper nouns,

especially emerging ones, such as “iPod” and “Xbox”. On the other hand, words

such as “thanx” in the context of “[ok] [thanx] [for]” is not considered as an error

by Google even though it should be corrected.


The algorithm for text preprocessing that comprises of the basic ISSAC together

with all its enhancements is described in Algorithm 1.

3.5 Evaluation and Discussion

Evaluations are conducted using chat records provided by 247Customer.com. As

a provider of customer lifecycle management services, the chat records by

247Customer.com offer a rich source of domain information in a natural setting (i.e.

conversations between customers and agents). Consequently, these chat records are

filled with spelling errors, ad-hoc abbreviations, improper casing and many other

problems that are considered as intolerable by existing language and speech applica-

tions. Therefore, these chat records become the ideal source for evaluating ISSAC.

Four sets of test data, each comes in an XML file of 100 chat sessions, were em-

ployed in the previous evaluations [273]. To evaluate the enhanced ISSAC, we have

included an additional three sets, which brings the total number of chat records to

700. The chat records and the Google search engine constitute the domain corpora

and the general collection, respectively. GNU Aspell version 0.60.4 [9] is employed

for detecting errors and generating suggestions. Similar to the previous evaluations,

determining the correctness of replacements by Aspell and enhanced ISSAC is a

delicate process that must be performed manually. For example, it is difficult to

automatically determine whether the error “itme” should be replaced with “time”

or “item” without more information (e.g. the neighbouring words). The evaluation

of the errors and replacements are conducted in a unified manner. The errors are

not classified into spelling errors, ad-hoc abbreviations or improper casing. For ex-

ample, should the error “az” (“AZ” is the abbreviation for the state of “Arizona”)

in the context of “Glendale az <” be considered as an abbreviation or improper

casing? The boundaries between the different noises that occur in real-world texts,

especially those from online sources, are not clear.

After a careful evaluation of all replacements suggested by Aspell and by en-

hanced ISSAC for all 3, 313 errors, we discovered a further improvement in accuracy

using the latter. As shown in Figure 3.4, the use of the first suggestions by Aspell

as replacements for spelling errors yields an average of 71%, which is a decrease

from 74.4% in the previous evaluations. With the addition of the various weights

which form basic ISSAC, an average increase of 22% was noted, resulting in an im-

proved accuracy of 96.5%. As predicted, the enhanced ISSAC scored a much better

accuracy at 98%. The increase of 1.5% from basic ISSAC is contributed by the

suggestions from Google that complement the inadequacies of Aspell. A previous

3.5. Evaluation and Discussion 53

Algorithm 1 Enhanced ISSAC

1: input: chat records or other online documents

2: Remove all HTML or XML tags from input documents

3: Extract and keep URLs, emails, emoticons and tables

4: Detect and identify sentence boundary

5: for each document do

6: for each sentence in the document do

7: tokenise the sentence to produce a set of words T = t1, ..., tw

8: for each word t ∈ T do

9: Identify left l and right r word for t

10: if t consists of all upper case then

11: Turn all letters in t to lower case

12: else if t consists of all digits then

13: next

14: Feed t to Aspell

15: if t is identified as error by Aspell or Google spellcheck then

16: initialise S, the set of suggestions for error t, and NS, an array of new

scores for all suggestions for error t

17: Add the n suggestions for word t produced by Aspell to S according

to the original rank from 1 to n

18: Perform a lookup in the abbreviation dictionary and add all the corre-

sponding m expansions for t at the front of S, all with rank 1

19: Perform a lookup in the spelling dictionary and add the retrieved sug-

gestion at the front of S with rank 1

20: Add the error word t itself at the end of S, with rank n + 1

21: The final S is s1,1, s2,1, ..., sm+1,1,

sm+2,1, ..., sm+n+1,n, sm+n+2,n+1 where j and i in sj,i are the element

index and the rank, respectively

22: for each suggestion sj,i ∈ S do

23: Determine i−1, NED between error e and the jth suggestion, RF by

looking into the spelling dictionary, AF by looking into the abbrevi-

ation dictionary, DS, and GS

24: Sum the weights and push the sum into NS

25: Correct word t with the suggestion that has the highest combined

weights in array NS

26: output: documents with spelling errors corrected, abbreviations expanded and

improper casing restored.


Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4

number of correct

replacements using

enhanced ISSAC

98.45% 97.91% 98.40% 98.23%

number of correct

replacements using

basic ISSAC

97.06% 97.07% 95.92% 96.20%

number of correct

replacements using

Aspell

74.61% 75.94% 71.81% 75.19%

(a) Evaluations 1 to 4.

Evaluation 5 Evaluation 6 Evaluation 7 Average

number of correct

replacements using

enhanced ISSAC

97.39% 97.85% 97.86% 98.01%

number of correct

replacements using

basic ISSAC

95.64% 96.65% 97.14% 96.53%

number of correct

replacements using

Aspell

63.62% 65.79% 70.24% 71.03%

(b) Evaluations 5 to 7.

Figure 3.4: Accuracy of enhanced ISSAC over seven evaluations.

Enhanced

ISSAC

0.80%

0.70%

0.50%

Correct replacement not in suggestion list

Inadequate/erroneous neighbouring words

Anomalies

Causes

Figure 3.5: The breakdown of the causes behind the incorrect replacements by

enhanced ISSAC.

error “prder” within the context of “The prder number” that could not be corrected

by basic ISSAC due to the first cause was solved after our enhancements. The

correct replacement “order” was suggested by Google. Another error “ffer” in the

context of “youo ffer on” that could not be corrected due to the second cause was

successfully replaced by “offer” after Google has simultanouesly corrected the left

word “you”. The increase in accuracy by 1.5% is in line with the drop in the number

of errors with wrong replacements due to (1) the absence of correct replacements

in Aspell’s suggestions, and (2) the erronous neighbouring words. There is a visible

drop in the number of errors with wrong replacements due to the first and the sec-

ond cause, from the existing 2% (in Figure 3.3) to 0.8% (in Figure 3.5), and 1% (in

3.6. Conclusion 55

Figure 3.3) to 0.7% (in Figure 3.5), respectively.

3.6 Conclusion

As an increasing number of ontology learning systems are opening up to the use

of online sources, the need to handle noisy text becomes inevitable. Regardless of

whether we acknowledge this fact, the quality of ontologies and the proper func-

tioning of the systems are, to a certain extent, dependent on the cleanliness of the

input texts. Most of the existing techniques for correcting spelling errors, expanding

abbreviations and restoring cases are studied separately. We, along with an increas-

ing number of researchers, have acknowledged the fact that many noises in text are

composite in nature (i.e. multi-error). As we have demonstrated throughout this

chapter, many errors are difficult to be classified as either spelling errors, ad-hoc

abbreviations or improper casing.

In this chapter, we presented the enhancement of the ISSAC technique. The

basic ISSAC was built upon the famous spell checker Aspell for simultaneously pro-

viding solution to spelling errors, abbreviations and improper casing. This scoring

mechanism combines weights based on various information sources, namely, original

rank by Aspell, reuse factor, abbreviation factor, normalised edit distance, domain

significance and general significance. In the course of evaluating basic ISSAC, we

have uncovered and discussed in detail three causes behind the replacement errors.

We approached the enhancement of ISSAC from the first and the second cause,

namely, the absence of correct replacements from Aspell’s suggestions, and the in-

adequacy of the neighbouring words. We proposed three modifications to the basic

ISSAC, namely, (1) the use of Google spellcheck for compensating the inadequacy

of Aspell, (2) the incorporation of Google spellcheck for determining if a word is

erroneous, and (3) the alteration of the reuse factor RS by shifting from the use of

a history list to a spelling dictionary. Evaluations performed using the enhanced

ISSAC on seven sets of chat records revealed a further improved accuracy at 98%

from the previous 96.5% using basic ISSAC.

Even though the idea for ISSAC was first motivated and conceived within the

paradigm of ontology learning, we see great potentials in further improvements and

fine-tuning for a wide range of uses, especially in language and speech applications.

We hope that a unified technique such as ISSAC will pave the way for more research

into providing a complete solution for text preprocessing (i.e. text cleaning) in

general.


3.7 Acknowledgement

This research was supported by the Australian Endeavour International Post-

graduate Research Scholarship, and the Research Grant 2006 by the University of

Western Australia. The authors would like to thank 247Customer.com for providing

the evaluation data. Gratitude to the developer of GNU Aspell, Kevin Atkinson.

3.8 Other Publications on this Topic

Wong, W., Liu, W. & Bennamoun, M. (2006) Integrated Scoring for Spelling Er-

ror Correction, Abbreviation Expansion and Case Restoration in Dirty Text. In the

Proceedings of the 5th Australasian Conference on Data Mining (AusDM), Sydney,

Australia.

This paper contains the preliminary ideas on basic ISSAC, which were extended

and improved to contribute towards the conference paper on enhanced ISSAC that

form Chapter 3.

57CHAPTER 4

Text Processing

Abstract

In ontology learning, research on word collocational stability or unithood is typ-

ically performed as part of a larger effort for term recognition. Consequently, in-

dependent work dedicated to the improvement of unithood measurement is limited.

In addition, existing unithood measures were mostly empirically motivated and de-

rived. This chapter presents a dedicated probabilistic measure that gathers linguistic

evidence from parsed text and statistical evidence from Web search engines for deter-

mining unithood during noun phrase extraction. Our comparative study using 1, 825

test cases against an existing empirically-derived function revealed an improvement

in terms of precision, recall and accuracy.

4.1 Introduction

Automatic term recognition is the process of extracting stable noun phrases from

text and filtering them for the purpose of identifying terms which characterise certain

domains of interest. This process involves the determination of unithood and ter-

mhood. Unithood, which is the focus of this chapter, refers to “the degree of strength

or stability of syntagmatic combinations or collocations” [120]. Measures for de-

termining unithood can be used to decide whether or not word sequences can form

collocationally stable and semantically meaningful compounds. Compounds are con-

sidered as unstable if they can be further broken down to create non-overlapping

units that refer to semantically distinct concepts. For example, the noun phrase

“Centers for Disease Control and Prevention” is a stable and meaningful unit while

“Centre for Clinical Interventions and Royal Perth Hospital” is an unstable com-

pound that refers to two separate entities. For this reason, unithood measures are

typically used in term recognition for finding stable and meaningful noun phrases,

which are considered as likelier terms. Recent reviews [275] showed that existing

research on unithood is mostly conducted as part of larger efforts for termhood mea-

surement. As a result, there is only a small number of existing measures dedicated

to determining unithood. In addition, existing measures are usually derived using

0This chapter appeared in the Proceedings of the 3rd International Joint Conference on Natural

Language Processing (IJCNLP), Hyderabad, India, 2008, with the title “Determining the Unithood

of Word Sequences using a Probabilistic Approach”.

58 Chapter 4. Text Processing

word frequency from static corpora, and are modified as per need. As such, the

significance of the different weights that compose the measures typically assumes an

empirical viewpoint [120].

The three objectives of this chapter are (1) to separate the measurement of

unithood from the determination of termhood, (2) to devise a probabilistic measure

which requires only one threshold for determining the unithood of word sequences

using dynamic Web data, and (3) to demonstrate the superior performance of the

new probabilistic measure against existing empirical measures. In regard to the first

objective, we derive our probabilistic measure free from any influence of termhood

determination. Following this, our unithood measure will be an independent tool

that is applicable not only to term recognition, but also other tasks in information

extraction and text mining. Concerning the second objective, we devise our new

measure, known as the Odds of Unithood (OU), using Bayes Theorem and several

elementary probabilities. The probabilities are estimated using Google page counts

to eliminate problems related to the use of static corpora. Moreover, only one

threshold, namely, OUT is required to control the functioning of OU. Regarding the

third objective, we compare our new OU against an existing empirically-derived

measure called Unithood (UH)1 [275] in terms of their precision, recall and accuracy.

In Section 4.2, we provide a brief review on some of existing techniques for mea-

suring unithood. In Section 4.3, we present our new probabilistic measure and the

accompanying theoretical and intuitive justification. In Section 4.4, we summarise

some findings from our evaluations. Finally, we conclude this chapter with an out-

look to future work in Section 4.5.

4.2 Related Works

Some of the common measures of unithood include pointwise mutual informa-

tion (MI) [47] and log-likelihood ratio [64]. In mutual information, the co-occurrence

frequencies of the constituents of complex terms are utilised to measure their de-

pendency. The mutual information for two words a and b is defined as:

MI(a, b) = log2

p(a, b)

p(a)p(b)(4.1)

1This foundation work on dedicated unithood measures appeared in the Proceedings of the 10th

Conference of the Pacific Association for Computational Linguistics (PACLING), Melbourne, Aus-

tralia, 2007 with the title “Determining the Unithood of Word Sequences using Mutual Information

and Independence Measure”.

4.2. Related Works 59

where p(a) and p(b) are the probabilities of occurrence of a and b. Many measures

that apply statistical techniques assume strict normal distribution and indepen-

dence between the word occurrences [81]. For handling extremely uncommon words

or small sized corpus, the log-likelihood ratio delivers the best precision [136]. Log-

likelihood ratio attempts to quantify how much more likely one pair of words is to

occur compared to the others. Despite its potential, “How to apply this statistic

measure to quantify structural dependency of a word sequence remains an inter-

esting issue to explore.” [131]. Seretan et al. [224] examined the use of mutual

information, log-likelihood ratio and t-tests with search engine page counts for de-

termining the collocational strength of word pairs. However, no performance results

were presented.

Wong et al. [275] presented a hybrid measure inspired by mutual information in

Equation 4.1, and Cvalue in Equation 4.3. The authors employ search engine page

counts for the computation of statistical evidence to replace the use of frequencies

obtained from static corpora. The authors proposed a measure known as Unithood

(UH) for determining the mergeability of two lexical units ax and ay to produce

a stable sequence of words s. The word sequences are organised as a set W =

s, ax, ay where s = axbay is a term candidate, b can be any preposition, the

coordinating conjunction “and” or an empty string, and ax and ay can either be

noun phrases in the form ADJ∗N+ or another s (i.e. defining a new s in terms of

another s). The authors define UH as:

UH(ax, ay) =

1 if (MI(ax, ay) > MI+) ∨

(MI+ ≥MI(ax, ay)

≥MI−∧

ID(ax, s) ≥ IDT ∧

ID(ay, s) ≥ IDT ∧

IDR+ ≥ IDR(ax, ay)

≥ IDR−)

0 otherwise

(4.2)

where MI+, MI−, IDT , IDR+ and IDR− are thresholds for determining the merge-

ability decision, and MI(ax, ay) is the mutual information between ax and ay, while

ID(ax, s), ID(ay, s) and IDR(ax, ay) are measures of lexical independence of ax

and ay from s. For brevity, let z be either ax or ay, and the independence measure


ID(z, s) is then defined as:

ID(z, s) =

log10(nz − ns) if(nz > ns)

0 otherwise

where nz and ns are the Google page count for z and s, respectively. IDR(ax, ay)

is computed as ID(ax,s)ID(ay ,s)

. Intuitively, UH(ax, ay) states that the two lexical units ax

and ay can only be merged in two cases, namely, (1) if ax and ay has extremely

high mutual information (i.e. higher than a certain threshold MI+), or (2) if ax

and ay achieve average mutual information (i.e. within the acceptable range of two

thresholds MI+ and MI−) due to their extremely high independence from s (i.e.

higher than the threshold IDT ).

Frantzi [79] proposed a measure known as Cvalue for extracting complex terms.

The measure is based upon the claim that a substring of a term candidate is a

candidate itself given that it demonstrates adequate independence from the longer

version it appears in. For example, “E. coli food poisoning”, “E. coli” and “food

poisoning” are acceptable as valid complex term candidates. However, “E. coli food”

is not. Given a word sequence a to be examined for unithood, the Cvalue is defined

as:

Cvalue(a) =

log2 |a|fa if |a| = g

log2 |a|(fa −∑

l∈Lafl

|La|) otherwise

(4.3)

where |a| is the number of words in a, La is the set of longer term candidates that

contain a, g is the longest n-gram considered, fa is the frequency of occurrence

of a, and a /∈ La. While certain researchers [131] consider Cvalue as a termhood

measure, others [180] accept it as a measure of unithood. One can observe that

longer candidates tend to gain higher weights due to the inclusion of log2 |a| in

Equation 4.3. In addition, the weights computed using Equation 4.3 are purely

dependent on the frequency of a alone.

4.3 A Probabilistic Measure for Unithood Determination

The determination of the unithood (i.e. collocational strength) of word sequences

using the new probabilistic measure is composed of two parts. Firstly, a list of

noun phrases is extracted using syntactic and dependency analysis. Secondly, the

collocational strength of word sequences are examined based on several probabilistic

evidence.

4.3. A Probabilistic Measure for Unithood Determination 61

Figure 4.1: The output by Stanford Parser. The tokens in the “modifiee” column

marked with squares are head nouns, and the corresponding tokens along the same

rows in the “word” column are the modifiers. The first column “offset” is subse-

quently represented using the variable i.

4.3.1 Noun Phrase Extraction

Most techniques for extracting noun phrases rely on regular expressions, and

part-of-speech and dependency information. Our extraction technique is imple-

mented as a head-driven noun phrase chunker [271] that feeds on the output of

Stanford Parser [132]. Figure 4.1 shows a sample output by the parser for the

sentence “They’re living longer with HIV in the brain, explains Kathy Kopnisky of

the NIH’s National Institute of Mental Health, which is spending about millions in-

vestigating neuroAIDS.”. Note that the words are lemmatised to obtain the root

form. The noun phrase chunker begins by identifying a list of head nouns from the

parser’s output. The head nouns are marked with squares in Figure 4.1. As the

name suggests, the chunker uses the head nouns as the starting point, and proceeds

to the left and later right in an attempt to identify maximal noun phrases using

the head-modifier information. For example, the head “Institute” is modified by


Figure 4.2: The output of the head-driven noun phrase chunker. The tokens which

are highlighted with a darker tone are the head nouns. The underlined tokens are

the corresponding modifiers identified by the chunker.

“NIH’s”, “National” and “of”. Since modifiers of the type prep and poss cannot

be straightforwardly chunked, the phrase “National Institute” was produced instead

as shown in Figure 4.2. Similarly, the phrase “Mental Health” was also identified

by the chunker. The fragments of noun phrases identified by the chunker which are

separated by the coordinating conjunction “and” or prepositions (e.g. “National In-

stitute”, “Mental Health”) are organised as pairs in the form of (ax, ay) and placed

in the set A. The i in ai is the word offset generated by the Stanford Parser (i.e. the

“offset” column in Figure 4.1). If ax and ay are located immediately next to each

other in the sentence, then x + 1 = y. If the pair is separated by a preposition or a

conjunction, then x + 2 = y.

4.3.2 Determining the Unithood of Word Sequences

The next step is to examine the collocational strength of the pairs in A. Word

pairs in (ax, ay) ∈ A that have very high unithood or collocational strength are

combined to form stable noun phrases and hence, potential domain-relevant terms.

Each pair (ax, ay) that undergoes the examination for unithood is organised as W =

s, ax, ay where s is the hypothetically-stable noun phrase composed of s = axbay

and b can either be an empty string, a preposition, or the coordinating conjunction

“and”. Formally, the unithood of any two lexical units ax and ay can be defined as

Definition 4.3.2.1. The unithood of two lexical units is the “degree of strength or

stability of syntagmatic combinations and collocations” [120] between them.

It then becomes obvious that the problem of measuring the unithood of any word

sequences requires the determination of their “degree” of collocational strength as


mentioned in Definition 4.3.2.1. In practical terms, the “degree” mentioned above

provides us with a quantitative means to determine if the units ax and ay should be

combined to form s, or remain as separate units. The collocational strength of ax

and ay that exceeds a certain threshold demonstrates to us that s has the potential

of being a stable compound and hence, a better term candidate than ax and ay

separated. It is worth mentioning that the size (i.e. number of words) of ax and

ay is not limited to 1. For example, we can have ax=“National Institute”, b=“of”

and ay=“Allergy and Infectious Diseases”. In addition, the size of ax and ay has no

effect on the determination of their unithood using our measure.

As we have discussed in Section 4.2, most of the conventional measures employ

frequency of occurrence from local corpora, and statistical tests or information-

theoretic measures to determine the coupling strength of word pairs. The two main

problems associated with such measures are:

• Data sparseness is a problem that is well-documented by many researchers

[124]. The problem is inherent to the use of local corpora and it can lead to

poor estimation of parameters or weights; and

• The assumption of independence and normality of word distribution are two

of the many problems in language modelling [81]. While the independence

assumption reduces text to simply a bag of words, the assumption of the

normal distribution of words will often lead to incorrect conclusions during

statistical tests.

As a general solution, we innovatively employ search engine page counts in a proba-

bilistic framework for measuring unithood. We begin by defining the sample space,

N as the set of all documents indexed by the Google search engine. We can estimate

the index size of Google, |N | using function words as predictors. Function words

such as “a”, “is” and “with”, as opposed to content words, appear with frequencies

that are relatively stable over different domains. Next, we perform random draws

(i.e. trial) of documents from N . For each lexical unit w ∈ W , there is a corre-

sponding set of outcomes (i.e. events) from the draw. The three basic sets which

are of interest to us are:

Definition 4.3.2.2. Basic events corresponding to each w ∈ W :

• X is the event that ax occurs in the document

• Y is the event that ay occurs in the document


• S is the event that s occurs in the document

It should be obvious to the readers that since the documents in S also contain

the two units ax and ay, S is a subset of X∩Y or S ⊆ X∩Y . It is worth noting that

even though S ⊆ X ∩Y , it is highly unlikely that S = X ∩Y since the two portions

ax and ay may exist in the same document without being conjoined by b. Next,

subscribing to the frequency interpretation of probability, we obtain the probability

of the events in Definition 4.3.2.2 in terms of search engine page counts:

P (X) =nx

|N |(4.4)

P (Y ) =ny

|N |

P (S) =ns

|N |

where nx, ny and ns are the page counts returned by a search engine using the terms

[+“ax”], [+“ay”] and [+“s”], respectively. The pair of quotes that encapsulates the

search terms is the phrase operator, while the character “+” is the required operator

supported by the Google search engine. As discussed earlier, the independence

assumption required by certain information-theoretic measures may not always be

valid. In our case, P (X ∩ Y ) 6= P (X)P (Y ) since the occurrences of ax and ay

in documents are inevitably governed by some hidden variables. Following this,

we define the probabilities for two new sets which result from applying some set

operations on the basic events in Definition 4.3.2.2:

P (X ∩ Y ) =nxy

|N |(4.5)

P (X ∩ Y \ S) = P (X ∩ Y )− P (S)

where nxy is the page count returned by Google for the search using [+“ax” +“ay”].

Defining P (X ∩ Y ) in terms of observable page counts, rather than a combina-

tion of two independent events allows us to avoid any unnecessary assumption of

independence.

Next, referring back to our main problem discussed in Definition 4.3.2.1, we are

required to estimate the collocation strength of the two units ax and ay. Since

there is no standard metric for such measurement, we address the problem from a

probabilistic perspective. We introduce the probability that s is a stable compound

given the evidence s possesses:

Definition 4.3.2.3. The probability of unithood:

P (U |E) =P (E|U)P (U)

P (E)


where U is the event that s is a stable compound and E is the evidence belonging

to s. P (U |E) is the posterior probability that s is a stable compound given the

evidence E. P (U) is the prior probability that s is a unit without any evidence, and

P (E) is the prior probability of evidence held by s. As we shall see later, these two

prior probabilities are immaterial in the final computation of unithood. Since s can

either be a stable compound or not, we can state that,

P (U |E) = 1− P (U |E) (4.6)

where U is the event that s is not a stable compound. Since Odds = P/(1−P ), we

multiply both sides of Definition 4.3.2.3 by (1− P (U |E))−1 to obtain,

P (U |E)

1− P (U |E)=

P (E|U)P (U)

P (E)(1− P (U |E))(4.7)

By substituting Equation 4.6 in Equation 4.7 and later, applying the multiplication

rule P (U |E)P (E) = P (E|U)P (U) to it, we obtain:

P (U |E)

P (U |E)=

P (E|U)P (U)

P (E|U)P (U)(4.8)

We proceed to take the log of the odds in Equation 4.8 (i.e. logit) to get:

logP (E|U)

P (E|U)= log

P (U |E)

P (U |E)− log

P (U)

P (U)(4.9)

While it is obvious that certain words tend to co-occur more frequently than others

(i.e. idioms and collocations), such phenomena are largely arbitrary [231]. This

makes the task of deciding on what constitutes an acceptable collocation difficult.

The only way to objectively identify stable and meaningful compounds is through

observations in samples of the language (e.g. text corpus) [174]. In other words,

assigning the apriori probability of collocational strength without empirical evidence

is both subjective and difficult. As such, we are left with the option to assume that

the probability of s being a stable unit and not being a stable compound without

evidence is the same (i.e. P (U) = P (U) = 0.5). As a result, the second term in

Equation 4.9 evaluates to 0:

logP (U |E)

P (U |E)= log

P (E|U)

P (E|U)(4.10)

We introduce a new measure for determining the odds of s being a stable compound

known as the Odds of Unithood (OU):


(a) The area with a darker shade is the set

X ∩ Y \ S. Computing the ratio of P (S)

and the probability of this area gives us

the first evidence.

(b) The area with a darker shade is the

set S′. Computing the ratio of P (S) and

the probability of this area (i.e. P (S′) =

1− P (S)) gives us the second evidence.

Figure 4.3: The probability of the areas with darker shade are the denominators

required by the evidences e1 and e2 for the estimation of OU(s).

Definition 4.3.2.4. Odds of unithood

OU(s) = logP (E|U)

P (E|U)

Assuming that the evidence in E are independent of one another, we can evaluate

OU(s) in terms of:

OU(s) = log

∏

i P (ei|U)∏

i P (ei|U)(4.11)

=∑

i

logP (ei|U)

P (ei|U)

where ei is the individual evidence of s.

With the introduction of Definition 4.3.2.4, we can examine the degree of collo-

cation strength of ax and ay in forming a stable and meaningful s in terms of OU(s).

With the base of the log in Definition 4.3.2.4 more than 1, the upper and lower bound

of OU(s) would be +∞ and −∞, respectively. OU(s) = +∞ and OU(s) = −∞

corresponds to the highest and the lowest degree of stability of the two units ax and

ay appearing as s, respectively. A high OU(s) indicates the suitability of the two

units ax and ay to be merged to form s. Ultimately, we have reduced the vague

problem of unithood determination introduced in Definition 4.3.2.1 into a practical

and computable solution in Definition 4.3.2.4. The evidence that we employ for de-

termining unithood is based on the occurrence of s (the event S if the readers recall

from Definition 4.3.2.2). We are interested in two types of occurrence of s, namely,

(1) the occurrence of s given that ax and ay have already occurred or X ∩ Y , and


(2) the occurrence of s as it is in our sample space, N . We refer to the first evidence

e1 as local occurrence, while the second one e2 as global occurrence. We will discuss

the justification behind each type of occurrence in the following paragraphs. Each

evidence ei captures the occurrence of s within different confinements. We estimate

the evidence using the elementary probabilities already defined in Equations 4.4 and

4.5.

The first evidence e1 captures the probability of occurrence of s within the con-

finement of ax and ay or X∩Y . As such, P (e1|U) can be interpreted as the probabil-

ity of s occurring within X ∩Y as a stable compound or P (S|X ∩Y ). On the other

hand, P (e1|U) captures the probability of s occurring in X ∩ Y not as a unit. In

other words, P (e1|U) is the probability of s not occurring in X ∩Y , or equivalently,

equal to P ((X ∩ Y \ S)|(X ∩ Y )). The set X ∩ Y \ S is shown as the area with a

darker shade in Figure 4.3(a). Let us define the odds based on the first evidence as:

OL =P (e1|U)

P (e1|U)(4.12)

Substituting P (e1|U) = P (S|X ∩ Y ) and P (e1|U) = P ((X ∩ Y \ S)|(X ∩ Y )) into

Equation 4.12 gives us:

OL =P (S|X ∩ Y )

P ((X ∩ Y \ S)|(X ∩ Y ))

=P (S ∩ (X ∩ Y ))

P (X ∩ Y )

P (X ∩ Y )

P ((X ∩ Y \ S) ∩ (X ∩ Y ))

=P (S ∩ (X ∩ Y ))

P ((X ∩ Y \ S) ∩ (X ∩ Y ))

and since S ⊆ (X ∩ Y ) and (X ∩ Y \ S) ⊆ (X ∩ Y ),

OL =P (S)

P (X ∩ Y \ S)if(P (X ∩ Y \ S) 6= 0)

and OL = 1 if P (X ∩ Y \ S) = 0.

The second evidence e2 captures the probability of occurrence of s without con-

finement. If s is a stable compound, then its probability of occurrence in the sample

space would simply be P (S). On the other hand, if s occurs not as a unit, then its

probability of non-occurrence is 1 − P (S). The complement of S, which is the set

S ′ is shown as the area with a darker shade in Figure 4.3(b). Let us define the odds

based on the second evidence as:

OG =P (e2|U)

P (e2|U)(4.13)


Substituting P (e2|U) = P (S) and P (e2|U) = 1− P (S) into Equation 4.13 gives us:

OG =P (S)

1− P (S)

Intuitively, the first evidence attempts to capture the extent to which the exis-

tence of the two lexical units ax and ay is attributable to s. Referring back to OL,

whenever the denominator P (X ∩ Y \ S) becomes less than P (S), we can deduce

that ax and ay actually exist together as s more than in other forms. At one extreme

when P (X ∩ Y \ S) = 0, we can conclude that the co-occurrence of ax and ay is

exclusively for s. As such, we can also refer to OL as a measure of exclusivity for

the use of ax and ay with respect to s. This first evidence is a good indication for

the unithood of s since the more the existence of ax and ay is attributed to s, the

stronger the collocational strength of s becomes. Concerning the second evidence,

OG attempts to capture the extent to which s occurs in general usage (i.e. World

Wide Web). We can consider OG as a measure of pervasiveness for the use of s. As

s becomes more widely used in text, the numerator in OG increases. This provides a

good indication on the unithood of s since the more s appears in usage, the likelier

it becomes that s is a stable compound instead of an occurrence by chance. As a

result, the derivation of OU in terms of OL and OG offers a comprehensive way of

determining unithood.

Finally, expanding OU(s) in Equation 4.11 using Equations 4.12 and 4.13 gives

us:

OU(s) = log OL + log OG (4.14)

= logP (S)

P (X ∩ Y \ S)+ log

P (S)

1− P (S)

As such, the decision on whether ax and ay should be merged to form s is made

based solely on OU defined in Equation 4.14. We merge ax and ay if their odds of

unithood exceeds a certain threshold, OUT .

4.4 Evaluations and Discussions

For this evaluation, we employed 500 news articles from Reuters in the health

domain gathered between December 2006 to May 2007. These 500 articles are fed

into the Stanford Parser whose output is then used by our head-driven noun phrase

chunker [271, 275] to extract word sequences in the form of nouns and noun phrases.

Pairs of word sequences (i.e. ax and ay) located immediately next to each other,

or separated by a preposition or the conjunction “and” in the same sentence are

4.4. Evaluations and Discussions 69

measured for their unithood. Using the 500 news articles, we managed to obtain

1, 825 pairs of words to be tested for unithood.

We performed a comparative study of our new probabilistic measure against the

empirical measure described in Equation 4.2. Two experiments were conducted.

In the first one, the decisions on whether or not to merge the 1, 825 pairs were

performed automatically using our probabilistic measure OU. These decisions are

known as the actual results. At the same time, we inspected the same list of word

pairs to manually decide on their unithood. These decisions are known as the ideal

results. The threshold OUT employed for our evaluation is determined empirically

through experiments and is set to −8.39. However, since only one threshold is

involved in deciding mergeability, training algorithms and datasets may be employed

to automatically decide on an optimal number. This option is beyond the scope of

this chapter. The actual and ideal results for this first experiment are organised into

a contingency table (not shown here) for identifying the true and the false positives,

and the true and the false negatives. In the second experiment, we conducted the

same assessment as carried out in the first one but the decisions to merge the 1, 825

pairs are based on the UH measure described in Equation 4.2. The thresholds

required for this measure are based on the values suggested by Wong et al. [275],

namely, MI+ = 0.9, MI− = 0.02, IDT = 6, IDR+ = 1.35, and IDR− = 0.93.

Figure 4.4: The performance of OU (from Experiment 1) and UH (from Experiment

2) in terms of precision, recall and accuracy. The last column shows the difference

between the performance of Experiment 1 and 2.

Using the results from the contingency tables, we computed the precision, recall

and accuracy for the two measures under evaluation. Figure 4.4 summarises the

performance of OU and UH in determining the unithood of 1, 825 pairs of lexical

units. One will notice that our new measure OU outperformed the empirical measure

UH in all aspects, with an improvement of 2.63%, 3.33% and 2.74% for precision,

recall and accuracy, respectively. Our new measure achieved a 100% precision with

a lower recall at 95.83%. However, more evaluations using larger datasets and


statistical tests for significance are required to further validate the performance of

the probabilistic measure OU.

As with any measures that employ thresholds as a cut-off point for accepting

or rejecting certain decisions, we can improve the recall of OU by decreasing the

threshold OUT . In this way, there will be less false negatives (i.e. pairs which are

supposed to be merged but are not) and hence, increases the recall rate. Unfor-

tunately, recall will improve at the expense of precision since the number of false

positives will definitely increase from the existing 0. Since our application (i.e.

ontology learning) requires perfect precision in determining the unithood of noun

phrases, OU is the ideal candidate. Moreover, with only one threshold (i.e. OUT )

required in controlling the performance of OU, we are able to reduce the amount of

time and effort spent on optimising our results.

4.5 Conclusion and Future Work

In this chapter, we highlighted the significance of unithood, and that its mea-

surement should be given equal attention by researchers in ontology learning. We

focused on the development of a dedicated probabilistic measure for determining

the unithood of word sequences. We refer to this measure as the Odds of Unit-

hood (OU). OU is derived using Bayes Theorem and is founded upon two evidence,

namely, local occurrence and global occurrence. Elementary probabilities estimated

using page counts from Web search engines are utilised to quantify the two evidence.

The new probabilistic measure OU is then evaluated against an existing empirical

measure known as Unithood (UH). Our new measure OU achieved a precision and

a recall of 100% and 95.83%, respectively, with an accuracy at 97.26% in measuring

the unithood of 1, 825 test cases. OU outperformed UH by 2.63%, 3.33% and 2.74%

in terms of precision, recall and accuracy, respectively. Moreover, our new measure

requires only one threshold, as compared to five in UH to control the mergeability

decision.

More work is required to establish the coverage and the depth of the World Wide

Web with regard to the determination of unithood. While the Web has demonstrated

reasonable strength in handling general news articles, we have yet to study its ap-

propriateness in dealing with unithood determination for technical text (i.e. the

depth of the Web). Similarly, it remains a question the extent to which the Web

is able to satisfy the requirement of unithood determination for a wider range of

domains (i.e. the coverage of the Web). Studies on the effect of noises (e.g. keyword

spamming) and multiple word senses on unithood determination using the Web is

4.6. Acknowledgement 71

another future research direction.

4.6 Acknowledgement


graduate Research Scholarship, and the Research Grant 2006 by the University of

Western Australia.


Wong, W., Liu, W. & Bennamoun, M. (2007) Determining the Unithood of Word

Sequences using Mutual Information and Independence Measure. In the Proceed-

ings of the 10th Conference of the Pacific Association for Computational Linguistics

(PACLING), Melbourne, Australia.

This paper presents the work on the adaptation of existing word association measures

to form the UH measure. The ideas on UH were later reformulated to give rise to

the probabilistic measure OU. The description of OU forms the core contents of this

Chapter 4.

Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and

Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research

on Text and Web Mining Technologies, IGI Global.

This book chapter combines the ideas on the UH measure, from Chapter 4, and

the TH measure, from Chapter 5.


73CHAPTER 5

Term Recognition

Abstract

Term recognition identifies domain-relevant terms which are essential for discov-

ering domain concepts and for the construction of terminologies required by a wide

range of natural language applications. Many techniques have been developed in

an attempt to numerically determine or quantify termhood based on term charac-

teristics. Some of the apparent shortcomings of existing techniques are the ad-hoc

combination of termhood evidence, mathematically-unfounded derivation of scores

and implicit assumptions concerning term characteristics. We propose a probabilis-

tic framework for formalising and combining qualitative evidence based on explicitly

defined term characteristics to produce a new termhood measure. Our qualitative

and quantitative evaluations demonstrate consistently better precision, recall and

accuracy compared to three other existing ad-hoc measures.

5.1 Introduction

Technical terms, more commonly referred to as terms, are content-bearing lexi-

cal units which describe the various aspects of a particular domain. There are two

types of terms, namely, simple terms (i.e. single-word terms) and complex terms

(multi-word terms). In general, the task of identifying domain-relevant terms is re-

ferred to as automatic term recognition, term extraction or terminology mining. The

broader scope of term recognition can also be viewed in terms of the computational

problem of measuring termhood, which is the extent of a term’s relevance to a par-

ticular domain [120]. Terms are particularly important for labelling or designating

domain-specific concepts, and for contributing to the construction of terminologies,

which are essentially enumerations of technical terms in a domain. Manual efforts

in term recognition are no longer viable as more new terms come into use and new

meanings may be added to existing terms as a result of information explosion. Cou-

pled with the significance of terminologies to a wide range of applications such as

ontology learning, machine translation and thesaurus construction, automatic term

recognition is the next logical solution.

0This chapter appeared in Intelligent Data Analysis, Volume 13, Issue 4, Pages 499-539, 2009,

with the title “A Probabilistic Framework for Automatic Term Recognition”.

74 Chapter 5. Term Recognition

Very often, term recognition is considered as similar or equivalent to named-

entity recognition, information retrieval and term relatedness measurement. An

obvious dissimilarity between named-entity recognition and term recognition is that

the former is a deterministic problem of classification whereas the latter involves

the subjective measurement of relevance and ranking. Hence, unlike the evaluation

of named-entity recognition where various platforms such as the BioCreAtIvE Task

[106] and the Message Understanding Conference (MUC) [42] are readily available,

determining the performance of term recognition remains an extremely subjective

problem. Having closer resemblance to information retrieval in that both involve

relevance ranking, term recognition does have its unique requirements [120]. Un-

like information retrieval where information relevance can be evaluated based on

user information needs, term recognition does not have user queries as evidence for

deciding on the domain relevance of terms. In general, term recognition can be per-

formed with or without initial seedterms as evidence. The seedterms enable term

recognition to be conducted in a controlled environment and offer more predictable

outcomes. Term recognition using seedterms, also referred to as guided term recog-

nition is in some aspects similar to measuring term relatedness. The relevance of

terms to a domain in guided term recognition is determined in terms of their seman-

tic relatedness with the domain seedterms. Therefore, existing semantic similarity

or relatedness measures based on lexical information (e.g. WordNet [206], Wikipedia

[276]), corpus statistics (e.g. Web corpus [50]), or the combination of both [114] are

available for use. Without using seedterms, term recognition relies solely on term

characteristics as evidence. This term recognition approach is far more difficult and

faces numerous challenges. The focus of this chapter is on term recognition without

seedterms.

In this chapter, we develop a formal framework for quantifying evidence based

on qualitative term characteristics for the purpose of measuring termhood, and ulti-

mately, term recognition. Several techniques have been developed in an attempt to

numerically determine or quantify termhood based on a list of term characteristics.

The shortcomings of existing techniques can be examined from three perspectives.

Firstly, word or document frequency in text corpus has always been the main source

of evidence due to its accessibility and computability. Despite a general agreement

[37] that frequency is a good criteria for discriminating terms from non-terms, fre-

quency alone is insufficient. Many researchers began realising this issue and more

diverse evidence [131, 109] was incorporated, especially linguistic-based such as syn-

tactic and semantic information. Unfortunately, as the types of evidence become

5.1. Introduction 75

increasingly diversified (e.g. numerical and nominal), the consolidation of evidence

by existing techniques becomes more ad-hoc. This issue is very obvious when one

examines simple but crucial questions as to why certain measures take different bases

for logarithm, or why two weights were combined using addition instead of multipli-

cation. In the words of Kageura & Umino [120], most of the existing techniques “take

an empirical or pragmatic standpoint regarding the meaning of weight”. Secondly,

the underlying assumptions made by many techniques regarding term characteris-

tics for deriving evidence were mostly implicit. This makes the task of characteristic

attribution and tracing inaccuracies in termhood measurement difficult. Thirdly,

many techniques for determining termhood failed to provide ways for selecting the

final terms from a long list of term candidates. According to Cabre-Castellvi et

al. [37], “all systems propose large lists of candidate terms, which at the end of the

process have to be manually accepted or rejected.”. In short, the derivation of a for-

mal termhood measure based on term characteristics for term recognition requires

solutions to the following issues:

• the development of a general framework to consolidate all evidence represent-

ing the various term characteristics;

• the determination of the types of evidence to be included to ensure that the

resulting score will closely reflect the actual state of termhood implied by a

term’s characteristics;

• the explicit definition of term characteristics and their attribution to linguistic

theories (if any) or other justifications; and

• the automatic determination of optimal thresholds to identify terms from the

final lists of ranked term candidates.

The main objective of this chapter is to address the development of a new prob-

abilistic framework for incorporating qualitative evidence for measuring termhood

which provides solutions to all four issues outlined above. This new framework

is based on the general Bayes Theorem, and the word distributions required for

computing termhood evidence are founded upon the Zipf-Mandelbrot model. The

secondary objective of this chapter is to demonstrate the performance of this new

term recognition technique in comparison with existing techniques using widely-

accessible benchmarks. In Section 5.2, we summarise the notations and datasets

employed throughout this chapter for the formulation of equations, experiments


and evaluations. Section 5.3.1 and 5.3.2 summarise several prominent probabilis-

tic models and ad-hoc techniques related to term recognition. In Section 5.3.3, we

discuss several commonly employed word distribution models that are crucial for

formalising statistical and linguistic evidence. We outline in detail our proposed

technique for term recognition in Section 5.4. We evaluate our new technique, both

qualitatively and quantitatively, in Section 5.5 and compare its performance with

several other existing techniques. In particular, Section 5.5.2 includes the detailed

description of an automatic way of identifying actual terms from the list of ranked

term candidates. We conclude this chapter in Section 5.6 with an outlook to future

work.

5.2 Notations and Datasets

In this section, we discuss briefly the types of termhood evidence. This section

can also be used as a reference for readers who require clarification about the nota-

tions used at any point in this chapter. In addition, we summarise the origin and

composition of the datasets employed in various parts of this chapter for experiments

and evaluations. There is a wide array of evidence employed for term recognition

ranging from statistical to linguistics. Word and document frequency is used ex-

tensively to measure the significance of a lexical unit. Depending on how frequency

is employed, one can classify the term recognition techniques as either ad-hoc or

probabilistic. Linguistic evidence, on the other hand, typically includes syntactical

and semantic information. Syntactical evidence relies upon information about how

distinct lexical units assume the role of heads and modifiers to form complex terms.

For semantic evidence, some predefined knowledge is often employed to relate one

lexical unit with others to form networks of related terms.

Frequency refers to the number of occurrences of certain event or entity. There

are mainly two types of frequency related to the area of term recognition. The first

is document frequency. Document frequency refers to the number of documents in a

corpus that contains some words of interest. There are many different notations but

throughout this chapter, we will adopt the notation N as the number of documents

in a corpus and na as the number of documents in the corpus which contains word a.

In cases where more than one corpus is involved, nax is used to denote the number

of documents in corpus x containing word a. The second type of frequency is term

frequency. Term frequency is the number of occurrences of certain words in a corpus.

In other words, term frequency is independent of the documents in the corpus. We

will employ the notation fa as the number of occurrences of word a in a corpus and

5.2. Notations and Datasets 77

F as the sum of the number of occurrences of all words in a corpus. In the case

where different units of text are involved such as paragraphs, sentences, documents

or even corpora, fax represents the frequency of candidate a in unit x. Given that

W is the set of all distinct words in a corpus, then F =∑

∀a∈W fa. With regard

to term recognition, we will use the notation TC to represent the set of all term

candidates extracted from some corpus for processing, and |TC| is the number of

term candidates in TC. The notation a where a ∈ TC is used to represent a term

candidate and it can either be simple or complex. For complex terms, term candidate

a is made up of constituents where ah is the head and Ma is the set of modifiers. A

term candidate can also be surrounded by a set of context words Ca. The notion

of context words may differ across different techniques. Certain techniques consider

all words surrounding term a located within a fixed-size window as context words

of a, while others may employ grammatical relations to extract context words. The

actual composition of Ca is not of concern at this point. Following this, Ca ∩ TC

is simply the context words of a which are also term candidates themselves (i.e.

context terms).

Figure 5.1: Summary of the datasets employed throughout this chapter for experi-

ments and evaluations.

Throughout this chapter, we employ a standard set of corpora for experiment-

ing with the various aspects of term recognition, and also for evaluation purposes.

The corpora that we employ are divided into two groups. The first group, known

as domain corpus, consists of a collection of abstracts in the domain of molecular

biology that is made available through the GENIA corpus [130]. Currently, version

3.0 of the corpus consists of 2, 000 abstracts with a total of 402, 483 word count.


The GENIA corpus is an ideal resource for evaluating term recognition techniques

since the text in the corpus is marked-up with both part-of-speech tags and seman-

tic categories. Biologically-relevant terms in the corpus were manually identified by

two domain experts [130]. Hence, a gold standard (i.e. a list of terms relevant to

the domain), represented as the set G, for the molecular biology domain can be con-

structed by extracting the terms which have semantic descriptors enclosed by cons

tags. For reproducibility of our experiments, the corpus can be downloaded from

http://www-tsujii.is.s.u-tokyo.ac.jp/∼genia/topics/Corpus/. The second

collection of text is called the contrastive corpus and is made up of twelve differ-

ent text collections gathered from various online sources. As the name implies, the

second group of text serves to contrast and discriminate the content of the domain

corpus. The writing style of the contrastive corpus is different from the domain

corpus because the former tend to be prepared using journalistic writing (i.e. writ-

ten in general language with minimal usage of technical terms), targeting general

readers. The contrastive texts were automatically gathered from news provider such

as Reuters between the period of February 2006 to July 2007. The summary of the

domain corpus and contrastive corpus is presented in Figure 5.1. Note that for

simplicity reasons, hereafter, d is used to represent the domain corpus and d for

contrastive corpus.

5.3 Related Works

There are mainly two schools of techniques in term recognition. The first at-

tempts to begin the empirical study of termhood from a theoretically-founded per-

spective, while the second is based upon the belief that a method should be judged

for its quality of being of practical use. These two groups are by no means exclusive

but they form a good platform for comparison. In the first group, probability and

statistics are the main guidance for designing new techniques. Probability theory

acts as the mathematical foundation for modelling the various components in the

corpus, and drawing inferences about different aspects such as relevance and repre-

sentativeness of documents or domains using descriptive and inferential statistics.

In the second group, ad-hoc techniques are characterised by the pragmatic use of

evidence to measure termhood. Ad-hoc techniques are usually put together and

modified as per need as the observation of immediate results progresses. Obviously,

such techniques are at most inspired by, but not derived from formal mathematical

models [120]. Many critics claim that such techniques are unfounded and the results

that are reported using these techniques are merely coincidental.


The details of some existing research work on the two groups of techniques rele-

vant to term recognition are presented in the next two subsections.

5.3.1 Existing Probabilistic Models for Term Recognition

There is currently no formal framework dedicated to the determination of ter-

mhood which combines both statistical and qualitative linguistic evidence. Formal

probabilistic models related to dealing with terms in general are mainly studied

within the realm of document retrieval and automatic indexing. In probabilistic in-

dexing, one of the first few detailed quantitative models was proposed by Bookstein

& Swanson [29]. In this model, the differences in the distributional behaviour of

words are employed as a guide to determine if a word should be considered as an

index term. This model is founded upon the research on how function words can

be closely modeled by a Poisson distribution whereas content words deviate from

it [256]. We will elaborate on Poisson and other related models in Section 5.3.3.

An even larger collection of literature on probabilistic models can be found in the

related area of document retrieval. The simplest of all the retrieval models is the

Binary Independence Model [82, 147]. As with all other retrieval models, the Binary

Independence Model is designed to estimate the probability that a document j is

considered as relevant given a specific query k, which is essentially a bag of words.

Let T = t1, ...tn be the set of terms in the collection of documents (i.e. corpus).

We can then represent the set of terms Tj occurring in document j as a binary vector

vj = x1, ..., xn where xi = 1 if ti ∈ Tj and xi = 0 otherwise. This way, the odds

of document j, represented by a binary vector vj being relevant R to query k can

be computed as [83]

O(R|k,vj) =P (R|k,vj)

P (R|k,vj)=

P (R|k)

P (R|k)

P (vj|R, k)

P (vj|R, k)

and based on the assumption of independence between the presence and absence of

terms,

P (vj|R, k)

P (vj|R, k)=

n∏

i=1

P (xi|R, k)

P (xi|R, k)

Other more advanced models that take into consideration other factors such as

term frequency, document frequency and document length have also been proposed

by researchers such as Spark Jones et al. [117, 118].

There is also another line of research which treats the problem of term recognition

as a supervised machine learning task. In this term recognition approach, each


word from a corpus is classified as a term or non-term. Classifiers are trained

using annotated domain corpora. The trained models can then be applied to other

corpora in the same domain. Turney [251] presented a comparative study between a

recognition model based on genetic algorithms and an implementation of the bagged

C4.5 decision tree algorithm. Hulth [109] studied the impact of prior input word

selection on the performance of term recognition. The author uses a classifier trained

on 2, 000 abstracts in the domain of information technology to identify terms from

non-terms, and concluded that limiting the input words to NP-chunks offered the

best precision. This study further reaffirmed the benefit of incorporating linguistic

evidence during the measurement of termhood.

5.3.2 Existing Ad-Hoc Techniques for Term Recognition

Most of the current termhood measures for term recognition fall into this ad-hoc

techniques group. Term frequency and document frequency are the main types of

evidence used by ad-hoc techniques. Unlike the use of classifiers described in the

previous section, techniques in this group employ termhood scores for ranking and

selecting terms from non-terms.

Most common ad-hoc techniques that employ raw frequencies are variants of

Term Frequency Inverse Document Frequency (TF-IDF). TF attempts to capture

the pervasiveness of a term candidate within some documents, while IDF measures

the “informativeness” of a term candidate. Despite the mere heuristic background

of TF-IDF, the robustness of this weighting scheme has given rise to a number of

variants and has found its way into many retrieval applications. Certain researchers

[210] have even attempted to provide theoretical justifications as to why the combi-

nation of TF and IDF works so well. Basili et al. [19] proposed a TF-IDF inspired

measure for assigning terms with more accurate weights that reflect their specificity

with respect to the target domain. This contrastive analysis is based on the heuristic

that general language-dependent phenomena should spread similarly across differ-

ent domain corpus and special-language phenomena should portray odd behaviours.

The Contrastive Weight [19] for simple term candidate a in target domain d is

defined as:

CW (a) = log fad

(

log

∑

j

∑

i fij∑

j faj

)

(5.1)

where fad is the frequency of the simple term candidate a in the target domain d,∑

j

∑

i fij is the sum of the frequencies of all term candidates in all domain corpora,

and∑

j faj is the sum of the frequencies of term candidate a in all domain corpora.


For complex term candidates, the frequencies of their heads are utilised to compute

their weights. This is necessary because the low frequencies among complex terms

make estimations difficult. Consequently, the weight for complex term candidate a

in domain d is defined as:

CW (a) = fadCW (ah) (5.2)

where fad is the frequency of the complex term candidate a in the target domain d,

and CW (ah) is the contrastive weight for the head, ah of the complex term candidate,

a. The use of head noun by Basili et al. [19] for computing the contrastive weights

of complex term candidates CW (a) reflects the head-modifier principle [105]. The

principle suggests that the information being conveyed by complex terms manifests

itself in the arrangement of the constituents. The head acts as the key that refers to

a general category to which all other modifications of the head belong. The modifiers

are responsible for distinguishing the head from other forms in the same category.

Wong et al. [274] presented another termhood measure based on contrastive analysis

called Termhood (TH) which places emphasis on the difference between the notion

of prevalence and tendency. The measure computes a Discriminative Weight (DW)

for each candidate a as:

DW (a) = DP (a)DT (a) (5.3)

This weight realises the heuristic that the task of discriminating terms from non-

terms is a function of Domain Prevalence (DP) and Domain Tendency (DT). If a

is a simple term candidate, its DP is defined as:

DP (a) = log10(fad + 10) log10

(

∑

j fjd +∑

j fjd

fad + fad

+ 10

)

(5.4)

where∑

j fjd +∑

j fjd is the sum of the frequencies of occurrences of all a ∈ TC

in both domain and contrastive corpus, while fad and fad are the frequencies of

occurrences of a in the domain corpus and contrastive corpus, respectively. DP

simply increases, with offset for too frequent terms, along with the frequency of a in

the domain corpus. If the term candidate is complex, the authors define its DP as:

DP (a) = log10(fad + 10)DP (ah)MF (a) (5.5)

The reason behind the use of the DP of the complex term’s head (i.e. DP (ah)) in

Equation 5.5 is similar to that of CW in Equation 5.2. DT , on the other hand, is


employed to determine the extent of the inclination of the usage of term candidate

a for domain and non-domain purposes. The authors defined DT as:

DT (a) = log2

(

fad + 1

fad + 1+ 1

)

(5.6)

where fad is the frequency of occurrences of a in the domain corpus, while fad is the

frequency of occurrences of a in the contrastive corpus. If term candidate a is equally

common in both domain and non-domains (i.e. contrastive domains), DT = 1. If

the usage of a is more inclined towards the target domain, fad > fad, then DT > 1,

and DT < 1 otherwise.

Besides contrastive analysis, the use of contextual evidence to assist in the cor-

rect identification of terms is also common. There are currently two dominant

approaches to extract contextual information. Most of the existing researchers such

as Maynard & Ananiadou [171] employed fixed-size windows for capturing context

words for term candidates. The Keyword in Context (KWIC) [159] index can be

employed to identify the appropriate windows of words surrounding the term candi-

dates. Other researchers such as Basili et al. [20], LeMoigno et al. [144] and Wong

et al. [278] employed grammatical relations to identify verb phrases or independent

clauses containing the term candidates. One of the work along the line of incorpo-

rating contextual information is NCvalue by Frantzi & Ananiadou [80]. Part of the

NCvalue measure involves the assignment of weights to context words in the form

of nouns, adjectives and verbs located within a fixed-size window from the term

candidate. Given that TC is the set of all term candidates and c is a noun, verb or

adjective appearing with term candidates, weight(c) is defined as:

weight(c) = 0.5

(

|TCc|

|TC|+

∑

e∈TCcfe

fc

)

(5.7)

where TCc is the set of term candidates that have c as a context word,∑

e∈TCcfe

is the sum of the frequencies of term candidates that appear with c, and fc is the

frequency of c in the corpus. After calculating the weights for all possible context

words, the sum of the weights of context words appearing with each term candidate

is obtained. Formally, for each term candidate a that has a set of accompanying

context words Ca, the cumulative context weight is defined as:

cweight(a) =∑

c∈Ca

weight(c) + 1 (5.8)

Eventually, the NCvalue for a term candidate is defined as:

NCvalue(a) =1

log FCvalue(a)cweight(a) (5.9)


where F is the number of words in the corpus. Cvalue(a) is given by,

Cvalue(a) =

log2 |a|fa if |a| = g

log2 |a|(fa −∑

l∈Lafl

|La|) otherwise

(5.10)

where |a| is the number of words that constitute a, La is the set of potential longer

term candidates that contain a, g is the longest n-gram considered, and fa is fre-

quency of occurrences of a in the corpus. The TH measure by Wong et al. [274]

incorporates contextual evidence in the form of Average Contextual Discriminative

Weight (ACDW). ACDW is the average DW of the context words of a adjusted

based on the context’s relatedness with a:

ACDW (a) =

∑

c∈CaDW (c)NGD(a, c)

|Ca|(5.11)

where NGD is the Normalised Google Distance by Cilibrasi & Vitanyi [50] that is

used to determine the relatedness between two lexical units without any feature

extraction or static background knowledge. The final termhood score for each term

candidate a is given by [278]:

TH(a) = DW (a) + ACC(a) (5.12)

where ACC is the adjusted value of ACDW based on DW of Equation 5.3.

The inclusion of semantic relatedness measure by Wong et al. [278] brings us to

the use of semantic information during the determination of termhood. Maynard

& Ananiadou [171, 172] employed the Unified Medical Language System (UMLS)

to compute two weights, namely, positional and commonality. Positional weight is

obtained based on the combined number of nodes belonging to each word, while

commonality is measured by the number of shared common ancestors multiplied by

the number of words. Accordingly, the similarity between two term candidates is

defined as [171]:

sim(a, b) =com(a, b)

pos(a, b)(5.13)

where com(a, b) and pos(a, b) is the commonality and positional weight, respectively,

between term candidate a and b. The authors then modified the NCvalue discussed

in Equation 5.9 by incorporating the new similarity measure as part of a Context

Factor (CF). The context factor of a term candidate a is defined as:

CF (a) =∑

c∈Ca

fc|aweight(c) +∑

b∈CTa

fb|asim(a, b) (5.14)


where Ca is the set of context words of a, fc|a is the frequency of c as a context

word of a, weight(c) is the weight for context word c as defined in Equation 5.7,

CTa is the set of context words of a which also happen to be term candidates (i.e.

context terms), fb|a is the frequency of b as a context term of a, and sim(a, b) is the

similarity between term candidate a and its context term b as defined in Equation

5.13. The new NCvalue is defined as:

NCvalue(a) = 0.8Cvalue(a) + 0.2CF (a) (5.15)

Basili et al. [20] commented that the use of extensive and well-grounded semantic

resources by Maynard & Ananiadou [171] faces the issue of portability to other

domains. Instead, Basili et al. [20] combined the use of contextual information

and the head-modifier principle to capture term candidates and their context words

on a feature space for computing similarity using the cosine measure. According

to the authors [20], “the term sense is usually determined by its head.”. On the

contrary, such statement by the authors opposes the fundamental fact, not only

in terminology but in general linguistics, that simple terms are polysemous and the

modification of such terms is necessary to narrow down their possible interpretations

[105]. Moreover, the size of corpus has to be very large, and the specificity and

density of domain terms in the corpus has to be very high to allow for extraction of

adequate features.

In summary, while the existing techniques described above may be intuitively

justifiable, the manner in which the weights were derived remains questionable. To

illustrate, why are the products of the various variables in Equations 5.1 and 5.9

taken instead of their summations? What would happen to the resulting weights

if the products are taken instead of the summations, and the summations taken

instead of the products in Equations 5.7, 5.15 and 5.4? These are just minor but

thought-provoking questions in comparison to more fundamental issues related to the

decomposability and traceability of the weights back to their various constituents or

individual evidence. The two main advantages of decomposability and traceability

are (1) the ability to trace inaccuracies of termhood measurement to their origin (i.e.

what went wrong and why), and (2) the attribution of the significance of the various

weights to their intended term characteristics (i.e. what do the weights measure?).

5.3.3 Word Distribution Models

An alternative to the use of relative frequency as practiced by many ad-hoc

techniques discussed above in Section 5.3.2 is to develop models of the distribution


of words and employ such models to describe the various characteristics of terms,

and the corpus or the domain they represent. It is worth pointing out that the

modelling is done with respect to all words (i.e. terms and non-terms) that a corpus

contains. This is important for capturing the behaviour of both terms and non-

terms in the domain for discrimination purposes. Word distribution models can be

used to normalise frequency of occurrence [7] and to solve problems related to data

sparsity caused by the use of raw frequencies in ad-hoc techniques. The modelling

of word distributions in documents or the entire corpus can also be employed as

means of predicting the rate of occurrence of words. There are mainly two groups of

models related to word distribution. The first group attempts to model the frequency

distribution of all words in an entire corpus while the second group focuses on the

distribution of a single word.

The foundation of the first group of models is the relationship between the fre-

quencies of words and their ranks. One of the most widely-used models in this group

is the Zipf ’s Law [285] which describes the relationship between the frequency of a

word, f and its rank, r as

P (r; s,H) =1

rsH(5.16)

where s is an exponent characterising the distribution. Given that 1 ≤ r ≤ |W |

where W is the set of all distinct words in the corpus, H is defined as the |W |-th

harmonic number, H =∑|W |

i=1 i−s. The actual notation for H computed as the |W |-th

harmonic number is H|W |,s. However, for brevity, we will continue with the use of the

notation H. A generalised version of the Zipfian distribution is the Zipf-Mandelbrot

Law [168] whose probability mass function is given by

P (r; q, s,H) =1

(r + q)sH(5.17)

where q is a parameter for expressing the richness of word usage in the text. Simi-

larly, H can be computed as H =∑|W |

i=1(i + q)−s. There is still a hyperbolic relation

between rank and frequency in the Zipf-Mandelbrot Distribution. The additional

parameter q can be used to model curves in the distribution, something not possi-

ble in the original Zipf. There are few other probability distributions that can or

have been used to model word distribution such as the Pareto distribution [7], the

Yule-Simon distribution [230] and the generalised inverse Gauss-Poisson law [12].

All of the distributions described above are discrete power law distributions, except

for Pareto, that have the ability to model the unique property of word occurrences,


(a) Distribution of words extracted from the domain corpus dispersed according

to the domain corpus.

(b) Distribution of words extracted from the domain corpus dispersed according

to the contrastive corpus.

Figure 5.2: Distribution of 3, 058 words randomly sampled from the domain corpus

d. The line with the label “KM” is the aggregation of the individual probability of

occurrence of word i in a document, 1 − P (0; αi, βi) using K-mixture with αi and

βi defined in Equations 5.21 and 5.20. The line with the label “ZM-MF” is the

manually fitted Zipf-Mandelbrot model. The line labeled “RF” is the actual rate of

occurrence computed as fi/F .


namely, the “long tail phenomenon”. One of the main problems that hinders the

practical use of these distributions is the estimation of the various parameters [112].

To illustrate, Figure 5.3 summarises the parameters of the manually fitted Zipf-

Mandelbrot models for the distribution of a set of 3, 058 words randomly drawn from

our domain corpus d. The lines with the label “ZM-MF” shown in Figures 5.2(a) and

5.2(b) show the distributions of the words dispersed according to the domain corpus,

and the contrastive corpus, respectively. One can notice that the distribution of the

words in d is particularly difficult to be fitted because they tend to have a bulge

near the end. This is caused by the presence of many domain-specific terms (which

are unique to d) in the set of 3, 058 words. Such domain-specific terms will have

extremely low or most of the time, zero word count. Nevertheless, a better fit for

the Figure 5.2(b) can be achieved through more trial-and-error. In addition to the

trial-and-error exercise required in manual fitting, the values in Figure 5.3 clearly

show that different parameters are required even for fitting the same set of words

using different corpus. The manual fits we have carried out are far from perfect and

some automatic fitting mechanism is required if we were to practically employ the

Zipf-Mandelbrot model. In the words of Edmundson [65], “a distribution with more

or different parameters may be required. It is clear that computers should be used on

this problem...”.

Figure 5.3: Parameters for the manually fitted Zipf-Mandelbrot model for the set of

3, 058 words randomly drawn from d.

In the second group of models, individual word distributions allow us to capture

and express the behaviour of individual words in parts of a corpus. The standard

probabilistic model for distribution of some event over fixed-size units is the Poisson

distribution. In the conventional case of individual word distribution, the event

would be the k occurrence of word i and the unit would be a document. The

definition of Poisson distribution is given by

P (k; λi) =e−λiλk

i

k!(5.18)

where λi is the average number of occurrences of word i per document or λi = fi/N .

Obviously, λi will vary between different words. P (0; λi) will give the probability


that word i does not exist in a document while 1−P (0; λi) will give the probability

that a candidate has at least one occurrence in a document. Other similarly unsuc-

cessful attempts for better fits are the Binomial model and the Two-Poisson model

[82, 240, 83]. These single-parameter distributions (i.e. Poisson and Binomial) have

been traditionally employed to model individual word distributions based on unreal-

istic assumptions such as the independence between word occurrences. As a result,

they are poor-fits of the actual word distribution. Nevertheless, such variation from

the Poisson distribution or colloquially known as non-poissonness serves a purpose.

It is well-known throughout the literature [256, 45, 169] that Poisson distribution

is only a good fit for functional words while content words tend to deviate from it.

Using this property, we can also employ the single Poisson as a predictor of whether

a lexical unit is a content word or not, and hence as an indicator of possible ter-

mhood. A better fit for individual word distribution employs a mixture of Poissons

[170, 46]. Negative Binomial is one of such mixtures but the involvement of large

binomial coefficients makes it computationally unattractive. Another alternative is

the K-mixture proposed by Katz [122] that allows the Poisson parameter λi to vary

between documents. The distribution of k occurrences of word i in a document is

given by:

P (k; αi, βi) = (1− αi)δk,0 +αi

βi + 1

(

βi

βi + 1

)k

(5.19)

where δk,0 is the Dirac’s delta. δk,0 = 1 if k = 0 and δk,0 = 0 otherwise. The

parameters αi and βi can be computed as:

βi =fi − ni

ni

(5.20)

αi =λi

βi

(5.21)

where λi is the single Poisson parameter of the observed mean. βi determines the

additional word i per each document that contain i, and αi can be seen as the

fraction of documents with i and without i. One of the property of K-mixture is

that it is always a perfect fit at k = 0. This desirable property of K-mixture can

be employed to accurately determine the probability of non-occurrence of word i in

a document. P (0; αi, βi) gives us the probability that word i does not occur in a

document and 1− P (0; αi, βi) gives us the probability that a word has at least one

occurrence in a document (i.e. the candidate exists in a document). When k = 0,

5.4. A New Probabilistic Framework for Determining Termhood 89

the K-mixture is reduced to

P (0; αi, βi) = (1− αi) +αi

βi + 1(5.22)

Unlike fixed-size textual units such as documents, the notion of domains is elusive.

The lines labeled with “KM” in Figure 5.2 are the result of the aggregation of

the individual probability of occurrence of word i in documents of the respective

corpora. Figures 5.2(a) and 5.2(b) clearly show that models like K-mixture whose

distributions are defined over documents or other units with clear, explicit bound-

aries cannot be employed directly as predictors for the actual rate of occurrence of

words in domain.

5.4 A New Probabilistic Framework for Determining Ter-

mhood

We begin the derivation of the new probabilistic framework for term recognition

by examining the definition of termhood. Based on two prominent review papers

on term recognition [120, 131], we define termhood as:

Definition 5.4.1. Termhood is the degree to which a lexical unit is relevant to a

domain of interest.

As outlined in Section 5.1, our focus is to construct a formal framework which

combines evidence, in the form of term characteristics, instead of seedterms for

term recognition. The aggregated evidence can then used to determine the extent

of relevance of the corresponding term with respect to a particular domain, as in

the definition of termhood in Definition 5.4.1. The characteristics of terms manifest

themselves in suitable corpora that represent the domain of interest. From here on,

we will use the notation d to interchangeably denote the elusive notion of a domain

and its tangible counterpart, the domain corpus. Since the quality of the termhood

evidence with respect to the domain is dependent on the issue of representativeness

of the corresponding corpus, the following assumption is necessary for us to proceed:

Assumption 1. Corpus d is a balanced, unbiased and randomised sample of the

population text representing the corresponding domain.

The actual discussion on corpus representativeness is nevertheless important but

the issue is beyond the scope of this chapter. Having Assumption 1 in place, we

restate Definition 5.4.1 in terms of probability to allow us to formulate a probabilistic

model for measuring termhood in the next two steps,


Aim 1. What is the probability that a is relevant to domain d given the evidence a

has?

In the second step, we lay the foundation for the various term characteristics

which are mentioned throughout this chapter, and used specifically for formalising

termhood evidence in Section 5.4.2. We subscribed to the definition of ideal terms

as adopted by many researchers [157]. Definition 5.4.2 outlines the primary char-

acteristics of terms. These characteristics rarely exist in real-world settings since

word ambiguity is a common phenomenon in linguistics. Nevertheless, this defini-

tion is necessary to establish a platform for determining the extent of deviation of

the characteristics of terms in actual usage from the ideal cases,

Definition 5.4.2. The primary characteristics of terms in ideal settings are:

• Terms should not have synonyms. In other words, there should be no different

terms implying the same meaning.

• Meaning of terms is independent of context.

• Meaning of terms should be precise and related directly to a concept. In other

words, a term should not have different meanings or senses.

In addition to Definition 5.4.2, there are several other related characteristics of

terms which are of common knowledge in this area. Some of these characteristics

follow from the general properties of words in linguistics. This list is not a standard

and is by no means exhaustive or properly theorised. Nonetheless, as we have

pointed out, such heuristically-motivated list is one of the foundation of automatic

term recognition. They are as follow:

Definition 5.4.3. The extended characteristics of terms:

1 Terms are properties of domain, not document [19].

2 Terms tend to clump together [28] the same way content-bearing words do

[285].

3 Terms with longer length are rare in a corpus since the usage of words with

shorter length is more predominant [284].

4 Simple terms are often ambiguous and modifiers are required to reduce the

number of possible interpretations.


5 Complex terms are preferred [80] since the specificity of such terms with respect

to certain domains are well-defined.

Definition 5.4.2 simply states that a term is unambiguously relevant to a domain.

For instance, assume that once we encounter the term “bridge”, it has to immediately

mean “a device that connects multiple network segments at the data link layer”, and

nothing else. At the same time, such “device” should not be identifiable using other

labels. If this is the case, all we need to do is to measure the extent to which a term

candidate is relevant to a domain regardless of its relevance to other domains since

an ideal term cannot be relevant to both (as implied in Definition 5.4.2).

This brings us to the third step where we can now formulate our Aim 1 as a

conditional probability between two events and pose it using Bayes Theorem,

P (R1|A) =P (A|R1)P (R1)

P (A)(5.23)

where R1 is the event that a is relevant to domain d and A is the event that a is a

candidate with evidence set V = E1, ..., Em. P (R1|A) is the posterior probability

of candidate a being relevant to d given the evidence set V associated to a. P (R1)

and P (A) are the prior probabilities of candidate a being relevant without any evi-

dence, and the probability of a being a candidate with evidence V , respectively. One

has to bare in mind that Equation 5.23 is founded upon the Bayesian interpretation

of probability. Consequently, subjective rather than frequency-based assessments of

P (R1) and P (A) are well-accepted, at least by the Bayesians. As we shall see later,

these two prior probabilities will be immaterial in the final computation of weights

for the candidates. In addition, we introduce the event that a is relevant to other

domains d, R2, which can be seen as the complementary event of R1. Similar to

Assumption 1, we subscribe to the following assumption for d,

Assumption 2. Contrastive corpus d is the set of balanced, unbiased and randomised

sample of the population text representing approximately all major domains other

than d.

Based on the ideal characteristics of terms in Definition 5.4.2, and the new event

R2, we can state that P (R1 ∩ R2) = 0. In other words, R1 and R2 are mutually

exclusive in ideal settings. Ignoring the fact that a term may appear in certain

domains by chance, any candidate a can either be relevant to d or to d, but definitely

not both. Unfortunately, a point worth noting is that “An impregnable barrier

between words of a general language and terminologies does not exist.” [157]. For


example, the word “bridge” has multiple meaning and is definitely relevant to more

than one domain, or in other words, P (R1 ∩ R2) is not strictly 0 in reality. While

people in the computer networking domain may accept and use the word “bridge”

as a term, it is in fact not an ideal term. Words like “bridge” are often a poor choice

of terms (i.e. not ideal terms) simply because they are simple terms, and inherently

ambiguous as defined in Definition 5.4.3.4. Instead, a better term for denoting the

concept which the word “bridge” attempts to represent would be “network bridge”.

As such, we assume that:

Assumption 3. Each concept represented using a polysemous simple term in a corpus

has a corresponding unambiguous complex term representation occurring in the

same corpus.

From Assumption 3, since all important concepts of a domain have unambiguous

manifestations in the corpus, the possibility of the ambiguous counterparts achieving

lower ranks during our termhood measurement will have no effect on the overall term

recognition output. In other words, polysemous simple terms can be considered as

insignificant in our determination of termhood. Based on this alone, we can assume

that the non-relevance to d approximately implies the relevance to d. This brings us

to the next property about the prior probability of relevance of terms. The mutual

exclusion and complementation properties of the relevance of terms in d, R1, and in

d, R2 are:

• P (R1 ∩R2) ≈ 0

• P (R1 ∪R2) = P (R1) + P (R2) ≈ 1

Even in the presence of a prior probability to this approximation, the addition law

of probability still has to hold. As such, we can extend this approximation of the

sum of the probability of relevance without evidence to include the prior probability

of evidence:

P (R1|A) + P (R2|A) ≈ 1 (5.24)

without violating the probability axioms.

Knowing that P (R1∩R2) only approximates to 0 in reality, we will need to make

sure that the relevance of candidate a in domain d does not happen by chance.

The occurrence of a term in a domain is considered as accidental if the concepts

represented by the terms are not topical for that domain. Moreover, the accidental

repeats of the same term in non-topical cases are possible [122]. Consequently, we


need to demonstrate the odds of term candidate a being more relevant to d than to

d:

Aim 2. What are the odds of candidate a being relevant to domain d given the

evidence it has?

In this fourth step, we alter Equation 5.23 to reflect our new Aim 2 for determin-

ing the odds rather than merely probabilities. Since Odds = P1−P

, we can apply an

order-preserving transformation by multiplying 11−P (R1|A)

to Equation 5.23 to give

us the odds of relevance given the evidence candidate a has:

P (R1|A)

1− P (R1|A)=

P (A|R1)P (R1)

P (A)(1− P (R1|A))(5.25)

and since 1− P (R1|A) ≈ P (R2|A) from Equation 5.24, we have:

P (R1|A)

P (R2|A)=

P (A|R1)P (R1)

P (A)P (R2|A)(5.26)

and applying the multiplication rule P (R2|A)P (A) = P (A|R2)P (R2) to both sides

of Equation 5.25 to obtain

P (R1|A)

P (R2|A)=

P (A|R1)

P (A|R2)

P (R1)

P (R2)(5.27)

Equation 5.27 can also be called the odds of relevance of candidate a to d given the

evidence a has. The second term in Equation 5.27, P (R1)P (R2)

is the odds of relevance

of candidate a without evidence. We can use Equation 5.27 as a way to rank the

candidates. Taking the log of odds, we have

logP (A|R1)

P (A|R2)= log

P (R1|A)

P (R2|A)− log

P (R1)

P (R2)

P (A|R1) and P (A|R2) are the class conditional probabilities for a being a candidate

with evidence V given its different states of relevance. Since probability of relevance

to d and to d of all candidates without any evidence are the same, we can safely

ignore the second term (i.e. odds of relevance without evidence) in Equation 5.27

without committing the prosecutor’s fallacy [232]. This gives us

logP (A|R1)

P (A|R2)≈ log

P (R1|A)

P (R2|A)(5.28)

To facilitate the scoring and ranking of the candidates based on the evidence they

have, we introduce a new function of evidence possessed by candidate a. We call

this new function the Odds of Termhood (OT)


OT (A) = logP (A|R1)

P (A|R2)(5.29)

Since we are only interested in ranking and from Equation 5.28, ranking candidates

according to OT (A) is the same as ranking the candidates according to our Aim 2

reflected through Equation 5.27. Obviously, from Equation 5.29, our initial predica-

ment of not being able to empirically interpret the prior probabilities P (A) and

P (R1) is no longer a problem.

Assumption 4. Independence between evidences in the set V .

In the fifth step, we decompose the evidence set V associated with each candidate

a to facilitate the assessment of the class conditional probabilities P (A|R1) and

P (A|R2). Given Assumption 4, we can evaluate P (A|R1) as

P (A|R1) =∏

i

P (Ei|R1) (5.30)

and P (A|R2) as

P (A|R2) =∏

i

P (Ei|R2) (5.31)

where P (Ei|R1) and P (Ei|R2) are the probabilities of a as a candidate associated

with evidence Ei given its different states of relevance R1 and R2, respectively.

Substituting Equation 5.30 and 5.31 in 5.29, we get

OT (A) =∑

i

logP (Ei|R1)

P (Ei|R2)(5.32)

Lastly, for the ease of computing the evidence, we define individual scores called

evidential weight (Oi) provided by each evidence Ei as

Oi =P (Ei|R1)

P (Ei|R2)(5.33)

and substituting Equation 5.33 in 5.32 provides

OT (A) =∑

i

log Oi (5.34)

The purpose of OT is similar to many other functions for scoring and ranking term

candidates such as those reviewed in Section 5.3.2. However, what differentiates our

new function from the existing ones is that OT is founded upon and derived in a

probabilistic framework whose assumptions are made explicit. Moreover, as we will

discuss in the following Section 5.4.2, the individual evidence is formulated using

probability and the necessary term distributions are derived from formal distribution

models to be discussed in Section 5.4.1.


5.4.1 Parameters Estimation for Term Distribution Models

In Section 5.3.3, we presented a wide range of distribution models for individual

terms and for all terms in the corpus. Our intention is to avoid the use of raw fre-

quencies and also relative frequencies for computing the evidential weights, Oi. The

shortcomings related to the use of raw frequencies have been clearly highlighted in

Section 5.3. In this section, we discuss the two models that we employ for comput-

ing the evidential weights, namely, the Zipf-Mandelbrot model and the K-mixture

model. The Zipf-Mandelbrot model is employed to predict the probability of occur-

rence of a term candidate in the domain, while the K-mixture model predicts the

probability of a certain number of occurrences of a term candidate in a document

of the domain. Most of the literature on Zipf and Zipf-Mandelbrot laws often leave

out the single most important aspect that makes these two distributions applicable

to real-world applications, namely, parameter estimation. As we have discussed in

Section 5.3.3, the manual process of deciding the parameters, namely, s, q and H

in the case of Zipf-Mandelbrot distribution, is tedious and may not easily achieve

the best fit. Moreover, the parameters for modelling the same set of terms vary

across different corpora. A recent paper by Izsak [112] discussed on some general

aspects of standardising the process of fitting the Zipf-Mandelbrot model. We ex-

perimented with linear regression using the ordinary least squares method [200] and

the weighted least squares method [63] to estimate the three parameters, namely, s,

q and H. The results of our experiments are reported in this section. We would like

to stress that our focus is to search for appropriate parameters to achieve a good

fit of the models using observed data (i.e. raw frequencies and ranks). While we do

not discount the importance of observing the various assumptions involved in linear

regression such as the normality of residuals and the homogeneity of variance, their

discussions are beyond the scope of this chapter.

We begin by linearising the actual Equation 5.17 of the Zipf-Mandelbrot model

to allow for linear regression. From here on, where there is no confusion, we will

refer to the probability mass function of the Zipf-Mandelbrot model, P (r; q, s,H) as

ZMr for clarity. We will take the natural logarithm of both sides of Equation 5.17

to obtain:

ln ZMr = ln H − s ln(r + q) (5.35)

Our aim is then to find the line defined by Equation 5.35 that best fits our observed

points (ln r, ln fr

F); r = 1, 2, ..., |W |. For sufficiently large r, ln(r + q)/ ln(r) will


approximates to 1 and we will have:

ln ZMr ≈ ln H − s ln r

As a result, ln ZMr will be an approximate linear function of ln r with the points

scattered along a straight line with slope −s and an intersection at lnH on the

Y-axis. We can then move on to determine the estimates lnH and s. We attempt

to minimise the squared sum of residuals (SSR) between the actual points ln fi

Fand

the predicted points ln ZMi:

SSR =

|W |∑

i=1

(lnfi

F− ln ZMi)

2 (5.36)

Given that |W | is the number of words, the least squares estimates ln H and s is

defined as [1]:

s =|W |

∑

j (lnfj

F)(ln j)−

∑

j (lnfj

F)∑

j (ln j)

|W |∑

j (ln j)2 −∑

j (ln j)∑

j (ln j)(5.37)

and

ln H =

∑

j (ln ZMj)− s∑

j (ln j)

|W |(5.38)

Since the approximate of ln fr

Fis given by ln ZMr = ln H − s ln(r + q), then it holds

that ln f1

F≈ ln ZM1. Following this, ln f1

F≈ ln H − s ln(1 + q). As a result, we can

estimate q using s and ln H as such

lnf1

F≈ ln H − s ln(1 + q)

ln(1 + q) ≈1

sln H − ln

f1

F

q ≈ e1

s(ln H−ln

f1F

) − 1 (5.39)

To illustrate our process of automatically fitting the Zipf-Mandelbrot model, please

refer to Figure 5.4. Figure 5.6 summarises the parameters of the automatically fitted

Zipf-Mandelbrot model. The lines with the label “ZM-OLS” in Figures 5.4(a) and

5.4(b) show the automatically fitted Zipf-Mandelbrot model for the distribution of

the same set of 3, 058 words employed in Figure 5.2, using the ordinary least squares

method. The line “RF” is the actual relative frequency that we are trying to fit.

One will notice from the SSR column of Figure 5.5 that the automatic fit provided

by the ordinary least squares method achieves relatively good results (i.e. low SSR)

in dealing with the curves along the line in Figure 5.4(b).


(a) Distribution of words extracted from the domain corpus dispersed according to the

domain corpus.

(b) Distribution of words extracted from the domain corpus dispersed according to the

contrastive corpus.

Figure 5.4: Distribution of the same 3, 058 words as employed in Figure 5.2. The line

with the label “ZM-OLR” is the Zipf-Mandelbrot model fitted using ordinary least

squares method. The line labeled “ZM-WLS” is the Zipf-Mandelbrot model fitted

using weighted least squares method, while “RF” is the actual rate of occurrence

computed as fi/F .


Figure 5.5: Summary of the sum of squares of residuals, SSR and the coefficient of

determination, R2 for the regression using manually estimated parameters, parame-

ters estimated using ordinary least squares (OLS), and parameters estimated using

weighted least squares (WLS). Obviously, the smaller the SSR is, the better the fit.

As for 0 ≤ R2 ≤ 1, the upper bound is achieved when the fit is perfect.

Figure 5.6: Parameters for the automatically fitted Zipf-Mandelbrot model for the

set of 3, 058 words randomly drawn.

We also attempted to fit the Zipf-Mandelbrot model using the second type of

least squares method, namely, the weighted least squares. The idea is to assign to

each point ln fi/F a weight that reflects the uncertainty of the observation. Instead

of weighting all points equally, they are weighted such that points with a greater

weight contribute more to the fit. Most of the time, the weight wi assigned to the

i-th point is determined as a function of the variance of that observation, denoted as

wi = σ−1i . In other words, we assign points with lower variances greater statistical

weights. Instead of using variance, we propose the assignment of weights for the

weighted least squares method based on the changes of the slopes at each segment

of the distribution. The slope at the point (xi, yi) is defined as the slope of the

segment between the points (xi, yi) and (xi−1, yi−1), and it is given by:

mi,i−1 =yi − yi−1

xi − xi−1

The weight to be assigned for each point (xi, yi) is a function of the conditional

cumulation of slopes up to that point. The cumulation of the slopes is conditional

depending on the changes between slopes. The slope of the segment between point


i and i− 1 is added to the cumulative slope if its rate of change from the previous

segment i− 1 and i− 2 is between 1.1 and 0.9. In other words, the slopes between

the two segments are approximately the same. If the change in slopes between the

two segments is outside that range, the cumulative slope is reset to 0. Given that

i = 1, 2, ..., |W |, computing the slope at point 1 uses a prior non-existence point,

m1,0 = 0. Formally, we set the weight wi to be assigned to point i for the weighted

least squares method as:

wi =

0 if(i = 1)

mi,i−1 + wi−1 if(i 6= 1 ∧ 1.1 ≤mi,i−1

mi−1,i−2

≤ 0.9)

0 otherwise

(5.40)

Consequently, instead of minimising the sum of squares of the residuals (SSR) where

all points are treated equally as in the ordinary least squares in Equation 5.36, we

include the new weight wi defined in Equation 5.40 to give us:

SSR =

|W |∑

i=1

wi(lnfi

F− ln ZMi)

2 (5.41)

Referring back to Figure 5.4, the lines with the label “ZM-WLS” demonstrate the

fit of the Zipf-Mandelbrot model whose parameters are estimated using the weighted

least squares method. The line “RF” is again the actual relative frequency we are

trying to fit. Despite the curves, especially in the case of using the contrastive corpus

d, the weighted least squares is able to provide a good fit. The constantly changing

slopes especially in the middle of the distribution provide an increasing weight to

each point, enabling such points to contribute more to the fitting.

In the subsequent sections, we will utilise the Zipf-Mandelbrot model for mod-

elling the distribution of term candidates in both the domain corpus (from which the

candidates were extracted), and also the contrastive corpus. We employ P (r; q, s,H)

of Equation 5.17 in Section 5.3.3 to compute the probability of occurrence of ranked

words in both the domain corpus and the contrastive corpus. The parameters H,

q and s are estimated as shown in Equation 5.38, 5.39 and 5.37, respectively. For

standardisation purposes, we introduce the following notations:

• ZMrd provides the probability of occurrence of a word with rank r in domain

corpus d; and

• ZMrd provides the probability of occurrence of a word with rank r in the

contrastive corpus d.


In addition, we will also be using the K-mixture model as discussed in Section 5.3.3

for predicting the probability of occurrence of the term candidates in documents

of the respective corpora. Recall that P (0; αi, βi) in Equation 5.22 gives us the

probability that word i does not occur in a document (i.e. probability of non-

occurrence) and 1 − P (0; αi, βi) gives us the probability that word i has at least

one occurrence in a document (i.e. probability of occurrence). The βi and αi are

computed based on Equations 5.20 and 5.21. The distribution of words in either d

or d can be achieved by defining the parameters of the K-mixture model over the

respective corpora. We will employ the following notations:

• KMad is the probability of occurrence of word a in documents in the domain

corpus d; and

• KMad is the probability of occurrence of word a in documents in contrastive

corpus d;

5.4.2 Formalising Evidences in a Probabilistic Framework

All existing techniques for term recognition are founded upon some heuristics

or linguistic theories that define what makes a term candidate relevant. However,

there are researchers [120, 107] who criticised such existing methods for the lack of

proper theorisation despite the reasonable intuitions behind them. Definition 5.4.2.1

highlights a list of commonly adopted characteristics for determining the relevance

of terms [120].

Definition 5.4.2.1. Characteristics of term relevance:

1 A term candidate is relevant to a domain if it appears relatively more frequent

in that domain than in others.

2 A term candidate is relevant to a domain if it appears only in this one domain.

3 A term candidate relevant to a domain may have biased occurrences in that

domain:

3.1 A term candidate of rare occurrence in a domain. Such candidates are

also known as “hapax legomena” which manifest itself as the long tail in

Zipf’s law.

3.2 A term candidate of common occurrence in a domain.

4 Following from Definition 5.4.3.4 and 5.4.3.5, a complex term candidate is

relevant to a domain if its head is specific to that domain.


We propose a series of evidence as listed below to capture the individual charac-

teristics presented in Definition 5.4.2.1 and 5.4.3. They are as follow:

• Evidence 1: Occurrence of term candidate a

• Evidence 2: Existence of term candidate a

• Evidence 3: Specificity of the head ah of term candidate a

• Evidence 4: Uniqueness of term candidate a

• Evidence 5: Exclusivity of term candidate a

• Evidence 6: Pervasiveness of term candidate a

• Evidence 7: Clumping tendency of term candidate a

The seven evidence is used to compute the corresponding evidential weights Oi which

in turn are summed to produce the final ranking using OT as defined in Equation

5.34. Since OT served as a probabilistically-derived formulaic realisation of our

Aim 2, we can consider the various Oi as manifestations of sub-aims of Aim 2. The

formulation of the evidential weights begin with the associated definitions and sub-

aims. Each sub-aim attempting to realise the associated definition has an equivalent

mathematical formulation. The formula is then expanded into a series of probability

functions connected through the addition and multiplication rule. There are four

basic probability distributions that are required to compute the various evidential

weights:

• P(occurrence of a in d)=P (a, d): This distribution provides the probability

of occurrence of a in the domain corpus d. By ranking the term candidates

according to their frequency of occurrence in domain d, each term candidate

will have a rank r. We employ ZMrd described in Section 5.4.1 for this purpose.

For brevity, we use P (a, d) to denote the probability of occurrence of term

candidate a in the domain corpus d.

• P(occurrence of a in d)=P (a, d): This distribution provides the probability of

occurrence of a in the contrastive corpus d. By ranking the term candidates

according to their frequency of occurrence in the contrastive corpus d, each

term candidate will have a rank r. We employ ZMrd described in Section

5.4.1 for this purpose. For brevity, we use P (a, d) to denote the probability of

occurrence of term candidate a in the contrastive corpus d.


• P(occurrence of a in documents in d)=PK(a, d): This distribution provides

the probability of occurrence of a in documents in domain corpus d where

the subscript K refers to K-mixture. One should be immediately reminded of

KMad described in Section 5.4.1. For brevity, we employ PK(a, d) to denote

the probability of occurrence of term candidate a in documents in the domain

corpus d.

• P(occurrence of a in documents in d)=PK(a, d): This distribution provides

the probability of occurrence of a in documents in the contrastive corpus d.

We employ the distribution provided by KMad described in Section 5.4.1.

For brevity, we use PK(a, d) to denote the probability of occurrence of term

candidate a in documents in the contrastive corpus d.

Since the probability masses P (a, d) and P (a, d) described above are defined over

the sample space of all words in the respective corpus (i.e. either d or d), we have

that for any term candidate a ∈ W :

• 0 ≤ P (a, d) ≤ 1

• 0 ≤ P (a, d) ≤ 1

and

•∑

∀a∈W P (a, d) = 1

•∑

∀a∈W P (a, d) = 1

On the other hand, the other two distributions, namely, PK(a, d) and PK(a, d) are

defined over the sample space of all possible number of occurrences k = 0, 1, 2, ..., n

of a particular term candidate a in a document using the K-mixture model. Hence,

• 0 ≤ PK(a, d) ≤ 1

• 0 ≤ PK(a, d) ≤ 1

but

•∑

∀a∈W PK(a, d) 6= 1

•∑

∀a∈W PK(a, d) 6= 1


In addition to the axioms above, there are three sets of related properties that

require further clarification. The first set concerns the events of occurrence and

non-occurrence of term candidates in the domain corpus d and in the contrastive

corpus d:

Property 1. Properties of the probability distributions of occurrence and non-

occurrence of term candidates in the domain corpus and the contrastive corpus:

1) The events of the occurrences of a in d and in d are not mutually exclusive.

In other words, the occurrence of a in d does not imply the non-occurrence of

a in d. This is true since any term candidate can occur in d, d or even both,

either intentionally or by accident.

P (occurrence of a in d ∩ occurrence of a in d) 6= 0

2) The occurrence of words in d does not affect (i.e. independent of) the proba-

bility of its occurrence in other domains d and vice versa.

P (occurrence of a in d ∩ occurrence of a in d) = P (a, d)P (a, d)

3) The events of occurrence and non-occurrence of the same candidate within the

same domain are complementary.

P (non-occurrence of a in d) = 1− P (a, d)

4) Following from 1), the events of occurrence in d and non-occurrence in d and

vice versa of the same term are not mutually exclusive since candidate a can

occur in both d and d.

P (occurrence of a in d ∩ non-occurrence of a in d) 6= 0

5) Following from 2), the events of occurrence in d and non-occurrence in d and

vice versa of the same term are also independent.

P (occurrence of a in d ∩ non-occurrence of a in d) = P (a, d)(1−P (a, d))

The second set of properties is concerned with complex term candidates. Each

complex candidate a is made up of a head ah and a set of modifiers Ma. Since

candidate a and its head ah have the possibility of both occurring in d or in d, they

are not mutually exclusive. As such, the probability of union of the two events of

occurrence is not the sum of the individual probability of occurrence. Lastly, we will

assume that the occurrences of candidate a and its head ah within the same domain


(i.e. either d or d) are independent. While this may not be the case in reality, but as

we shall see later, such property allows us to provide estimates for many non-trivial

situations. As such,

Property 2. The mutual exclusion and independence property of the occurrence of

term candidate a and its head ah within the same corpus (i.e. either in d or in d):

P (occurrence of a in d ∩ occurrence of ah in d) 6= 0

P (occurrence of a in d ∩ occurrence of ah in d) = P (a, d)P (ah, d)

P (occurrence of a in d ∪ occurrence of ah in d) = P (a, d) + P (ah, d) −

P (a, d)P (ah, d)

The last set of properties is made in regard to the occurrence of candidates in

documents in the corpus. Since the probability of occurrence of a candidate in

documents is derived from Poisson mixture, then it follows that the probability of

occurrence (where k ≥ 1) of a candidate in documents is the complement of the

probability of non-occurrence of that candidate (where k = 0).

Property 3. The complementation property of the occurrence and non-occurrence

of term candidate a in documents within the same domain (i.e. either in d or in d):

P (non-occurrence of a in documents in d) = 1− PK(a, d)

Next, we move on to define the odds that correspond to each of the evidence laid

out earlier.

• Odds of Occurrence: The first evidential weight O1 attempts to realise Defi-

nition 5.4.2.1.1. O1 captures the odds of whether a occurs in d or in d. The

notion of occurrence is the simplest among all the weights on which most other

evidential weights are founded upon. Formally, O1 can be described as

Sub-Aim 1. What are the odds of term candidate a occurring in d?

and can be mathematically formulated as:

O1 =P (occurrence of a|R1)

P (occurrence of a|R2)

=P (occurrence of a in d)

P (occurrence of a in d)

=P (a, d)

P (a, d)(5.42)


• Odds of Existence: Similar to O1, the second evidential weight O2 attempts to

realise Definition 5.4.2.1.1 but keeping in mind Definition 5.4.3.3 and Definition

5.4.3.4. We can consider O2 as a realistic extension of O1 for reasons to be

discussed below. Since the non-occurrence of term candidates in the corpus

does not imply its conceptual absence or non-existence, we would like O2 to

capture the following:

Sub-Aim 2. What are the odds of term candidate a being in existence in d?

The main issue related to the probability of occurrence is the fact that a big

portion of candidates rest along the long tail of Zipf’s Law. Since most of the

candidates are rare, their probabilities of occurrences alone do not reflect their

actual existence or intended usage. What makes the situation worse is that

longer words tend to have lower rate of occurrences [249] based on Definition

5.4.3.3. In the words of Zipf [284], “it seems reasonably clear that shorter

words are distinctly more favoured in language than longer words.”. For ex-

ample, consider the events that we observe more “bridge” occurring in the

computer networking domain than its complex counterpart “network bridge”.

The observed events do not imply that the concept represented by the com-

plex term is different or of any less importance to the domain simply because

“network bridge” occurs less than “bridge”. The fact that authors are more

predisposed at using shorter terms whenever possible to represent the same

concept demonstrate the Principle of Least Effort, which is the foundation be-

hind most of Zipf’s Laws. This brings us to Definition 5.4.3.4 which requires

us to assign higher importance to complex terms than their heads appearing as

simple terms. We need to ensure that O2 captures these requirements. We can

extend the lexical occurrence of complex candidates conceptually by including

the lexical occurrence of their heads. Since the events of occurrences of a and

its head ah are not mutually exclusive as discussed in Property 2, we will need

to subtract the probability of the intersection of these two events from the sum

of the two probabilities to obtain the probability of the union. Following this,

and based on the assumptions about the probability of occurrences of complex

candidates and their heads in Property 2, we can mathematically formulate

O2 as

O2 =P (existence of a|R1)

P (existence of a|R2)

=P (existence of a in d)

P (existence of a in d)


=P (occurrence of a in d ∪ occurrence of ah in d)

P (occurrence of a in d ∪ occurrence of ah in d)

=P (a, d) + P (ah, d)− P (a, d)P (ah, d)

P (a, d) + P (ah, d)− P (a, d)P (ah, d)

In the case where candidate a is simple, the probability of occurrence of its

head ah, and the probability of both a and ah occurring will be evaluated

to zero. As a result, the second evidential weight O2 for simple terms will

be equivalent to its first evidential weight O1. Such formulation satisfies the

additional Definition 5.4.3.5 that requires us to allocate higher weights to

complex terms.

• Odds of Specificity: The third evidential weight O3 specifically focuses on

Definition 5.4.2.1.4 for complex term candidates. O3 is meant for capturing

the odds of whether the inherently ambiguous head ah of a complex term a is

specific to d. If the heads ah of complex terms are found to occur individually

without a in large numbers across different domains, then the specificity of the

concept represented by ah with regard to d may be questionable. O3 can be

formally stated as:

Sub-Aim 3. What are the odds that the head ah of a complex term candidate

a is specific to d?

The head of a complex candidate is considered as specific to a domain if the

head and the candidate itself both have higher tendency of occurring together

in that domain. The higher the intersection of the events of occurrences of a

and ah in a certain domain, the more specific ah is to that domain. For ex-

ample, if the event of both “bridge” and “network bridge” occurring together

in the computer networking domain is very high, this means the possibly am-

biguous head “bridge” is used in a very specific context in that domain. In

such cases, when “bridge” is encountered in the domain of computer network-

ing, one can safely deduce that it refers to the same domain-specific concept

as “network bridge”. Consequently, the more specific the head ah is with re-

spect to d, the less ambiguous its occurrence is in d. It follows from Definition

5.4.2.1.4 that the less ambiguous ah is, the chances of its complex counterpart

a being relevant to d will be higher. Based on the assumptions about the

probability of occurrence of complex candidates and their heads in Property


2, we define the third evidential weight for complex term candidates as:

O3 =P (specificity of a|R1)

P (specificity of a|R2)

=P (specificity of a to d)

P (specificity of a to d)

=P (occurrence of a in d ∩ occurrence of ah in d)

P (occurrence of a in d ∩ occurrence of ah in d)

=P (a, d)P (ah, d)

P (a, d)P (ah, d)

• Odds of Uniqueness: The fourth evidential weight O4 realises Definition 5.4.2.1.2

by capturing the odds of whether a is unique to d or to d. The notion of

uniqueness defined here will be employed for the computation of the next two

evidential weights O5 and O6. Formally, O4 can be described as

Sub-Aim 4. What are the odds of term candidate a being unique to d?

A term candidate is considered as unique if it occurs only in one domain and

not others. Based on the assumptions on the probability of occurrence and

non-occurrence in Property 1, O4 can be mathematically formulated as:

O4 =P (uniqueness of a|R1)

P (uniqueness of a|R2)

=P (uniqueness of a to d)

P (uniqueness of a to d)

=P (occurrence of a in d ∩ non-occurrence of a in d)

P (occurrence of a in d ∩ non-occurrence of a in d)

=P (a, d)(1− P (a, d))

P (a, d)(1− P (a, d))

• Odds of Exclusivity: The fifth evidential weight O5 realises Definition 5.4.2.1.3.1

by capturing the probability of whether a is more exclusive in d or in d. For-

mally, O5 can be described as

Sub-Aim 5. What are the odds of term candidate a being exclusive in d?

Something is regarded as exclusive if it exists only in a category (i.e. unique

to that category) with certain restrictions such as limited usage. It is obvious

at this point that a term candidate which is unique and rare in a domain is

considered as exclusive in that domain. There are several ways of realising

the rarity of terms in domains. For example, one can employ some measures


of vocabulary richness or diversity [65] to quantify the extent of dispersion or

concentration of term usage in a particular domain. However, the question on

how such measures can be integrated into probabilistic frameworks such as the

one proposed in this chapter remains a challenge. We propose the view that

terms are considered as rare in a domain if they exist only in certain aspects

of that domain. For example, in the domain of computer networking, we may

encounter terms like “Fiber distributed data interface” and “WiMAX”. They

may both be relevant to the domain but their distributional behaviour in the

domain corpus is definitely different. While both may appear to represent

certain similar concepts such as “high-speed transmission”, their existence are

biased to different aspects of the same domain. The first term may be biased

to certain aspects characterised by concepts such as “token ring” and “local

area network”, while the second may appear biased to aspects like “wireless

network” and “mobile application”. We propose to realise the notion of “do-

main aspects” through the documents that the domain contains. We consider

the documents that made up the domain corpus as discussions of the various

aspects of a domain. Consequently, a term candidate can be considered as rare

if it has a low probability of occurrence in documents in the domain. Please

note the difference in the probability of occurrence in a domain versus the

probability of occurrence in documents in a domain. Following this, Property

3 and the probability of uniqueness discussed as part of O4, we define the fifth

evidential weight as:

O5 =P (exclusivity of a|R1)

P (exclusivity of a|R2)

=P (exclusivity of a in d)

P (exclusivity of a in d)

=P (uniqueness of a to d)P (rarity of a in d)

P (uniqueness of a to d)P (rarity of a in d)

=P (a, d)(1− P (a, d))P (rarity of a in d)

P (a, d)(1− P (a, d))P (rarity of a in d)

If we subscribe to our definition of rarity proposed above where P(rarity of a in d)=

1− PK(a, d), then,

=P (a, d)(1− P (a, d))(1− PK(a, d))

P (a, d)(1− P (a, d))(1− PK(a, d))

The higher the probability that candidate a has no occurrence in documents

in d, the rarer it becomes in d. Whenever the occurrence of a increases in doc-

uments in d, 1− PK(a, d) reduces (i.e. getting less rare) and this leads to the


decrease in overall O5 or exclusivity. One may have noticed that the interpre-

tation of the notion of rarity using the occurrences of candidates in documents

may not be the most appropriate. Since the usage of terms in documents has

an effect on the existence of terms in the domain, the independence assump-

tion required to enable the product of the probability of uniqueness and of

rarity to take place does not hold in reality.

• Odds of Pervasiveness: The sixth evidential weight O6 attempts to capture

Definition 5.4.2.1.3.2. Formally, O6 can be described as

Sub-Aim 6. What are the odds of term candidate a being pervasive in d?

Something is considered to be pervasive if it exists very commonly in only

one category. This makes the notion of commonness the opposite of rarity. A

term candidate is said to be common if it occurs in most aspects of a domain.

In other words, among all documents discussing about a domain, the term

candidate has a high probability of occurring in most or nearly all of them.

Following this and Property 3, we define the sixth evidential weight as:

O6 =P (pervasiveness of a|R1)

P (pervasiveness of a|R2)

=P (pervasiveness of a in d)

P (pervasiveness of a in d)

=P (uniqueness of a to d)P (commonness of a in d)

P (uniqueness of a to d)P (commonness of a in d)

=P (a, d)(1− P (a, d))P (commonness of a in d)

P (a, d)(1− P (a, d))P (commonness of a in d)

If we follow the definition of rarity introduced as part of the fifth evidential

weight O5, then the notion of commonness is the complement of rarity which

means P(commonness of a in d)=1-P(rarity of a in d), or

=P (a, d)(1− P (a, d))PK(a, d)

P (a, d)(1− P (a, d))PK(a, d)

Similar to the interpretation of the notion of rarity, the use of the probabil-

ity of non-occurrence in documents may not be the most suitable since the

independence assumption that we need to make does not hold in reality.

• Odds of Clumping Tendency: The last evidential weight involves the use of

contextual information. There are two main issues involved in utilising con-

textual information. First is the question of what constitutes the context of


a term candidate, and secondly, how to cultivate and employ contextual ev-

idence. Regarding the first issue, it is obvious that the importance of terms

and context in characterising domain concepts should be reflected through

their heavy participations in the states or actions expressed by verbs. Follow-

ing this, we put forward a few definitions related to what context is, and the

relationship between terms and context.

Definition 5.4.2.2. Words that are contributors to the same state or action

as a term can be considered as context related to that term.

Definition 5.4.2.3. Relationship between terms and their context

1 In relation to Definition 5.4.3.2, terms tend to congregate at different

parts of text to describe or characterise certain aspects of the domain d.

2 Following the above, terms which clump together will eventually be each

others’ context.

3 Following the above, context which are also terms (i.e. context terms)

are more likely to be better context since actual terms tend to clump.

4 Context words which are also semantically related to their terms are more

qualified at describing those terms.

Regarding the second issue highlighted above, we can employ context to pro-

mote or demote the rank of terms based on the terms’ tendency to clump.

Following Definition 5.4.2.3, terms with higher tendency to clump or occur to-

gether with their context should be promoted since such candidates are more

likely to be the “actual” terms in a domain. Some readers may be able to

recall from Definition 5.4.2 which states that the meaning of terms should be

independent from their context. We would like to point out that the func-

tion of this last evidential weight is not to infer the meaning of terms from

their context since such action conflicts with Definition 5.4.2. Instead, we

employ context to investigate and reveal an important characteristic of terms

as defined in Definition 5.4.3.2 and 5.4.2.3, namely, the tendency of terms to

clump. We employ the linguistically-motivated technique by Wong et al. [275]

to extract term candidates together with their context words in the form of

instantiated sub-categorisation frames [271]. This seventh evidential weight

O7 attempts to realise Definition 5.4.3.2 and 5.4.2.3. Formally, O7 can be

described as


Sub-Aim 7. What are the odds that term candidate a clumps with its context

Ca in d?

We can compute the clumping tendency of candidate a and its context words

as the probability of candidate a occurring together with any of its context

words. The higher the probability of candidate a and its context words oc-

curring together in the same domain, the more likely it is that they clump.

Since related context words are more qualified at describing the terms based

on Definition 5.4.2.3.4, we have to include a semantic relatedness measure for

that purpose. We employ Psim(a, c) to estimate the probability of relatedness

between candidate a and its context word c ∈ Ca ∩ TC. Psim(a, c) is imple-

mented using the semantic relatedness measure NGD by Cilibrasi & Vitanyi

[50, 261] which has been discussed in Section 5.3.2. Let c ∈ Ca∩TC be the set

of context terms, the last evidential weight can be mathematically formulated

as:

O7 =P (clumping of a with its related context|R1)

P (clumping of a with its related context|R2)

=P (clumping of a with its related context in d)

P (clumping of a with its related context in d)

=P (occurrence of a with any related c ∈ Ca ∩ TC in d)

P (occurrence of a with any related c ∈ Ca ∩ TC in d)

=

∑

∀c∈Ca∩TC P (occurrence of a in d ∩ occurrence of c in d)Psim(a, c)∑

∀c∈Ca∩TC P (occurrence of a in d ∩ occurrence of c in d)Psim(a, c)

=

∑

∀c∈Ca∩TC P (a, d)P (c, d)Psim(a, c)∑

∀c∈Ca∩TC P (a, d)P (c, d)Psim(a, c)


In this evaluation, we studied the ability of our new probabilistic measure known

as the Odds of Termhood (OT) in separating domain-relevant terms from general

ones. We contrasted our new measure with three existing scoring and ranking

schemes, namely, Contrastive Weight (CW), NCvalue (NCV) and Termhood (TH).

The implementations of CW , NCV and TH are in accordance to Equation 5.1 and

5.2, 5.9, and 5.12 respectively. The evaluations of the four termhood measures were

conducted in two parts:

• Part 1: Qualitative evaluation through the analysis and discussion based on

the frequency distribution and measures of dispersion, central tendency and

correlation.


• Part 2: Quantitative evaluation through the use of performance measures,

namely, precision, recall, F-measure and accuracy, and the GENIA annotated

text corpus as the gold standard G. In addition, the approach we chose to

evaluate the termhood measures provides a way to automatically decide on a

threshold for accepting and rejecting the ranked term candidates.

For both parts of our evaluation, we employed a dataset containing a domain corpus

describing the domain of molecular biology, and a contrastive corpus which spans

across twelve different domains other than molecular biology. The datasets are de-

scribed in Figure 5.1 in Section 5.2. Using the part-of-speech tags in the GENIA

corpus, we extracted the maximal noun phrases as term candidates. Due to the large

number of distinct words (over 400, 000 as mentioned in Section 5.2) in the GENIA

corpus, we have extracted over 40, 000 lexically-distinct maximal noun phrases. The

large number of distinct noun phrases is due to the absence of preprocessing to nor-

malise the lexical variants of the same concept. For practical reasons, we randomly

sampled the set of noun phrases for distinct term candidates. The resulting set

TC contains 1, 954 term candidates. Following this, we performed the scoring and

ranking procedure using the four measures included in this evaluation on the set of

1, 954 term candidates.

5.5.1 Qualitative Evaluation

In the first part of the evaluation, we analysed the frequency distributions of the

ranked term candidates generated by the four measures. Figures 5.7 and 5.8 show

the frequency distributions of the candidates ranked in descending order according to

the weights assigned by the four measures. The candidates are ranked in descending

order according to their scores assigned by the respective measures. One can notice

the interesting trends from the graphs by CW and NCV in Figures 5.8(b) and

5.8(a). The first half of the graph by CW , prior to the sudden surge of frequency,

consists of only complex terms. Complex terms tend to have lower word counts

compared to simple terms and hence, the disparity in the frequency distribution as

shown in Figure 5.8(b). This is attributed to the biased treatment given to complex

terms evident in Equation 5.2. However, priority is also given to complex terms by

TH but as one can see from the distribution of candidates by TH, such undesirable

trend does not occur. One of the explanation is the heavy reliance of frequency by

CW while TH attempts to diversify the evidence in the computation of weights.

While frequency may be a reliable source of evidence, the use of it alone is definitely

inadequate [37]. As for NCV , Figure 5.8(a) reveals that scores are assigned to


(a) Candidates ranked in descending order according to the scores assigned by OT .

(b) Candidates ranked in descending order according to the scores assigned by TH.

Figure 5.7: Distribution of the 1, 954 terms extracted from the domain corpus d

sorted according to the corresponding scores provided by OT and TH. The single

dark smooth line stretching from the left (highest value) to the right (lowest value)

of the graph is the scores assigned by the respective measures. As for the two

oscillating lines, the dark line is the domain frequencies while the light one is the

contrastive frequencies.


(a) Candidates ranked in descending order according to the scores assigned by NCV .

(b) Candidates ranked in descending order according to the scores assigned by CW .

Figure 5.8: Distribution of the 1, 954 terms extracted from the domain corpus d

sorted according to the corresponding scores provided by NCV and CW . The

single dark smooth line stretching from the left (highest value) to the right (lowest

value) of the graph is the scores assigned by the respective measures. As for the two

oscillating lines, the dark line is the domain frequencies while the light one is the

contrastive frequencies.


Figure 5.9: The means µ of the scores, standard deviations σ of the scores, sum

of the domain frequencies and of the contrastive frequencies of all term candidates,

and their ratio.

Figure 5.10: The Spearman rank correlation coefficients ρ between all possible pairs

of measure under evaluation.

candidates by NCV based solely on the domain frequency. In other words, the

measure NCV lacks the required contrastive analysis. As we have pointed out,

terms can be ambiguous and we must not ignore the cross-domain distributional

behaviour of terms. In addition, upon inspecting the actual list of ranked candidates,

we noticed that higher scores were assigned to candidates which were accompanied

by more context words. Another positive trait that TH exhibits is its ability to

assign higher scores to terms which occur relatively more frequent in d and in d.

This is evident through the gap between fd (dark oscillating line) and fd (light

oscillating line), especially at the beginning of the x-axis in Figure 5.7(b). One can

notice that candidates along the end of the x-axis are those with fd > fd. The same

can be said about our new measure OT . However, the discriminating power of OT is

apparently better since the gap between fd and fd is larger and lasted longer. Figure

5.9 summarises the mean and standard deviation of the weights generated by the

various measures. One can notice the extremely high dispersion from the mean of

the scores generated by CW and NCV . We speculate that such trends are due to

the erratic assignments of weights, heavily influenced by frequencies. In addition,

we employed the Spearman rank correlation coefficient to study the possibility of

any correlation between the four ranking schemes under evaluation. Figure 5.10

summarises the correlation coefficients between the various measures. Note that


there is a relatively strong correlation between the ranks produced by our new

probabilistic measure OT and the ranks by the ad-hoc measure TH. The correlation

of TH with OT revealed the possibility of providing mathematical justifications for

the former’s heuristically-motivated ad-hoc technique using a general probabilistic

framework.

5.5.2 Quantitative Evaluation

Figure 5.11: An example of a contingency table. The values in the cells TP , TN ,

FP and FN are employed to compute the precision, recall, Fα and accuracy. Note

that |TC| is the total number of term candidates in the input set TC, and |TC| =

TP + FP + FN + TN .

In the second part of the evaluation, we employed the gold standard G generated

from the GENIA corpus as discussed in Section 5.2 for evaluating our new term

recognition technique using OT and three other existing ones (i.e. TH, NCV and

CW ). We employed four measures [166] common to the field of information retrieval

for performance comparison. These performance measures are precision, recall, Fα-

measure, and accuracy. These measures are computed by constructing a contingency

table as shown in Figure 5.11:

precision =TP

TP + FP

recall =TP

TP + FN

Fα =(1 + α)(precision× recall)

(α× precision) + recall

accuracy =TP + TN

TP + FP + FN + TN

where TP , TN , FP and FN are values from the four cells of the contingency table

shown in Figure 5.11, and α is the weight for recall within the range (0,∞). It

suffices to know that as the α value increases, the weight of recall increases in the


measure [204]. Two common α values are 0.5 and 2. F2 weighs recall twice as

much as precision, and precision in F0.5 weighs two times more than recall. Recall

and precision are evenly weighted in the traditional F1 measure. Before presenting

the results for this second part of the evaluation, there are several points worth

clarifying. Firstly, the gold standard G is a set of unordered collection of terms

whose domain relevance has been established by experts. Secondly, as part of the

evaluation, each term candidate a ∈ TC will be assigned a score by the respective

termhood measures. These scores are used to rank the candidates in descending

order where larger scores correspond to higher ranks. As a result, there will be

four new sets of ranked term candidates, each corresponding to a measure under

evaluation. For example, TCNCV is the output set after the scoring and ranking of

the input term candidates TC by the measure NCV . The output from the measures

TH, OT and CW are TCTH , TCOT and TCCW , respectively. We would like to

remind the readers that |TC| = |TCTH | = |TCOT | = |TCCW | = |TCNCV |. The

individual elements of the output sets appear as ai where a is the term candidate

from TC and i is the rank. Next, the challenge lies in how the resulting sets of

ranked term candidates should be evaluated using the gold standard G. Generally,

as in information retrieval, a binary classification is performed. In other words,

we try to find a match in G for every ai in TCX where X is any of the measure

under evaluation. A positive match indicates that ai is a term, while no match

implies that ai is a non-term. However, there is a problem with this approach. The

elements (i.e. term candidates) in the four output sets are essentially the same. The

difference between the four sets lies in the ranks of the term candidates, and not

the candidates themselves. In other words, the simple attempts of trying to find

the number of matches in every TCX with G will produce the same results (i.e.

same precision, recall, F-score and accuracy). Obviously, the ranks i assigned to the

ranked term candidates ai by the different termhood measures have a role to play.

Following this, a cut-off point (i.e. threshold) for the ranked term candidates needs

to be employed. To have an unbiased comparison of the four measures using the

gold standard, we have to ensure that the cut-off rank for each termhood measure is

the optimal. Manually deciding on the four different “magic numbers” for the four

measures is a challenging and undesirable task.

To overcome the challenges involved in performing an unbiased comparative

study on the four termhood measures as discussed above, we propose to fairly exam-

ine their performance and possibly, to decide on the optimal cut-off ranks through

rank binning. We briefly describe the 3-step process of rank binning for each set


TCX :

• Decide on a standard size b for the bins;

• Create n = ⌈ |TCX |b⌉ rank bins where ⌈y⌉ is the ceiling value of y;

• Each bin BXj is assigned a rank j where 1 ≤ j ≤ ⌈ |TCX |

b⌉, and X is the

identifier for the corresponding measure (i.e. NCV , CW , OT or TH). Bin

BX1 is considered as a higher bin compared to BX

2 and so on; and

• Distribute the ranked term candidates in set TCX to their respective bins. Bin

BX1 will hold the first b-th ranked term candidates from set TCX . In general,

bin BXj will contain the first (j × b)-th ranked term candidates from TCX

where 1 ≤ j ≤ ⌈ |TCX |b⌉. Obviously, there is an exception for the last bin BX

n

where n = ⌈ |TCX |b⌉ if b is not a factor of |TCX | (i.e. |TCX | is not divisible by

b). In such exceptions, the last bin BXn would simply contain all the ranked

term candidates in |TCX |.

The results of binning the ranked term candidates produced by the four termhood

measures using the input set of 1, 954 term candidates are shown in Figure 5.12. We

would like to point out that the choice of b has effects on the performance indicators

of each bin. Setting the bin size too large may produce deceiving performance

indicators that do not reflect the actual quality of the ranked term candidates. This

occurs when an increasing numbers of ranked term candidates, which can either

be actual terms or non-terms in the gold standard, are mixed into the same bin.

On the other hand, selecting bin sizes which are too small may defeat the purpose

of collective evaluation of the ranked term candidates. Moreover, large number of

bins will make interpretation of the results difficult. A rule-of-thumb is to have

appropriate bin sizes selected based on the size of TC and the sensible number of

bins for ensuring interpretable results. For example, setting b = 100 for a set of

500 term candidates is a suitable number since there are only 5 bins. However,

using the same bin size on a set of 5, 000 term candidates is inappropriate. In our

case, to ensure interpretability of the tables included in this chapter, we set the

bin size to b = 200 for our 1, 954 term candidates. The results are organised into

contingency tables as introduced in Figure 5.11. Each individual contingency table

in Figure 5.12 contains four cells and is structured in the same way as Figure 5.11.

The individual table summarises the result obtained from the binary classification

performed using the corresponding bin of term candidates on the gold standard G.

Each measure will have several contingency tables where each table contains values


for determining the quality (i.e. relevance to the domain) of the term candidates

that fall in the corresponding bins, as prescribed by the gold standard.

Figure 5.12: The collection all contingency tables for all termhood measures X across

all the 10 bins BXj . The first column contains the rank of the bins and the second

column shows the number of term candidates in each bin. The third general column

“termhood measures, X” holds all the 10 contingency tables for each measure X

which are organised column-wise, bringing the total number of contingency tables

to 40 (i.e. 10 bins, organised in rows by 4 measures). The structure of the individual

contingency tables follows the one shown in Figure 5.11. The last column is the row-

wise sums of TP + FP and FN + TN . The rows beginning from the second row

until the second last are the rank bins. The last row is the column-wise sums of

TP + FN and FP + TN .

Using the values in the contingency tables in Figure 5.12, we computed the

precision, recall, F-scores and accuracy of the four measures at different bins. The

performance results are summarised in Figure 5.13. The accuracy indicators, acc


Figure 5.13: Performance indicators for the four termhood measures in 10 respective

bins. Each row shows the performance achieved by the four measures in a particular

bin. The columns contain the performance indicators for the four measures. The

notation pre stands for precision, rec is recall and acc is accuracy. We use two

different α values, resulting in two F-scores, namely, F0.1 and F1. The values of the

performance measures with darker shades are the best performing ones.

show the extent to which a termhood measure correctly predicted both terms (i.e.

TP ) and non-terms (i.e. TN). On the other hand, precision, pre measures the

extent to which the candidates predicted as terms (i.e. TP + FP ) are actual terms

(i.e. TP ). The recall indicators, rec capture the ability of the measures in correctly

(i.e. TP ) identifying all terms that exist in the set TC (i.e. TP + FN). As shown

in the last bins (i.e. j = 10) of Figure 5.13, it is trivial to achieve a recall of 100% by

simply binning all term candidates in TC into one large bin during evaluation. Recall

alone is not a good performance indicator and one needs to take into consideration

the number of non-terms which are mistakenly predicted as terms (i.e. FP ) which

is neglected in the computation of recall. Hence, we need to find a balance between

recall and precision instead, which is aptly captured as the F1 score. To demonstrate

an F-score which places more emphasis on precision, we combine both recall and

precision to obtain the F0.1 score as shown in Figure 5.13. Before we begin discussing

the quantitative results, we would like to provide an example of how to intepret the

performance indicators in Figure 5.13 to ensure absolute clarity. If we accept bin

j = 8 using the termhood measure OT as the solution to term recognition, then

84.56% of the solution is precise, and 79.73% of the solution is accurate. In other

words, 84.56% of the predicted terms in bin j = 8 are actual terms, and 79.73%

of the term predictions and non-term predictions are actually true, all according to

the gold standard. The similar approach is used to intepret the performance of all


other measures in all bins.

From Figure 5.13, we notice that the measures TH and OT are consistently bet-

ter in terms of all the performance indicators compared to NCV and CW . Using

any one of the 10 bins and any performance indicators for comparison, TH and

OT offered the best performance. The close resemblance and consistency in the

performance indicators of TH and OT supports and objectively confirms the cor-

relation between the two termhood measures as suggested by the Spearman rank

correlation coefficient discovered in Section 5.5.1. OT and TH achieved the best

precision in the first bin at 98% and 98.5%, respectively by obviously sacrificing the

recall. The worst performing termhood measure in terms of precision is the NCV

at the maximum of only 76.87%. Since the lowest precision of NCV lies in the last

bin, its recall achieves the maximum at 100%. In fact, Figure 5.13 clearly shows

that the maximum values for all performance indicators of NCV rest in the last

bin, and the precision values of NCV are erratically distributed across all the bins.

Ideally, a good termhood measure should attain the highest precision in the first

bin with the subsequent bins achieving decreasing precisions. This is important to

show that actual terms are assigned with higher ranks by the termhood measures.

This reaffirms our suggestion that contrastive analysis which is present in OT , TH

and CW is necessary for term recognition. The frequency distribution of terms

ranked by NCV , as shown in Figure 5.8(a) in the previous Section 5.5.1, clearly

shows the improperly ranked term candidates where only the frequencies from the

domain corpus are considered. Generally, as we are able to observe from the pre-

cision and recall columns, “well-behaved” termhood measures usually have higher

precisions with lower recalls in the first few bins. This is due to the more restrictive

membership of the higher bins where only highly ranked term candidates by the

respective termhood measures are included. The highest recall is achieved when

there is no more false negative since all term candidates in set TC are included for

scoring and ranking, and all the candidates are simply predicted as terms. In our

case, the highest recall is obviously 100% as we have mentioned earlier, while lowest

precision is 76.87%. This relation between precision and recall is aptly captured by

the F0.1 and F1 scores. The F0.1 scores begin at much higher values at the higher

bins compared to the F1 scores. This is due to our emphasis on precision instead of

recall as we have justified earlier, and since higher bins have higher precision, F0.1

will inevitably gain higher values. F0.1 becomes lower than F1 when precision falls

below recall.

The cells with darker shades under each performance measure in Figure 5.13


indicate the maximum values for that measure. In other words, the termhood mea-

sure OT has the best accuracy of 81.47% at bin j = 9, or a maximum F0.1 score

of 86.17% at bin j = 5. We can see that the highest F1 score and the best accu-

racy of OT are more than those of TH. Assuming consistency of these results with

other corpora, OT can be regarded as a measure which attempts to find a balance

between precision and recall. If we weigh precision more, TH will triumph over OT

based on their maximum F0.1 scores. Consequently, we can employ these maximum

values of different performance measures as highly flexible cut-off points for decid-

ing on which top n ranked term candidates to be selected and considered as actual

terms. These maximum values optimise the precision and the recall to ensure that

the maximum number of actual terms is selected while minimising the inclusion of

non-terms. This evaluation approach provides a solution to the problem discussed

at the start of this chapter which is also mentioned by Cabre-Castellvi et al. [37],

“all systems propose large lists of candidate terms, which at the end of the process

have to be manually accepted or rejected.”. In addition, our proposed use of rank

binning and the maximum values of the various performance measures have allowed

us to perform an unbiased comparison on all four term measures. In short, we have

shown that:

• The new OT termhood measure can provide mathematical justifications for

the heuristically-derived measure TH;

• The new OT termhood measure aims for a balance between precision and re-

call, and offers the most accurate solution to the requirements of term recog-

nition compared to the other measures TH, NCV and CW ; and

• The new OT termhood measure performs on par with the heuristically-derived

measure TH, and they are both consistently the best performing term recog-

nition measures in terms of precision, recall, F-scores and accuracy compared

to NCV and CW .

Some critics may simply disregard the results reported here as unimpressive and

be inclined to compare them with results from other related but distinct disci-

plines such as named-entity recognition or document retrieval. However, one has to

keep in mind several fundamental differences in regard to the evaluations in term

recognition. Firstly, unlike other established fields, term recognition is largely an

unconsolidated research area which still lacks a common comparative platform [120].

As a result, individual techniques or systems are being developed and tested with

5.6. Conclusions 123

small datasets in highly specialised domains. According to Cabre-Castellvi et al.

[37], “This lack of data makes it difficult to evaluate and compare them.”. Secondly,

we could not emphasise enough the fact that term recognition is a subjective task

in comparison to fields such as named-entity recognition and document retrieval. In

the most primitive way, named-entity recognition, which is essentially a classification

problem, can be deterministically performed through a finite set of rigid designa-

tors, resulting in near-human performance in common evaluation forums such as the

Message Understanding Conference (MUC) [42]. While being more subjective than

named-entity recognition, the task of determining document relevance in document

retrieval is guided by explicit user queries with common evaluation platform such as

the Text Retrieval Conference (TREC) [264]. On the other hand, term recognition

is based upon the elusive characteristics of terms. Moreover, the set of character-

istics employed differs across a diversed range of term recognition techniques, and

within each individual technique, the characteristics may be subjected to different

implicit interpretations. The challenges of evaluating term recognition techniques

become more obvious when one consider the survey by Cabre-Castellvi et al. [37]

where more than half of the systems reviewed remain not evaluated.

5.6 Conclusions

Term recognition is an important task for many natural language systems. Many

techniques have been developed in an attempt to numerically determine or quantify

termhood based on heuristically-motivated term characteristics. We have discussed

several shortcomings related to many existing techniques such as ad-hoc combination

of termhood evidence, mathematically-unfounded derivation of scores, and implicit

and possibly flawed assumptions concerning term characteristics. All these short-

comings lead to issues such as non-decomposability and non-traceability of how the

weights and scores are obtained. These issues bring into light the question of what

are the term characteristics that the different weights and scores are trying to em-

body, if any, and whether these individual weights or scores are actually measuring

what they are supposed to capture. Termhood measures which cannot be traced or

attributed to any term characteristics are fundamentally flawed.

In this chapter, we stated clearly the four main challenges in creating a formal

and practical technique for measuring termhood. These challenges are (1) the for-

malisation of a general framework for consolidating evidence representing different

term characteristics, (2) the formalisation of the various evidence representing the

different term characteristics, (3) the explicit definition of term characteristics and


their attribution to linguistic theories (if any) or other justifications, and (4) the au-

tomatic determination of optimal thresholds for selecting terms from the final lists

of ranked term candidates. We addressed the first three challenges through a new

probabilistically-derived measure called the Odds of Termhood (OT) for scoring and

ranking term candidates for term recognition. The design of the measure begins

with the derivation of a general probabilistic framework for integrating termhood

evidence. Next, we introduced seven evidence, founded on formal models of word

distribution, to facilitate the calculation of OT . The evidence captures the various

different characteristics of terms which are either heuristically-motivated or based

on linguistic theories. The fact that evidence can be added or removed makes OT

a highly flexible framework that is adaptable to different applications’ requirements

and constraints. In fact, in the evaluation, we have shown close correlation between

our new measure OT and the ad-hoc measure TH. We believe by adjusting the

inclusion or exclusion of various evidence, other ad-hoc measures can be captured as

well. Our two-part evaluation comparing OT with three other existing ad-hoc mea-

sures, namely, CW , NCV and TH have demonstrated the effectiveness of the new

measure and the new framework. A qualitative evaluation studying the frequency

distributions revealed advantages of our new measure OT . A quantitative evalua-

tion using the GENIA corpus as the gold standard and four performance measures

further supported our claim that our new measure OT offers the best performance

compared to the three existing ad-hoc measures. Our evaluation revealed that (1)

the current evidence employed in OT can be seen as probabilistic realisations of

the heuristically-derived measure TH, (2) OT offers a solution to the need for term

recognition which is both accurate and balanced in terms of recall and precision,

and (3) OT performs on par with the heuristically-derived measure TH and they

are both the best performing term recognition measures in terms of precision, recall,

F-scores and accuracy compared to NCV and CW . In addition, our approach of

rank binning and the use of performance measures for deciding on optimal cut-off

ranks addresses the fourth challenge.

5.7 Acknowledgement


graduate Research Scholarship, the University Postgraduate Award (International

Students) by the University of Western Australia, the 2008 UWA Research Grant,

and the Curtin Chemical Engineering Inter-University Collaboration Fund. The au-

thors would like to thank the anonymous reviewers for their invaluable comments.

5.8. Other Publications on this Topic 125


Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learning

Domain Ontologies using Domain Prevalence and Tendency. In the Proceedings of

the 6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.

This paper describes a heuristic measure called TH for determining termhood based

on explicitly defined term characteristics and the distributional behaviour of terms

across different corpora. The ideas on TH were later reformulated to give rise to

the probabilistic measure OT. The description of OT forms the core contents of this

Chapter 5.

Wong, W., Liu, W. & Bennamoun, M. (2007) Determining Termhood for Learn-

ing Domain Ontologies in a Probabilistic Framework. In the Proceedings of the 6th

Australasian Conference on Data Mining (AusDM), Gold Coast, Australia.

This paper describes the preliminary attempts at developing a probabilistic framework

for consolidating termhood evidence based on explicitly defined term characteristics

and formal word distribution models. This work was later extended to form the core

contents of this Chapter 5.

Wong, W., Liu, W. & Bennamoun, M. (2008) Determination of Unithood and

Termhood for Term Recognition. M. Song and Y. Wu (eds.), Handbook of Research

on Text and Web Mining Technologies, IGI Global.

This book chapter combines the ideas on the UH measure and the TH measure from

Chapter 4 and 5, respectively.


127CHAPTER 6

Corpus Construction for Term Recognition

Abstract

The role of the Web for text corpus construction is becoming increasingly signif-

icant. However, its contribution is largely confined to the role of a general virtual

corpus, or poorly derived specialised corpora. In this chapter, we introduce a new

technique for constructing specialised corpora from the Web based on the system-

atic analysis of website contents. Our evaluations show that the corpora constructed

using our technique are independent of the search engines used, and that they out-

perform all corpora based on existing techniques for the task of term recognition.

6.1 Introduction

Broadly, a text corpus is considered as any collection containing more than one

text of a certain language. A general corpus is balanced with regard to the various

types of information covered by the language of choice [173]. In contrast, the con-

tent of a specialised corpus, also known as domain corpus, is biased towards a certain

sub-language. For example, the British National Corpus (BNC) is a general corpus

designed to represent modern British English. On the other hand, the specialised

corpus GENIA contains solely texts from the molecular biology domain. Several

connotations associated with text corpora such as size, representativeness, balance

and sampling are the main topics of ongoing debate within the field of corpus linguis-

tics. In reality, great manual effort is required for constructing and maintaining text

corpora that satisfy these connotations. Although these curated corpora do play

a significant role, several related inadequacies such as the inability to incorporate

frequent changes, rarity of traditional corpora for certain domains, and limited cor-

pus size have hampered the development of corpus-driven applications in knowledge

discovery and information extraction.

The increasingly accessible, diverse and inexpensive information on the World

Wide Web (the Web) have attracted the attention of researchers who are in search

of alternatives to manual construction of corpora. Despite issues such as poor re-

producibility of results, noise, duplicates and sampling, many researchers [40, 129,

16, 228, 74] agreed that the vastness and diversity of the Web remains the most

0This chapter was accepted with revision by Language Resources and Evaluation, 2009, with the

title “Constructing Specialised Corpora through Domain Representativeness Analysis of Websites”.

128 Chapter 6. Corpus Construction for Term Recognition

promising solution to the increasing need for very large corpora. Current work on

using the Web for linguistic purposes can be broadly grouped into (1) the Web it-

self as a corpus, also known as virtual corpus [97], and (2) the Web as a source of

data for constructing locally-accessible corpora known as Web-derived corpora. The

contents of a virtual corpus are distributed over heterogeneous servers, and accessed

using URLs and search engines. It is not difficult to see that these two types of

corpora are not mutually exclusive, and that a Web-derived corpus can be easily

constructed, albeit the downloading time, using the URLs from the corresponding

virtual corpus. The choice between the two types of corpora then becomes a ques-

tion of trade-off between effort and control. On the one hand, applications which

require stable count, and complete access to the texts for processing and analysis

can opt for Web-derived corpora. On the other hand, in applications where speed

and corpus size supersede any other concerns, a virtual corpus alone suffices.

The current state-of-the-art mainly focuses on the construction of Web-derived

corpora, ranging from the simple query-and-download approach using search engines

[15], to the more ambitious custom Web crawlers for very large collections [155, 205].

BootCat [15] is a widely-used toolkit to construct specialised Web-derived corpora.

It employs a naive technique of downloading webpages returned by search engines

without further analysis. [228] extended the use of BootCat to construct a large

general Web-derived corpus using 500 seed terms. This technique requires a large

number of seed terms (in the order of hundreds) to produce very large Web-derived

corpora, and the composition of the corpora varies depending on the search engines

used. Instead of relying on search engines and seed terms, [155] constructed a very

large general Web-derived corpus by crawling the Web using seed URLs. In this

approach, the lack of control and the absence of further analysis cause topic drift as

the crawler traverses further away from the seeds. A closer look into the advances

in this area reveals the lack of systematic analysis of website contents during corpus

construction. Current techniques simply allow the search engines to dictate which

webpages are suitable for the domain based solely on matching seed terms. Others

allow their Web crawlers to run astray without systematic controls.

We propose a technique, called Specialised Corpora Construction based on Web

Texts Analysis (SPARTAN)1 to automatically analyze the contents of websites for

discovering domain-specific texts to construct very large specialised corpora. The

1This foundation work on corpus construction using Web data appeared in the Proceedings of

the 21st Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand, 2008

with the title “Constructing Web Corpora through Topical Web Partitioning for Term Recognition”.

6.1. Introduction 129

first part of our technique analyzes the domain representativeness of websites for

discovering specialised virtual corpora. The second part of the technique selectively

localises the distributed contents of websites in the virtual corpora to create spe-

cialised Web-derived corpora. This technique can also be employed to construct

BNC-style balanced corpora through stratified random sampling from a balanced

mixture of domain-categorised Web texts. During our experiments, we will show

that unlike BootCat-derived corpora which vary greatly across different search en-

gines, our technique is independent of the search engine employed. Instead of blindly

using the results returned by search engines, our systematic analysis allows the

most suitable websites and their contents to surface and to contribute to the spe-

cialised corpora. This systematic analysis significantly improved the quality of our

specialised corpora as compared to BootCat-based corpora, and the naive Seed-

Restricted Querying (SREQ) of the Web. This is verified using the term recognition

task.

In short, the theses of this chapter are as follows:

1) Web-derived corpora are simply localised versions of the corresponding virtual

corpora;

2) The often mentioned problems of using search engines for corpus construction

are in fact a revelation of the inadequacies in current techniques;

3) The use of websites, instead of webpages, as basic units of analysis during

corpus construction is more suitable for constructing very large corpora; and

4) The results provided by search engines cannot be directly accepted for con-

structing specialised corpora. The systematic analysis of website contents is

fundamental in constructing high-quality corpora.

The main contributions of this chapter are (1) a technique for constructing very large,

quality corpora using only a small number of seed terms, (2) the use of systematic

content analysis for re-ranking websites based on their domain representativeness to

allow the corpora to be search engine independent, and (3) processes for extending

user-provided seed terms and localising domain-relevant contents. This chapter is

structured as follows. In Section 6.2, we summarise current work on corpus con-

struction. In Section 6.3, we outline our specialised corpora construction technique.

In Section 6.4, we evaluate the specialised corpora constructed using our technique

in the context of term recognition. We end this chapter with an outlook to future

work in Section 6.5.


6.2 Related Research

The process of constructing corpora using data from the Web generally comprises

of webpage sourcing, and relevant text identification, which is discussed in Section

6.2.1 and 6.2.2, respectively. In Section 6.2.3, we outline several studies demonstrat-

ing the significance of search engine counts in natural language applications despite

their inconsistencies.

6.2.1 Webpage Sourcing

Currently, there are two main approaches for sourcing webpages to construct

Web-derived corpora, namely, using seed terms as query strings for search engines

[15, 74], and using seed URLs for guiding custom crawlers [155, 207].

The first approach is popular among current corpus construction practices due

to the toolkit known as BootCat [17]. BootCat requires several seed terms as input,

and formulates queries as conjunctions of randomly selected seeds for submission to

the Google search engine. The method then gathers the webpages listed in Google’s

search result to create a specialised corpus. There are several shortcomings related

to the construction of large corpora using this technique:

• First, different search engines employ different algorithms and criteria for de-

termining webpage relevance with respect to a certain query string. Since this

technique simply downloads the top webpages returned by a search engine,

the composition of the resulting corpora would vary greatly across different

search engines for reasons beyond knowing and control. It is worth noting

that webpages highly ranked by the different search engines may not have the

necessary coverage of the domain terminology for constructing high-quality

corpora. For example, the ranking by the Google search engine is primarily a

popularity contest [116]. In the words of [228], “...results are ordered...using

page-rank considerations”.

• Second, the aim of creating very large Web-derived corpora using this tech-

nique may be far from realistic. Most major search engines have restrictions

on the number of URLs served for each search query. For instance, the AJAX

Search API provided by Google returns a very low 322 search results for each

query. The developers of BootCat [15] suggested that 5 to 15 seed terms are

2Google’s Web search interface serves up to 1, 000 results. However, automated crawling and

scraping of that page for URLs will result in a blocking of the IP addresses. The SOAP API by

Google, which allows up to 1, 000 queries per day will be permanently phased out by August 2009.

6.2. Related Research 131

typically sufficient in many cases. Assuming each URL provides us with a

valid readable page, 20 seed terms and their resulting 1, 140 three-word com-

binations would produce a specialised corpus of only 1, 1400 × 32 = 36, 480

webpages. Since the combinations are supposed to represent the same domain,

duplicates will most likely occur when all search results are aggregated [228].

A 10% duplicate and download error for every search query reduces the corpus

size to 32, 832 webpages. For example, in order to produce a small corpus of

only 40, 000 webpages using BootCat, [228] has to prepare a startling 500 seed

terms.

• Third, to overcome issues related to inadequate seed terms for creating very

large corpora, BootCat uses extracted terms from the initial corpus to in-

crementally extend the corpus. [15] suggested using a reference corpus to

automatically identify domain-relevant terms. However, this approach does

not work well since the simple frequency-based techniques used by BootCat

are known for their low to mediocre performance in identifying domain terms

[279]. Without the use of control mechanisms and more precise techniques to

recognise terms, this iterative feedback approach will cause topic drift in the

final specialised corpora. Moreover, the idea of creating corpora by relying on

other existing corpora is not very appealing.

In a similar approach, [74] used the most frequent words in BNC, and Microsoft’s

Live Search instead of the typical BootCat-preferred Google to construct a very

large BNC-like corpus from the Web. Fletcher provided the reasons behind his

choice to use Live Search, which include generous query allowance, higher quality

search results, and more responsive to changes on the Web.

The approach of gathering webpages using custom crawlers based on seed URLs

gains wider acceptance as criticisms on the use of search engines intensified. Issues

against the use of search engines such as unknown algorithms for sorting search

results [128] and restrictions on the amount of data that can be obtained [18] have

become targets of critics in the recent years. Some of the current work based on

custom crawlers include a general corpus of 10 billion words downloaded from the

Web based on seed URLs from dmoz.org by [155]. Similarly, Renouf et al. [205]

developed a Web crawler for finding a large subset of random texts from the Web us-

ing seed URLs from human experts and dmoz.org as part of the WebCorp3 project.

Ravichandran et al. [203] demonstrated the use of randomised algorithm to gen-

3www.webcorp.org.uk


erate noun similarity lists from very large corpora. The authors used URLs from

dmoz.org as seed links to guide their crawlers for downloading 70 million webpages.

After boilerplates and duplicates removal, their corpus is reduced to approximately

31 million documents. Rather than sampling URLs from online directories, Baroni

& Ueyama [18] used search engines to obtain webpage URLs for seeding their custom

crawlers. The authors used a combinations of frequent Italian words for querying

Google, and retrieve a maximum of 10 pages per query. A resulting 5, 231 URLs

were used to seed breadth-first crawling to obtain a final 4 million-document Italian

corpus. The approach of custom crawling is not without its shortcomings. This

approach is typically based on the assumption that webpages of one domain tend to

link to others in the same domain. It is obvious that the reliance on this assump-

tion alone without explicit control will result in topic drift. Moreover, most authors

do not provide explicit statements for addressing important issues such as selection

policy (e.g. when to stop the crawl, where to crawl next), and politeness policy (e.g.

respecting the robot exclusion standard, how to handle disgruntled webmasters due

to the extra bandwidth). This trend of using custom crawlers calls for careful plan-

ning and justification. Issues such as cost-benefit analysis, hardware and software

requirements, and sustainability in the long run have to be considered. Moreover,

poorly-implemented crawlers are a nuisance on the web, consuming bandwidth and

clogging networks at the expense of others [248].

In fact, the worry of unknown ranking and data restriction by search engines [155,

128, 228] exposes the inadequacies of these existing techniques for constructing Web-

derived corpora (e.g. BootCat). These so-called ‘shortcomings’ of search engines

are merely mismatches in expectations. Linguists expect white box algorithms and

unrestricted data access, something we know we will never get. Obviously, these two

issues do place certain obstacles in our quest for very large corpora, but should we

totally avoid search engines given their integral role on the Web? If so, would we risk

missing the forest just for these few trees? The quick alternative, which is infesting

the Web with more crawlers, poses even greater challenges. Rather than reinventing

the wheel, we should think of how existing corpus construction techniques can be

improved using the already available large search engine repositories out there.

6.2.2 Relevant Text Identification

The process of identifying relevant texts, which usually comprise of webpage fil-

tering and content extraction, is an important step after the sourcing of webpages.

A filtering phase is fundamental in identifying relevant texts since not all webpages

6.2. Related Research 133

returned by search engines or custom web crawlers are suitable for specialised cor-

pora. This phase, however, is often absent from most of the existing techniques

such as BootCat. The commonly used techniques include some kind of richness or

density measures with thresholds. For instance, [125] constructed domain corpora

by collecting the top 100 webpages returned by search engines for each seed term.

As a way of refining the corpora, webpages containing only a small number of user-

provided seed terms are excluded. [4] proposed a knowledge-richness estimator that

takes into account semantic relations to support the construction of Web-derived

corpora. Webpages containing both the seed terms and the desired relations are

considered as better candidates to be included in the corpus. The candidate docu-

ments are ranked and manually filtered based on several term and relation richness

measures.

In addition to webpage filtering, content extraction (i.e. boilerplate removal) is

necessary to remove HTML tags and boilerplates (e.g. texts used in navigation bars,

headers, disclaimers). HTMLCleaner by [88] is a boilerplate remover based on the

heuristics that content-rich sections of webpages have longer sentences, lower number

of links, and more function words compared to the boilerplates. [67] developed a

boilerplate stripper called NCLEANER based on two character-level n-gram models.

A text segment is considered as a boilerplate and discarded if the ‘dirty’ model (based

on texts to be cleaned) achieves a higher probability compared to the ‘clean’ model

(based on training data).

6.2.3 Variability of Search Engine Counts

Unstable page counts have always been one of the main complaints of critics who

are against the use of search engines for language processing. Many work has been

conducted to discredit the use of search engines by demonstrating the arbitrariness

of page counts. The fact remains that page counts are merely estimations [148].

We are not here to argue otherwise. However, for natural language applications

that deal mainly with relative frequencies, ratios and ranking, these variations have

been shown to be insignificant. [181] conducted a study on using page counts for

estimating n-gram frequencies for noun compound bracketing. They showed that

the variability of page counts over time and across search engines do not significantly

affect the results of their task. [140] examined the use of page counts for several

NLP tasks such as spelling correction, compound bracketing, adjective ordering

and prepositional phrase attachment. The authors concluded that for majority of

the tasks conducted, simple and unsupervised techniques perform better when n-


gram frequencies are obtained from the Web. This is in line with the study by [252]

which showed that a simple algorithm relying on page counts outperforms a complex

method trained on a smaller corpus for synonym detection. [124] used search engines

to estimate frequencies for predicate-argument bigrams. They demonstrated the

high correlations between search engines page counts and frequencies obtained from

balanced, carefully edited corpora such as the BNC. Similarly, experiments by [26]

showed that search engine page counts were reliable over a period of 6-month, and

highly consistent with those reported by several manually-curated corpora including

the Brown Corpus [78].

In short, we can safely conclude that page counts from search engines are far

from accurate and stable [148]. Moreover, due to the inherent differences in their

relevance ranking and index sizes, page counts provided by the different search

engines are not comparable. Adequate studies have been conducted to show that

n-gram frequency estimations obtained from search engines indeed work well for a

certain class of applications. As such, one can either make good use of what is

available, or should stop harping on the primitive issue of unstable page count. The

key question now is not whether search engine counts are stable or otherwise, but

rather, how they are used.

6.3 Analysis of Website Contents for Corpus Construction

It is apparent from our discussion in Section 6.2 that the current techniques for

constructing corpora from the Web using search engines can be greatly improved.

In this section, we address the question of how corpus construction can benefit

from the current large search engine indexes despite several inherent mismatches in

expectations. Due to the restrictions imposed by search engines, we only have access

to limited number of webpage URLs [128]. As such, the common BootCat technique

of downloading ‘off-the-shelf’ webpages by search engines to construct corpora is

not the best approach since (1) the number of webpages provided is inadequate,

and (2) not all contents are appropriate for a domain corpus [18]. Moreover, the

authoritativeness of webpages has to be taken into consideration to eliminate low-

quality contents from questionable sources.

Putting into consideration these problems, we have developed a Probabilistic

Site Selector (PROSE) to re-rank and filter the websites returned by search engines

to construct virtual corpora. We will discuss in detail this analysis mechanism in

Section 6.3.1 and 6.3.2. In addition, Section 6.3.3 outlines the Seed Term Expansion

Process (STEP), the Selective Localisation Process (SLOP), and the Heuristic-based

6.3. Analysis of Website Contents for Corpus Construction 135

Figure 6.1: A diagram summarising our web partitioning technique.

Cleaning Utility for Web Texts (HERCULES) designed to construct Web-derived

corpora from virtual corpora for addressing the needs to access local texts by certain

natural language applications. An overview of the proposed technique is shown in

Figure 6.1. A summary of the three phases in SPARTAN is as follows:

Input

– A set of seed terms, W = w1, w2, ..., wn.

Phase 1: Website Preparation

– Gather the top 1, 000 webpages returned by search engines containing the

seed terms. Search engines such as Yahoo will serve the first 1, 000 pages

when accessed using the provided API.

– Generalise the webpages to obtain a set of website URLs, J .

Phase 2: Website Filtering

– Obtain estimates of the inlinks, number of webpages in the website, and

the number of webpages in the website containing the seed terms.

– Analyze the domain representativeness of the websites in J using PROSE.

– Select websites with good domain representativeness to form a new set

J ′. These sites constitute our virtual corpora.


Phase 3: Website Content Localisation

– Obtain a set of expanded seed terms, WX using Wikipedia through the

STEP module.

– Selectively download contents from websites in J ′ based on the expanded

seed terms WX using the SLOP module.

– Extract relevant contents from the downloaded webpages using HER-

CULES.

Output

– A specialised virtual corpus consisting of website URLs with high domain

representativeness.

– A specialised Web-derived corpus consisting of domain-relevant contents

downloaded from the websites in the virtual corpus.

6.3.1 Website Preparation

During this initial preparation phase, a set of candidate websites to represent

the domain of interest D, is generated. Methods such as random walk and random

IP address generation have been suggested to obtain random samples of webpages

[104, 192]. Such random sampling methods may work well for constructing gen-

eral or topic-diverse corpora from the Web if conducted under careful scrutiny. For

our specialised corpora, we employ purposive sampling instead to seek items (i.e.

websites) belonging to a specific, predefined group (i.e. domain D). Since there

is no direct way of deciding if a website belongs to domain D, a set of seed terms

W = w1, w2, ..., wn is employed as the determinant factor. Next, we submit queries

to the search engines for webpages containing the conjunction of the seed terms W .

The set of webpage URLs, which contains the purposive samples that we require, is

returned as the result. At the moment, only webpages in the form of HTML files

or plain text files are accepted. Since most search engines only serve the first 1, 000

documents, the size of our sample is no larger than 1, 000. We then process the

webpage URLs to obtain the corresponding domain names of the websites. In other

words, only the segment of the URL beginning from the scheme (e.g. http://) until

the authority segment of the hierarchical part is considered for further processing.

For example, in the URL http://web.csse.uwa.edu.au/research/areas/, only

the segment http://web.csse.uwa.edu.au/ is applicable. This collection of dis-

tinct websites (i.e. collection of webpages), represented using the notation J will be

subjected to re-ranking and filtering in the next phase.


We have selected websites as the basic unit for analysis instead of the typical

webpages for two main reasons. Firstly, websites are collections of related webpages

belonging to the same theme. This allows us to construct a much larger corpus

using the same number of units. For instance, assume that a search engine returns

1, 000 distinct webpages belonging to 300 distinct websites. In this example, we

can construct a corpus comprising of at most 1, 000 documents using a webpage

as a unit. However, using a website as a unit, we would be able to derive a much

larger 9, 000-document corpus, assuming an average of 300 webpages per website.

Secondly, the fine granularity and volatility of individual webpages makes analysis

and maintenance of the corpus difficult. It has been accepted [3, 14, 188] that

webpages disappear at a rate of 0.25− 0.5% per week [73]. Considering this figure,

virtual corpora based on webpage URLs are extremely unstable and require constant

monitoring as pointed out by Kilgarriff [127] to replace offline sources. Virtual

corpora based on websites as units are far less volatile. This is especially true if the

virtual corpora are composed of highly authoritative websites.

6.3.2 Website Filtering

In this section, we describe our probabilistic website selector called PROSE for

measuring and determining the domain representativeness of candidate websites in

J . The domain representativeness of a website is determined using PROSE based

on the following criteria introduced by [277]:

• The extent to which the vocabulary covered by a website is inclined towards

domain D;

• The extent to which the vocabulary of a website is specific to domain D; and

• The authoritativeness of a website with respect to domain D.

The websites from J which satisfy these criteria are considered as sites with good

domain representativeness, denoted as set J ′. The selected sites in J ′ form our vir-

tual corpus. For the next three subsections, we will discuss in detail the notations

involved, the means to quantify the three criteria for measuring domain representa-

tiveness, and the ways to automatically determine the selection thresholds.

Notations

Each site ui ∈ J has three pieces of important information, namely, an authority

rank, ri, the number of webpages containing the conjunction of the seed terms in

W , nwi, and the total number of webpages, nΩi. The authority rank, ri is obtained


by ranking the candidate sites in J according to their number of inlinks (i.e. low

numerical value indicates high rank). The inlinks to a website can be obtained using

the “link:” operator in certain search engines (e.g. Google, Yahoo). As for the second

(i.e. nwi) and the third (i.e. nΩi) piece of information, additional queries using the

operator “site:” need to be performed. The total number of webpages in site ui ∈ J

can be estimated by restricting the search (i.e. site search) as “site:ui”. The number

of webpages in site ui containing W can be obtained using the query “w site:ui”,

where w is the conjunction of the seeds in W with the AND operator. Figure 6.2

shows the distribution of webpages within the sites in J . Each rectangle represents

the collection of all webpages of a site in J . Each rectangle is further divided into

the collection of webpages containing seed terms W , and the collection of webpages

not containing W . The size of the collection of webpages for site ui that contain W

is nwi. Using the total number of webpages for the i-th site, nΩi, we estimate the

number of webpages in the same site not containing W as nwi = nΩi − nwi. With

the page counts nwi and nΩi, we can obtain the total page count for webpages not

containing W in J as

nw = N − nw =∑

ui∈J

nΩi −∑

ui∈J

nwi =∑

ui∈J

(nΩi − nwi)

where N is the total number of webpages in J , and nw is the total number of

webpages in J which contains W (i.e. the area within the circle in Figure 6.2).

Probabilistic Site Selector

A site’s domain representativeness is assessed based on three criteria, namely,

vocabulary coverage, vocabulary specificity and authoritativeness. Assuming inde-

pendence, the odds in favour of a site’s ability to represent a domain, defined as the

Odds of Domain Representativeness (OD), is measured as a product of the odds for

realising each individual criterion:

OD(u) = OC(u)OS(u)OA(u) (6.1)

where OC is the Odds of Vocabulary Coverage, OS is the Odds of Vocabulary Speci-

ficity, and OA is the Odds of Authoritativeness. OC quantifies the extent to which

site u is able to cover the vocabulary of the domain represented by W , while OS

captures the chances of the vocabulary of website u being specific to the domain

represented by W . On the other hand, OA measures the chances of u being an au-

thoritative website with respect to the domain represented by W . Next, we define

the probabilities that make up these three odds.


Figure 6.2: An illustration of an example sample space on which the probabilities

employed by the filter are based upon. The space within the dot-filled circle consists

of all webpages from all sites in J containing W . The m rectangles represent the

collections of all webpages of the respective sites u1, ..., um. The shaded but not

dot-filled portion of the space consists of all webpages from all sites in J that do not

contain W . The individual shaded but not dot-filled portion within each rectangle

is the collection of webpages in the respective sites ui ∈ J that do not contain W .

• Odds of Vocabulary Coverage: Intuitively, the more webpages from site ui that

contains W in comparison with other sites, the likelier it is that ui has a good

coverage of the vocabulary of the domain represented by W . As such, this

factor requires a cross-site analysis of page counts. Let the sample space, set

Y , be the collection of all webpages from all sites in J that contain W . This

space is the area within the circle in Figure 6.2 and the size is |Y | = nw.

Following this, let Z be the set of all webpages in site ui (i.e. any rectangles in

Figure 6.2) with the size |Z| = nΩi. Subscribing to the frequency interpretation

of probability, we compute the probability of encountering a webpage from site

ui among all webpages from all sites in J that contain W as:

PC(nwi) = P (Z|Y ) (6.2)

=P (Z ∩ Y )

P (Y )

=nwi

nw

where |Z ∩Y | = nwi is the number of webpages from the site ui containing W .


We compute OC as:

OC(ui) =PC(nwi)

1− PC(nwi)(6.3)

• Odds of Vocabulary Specificity : This odds acts as an offset for sites which

have a high coverage of vocabulary across many different domains (i.e. the

vocabulary is not specific to a particular domain). This helps us to identify

overly general sites, especially those encyclopaedic in nature which provide

background knowledge across a broad range of disciplines. The vocabulary

specificity of a site can be estimated using the variation in the pagecount of

W from the total pagecount of that site. Within a single site with fixed total

pagecount, an increase in the number of webpages containing W implies a

decrease of pagecount not containing W . In such cases, a larger portion of

the site would be dedicated to discussing W and the domain represented by

W . Intuitively, such phenomenon would indicate the narrowing of the scope

of word usage, and hence, an increase in the specificity of the vocabulary.

As such, the examination of the specificity of vocabulary is confined within a

single site, and hence, is defined over the collection of all webpages within that

site. Let Z be the set of all webpages in site ui and V be the set of all webpages

in site ui that contains W . Following this, the probability of encountering a

webpage that contains W in site ui is defined as:

PS(nwi) = P (V |Z) (6.4)

=P (V ∩ Z)

P (Z)

=nwi

nΩi

where |V ∩ Z| = |V | = nwi. We compute OS as:

OS(ui) =PS(nwi)

1− PS(nwi)(6.5)

• Odds of Authoritativeness : We first define a distribution for computing the

probability that website ui is authoritative with respect to W . It has been

demonstrated that the various indicators of a website’s authority such as the

number of inlinks, the number of outlinks and the frequency of visits, follow

the Zipf ’s ranked distribution [2]. As such, the probability that the site ui

with authority rank ri (i.e. a rank based on the number of inlinks to site ui)


is authoritative with respect to W can be defined using the probability mass

function:

PA(ri) = P (ri; |J |) =1

riH|J |

(6.6)

where |J | is the number of websites under consideration, and H|J | is the |J |-th

generalised harmonic number computed as:

H|J | =

|J |∑

k

1

k(6.7)

We then compute OA as:

OA(ui) =PA(ri)

1− PA(ri)(6.8)

Selection Thresholds

In order to select websites with good domain representativeness, a threshold for

OD is derived automatically as a combination of the individual thresholds related

to OC, OS and OA:

ODT = OAT OCT OST (6.9)

Depending on the desired output, these individual thresholds can be determined

using either one of the three options associated with each probability mass function.

All sites ui ∈ J with their odds OD(ui) exceeding ODT will be considered as suitable

candidates for representing the domain. These selected sites, denoted as the set J ′,

constitute our virtual corpus.

We now go through the details of deriving the thresholds for the individual odds.

• Firstly, the threshold for OC is defined as:

OCT =τC

1− τC

(6.10)

τC can either by PC , PCmaxor , PCmin

. The mean of the distribution is given

by:

PC =nw

nw

=

∑

ui∈J nwi

|J |×

1

nw

=nw

|J |×

1

nw

=1

|J |


while the highest and lowest probabilities are defined as:

PCmax= max

ui∈JPC(nwi)

PCmin= min

ui∈JPC(nwi)

where max PC(nwi) returns the maximum probability of the function PC(nwi)

where nwi ranges over the page counts of all websites ui in J .

• Secondly, the threshold for OS is given by:

OST =τS

1− τS

(6.11)

where τS can either be PS, PSmaxor PSmin

:

PS =

∑

ui∈J PS(nwi)

|J |

PSmax= max

ui∈JPS(nwi)

PSmin= min

ui∈JPS(nwi)

Note that PS 6= 1/|J | since the sum of PS(ui) for all ui ∈ J is not equal to 1.

• Thirdly, the threshold for OA is defined as:

OAT =τA

1− τA

(6.12)

where τA can either be PA, PAmaxor PAmin

. The expected value of the random

variable X for the Zipfian distribution is defined as:

X =HN,s−1

HN,s

and since s = 1 in our distribution of authority rank, the expected value of

the variable r, can be obtained through:

r =|J |

H|J |

Using r, we have PA as:

PA =1

rH|J |

=1

|J |

The highest and lowest probabilities are given by:

PAmax= max

ui∈JPA(ri) (6.13)

PAmin= min

ui∈JPA(ri)

where max PA(ri) returns the maximum probability of the function PA(ri)

where ri ranges over the authority ranks of all websites ui in J .


6.3.3 Website Content Localisation

This content localisation phase is designed to construct Web-derived corpora

using the virtual corpora created in the previous phase. The three main processes in

this phase are seed term expansion (STEP), selective content downloading (SLOP),

and content extraction (HERCULES).

STEP uses the categorical organisation of Wikipedia topics to discover related

terms to complement the user-provided seed terms. Under each Wikipedia cate-

gory, there is typically a listing of subordinate topics. For instance, there is cate-

gory called “Category:Blood cells” which corresponds to the “blood cell” seed term.

STEP begins by finding the category page “Category:w” on Wikipedia which cor-

responds to each w ∈ W (line 3 in Algorithm 2). Under the category page “Cate-

gory:Blood cells” is a listing of the various types of blood cells such as leukocytes,

red blood cell, reticulocytes, etc. STEP relies on regular expressions to scrap the

category page to obtain these related terms (line 4 in Algorithm 2). The related

topics in the category pages are typically structured using the <li> tag. It is im-

portant to note that not all topics listed under a Wikipedia category adhere strictly

to the hypernym-hyponym relation. Nevertheless, the terms obtained through such

means are highly related to the encompassing category since they are determined

by human contributors. These related terms can be relatively large in numbers.

As such, we employed the Normalised Web Distance4 (NWD) [276] for selecting

the m most related ones (line 6 and 8 in Algorithm 2). Algorithm 2 summarises

STEP. The existing set of seed terms W = w1, w2, ..., wn is expanded to become

WX = W1 = w1, ...,W2 = w2, ..., ...,Wn = wn, ... through this process.

Algorithm 2 STEP(W ,m)

1: initialise WX

2: for each wi ∈ W do

3: page := getcategorypage(wi)

4: relatedtopics = scrapepage(page)

5: for each a ∈ relatedtopics do

6: sim := NWD(a,wi)

7: recall the m most related topics (a1, ..., am)

8: Wi = wi, a1, ..., am

9: add Wi to the set WX

10: return WX

4A generalised version of the Normalised Google Distance (NGD) by [50].


SLOP then uses the expanded seed terms WX to selectively download the con-

tents from the websites in J ′. Firstly, all possible pairs of seed terms are obtained

for every combination of sets Wi and Wj from WX :

C = (x, y)|x ∈ Wi ∈WX ∧ y ∈ Wj ∈WX ∧ i < j ≤ |WX |

Using the seed term pairs in C, SLOP localises the webpages for all websites in J ′.

For every site u ∈ J ′, all pairs (x, y) in C are used to construct queries in the form of

q = “x”“y”site : u. These queries are then submitted to search engines to obtain the

URLs of webpages that contains the seed terms from each site. This move ensures

that only relevant pages from a website are downloaded. This prevents the localising

of boilerplate pages such as “about us”, “disclaimer”, “contact us”, “home”, “faq”,

etc whose contents are not suitable for the specialised corpora. Currently, only

HTML and plain text pages are considered. Using these URLs, SLOP downloads

the corresponding webpages to a local repository.

The final step of content localisation makes use of HERCULES to extract con-

tents from the downloaded webpages. HERCULES is based on the following se-

quence of heuristics:

1) all relevant texts are located within the <body> tag.

2) the contribution of invisible elements and formatting tags for determining the

relevance of texts is insignificant.

3) the segmentation of relevant texts, typically paragraphs, are defined by struc-

tural tags such as , , , <div>, etc.

4) length of sentences in relevant texts are typically higher.

5) the concentration of function words in relevant texts is higher [88].

6) the concentration of certain non-alphanumeric characters such as “|”, “-”, “.”

and “,” in irrelevant texts is higher.

7) other common observations such as the capitalisation of the first character

of sentences, and the termination of sentences by punctuation marks are also

observed.

HERCULES begins the process by detecting the presence of the <body> and

</body> tags, and extracting the contents between them. If no <body> tag is

present, the complete HTML source code is used. Next, HERCULES removes all


invisible elements (e.g. comments, javascript codes) and all tags without contents

(e.g. images, applets). Formatting tags such as , , <center>, etc are

also discarded. Structural tags are then used to break the remaining texts in the

page into segments. The length of each segment relative to all other segments is

determined. In addition, the ratio of function words and certain non-alphanumeric

characters (i.e. “|”, “-”, “.”, “,”) to the number of words in each segment is mea-

sured. The ratios related to non-alphanumeric characters are particularly useful

for further removing boilerplates such as Disclaimer | Contact Us | ..., or the

reference section of academic papers where the concentration of such characters is

higher than normal. Using these indicators, HERCULES removes segments which

do not satisfy the heuristics 4) to 7). The remaining segments are aggregated and

returned as contents.


In this section, we discuss the results of three experiments conducted to assess

the different aspects of our technique.

6.4.1 The Impact of Search Engine Variations on Virtual Corpus Con-

struction

We conducted a three-part experiment to study the impact of the choice of search

engines on the resulting virtual corpus. In this experiment, we examine the extent

of correlation between the websites ranked by the different search engines. Then,

we study whether or not the websites re-ranked using PROSE achieve higher level

of correlations. A high correlation between the websites re-ranked by PROSE will

suggest that the composition of the virtual corpora will remain relatively stable

regardless of the choice of search engines.

We performed a scaled-down version of the virtual corpus construction proce-

dure outlined in Section 6.3.1 and 6.3.2. For this experiment, we employed the three

major search engines, namely, Yahoo, Google and Live Search (by Microsoft), and

their APIs for constructing virtual corpora. We chose the seed terms “transcription

factor” and “blood cell” to represent the domain of molecular biology D1, while the

reliability engineering domain D2 is represented using the seed terms “risk man-

agement” and “process safety”. For each domain D1 and D2, we gathered the first

1, 000 webpage URLs from the three search engines. We then processed the URLs

to obtain the corresponding websites’ addresses. The set of websites obtained for

domain D1 using Google, Yahoo and Live Search is denoted as J1G, J1Y and J1M ,


Figure 6.3: A summary of the number of websites returned by the respective search

engines for each of the two domains. The number of common sites is also provided.

respectively. The same notations apply for domain D2. Next, these websites were

assigned with ranks based on their corresponding webpages’ order of relevance de-

termined by the respective search engines. We refer to these ranks as native ranks.

If a site has multiple webpages included in the search results, the highest rank shall

prevail. This ranking information is kept for use in the later part of this experiment.

Figure 6.3 summarises the number of websites obtained from each search engines

for each domain.

In the first part of this experiment, we sorted the 77 common websites for D1,

denoted as J1C = J1G ∩J1Y ∩J1M , and the 103 in J2C = J2G ∩J2Y ∩J2M using their

native ranks (i.e. the ranks generated by the search engines). We then determined

their Spearman’s rank correlation coefficients. The native columns in Figure 6.4(a)

and 6.4(b) show the correlations between websites sorted by different pairs of search

engines. The correlation between websites based on native rank is moderate, ranging

between 0.45 to 0.54. This extent of correlation does not come as a surprise. In

fact, this result supports our implicit knowledge that different search engines rank

the same webpages differently. Assuming the same query, the same webpage will

inevitably be assigned distinct ranks due to the inherent differences in the index

size and the algorithm itself. For this reason, the ranks generated by search engines

(i.e. native ranks) do not necessarily reflect the domain representativeness of the

webpages. In the second part of the experiment, we re-rank the websites in J1,2C

using PROSE. For simplicity, we only employ the coverage and specificity criteria

for determining the domain representativeness of websites, in the form of odds of

domain representativeness (OD). The information required by PROSE, namely, the

number of webpages containing W , nwi, and the total number of webpages, nΩi

are obtained from the respective search engines. In other words, the OD of each

website is estimated three times, each using different nwi and nΩi obtained from the

three different search engines. The three variants of estimation are later translated

into ranks for re-ordering the websites. Due to the varying nature of page counts


across different search engines as discussed in Section 6.2.3, many would expect that

re-ranking the websites using metrics based on such information would yield even

worst correlation. On the contrary, the significant increases in correlation between

websites after re-ranking by PROSE as shown in the PROSE columns in Figure 6.4(a)

and 6.4(b) demonstrated otherwise.

(a) Correlations between websites in the molecular biology domain.

(b) Correlations between websites in the reliability engineering domain.

Figure 6.4: A summary of the Spearman’s correlation coefficients between websites

before and after re-ranking by PROSE. The native columns show the correlation

between the websites when sorted according to their native ranks provided by the

respective search engines.

We discuss the reasons behind this interesting finding. As we have mentioned

before, search engine indexes vary greatly. For instance, based on page counts

by Google, we have a 15, 900/23, 800, 000 = 0.0006685 probability of encountering

a webpage from the site www.pubmedcentral.nih.gov that contains the bi-gram

“blood cell”. However, Yahoo provides us with a higher estimate at 0.001440. This is

not because Yahoo is more accurate than Google or vice versa, they are just different.

We have discussed this in detail in Section 6.2.3. This reaffirms that estimations

using different search engines are by themselves not comparable. Consider the next

example n-gram “gasoline”. Google and Yahoo provides the estimates 0.000046

and 0.000093 for the same site, respectively. Again, they are very different from

one another. While the estimations are inconsistent (i.e. Google and Yahoo offer

different page counts for the same n-grams), the conclusion is the same, namely, one

has better chances of encountering a page in www.pubmedcentral.nih.gov that

contains “blood cell”. In other words, estimations based on search engine counts

have significance only in relation to something else (i.e. relativity). This is exactly

5This page count and all subsequent page counts derived from Google and Yahoo is obtained

on 2 April 2009.


how PROSE works. PROSE determines a site’s OD based entirely on its contents.

OD is computed by PROSE using search engine counts. Even though the analysis

of the same site using different search engines eventually produces different OD, the

object of the study, namely, the content of the site, remains constant. In this sense,

the only variable in the analysis by PROSE is the search engine count. Since the

ODs generated by PROSE are used for inter-comparing the websites in J1,2C (i.e.

ranking), the numerical differences introduced through variable page counts by the

different search engines become insignificant. Ultimately, the same site analysed by

PROSE using unstable page counts by different search engines can still achieve the

same rank.

In the third part of this experiment, we examine the general ‘quality’ of the

websites ranked by PROSE using information provided by the different search en-

gines. As we have elaborated on in Section 6.3.2, PROSE measures the odds in

favour of the websites’ authority, vocabulary coverage and specificity. Websites

with low OD can be considered as poor representers of the domain. The rank-

ing of sites by PROSE using information from Google consistently resulted in the

most number of websites with OD less than −6. About 70.13% in domain D1 and

34.95% in domain D2 by Google are considered as poor representers. On the other

hand, the sites ranked using information by Yahoo and Live Search have relatively

higher OD. To explain this trend, let us consider the seed terms “transcription

factor”, “blood cell”. According to Google, there are 23, 800, 000 webpages in

www.pubmedcentral.nih.gov and out of that number, 1, 180 contain both seed

terms. As for Yahoo, it indexes far less 9, 051, 487 webpages from the same site but

offering approximately the same page count 1, 060 for the seed terms. This trend is

consistent when we examined the page count for the non-related n-gram “vehicle”

from the same site. Google and Yahoo reports the approximately same page count

of 24, 900 and 20, 100, respectively. There are a few possibilities. Firstly, the remain-

ing 23, 800, 000− 9, 051, 487 = 14, 748, 513 indexed by Google really do not contain

the n-grams, or secondly, Google overestimated the overall figure of 23, 800, 000.

The second possibility becomes more evident as we look at the page count by other

search engines6. Live Search reports a total page count of 61, 400 for the same

site with 1, 460 webpages containing the seed terms “transcription factor”, “blood

cell”. Ask.com, with a much larger site index at 15, 600, 000 has 914 pages with

the seed terms. The index sizes of all these other search engines are much smaller

6Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for

comparison since they used the same search index as Yahoo’s.


Figure 6.5: The number of sites with OD less than −6 after re-ranking using PROSE

based on page count information provided by the respective search engines.

than that of Google’s, and yet, they provided us with approximately the same num-

ber of pages containing the seed terms. Our finding is consistent with the recent

report by Uyar [254] which concluded that the page counts provided by Google are

usually higher than the estimates by other search engines. Due to such inflated

figures by Google, when we take the relative frequency of n-grams using Google’s

page counts, the significance of domain-relevant n-grams are greatly undermined.

The seed terms (i.e. “transcription factor”, “blood cell”) achieved a much lower

probability at 1, 180/23, 800, 000 = 0.000049 when assessed using Google’s page

count as compared to the probability by Yahoo 1, 060/9, 051, 487 = 0.000117. This

explains the devaluation of domain-relevant seed terms when assessed by PROSE

using information from Google, which leads to the falling of the OD of websites.

In short, Live Search and Yahoo are comparatively better search engines for the

task of measuring OD by PROSE. However, the index size of Live Search is undesir-

ably small, a problem agreed upon by other researchers such as [74]. Moreover, the

search facility using the “site:” operator is occasionally turned off by Microsoft, and

it sometimes offers illogical estimates. While this problem is present in all search

engines, it is particularly evident in Live Search when site search is used. For in-

stance, there are about 61, 400 pages from www.pubmedcentral.nih.gov indexed by

Live Search. However, Live Search reports that there are 159, 000 pages in that site

which contains the n-gram “transcription factor”. For this reason, we preferred the

balance between the index size and the ‘honesty’ in page counts offered by Yahoo.

6.4.2 The Evaluation of HERCULES

We conducted a simple evaluation of our content extraction utility HERCULES

using Cleaneval development set7. Due to some implementation difficulties, the scor-

ing program provided by Cleaneval cannot be used for this evaluation. Instead, we

employed a text comparison module8 written in Perl. The module, based on vector-

7http://cleaneval.sigwac.org.uk/devset.html8http://search.cpan.org/ stro/Text-Compare-1.03/lib/Text/Compare.pm


space model, is used for comparing the contents of the texts cleaned by HERCULES

with the gold standard provided by Cleaneval. The module uses a rudimentary stop

list to filter out common words and then the cosine similarity measure is employed

to compute text similarity. The texts cleaned by HERCULES achieved a 0.8919

similarity with the gold standard, and has a standard deviation of 0.0832. The

relatively small standard deviation shows that HERCULES is able to consistently

extract contents that meet the standard of human curators. We have made available

an online demo of HERCULES 9.

6.4.3 The Performance of Term Recognition using SPARTAN-based

Corpora

In this section, we evaluated the quality of the corpora constructed using SPAR-

TAN in the context of term recognition for the domain of molecular biology. We

compared the performance of term recognition using several specialised corpora,

namely:

• SPARTAN-based corpora

• the manually-crafted GENIA corpus [130]

• a BootCat-derived corpus

• seed-restricted querying of the Web (SREQ), as a virtual corpus

We employed the gold standard reference provided along with the GENIA cor-

pus for evaluating term recognition. We used the same set of seed terms W =

“human”,“blood cell”,“transcription factor” for various purposes throughout this

evaluation. The reason behind our choice of seed terms is simple: these are the same

seed terms used for the construction of GENIA, which is our gold standard.

BootCat-Derived Corpus

We downloaded and employed the BootCat toolkit10 with the new support for

Yahoo API to construct a BootCat-derived corpus using the same set of seed terms

W = “human”,“blood cell”,“transcription factor”. For reasons discussed in Sec-

tion 6.2.1, BootCat will not be able to construct a large corpus using only three

seed terms. The default settings of 3 terms per tuple, and 10 randomly selected

9A demo is available at http://explorer.csse.uwa.edu.au/research/algorithm hercules.pl. Note

that slow response time is possible when server is under heavy load.10http://sslmit.unibo.it/ baroni/bootcat.html


Figure 6.6: A listing of the 43 sites included in SPARTAN-V.

tuples for querying cannot be applied in our case. Moreover, we could not perceive

the benefits of randomly selecting terms for constructing tuples. As such, we gen-

erated all possible combinations of all possible length in this experiment. In other

words, we have three 1-tuple, three 2-tuple, and one 3-tuple for use. While this

move may appear redundant since all webpages which contain the 3-tuple will also

have the 2-tuples, we can never be sure that the same webpages will be provided as

results by the search engines. In addition, we altered a default setting in the script

by BootCat, collect urls from yahoo.pl which restricted our access to only the

first 100 results for each query. Using the seven seed term combinations and the

altered Perl script, we obtained 3, 431 webpage URLs for downloading. We then

employed the script by BootCat, retrieve and clean pages from url list.pl to

download and clean the webpages, resulting in a final corpus size of N = 3, 174

documents with F = 7, 641, 018 tokens.

SPARTAN-Based Corpora and SREQ

We first constructed a virtual corpus using SPARTAN and the seed terms W . Ya-

hoo is selected as our search engine of choice for this experiment for reasons outlined


in Section 6.4.1. We employed the API11 provided by Yahoo. All requests to Yahoo

are sent to this server process http://search.yahooapis.com/WebSearchService/

V1/webSearch?/. We format our query strings as appid=APIKEY&query=SEEDTERMS

&results=100. Additional options such as start=START are applied to enable

SPARTAN to obtain results beyond the first 100 webpages. This service by Yahoo is

limited to 5, 000 queries per IP address per day. However, the implementation of this

rule is actually quite lenient. In the first phase of SPARTAN, we obtained 176 dis-

tinct websites from the first 1, 000 webpages returned by Yahoo using the conjunction

of the three seed terms. For the second phase of SPARTAN, we selected the average

values as described in Section 6.3.2 for all three thresholds, namely, τC , τS and τA to

derive our selection cut-off point ODT . The selection process using PROSE provided

us with a reduced 43 sites. The virtual corpus thus contains about N = 84, 963, 524

documents (i.e. webpages) distributed over 43 websites. In this evaluation, we would

refer to this virtual corpus as SPARTAN-V, where the letter V stands for virtual.

We have made available an online query tool for SPARTAN-V12. Figure 6.6 shows

the websites included in the virtual corpus for this evaluation. We then extended the

virtual corpus during the third phase of SPARTAN to construct a Web-derived cor-

pus. We selected three most related topics for each seed term in W during seed term

expansion by STEP. The seed term “human” has no corresponding category page

on Wikipedia and hence, cannot be expanded. The set of expanded seed terms is

WX=“human”, “blood cell”, “erythropoiesis”,“reticulocyte”,“haematopoiesis”,

“transcription factor”, “CREB”, “C-Fos”,“E2F”. Using WX , SLOP gathered

80, 633 webpage URLs for downloading. A total of 76, 876 pages were actually

downloaded while the remaining 3, 743 could not be reached for reasons such as

connection error. Finally, HERCULES is used to extract contents from the down-

loaded pages for constructing the Web-derived corpus. About 15% of the webpages

were discarded by HERCULES due to the absence of proper contents. The final

Web-derived corpus, denoted as SPARTAN-L (the letter L refers to local) is com-

posed of N = 64, 578 documents with F = 118, 790, 478 tokens. We have made

available an online query tool for SPARTAN-L13. It is worth pointing out that using

SPARTAN and the same number of seed terms, we can easily construct a corpus

11More information on Yahoo! Search, including API key registration, is available at

http://developer.yahoo.com/search/web/V1/webSearch.html.12A demo is available at http://explorer.csse.uwa.edu.au/research/data virtualcorpus.pl. Note

that slow response time is possible when server is under heavy load.13A demo is available at http://explorer.csse.uwa.edu.au/research/data localcorpus.pl. Note

that slow response time is possible when server is under heavy load.


that is at least 20 times larger than a BootCat-derived corpus.

Many researchers have found good use of page counts for a wide range of NLP

applications using search engines as gateways to the Web (i.e. general virtual cor-

pus). In order to justify the need for content analysis during the construction of

virtual corpora by SPARTAN, we included the use of guided search engine queries as

a form of specialised virtual corpus during term recognition. We refer to this virtual

corpus as SREQ, the seed-restricted querying of the Web. Quite simply, we append

the conjunction of the seed terms W for every query made to the search engines. In

a sense, we can consider SREQ as the portion of the Web which contains the seed

terms W . For instance, the normal approach for obtaining the general page count

(i.e. the number of pages on the Web) for “TNF beta” is by submitting the n-gram

as a query to any search engines. Using Yahoo, the general virtual corpus has 56, 400

documents containing “TNF beta”. In SREQ, the conjunction of the seeds in W is

appended to “TNF beta”, resulting in the query q=“TNF beta” “transcription fac-

tor” “blood cell” “human”. Using this query, Yahoo provides us with 218 webpages,

while the conjunction of the seed terms alone results in the page count N = 149, 000.

We can consider the latter as the size of SREQ (i.e. total number of documents in

SREQ), while the former as the number of documents in SREQ which contains the

term “TNF beta”.

GENIA Corpus and the Preparations for Term Recognition

In this section, we evaluate the performance of term recognition using the dif-

ferent corpora discussed in Sections 6.4.3 and 6.4.3. Terms are content-bearing

words which are unambiguous, highly specific and relevant to a certain domain of

interest. Most existing term recognition techniques identify terms from among the

candidates through some scoring and ranking mechanisms. The performance of

term recognition is heavily dependent on the quality and the coverage of the text

corpora. Therefore, we find it appropriate to use this task to judge the adequacy

and applicability of both SPARTAN-V and SPARTAN-L in real-world applications.

The term candidates and gold standard employed in this evaluation comes with

the GENIA corpus [130]. The term candidates were extracted from the GENIA

corpus based on the readily-available part-of-speech and semantic marked-ups. A

gold standard, denoted as the set G, was constructed by extracting the terms which

have semantic descriptors enclosed by cons tags. For practicality reasons, we ran-

domly selected 1, 300 term candidates for evaluation, denoted as T . We manually

inspected the list of candidates and compared them against the gold standard. Out


of the 1, 300 candidates, 121 are non-terms (i.e. misses) while the remaining 1, 179

are domain-relevant terms (i.e. hits).

Figure 6.7: The number of documents and tokens from the local and virtual corpora

used in this evaluation.

Instead of relying on some complex measures, we used a simple, unsupervised

technique based solely on the cross-domain distributional behaviour of words for

term recognition. Our intention is to observe the extent of contribution of the quality

of corpora towards term recognition without being obscured by the complexity of

state-of-the-art techniques. We employed relative frequencies to determine whether

a word (i.e. term candidate) is a domain-relevant term or otherwise. The idea

is simple: if a word is encountered more often in a specialised corpus than the

contrastive corpus, then the word is considered as relevant to the domain represented

by the former. As such, this technique places even more emphasis on the coverage

and adequacy of the corpora to achieve good performance term recognition. For the

contrastive corpus, we have prepared a collection comprising of texts from a broad

sweeping range of domains other than our domain of interest, which is molecular

biology. Figure 6.7 summarises the composition of the contrastive corpus.

The term recognition procedure is performed as follows. Firstly, we took note of

the total number of tokens F in each local corpus (i.e. BootCat, GENIA, SPARTAN-

L, contrastive corpus). For the two virtual corpora, namely, SPARTAN-V and

SREQ, the total page count (i.e. total number of documents) N is used instead.

Secondly, the word frequency ft for each candidate t ∈ T is obtained from each

local corpus. We use page counts (i.e. document frequencies), nt as substitutes for

the virtual corpora. Thirdly, the relative frequency, pt for each t ∈ T are calcu-

lated as either ft/F or nt/N depending on the corpus type (i.e. virtual or local).

Fourthly, we evaluated the performance of term recognition using these relative

frequencies. Please take note that when comparing local corpora (i.e. BootCat,


Algorithm 3 assessBinaryClassification(t,dt,ct,G)

1: initialise decision

2: if dt > ct ∧ t ∈ G then

3: decision := “true positive”

4: else if dt > ct ∧ t /∈ G then

5: decision := “false positive”

6: else if dt < ct ∧ t ∈ G then

7: decision := “false negative”

8: else if dt < ct ∧ t /∈ G then

9: decision := “true negative”

10: return decision

GENIA, SPARTAN-L) with the contrastive corpus, the pt based on word frequency

is used. The pt based on document frequency is used for comparing virtual corpora

(i.e. SPARTAN-V, SREQ) with the contrastive corpus. If the pt by a specialised

corpus (i.e. BootCat, GENIA, SPARTAN-L, SPARTAN-V, SREQ), denoted as dt,

is larger than or equal to the pt by the contrastive corpus, ct, then the candidate

t is classified as a term. The candidate t is classified as a non-term if dt < ct. An

assessment function described in Algorithm 3 is employed to grade the decisions

achieved using the various specialised corpora.

Term Recognition Results

Contingency tables are constructed using the number of false positives and neg-

atives, and true positives and negatives obtained from Algorithm 3. Figure 6.8

summarises the errors introduced during the classification process for term recog-

nition using several different specialised corpora. We then computed the precision,

accuracy, F1 and F.5 score using the values in the contingency tables. Figure 6.9

summarises the performance metrics for term recognition using the different corpora.

Firstly, in the context of local corpora, Figure 6.9 shows that SPARTAN-L

achieved a better performance compared to BootCat. While SPARTAN-L is merely

2.5% more precise compared to BootCat, the latter fared the worst recall at 65.06%

among all other corpora included in the evaluation. The poor recall by BootCat

is due to its high false negative rate. In other words, true terms are not classified

as terms by BootCat due to its low-quality composition (e.g. poor coverage, speci-

ficity). Many domain-relevant terms in the vocabulary of molecular biology are not

covered by the BootCat-derived corpus. Despite being 19 times larger than GENIA,


(a) Results using the GENIA corpus.

(b) Results using the SPARTAN-V corpus.

(c) Results using the SPARTAN-L corpus.

(d) Results using the SREQ corpus.

(e) Results using the BootCat corpus.

Figure 6.8: The contingency tables summarising the term recognition results using

the various specialised corpora.

Figure 6.9: A summary of the performance metrics for term recognition.

the F1 score of the BootCat-derived corpus is far from ideal. The SPARTAN-L cor-

pus, which is 295 times larger than GENIA in terms of token size, has the closest

performance to the gold standard at F1 = 92.87%. Assuming that size does matter,

we speculate that a specialised Web-derived corpus of at least 419 times larger than

GENIA (using linear extrapolation) would be required to match the latter’s high vo-

cabulary coverage and specificity for achieving a 100% F1 score. At the moment, this

conjecture remains to be tested. Given its inferior performance and effortless setup,


BootCat-derived corpora can only serve as baselines in the task of term recognition

using specialised Web-derived corpora.

Secondly, in the context of virtual corpora, term recognition using the SPARTAN-

V achieved the best performance across all metrics with a 99.56% precision, even

outperforming the local version SPARTAN-L. An interesting point here is that the

other virtual corpus, SREQ achieved a good result with precision and recall close

to 90% despite the relative ease of setting up the apparatus required for guided

search engine querying. For this reason, we regard SREQ as the baseline for com-

paring the use of specialised virtual corpus in term recognition. In our opinion, a

9% improvement in precision justifies the additional systematic analysis of website

content performed by SPARTAN for creating a virtual corpus. From our experience,

the analysis of 200 websites generally requires on average, ceteris paribus, 1 to 1.5

hours of processing time using Yahoo API on a standard 1GHz computer with a

256 Mbps Internet connection. The ad-hoc use of search engines for accessing the

general virtual corpus may work for many NLP tasks. However, the relatively poor

performance by SREQ here justifies the need for more systematic techniques such

as SPARTAN when the Web is used as a specialised corpus for tasks such as term

recognition.

Thirdly, comparing between virtual and local corpora, only SPARTAN-V scored

a recall above 90% at 96.44%. Upon localising, the recall of SPARTAN-L dropped to

89.40%. This further confirms that term recognition requires large corpora with high

vocabulary coverage, and that the SPARTAN technique has the ability to system-

atically construct virtual corpora with the required coverage. It is also interesting

to note that a large 118 million token local corpus (i.e. SPARTAN-L) matches the

recall of a 149, 000 document virtual corpus (i.e. SREQ). However, due to the het-

erogenous nature of the Web and the inadequacy of simple seed term restriction,

SREQ scored 6% less than SPARTAN-L in precision. This concurred with our ear-

lier conclusion that ad-hoc querying, as in SREQ, is not the optimal way of using

the Web as specialised virtual corpora. Even the considerably smaller BootCat-

derived corpus achieved a 4% higher precision compared to SREQ. This shows that

size and coverage (there is 46 times more documents in SREQ than in BootCat)

contributes only to recall, which explains SREQ’s 24% better recall than BootCat.

Due to SREQ’s lack of vocabulary specificity, it fared the least precision at 90.44%.

Overall, certain tasks indeed benefit from larger corpora, obviously when metic-

ulously constructed. More specifically, tasks which do not require local access to

the texts in the corpora such as term recognition may well benefit from the con-


siderably larger and distributed nature of virtual corpora. This is evident when

the SPARTAN-based corpus fared 3 − 7% less across all metrics upon localising

(i.e. SPARTAN-L). Furthermore, the very close F1 score achieved by the worst per-

forming virtual corpus (i.e. baseline SREQ) with the best performing local corpus

SPARTAN-L shows that virtual corpus may indeed be more suitable for the task

of term recognition. We speculate that several reasons are at play, including the

ever-evolving vocabulary on the Web, and the sheer size of the vocabulary that even

Web-derived corpora cannot match.

In short, in the context of term recognition, the two most important factors which

determine the adequacy of the constructed corpora are coverage and specificity.

On the one hand, larger corpora, even when conceived in an ad-hoc manner, can

potentially lead to higher coverage, which in turn contributes significantly to recall.

On the other hand, the extra efforts spent on systematic analysis leads to more

specific vocabulary, which in turn contributes to precision. Most existing techniques

lack focus on either one or both factors, leading to poorly constructed and inadequate

virtual corpora and Web-derived corpora. For instance, BootCat has difficulty in

practically constructing very large corpora, while ad-hoc techniques such as SREQ

lacks systematic analysis which results in poor specificity. From our evaluation, only

SPARTAN-V achieved a balance F1 score exceeding 95%. In other words, the virtual

corpora constructed using SPARTAN are both adequately large with high coverage

and has specific enough vocabulary to achieve highly desirable term recognition

performance. We can construct much larger specialised corpora using SPARTAN by

adjusting certain thresholds. We can adjust τC , τS and τA to allow for more websites

to be included into the virtual corpora. We can also permit more related terms to

be included as extended seed terms during STEP. This will allow more webpages to

be downloaded to create even larger Web-derived corpora. This is possible since the

maximum pages derivable from the 43 websites are 84, 963, 524 as shown in Figure

6.7. During the localisation phase, only 64, 578 webpages which is a mere 0.07%

of the total, were actually downloaded. In other words, the SPARTAN technique

is highly customisable to create both small and very large Web and Web-derived

corpora using only several thresholds.

6.5 Conclusions

The sheer volume of textual data available on the Web, the ubiquitous coverage

of topics, and the growth of content have become the catalysts in promoting a wider

acceptance of the Web for corpus construction in various applications of knowledge


discovery and information extraction. Despite the extensive use of the Web as a

general virtual corpus, very few studies have focused on the systematic analysis

of website contents for constructing specialised corpora from the Web. Existing

techniques such as BootCat simply pass the responsibility of deciding on suitable

webpages to the search engines. Others allow their Web crawlers to run astray (and

subsequently resulting in topic drift) without systematic controls while downloading

webpages for corpus construction. In the face of these inadequacies, we introduced

a novel technique called SPARTAN which places emphasis on the analysis of the

domain representativeness of websites for constructing virtual corpora. This tech-

nique also provides the means to extend the virtual corpora in a systematic way

to construct specialised Web-derived corpora with high vocabulary coverage and

specificity.

Overall, we have shown that SPARTAN is independent of the search engines used

during corpus construction. SPARTAN performed the re-ranking of websites pro-

vided by search engines based on their domain representativeness to allow those with

the highest vocabulary coverage, specificity and authority to surface. The system-

atic analysis performed by SPARTAN is adequately justified when the performance

of term recognition using SPARTAN-based corpora achieved the best precision and

recall in comparison to all other corpora based on existing techniques. Moreover,

our evaluation showed that only the virtual corpora constructed using SPARTAN

are both adequately large with high coverage and has specific enough vocabulary

to achieve a balance term recognition performance (i.e. highest F1 score). Most

existing techniques lack focus on either one or both factors. We conclude that larger

corpora, when constructed with consideration for vocabulary coverage and speci-

ficity, deliver the prerequisites required for producing consistent and high-quality

output during term recognition.

Several future work have been planned to further assess SPARTAN. In the near

future, we hope to study the effect of corpus construction using different seed terms

W . We also intend to conduct research on examining how the content of SPARTAN-

based corpora evolve over time and its effect on term recognition. Furthermore, we

are also planning to study the possibility of extending the use of virtual corpora to

other applications which requires contrastive analysis.

6.6 Acknowledgement


graduate Research Scholarship. The authors would like to thank the anonymous


reviewers for their invaluable comments.


Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora through

Topical Web Partitioning for Term Recognition. In the Proceedings of the 21st Aus-

tralasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand.

This paper reports the preliminary ideas on the SPARTAN technique for creating

text corpora using data from the Web. The SPARTAN technique was later improved

and extended to form the core contents of this Chapter 6.

161CHAPTER 7

Term Clustering for Relation Acquisition

Abstract

Many conventional techniques for concepts formation in ontology learning rely on

the use of predefined templates and rules, and static background knowledge such as

WordNet. These techniques are not only difficult to scale between different domains

and to handle knowledge change, their results are far from desirable. This chap-

ter proposes a new multi-pass clustering algorithm for concepts formation known

as Tree-Traversing Ant (TTA) as part of an ontology learning system. This tech-

nique uses Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)

as measures for similarity and distance between terms to achieve highly adaptable

clustering across different domains. Evaluations using seven datasets show promis-

ing results with an average lexical overlap of 97% and an ontological improvement

of 48%. In addition, the evaluations demonstrated several advantages that are not

simultaneously present in standard ant-based and other conventional clustering tech-

niques.

7.1 Introduction

Ontologies are gaining increasing importance in modern information systems for

providing inter-operable semantics. Increasing demand on ontologies makes labour-

intensive creation more and more undesirable, if not impossible. Exacerbating the

situation is the problem of knowledge change that results from ever growing infor-

mation sources, both online and offline. Since the late nineties, more and more

researchers started looking for solutions to relieve knowledge engineers from the in-

creasingly acute situations. One of the main research area with high impact if suc-

cessful is to construct and maintain ontology automatically or semi-automatically

from electronic text. Ontology learning from text is the process of identifying con-

cepts and relations from natural language text, and using them to construct and

maintain ontologies. In ontology learning, terms are the lexical realisations of im-

portant concepts for characterising a domain. Consequently, the task of grouping

together variants of terms to form concepts, known as term clustering, constitutes

0This chapter appeared in Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages

349-381, with the title “Tree-Traversing Ant Algorithm for Term Clustering based on Featureless

Similarities”.

162 Chapter 7. Term Clustering for Relation Acquisition

a crucial fundamental step in ontology learning.

Unlike documents [242], webpages [44], and pixels in image segmentation and ob-

ject recognition [113], terms alone are lexically featureless. Similarity of objects can

be established by feature analysis based on visible (e.g. physical and behavioural)

traits. Unfortunately, using object names (i.e. terms) alone, similarity depends

on something less tangible, namely, background knowledge which humans acquired

through their senses over the years. The absence of features requires certain ad-

justments to be made with regard to the term clustering techniques. One of the

most evident adaptation required is the use of context and other linguistic evidence

as features for the computation of similarity. A recent survey [90] revealed that all

ontology learning systems which apply clustering techniques rely on the contextual

cues surrounding the terms as features. The large collection of documents, and pre-

defined patterns and templates required for the extraction of contextual cues makes

the portability of such ontology learning systems difficult. Consequently, non-feature

similarity measures are fast becoming a necessity for term clustering in ontology

learning from text. Along the same line of thought, Lagus et al. [137] stated that “In

principle a document might be encoded as a histogram of its words...symbolic words

as such retain no information of their relatedness”. In addition to the problems as-

sociated with feature extraction in term clustering, much work is still required with

respect to the clustering algorithm itself. Researchers [98] have shown that certain

commonly adopted algorithms such as the K-means and average-link agglomerative

yield mediocre results in comparison with the ant-based algorithms, which is a rela-

tively new paradigm. Handl et al. [98] demonstrated certain desirable properties in

ant-based algorithms such as the tolerance to different cluster sizes, and the ability

to identify the number of clusters. Despite such advantages, the potentials of ant-

based algorithms remain relatively unexplored for possible applications in ontology

learning.

In this chapter, we employ the established Normalised Google Distance (NGD)

[50] together with a new hybrid, multi-pass algorithm called Tree-Traversing Ant

(TTA)1 for clustering terms in ontology learning. TTA fuses the strengths of

standard ant-based and conventional clustering techniques with the advantages of

featureless-similarity measures. In addition, a second-pass is introduced in TTA

1This foundation work on term clustering using featureless similarity measures appeared in the

Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR),

Perth, Australia, 2006 with the title “Featureless Similarities for Terms Clustering using Tree-

Traversing Ants”.

7.2. Existing Techniques for Term Clustering 163

for refining the results produced using NGD. During the second-pass, the TTA em-

ploys a new distance measure called n-degree of Wikipedia (noW) for quantifying the

distance between two terms based on Wikipedia’s categorical system. Evaluations

using seven datasets show promising results, and revealed several advantages which

are not simultaneously present in existing clustering algorithms. In Section 2, we

give an introduction to the current term clustering techniques for ontology learning.

In Section 3, a description of the NGD measure and an introduction to standard

ant-based clustering are presented. In Section 4, we present the TTA, and how NGD

and noW are employed to support term clustering. In Section 5, we summarise the

results and findings from our evaluations. Finally, we conclude this chapter with an

outlook to future work in Section 6.

7.2 Existing Techniques for Term Clustering

Faure & Nedellec [69] presented a corpus-based conceptual clustering technique

as part of an ontology learning system called ASIUM. The clustering technique is

designed for aggregating basic classes based on a distance measure inspired by the

Hamming distance. The basic classes are formed prior to clustering in a phase for

extracting sub-categorisation frames [71]. Terms that appear in at least two different

occasions with the same verb, and the same preposition or syntactic role, can be

regarded as semantically similar such that they can substituted with one another in

that particular context. These semantically similar terms form the basic classes. The

basic classes form the lowest level of the ontology and are successively aggregated

to construct a hierarchy from bottom-up. Each time, only two basic classes are

compared. The clustering begins by computing the distance between all pairs of

basic classes and aggregate those with distance less than a user-defined threshold.

The distance between two classes containing the same words with same frequencies

have a distance 0. On the other hand, two classes without a single common word

have a distance 1. In other words, the terms in the basic classes act as features,

allowing for inter-class comparison. The measure for distance is defined as

distance(C1, C2) = 1− (

∑

FC1 ×Ncomm

card(C1)+∑

FC2 ×Ncomm

card(C2)∑card(C1)

i=1 f(wordiC1) +

∑card(C2)i=1 f(wordiC2

))

where card(C1) and card(C2) are the numbers of words in C1 and C2, respectively,

and Ncomm is the number of words common to both C1 and C2.∑

FC1 and∑

FC2

are the sums of the frequencies of the words in C1 and C2 which also occur in C2

and C1, respectively. f(wordiC1) and f(wordiC2

) are the frequencies of the ith word

of class C1 and C2, respectively.


Maedche & Volz [165] presented a bottom-up hierarchical clustering technique

that is part of the ontology learning system Text-to-Onto. This term clustering

technique relies on an all-knowing oracle, denoted by H, which is capable of re-

turning possible hypernyms for a given term. In other words, the performance of

the clustering algorithm has an upper-bound limited by the ability of the oracle to

know all possible hypernyms for a term. The oracle is constructed using WordNet

and lexico-syntactic patterns [51]. During the clustering phase, the algorithm is

provided with a list of terms and the similarity between each pair is computed using

the cosine measure. For this purpose, the syntactic dependencies of each term are

extracted and used as the features for that term. The algorithm is an extremely

long list of nested if-else statements. For the sake of brevity, it suffices to know

that the algorithm examines the hypernymy relations between all pairs of terms

before it decides on the placement of terms as parents, children or siblings of other

terms. Each time the information about the hypernym relations between two terms

is required, the oracle is consulted. The projection H(t) returns a set of tuples (x, y)

where x is a hypernym of term t and y is the number of times the algorithm has

found the evidence for it.

Shamsfard & Barforoush [225] presented two clustering algorithms as part of the

ontology learning system Hasti. Concepts have to be formed prior to the clustering

phase. It suffices to know that the process of forming the concepts and extracting

relations that are used as features for clustering involve a knowledge extractor where

“the knowledge extractor is a combination of logical, template driven and semantic

analysis methods” [227]. In the concept-based clustering technique, a similarity ma-

trix, consisting of the similarity of all possible pairs of concepts is computed. The

pair with the maximum similarity that is also greater than the merge-threshold is

chosen to form a new super concept. In this technique, each intermediate (i.e. non-

leaf) node in the conceptual hierarchy has at most two children, but the hierarchy is

not a binary tree as each node may have more than one parent. As for the relation-

based clustering technique, only non-taxonomic relations are considered. For every

concept c, a set of assertions about the non-taxonomic relations NF (c) that c has

with other concepts is identified. In other words, these relations can be regarded as

features that allow concepts to be merged according to what they share. If at least

one related concept is common between assertions about that relation, then the set

comprising the other concepts (called merge-set) contains good candidates for merg-

ing. After all the relations have been examined, a list of merge-set is obtained. The

merge-set with the highest similarity between its members is chosen for merging. In

7.3. Background 165

both clustering algorithms, the similarity measure employed is defined as

similarity(a, b) =maxlevel∑

j=1

(

card(cm)∑

i=1

(Wcm(i).r +

valence(cm(i).r)∑

k=1

Wcm(i).arg(k)))× Lj

where cm = Nf(a) ∩ Nf(b) is the intersection between the sets of assertions (i.e.

common relations) about a and b, and card(cm) is the cardinality of cm. Wcm(i).r is

the weight for each common relation and∑valence(cm(i).r)

k=1 Wcm(i).arg(k) is the sum of

the weights of all terms related to the common relations cm. Lj is the level constant

assigned to each similarity level which decreases as the level increases. The main

aspect of the similarity measure is the common features between two concepts a and

b (i.e. the intersection between the sets of non-taxonomic assertions Nf(a)∩Nf(b)).

Each common feature cm(i).r together with the corresponding weight Wcm(i).r and

the weight of the related terms are accumulated. In other words, the more features

two concepts have in common, the higher the similarity between them.

Regardless of how the existing techniques described in this section are named,

they shared a common point, namely, the reliance on some linguistic (e.g. subcate-

gorisation frames, lexico-syntactic patterns) or predefined semantic (e.g. WordNet)

resources as features. These features are necessary for the computation of similarity

using conventional measures and clustering algorithms. The ease of scalability across

different domains and the resources required for feature extraction are among the

few questions our new clustering technique attempts to overcome. In addition, the

new clustering technique fuses the strengths of recent innovations such as ant-based

algorithms and featureless similarity measures that have yet to benefit ontology

learning systems.

7.3 Background

7.3.1 Normalised Google Distance

Normalised Google Distance (NGD) computes the semantic distance between

objects based on their names using only page counts from the Google search engine.

A more generic name for the measure that employs page counts provided by any

Web search engines is the Normalised Web Distance (NWD) [262]. NGD is a non-

feature distance measure which attempts to capture every effective distance (e.g.

Hamming distance, Euclidean distance, edit distances) into a single metric. NGD

is based on the notions of Kolmogorov Complexity [93] and Shannon-Fano coding

[142].


The basis of NGD begins with the idea of the shortest binary program capable

of producing a string x as an output. The Kolmogorov Complexity of the string x,

K(x) is just the length of that program in binary bits. Extending this notion to

include an additional string y produces the Information Distance [23] where E(x, y)

is the length of the shortest binary program that can produce x given y, and y given

x. It was shown that [23]:

E(x, y) = K(x, y)−minK(x), K(y) (7.1)

where E(x, x) = 0, E(x, y) > 0 for x 6= y, and E(x, y) = E(y, x). Next, for every

other computable distances D that are non-negative and symmetric, there is a binary

program, given string x and y, with a length equal to D(x, y). Formally,

E(x, y) ≤ D(x, y) + cD

where cD is a constant that depends on the distance D and not x and y. E(x, y) is

called universal because it acts as the lower bound for all computable distances. In

other words, if two strings x and y are close according to some distance D, then they

are at least as close according to E [49]. Since all computable distances compare

the closeness of strings through the quantification of certain common features they

share, we can consider that information distance determines the distance between

two strings according to the feature by which they are most similar.

By normalising information distance, we have NID(x, y) ∈ (0, 1) where 0 means

the two strings are the same and 1 being completely different in the sense that they

share no features. The normalised information distance is defined as:

NID(x, y) =K(x, y)−minK(x), K(y)

maxK(x), K(y)

Nonetheless, referring back to Kolmogorov Complexity and Equation 7.1, the non-

computability of K(x) implies the non-computability of NID(x, y). Nonetheless,

an approximation of K can be achieved using real compression programs [261]. If

C is a compressor, then C(x) denotes the length of the compressed version of string

x. Approximating K(x) with C(x) results in:

NCD(x, y) =C(x, y)−minC(x), C(y)

maxC(x), C(y)

The derivation of NGD continues by observing the working behind compressors.

Compressors encode source words x into code words x′ such that the length |x| < |x′|.

We can consider these code words from the perspective of Shannon-Fano coding.

7.3. Background 167

Shannon-Fano coding encodes a source word x using a code word that has the

length log 1p(x)

. p(x) can be thought of as a probability mass function that maps each

source word x to the code that achieves optimal compression of x. In Shannon-Fano

coding, p(x) = nx

Xcaptures the probability of encountering source word x in a text

or a stream of data from a source, where nx is the occurrence of x and N is the total

number of source words in the same text. Cilibrasi & Vitanyi [49] discussed the use of

compressors for NCD and concluded that the existing compressors’ inability to take

into consideration external knowledge during compression makes them inadequate.

Instead, the authors proposed to make use of a source that “...stands out as the most

inclusive summary of statistical information” [49], namely, the World Wide Web.

More specifically, the authors proposed the use of the Google search engine to devise

a probability mass function that reflects the Shannon-Fano code. It appears that

the Google’s equivalence of Shannon-Fano code, known as Google code, has length

defined by [49]:

G(x) = log1

g(x)

G(x, y) = log1

g(x, y)

where g(x) = |x|/N and g(x, y) = |x ∩ y|/N are the new probability mass function

that capture the probability of occurrences of search terms x and y. x is the set

of webpages returned by Google containing the single search term x (i.e. singleton

set) and similarly, x ∩ y is the set of webpages returned by Google containing both

search term x and y (i.e. doubleton set). N is the summation of all unique singleton

and doubleton sets.

Consequently, the Google search engine can be considered as a compressor for

encoding search terms (i.e. source words) x to produce the meaning (i.e. compressed

code words) that has the length G(x). By rewriting the NCD, we obtain the new

NGD defined as:

NGD(x, y) =G(x, y)−minG(x), G(y)

maxG(x), G(y)(7.2)

All in all, NGD is an approximation of NCD and hence, NID to overcome the non-

computability of Kolmogorov Complexity. NGD employs the Google search engine

as a compressor to generate Google codes based on the Shannon-Fano coding. From

the perspective of term clustering, NGD provides an innovative starting point which

demonstrates the advantages of featureless similarity measures. In our new term

clustering technique, we take such innovation a step further by employing NGD in


a new clustering technique that combines the strengths from both conventional and

ant-based algorithms.

7.3.2 Ant-based Clustering

The idea of ant-based clustering was first proposed by Deneubourg et al. [61] in

1991 as part of an attempt to explain the different types of emergent technologies

inspired by nature. During simulation, the ants are represented as agents that move

around the environment, a square grid, in random. Objects are randomly placed in

this environment and the ants can pick-up the object, move and drop them. These

three basic operations are influenced by the distribution of the objects. Objects that

are surrounded by dissimilar ones are more likely to be picked up and later dropped

elsewhere in the surrounding of more similar ones. The probability of picking up

and dropping of objects are influenced by the probabilities:

Ppick(i) = (kp

kp + f(i))2

Pdrop(i) = (f(i)

kd + f(i))2

where f(i) is an estimation of the distribution density of the objects in the ants’

immediate environment (i.e. local neighbourhood) with respect to the object that

the ants is considering to pick or drop. The choice of f(i) varies depending on

the cost and other factors related to the environment and the data items. As f(i)

decreases below kp, the probability of picking up the object is very high, and the

opposite occurs when f(i) exceeds kp. As for the probability of dropping an object,

high f(i) exceeding kd induces the ants to give up the object, while f(i) less than

kd encourages the ants to hold on to the object. The combination of these three

simple operations and the heuristics behind them gave birth to the notion of basic

ants for clustering, also known as standard ant clustering algorithm (SACA).

Gutowitz [94] examined the basic ants described by Deneubourg et al. and

proposed a variant ant known as complexity-seeking ants. Such ants are capable of

sensing local complexity and are inclined to work in regions of high interest (i.e. high

complexity). Regions with high complexity are determined using a local measure

that assesses the neighbouring cells and counts the number of pairs of contrasting

cells (i.e. occupied or empty). Neighbourhoods with all empty or all occupied im-

mediate cells have zero complexity while regions with checkboard patterns have high

7.3. Background 169

complexity. Hence, these modified ants are able to accomplish their task faster be-

cause they are more inclined to manipulate objects in regions with higher complexity

[263].

Lumer & Faieta [160] further extended and improved the idea of ant-based clus-

tering in terms of the numerical aspects of the algorithm and the convergence time.

The authors represented the objects in terms of numerical vectors and the distance

between the vectors is computed using the Euclidean distance. Hence, given that

δ(i, j) ∈ [0, 1] as the Euclidean distance between object i (i.e. i is the location of the

object in the centre of the neighbourhood) and every other neighbouring objects j,

the neighbourhood function f(i) is defined by the authors as:

f(i) =

1s2

∑

j 1− δ(i,j)α

if f(i) > 0

0 otherwise(7.3)

where s2 is the size of the local neighbourhood, and α ∈ [0, 1] is a constant for scaling

the distance among objects. In other words, an ant has to consider the average

similarity of object i with respect to all other objects j in the local neighbourhood

before performing an operation (i.e. pickup or drop). As the value of f(i) is obtained

by averaging the total similarities with the number of neighbouring cells s2, empty

cells which do not contribute to the overall similarity must be penalised. In addition,

the radius of perception (i.e. the extent to which objects are taken into consideration

for f(i)) of each ant at the centre of the local neighbourhood is given by s−12

. The

clustering algorithm using the basic ant SACA is defined in Algorithm 4.

Handl & Meyer [100] introduced several enhancements to make ant-based clus-

tering more efficient. The first is the concept of eager ants where idle phases are

avoided by having the ants to immediately pickup objects as soon as existing ones

are dropped. The second is the notion of stagnant control. There are occasions in

ant-based clustering when ants are occupied or blocked due to objects that are diffi-

cult to dispose. In such cases, the ants are forced to drop whatever they are carrying

after a certain number of unsuccessful drops. In a different paper [98], the authors

have also demonstrated that the ant-based algorithm has several advantages:

• tolerance to different cluster size

• ability to identify the number of clusters

• performance increases with the size of the datasets

• graceful degradation in the face of overlapping clusters.


Algorithm 4 Basic ant-based clustering defined by Handl et al. [99]

1: begin

2: //INITIALISATION PHASE

3: Randomly scatter data items on the toroidal grid

4: for each j in 1 to #agents do

5: i := random select(remaining items)

6: pick up(agent(j),i)

7: g := random select(remaining empty grid locations)

8: place agent(agent(j),g)

9: //MAIN LOOP

10: for each it ctr in 1 to #iterations do

11: j := random select(all agents)

12: step(agent(j),stepsize)

13: i := carried item(agent(j))

14: drop := drop item?(f(i))

15: if drop = TRUE then

16: while pick = FALSE do

17: i := random select(free data items)

18: pick := pick item?(f(i))

Nonetheless, the authors have also highlighted two shortcomings of ant-based clus-

tering, namely, the inability to distinguish more refined clusters within coarser level

ones, and the inability to specify the number of clusters can be seen as a disadvantage

when the users have precise ideas about it.

Vizine et al. [263] proposed an adaptive ant clustering algorithm (A2CA) that

improves upon the algorithm by Lumer & Faieta. The authors introduced two major

modifications, namely, progressive vision scheme and the use of pheromones on grid

cells. The progressive vision scheme allows the dynamic adjustment of s2. Whenever

an ant perceives a larger cluster, it increases its radius of perception from the originals−12

to the new s′

−12

. The second enhancement allows ants to mark regions that are

recently constructed or under construction. The pheromones attract other ants,

resulting in an increase in the probability of deconstruction of relatively smaller

regions, and increases the probability of dropping objects at denser clusters.

Ant-based algorithms have been employed to cluster objects that can be repre-

sented using numerical vectors. Similar to conventional algorithms, the similarity or

distance measures used by existing ant-based algorithms are still feature-based. Con-

7.4. The Proposed Tree-Traversing Ants 171

sequently, they share similar problems such as difficult portability across domains.

In addition, despite the strengths of standard ant-based algorithms, two disadvan-

tages were identified. In our new technique, we make use of the known strengths of

standard ant-based algorithms and some desirable traits from conventional ones for

clustering terms using featureless similarity.

7.4 The Proposed Tree-Traversing Ants

The Tree-Traversing Ant (TTA) clustering technique is based on dynamic tree

structures as compared to toroidal grids in the case of standard ants. The dynamic

tree begins with one root node r0 consisting of all terms T = t1, ..., tn, and branches

out to new sub-nodes as required. In other words, the clustering process begins with

r0 = t1, ..., tn. For example, the first snapshot in Figure 7.1 shows the start of the

TTA clustering process with the root node r0 initialised with the terms t1, ...tn=10.

Essentially, each node in the tree is a set of terms ru = t1, ..., tq. The sizes of new

sub-nodes |ru| reduce as less and less terms are assigned to them in the process of

creating nodes with higher intra-node similarity.

The clustering starts with only one ant, while an unbounded number of ants

awaits to work at each of the new sub-node created. In the third snapshot in Figure

7.1, while the first ant moves on to work at the left sub-node r01, a new second

ant proceeds to process the right sub-node r02. The number of possible new sub-

nodes for each main node (i.e branching factor) in this version of TTA is two. In

other words, for each main node rm, we have the sub-nodes rm1, rm2. Similar to

some of the current enhanced ants, the TTA ants are endowed with the ability

of short-term memory for remembering similarities and distances acquired through

their senses. The TTA is equipped with two types of senses, namely, NGD and

n-degree of Wikipedia (noW). The standard ants have a radius of perception defined

in terms of cells immediately surrounding the ants. Instead, the perception radius

of TTA ants covers all terms in the two sub-nodes created for each current node. A

current node is simply a node originally consisting of terms to be sorted to the news

sub-nodes.

The TTA adopts a two-pass approach for term clustering. During the first-pass,

the TTA recursively breaks nodes into sub-nodes and relocate terms until the ideal

clusters are achieved. The resulting trees created in the first-pass are often good

enough to reflect the natural clusters. Nonetheless, discrepancies do occur due to

certain oddities in the co-occurrences of terms on the World Wide Web that manifest

themselves through NGD. Accordingly, a second-pass is created that uses noW for


Figure 7.1: Example of TTA at work

relocating terms which are displaced due to NGD. The second-pass can be regarded

as a refinement phase for producing clusters with higher quality.

7.4.1 First-Pass using Normalised Google Distance

The TTA begins clustering at the root node which consists of all n terms, r0 =

t1, ..., tn. Each term can be considered as an element in the node. A TTA ant

randomly picks a term, and proceed on to sense its similarity with every other terms

on that same node. The ant repeats this for all n terms until the similarity of all

possible pairs of terms have been memorised. The similarity between two terms tx

and ty is defined as:

s(tx, ty) = 1−NGD(tx, ty)α (7.4)

where NGD(tx, ty) is the distance between term tx and ty estimated using the orig-

inal NGD defined at Equation 7.2. α is a constant for scaling the distance between

the two terms. The algorithm then grows two new sub-nodes to accommodate the

two least similar terms ta and tb. The ant moves the first term ta from the main

node rm to the first sub-node while emitting pheromones that trace back to tb in


the process. The ant then follows the pheromone trail back to the second term tb

to move it to the second sub-node.

The second snapshot in Figure 7.1 shows two new sub-nodes r01 and r02. The

ant moved the term t1 to r01 and the least similar term t6 to r02. Nonetheless, prior

to the creation of new sub-nodes and the relocation of terms, an ideal intra-node

similarity condition must be tested. The operation of moving the two least similar

terms from the current node to create and initialise new sub-nodes is essentially a

partitioning process. Eventually, each leaf node will end up with only one term if

the TTA does not know when to stop. For this reason, we adopt an ideal intra-node

similarity threshold sT for controlling the extent of branching out. Whenever an ant

senses that the similarity between the two least similar terms exceeds sT , no further

sub-nodes will be created and the partitioning process at that branch will cease. A

high similarity (higher than sT ) between the two most dissimilar terms in a node

provides a simple but effective indication that the intra-node similarity has reached

an ideal stage. More refined factors such as the mean and standard deviation of

intra-node similarity are possible but have not been considered.

If the similarity between the two most dissimilar terms is still less than sT ,

further branching out will be performed. In this case, the TTA ant repeatedly picks

up the remaining terms on the current node one by one and senses their similarities

with every other terms which are already located in the sub-nodes. Formally, the

probability of picking up term ti by an ant in the first-pass is defined as:

P 1pick(ti) =

1 if ti ∈ rm

0 otherwise(7.5)

where rm is the set of terms in the current node. In other words, the probability

of picking up terms by an ant is always 1 as long as there are still terms remain-

ing in the current node. Each term ti ∈ rm is moved to one of the two sub-nodes

ru that has the term tj ∈ ru with the highest similarity with ti. In other words,

an ant considers multiple neighbourhoods prior to dropping a term. Snapshot 3 in

Figure 7.1 illustrates the corresponding two sub-nodes r01 and r02 that have been

populated with all the terms which were previously located at the current node

r0. The standard neighbourhood function f(i) defined in Equation 7.3 represents

the density of the neighbourhood as the average of the similarities between ti with

every other term in its immediate surrounding (i.e. local neighbourhood) confined

by s2. Unlike the sense of basic ants which covers only the surrounding cells s2,

the extent to which a TTA ant perceives covers all terms in the two sub-nodes (i.e.

multiple neighbourhoods) corresponding to the immediate current node. Accord-


ingly, instead of estimating f(i) as the averaged similarity defined over s2 terms

surrounding the ant, the new neighbourhood function fTTA(ti, u) is defined as the

maximum similarity between term ti ∈ rm and the neighbourhood (i.e. sub-nodes)

ru. The maximum similarity between ti and ru is the highest similarity between ti

and all other terms tj ∈ ru. Formally, we define the density of neighbourhood ru

with respect to term ti during the first-pass as:

f 1TTA(ti, ru) = maximum of s(ti, tj) w.r.t. tj ∈ ru (7.6)

where the similarity between the two terms s(ti, tj) is computed using Equation 7.4.

Besides deciding on whether to drop an object or not, like in the case of basic

ants, the TTA ant has to decide on one additional issue, namely, where to drop.

The TTA decides on where to drop a term based on the fTTA(ti, ru) that it has

memorised for all sub-nodes ru of the current node rm. Formally, the decision on

whether to drop term ti ∈ rm on sub-node rv depends on:

P 1drop(ti, rv) =

1 if (f 1TTA(ti, rv) = maximum of f 1

TTA(ti, ru)

w.r.t. ru ∈ rm1, rm2)

0 otherwise

(7.7)

The current version of the TTA clustering algorithm is implemented in two parts.

The first is the main function while the second one is a recursive function. The

main function is defined in Algorithm 5 while the recursive function for the first-

pass elaborated in this subsection is reported in Algorithm 6.

Algorithm 5 Main function

1: input A list of terms, T = t1, ...tn.

2: Create an initial tree with a root node r0 containing n terms.

3: Define the ideal intra-node similarity threshold sT and δT .

4: //first-pass using NGD

5: ant := new ant()

6: ant.ant traverse(r0,r0)

7: //second-pass using noW

8: leafnodes := ant.pickup trail()//return all leaf nodes marked by pheromones

9: for each rnext ∈ leafnodes do

10: ant.ant refine(leafnodes,rnext)


Algorithm 6 Function ant traverse(rm,r0) using NGD

1: if |rm| = 1 then

2: leave trail(rm,r0)//leave trail from current leave node to root node. for use in

second-pass

3: return //only one term left. return to root

4: ta, tb := find most dissimilar terms(rm)

5: if s(ta, tb) > sT then

6: leave trail(rm,r0)//leave trail from current leave node to root node. for use in

second-pass

7: return //ideal cluster has been achieved. return to root node

8: else

9: rm1, rm2 := grow sub nodes(rm)

10: move terms(ta, tb,rm1, rm2)

11: for each term ti ∈ rm do

12: pick(ti)//based on Eq. 7.5

13: for each ru ∈ rm1, rm2 do

14: for each term tj ∈ ru do

15: s(ti, tj) := sense similarity(ti,tj) //based on Eq. 7.4

16: remember similarity(s(ti, tj))

17: f 1TTA(ti, ru) := sense neighbourhood() //based on Eq. Eq. 7.6

18: remember neighbourhood(f 1TTA(ti, ru))

19: ∀u, f 1TTA(ti, ru) := recall neighbourhood()

20: rv := decide drop(∀u, f 1TTA(ti, ru))// based on Eq. 7.7

21: drop(ti,rv)

22: antm1 := new ant()

23: antm1.ant traverse(rm1,r0)//repeat the process recursively for each sub-node

24: antm2 := new ant()

25: antm2.ant traverse(rm2,r0)//repeat the process recursively for each sub-node


7.4.2 n-degree of Wikipedia: A New Distance Metric

The use of NGD for quantifying the similarity between two objects based on

their names alone can occasionally produce low-quality clusters. We will highlight

some of these discrepancies during our initial experiments in the next section. The

initial tree of clusters generated by the TTA using NGD demonstrated promising

results. Nonetheless, we reckoned that higher-quality clusters could be generated

if we allow the TTA ants to visit the nodes again for the purpose of refinement.

Instead of using NGD, we present a new way to gauge the similarity between terms.

Google can be regarded as the gateway to the huge volume of documents on

the World Wide Web. The sheer size of Google’s index enables a relatively reliable

estimate of term usage and occurrence using NGD. The page counts provided by

the Google search engine, which are the essence of NGD, are used to compute

the similarity between two terms based on the mutual information that they both

share at the compressed level. As for Wikipedia, its number of articles is less than

a fraction of what Google indexes. Nonetheless, the restrictions imposed on the

authoring of Wikipedia’s articles and their organisations provide a possibly new

way of looking at similarity between terms. n-degree of Wikipedia (noW) [272] is

inspired by a game for Wikipedians. 6-degree of Wikipedia2 is a task set out to

study the characteristics of Wikipedia in terms of the similarity between its articles.

An article in Wikipedia can be regarded as an entry of encyclopaedic information

describing a particular topic. The articles are organised using categorical indices

which eventually leads to the highest level, namely, “Categories”3. Each article

can appear under more than one category. Hence, the organisation of articles in

Wikipedia appears more as a directed acyclic graph with a root node instead of a

pure tree structure4. The huge volume of articles in Wikipedia, the organisation

of articles in a graph structure, the open-source nature of the articles, and the

availability of the articles in electronic form makes Wikipedia the ideal candidate

for our endeavour.

We define Wikipedia as a directed graph W := (V,E). W is essentially a network

of linked-articles where V = a1, ..., aω is the set of articles. We limit the vertices to

English articles only. At the moment, ω = |V | is reported to be 1, 384, 7295, making

it the largest encyclopaedia6 in merely five years since its conception. The inter-

2http://en.wikipedia.org/wiki/Six Degrees of Wikipedia3http://en.wikipedia.org/wiki/Category:Categories4http://en.wikipedia.org/wiki/Wikipedia:Categorization#Categories do not form a tree5http://en.wikipedia.org/wiki/Wikipedia:Size comparisons6http://en.wikipedia.org/wiki/Wikipedia:Largest encyclopedia


connections between articles are represented as the set of ordered pairs of vertices

E. At the moment, the edges are uniformly assigned with weight 1. Each article can

be considered as an elaboration of a particular event, an entity or an abstract idea.

In this sense, an article in Wikipedia is a manifestation of the information encoded

in the terms. Consequently, we can represent each term ti using the corresponding

article ai ∈ V in Wikipedia. Hence, the problem of finding the distance between two

terms ti and tj can be reduced to the discovery of how closely situated are the two

corresponding articles ai and aj in the Wikipedia categorical indices. The problem

of finding the degree of separation between two articles can be addressed in terms of

the single-source shortest path problem. Since the weights are all positive, we have

resorted to Dijkstra’s Algorithm for finding the shortest-path between two vertices

(i.e. articles). Other algorithms for the shortest-path problem are available. How-

ever, a discussion on these algorithms is beyond the scope of this chapter. Formally,

the noW value between terms tx and ty is defined as

noW (tx, ty) = δ(ax, ay) =

∑|SP |k=1 cek

if(ax 6= ay ∧ ax, ay ∈ V )

0 if(ax = ay ∧ ax, ay ∈ V )

∞ otherwise

(7.8)

where delta(ax, ay) is the degree of separation between the articles ax and ay which

corresponds to the term tx and ty, respectively. The degree of separation is computed

as the sum of the cost of all edges along the shortest path between articles ax and

ay in the graph of Wikipedia articles W . SP is the set of edges along the shortest

path and ek is the kth edge or element in set SP . |SP | is the number of edges

along the shortest path and cekis the cost associated with the kth edge. It is also

worth mentioning that while δ(tx, ty) ≥ 0 for ax, ay ∈ V , no upper bound can

be ascertained. The noW value between terms that do not have corresponding

articles in Wikipedia is set to∞. There is a hypothesis7 stating that no two articles

in Wikipedia are separated by more than six degrees. However, there are some

Wikipedians who have shown that certain articles can be separated by up to eight

steps8. This is the reason why we adopted the name n-degree of Wikipedia instead

of 6-degree of Wikipedia.

7http://tools.wikimedia.de/sixdeg/index.jsp8http://en.wikipedia.org/wiki/Six Degrees of Wikipedia


7.4.3 Second-Pass using n-degree of Wikipedia

Upon completing the first-pass, there is at most n leaf nodes where each term

in the initial set of all terms T end up in individual nodes (i.e. clusters). There are

only two possibilities for such extreme cases. The first is when the ideal intra-node

similarity threshold sT is set too high while the second is when all the terms are

extremely unrelated. In normal cases, most of the terms will be nicely clustered

into nodes with intra-node similarities exceeding sT . Only a small number of terms

is usually isolated into individual nodes. We refer to these terms as isolated terms.

There are two possibilities that lead to isolated terms in normal cases, namely, (1)

the term has been displaced during the first-pass due to discrepancies related to

NGD, or (2) the term is in fact an outlier. The TTA ants leave pheromone trails on

their return trip to the root node (as in line 2 and line 7 of Algorithm 6) to mark the

paths to the leaf nodes. In order to relocate the isolated terms to other more suitable

nodes, the TTA ants return to the leaf nodes by following the pheromone trails. At

each leaf node rl, the probability of picking up a term ti during the second-pass is

1 if the leaf node has only one term (isolated term):

P 2pick(ti) =

1 if |rl| = 1 ∧ ti ∈ rl

0 otherwise(7.9)

After picking up an isolated term, the TTA ant continues to move from one leaf

node to the next. At each leaf node, the ant determines whether that particular

leaf node (i.e. neighbourhood) rl is the most suitable one to house the isolated

term ti based on the average distance between ti and all other existing terms in rl.

Formally, the density of neighbourhood rl with respect to the isolated term ti during

the second-pass is defined as:

f 2TTA(ti, rl) =

∑|rl|j=1 noW (ti, tj)

|rl|(7.10)

where |rl| is the number of terms in the leaf node rl and the noW value between

the two terms ti and tj is computed using Equation 7.8.

This process of sensing the distance of the isolated term with all other terms in

a leaf node is performed for all leaf nodes. The probability of the ant dropping the

isolated term ti on the most suitable leaf node rv is evaluated once the ant returns

to the original leaf node that used to contain ti. Back at the original leaf node of

ti, the ant recalls the neighbourhood density f 2TTA(ti, rl) that it has memorised for


all neighbourhoods (i.e leaf nodes). The TTA ant drops the isolated term ti on the

leaf node rv if all terms in rv collectively yield the minimum average distance with

ti that satisfies the outlier discrimination threshold δT . Formally,

P 2drop(ti, rv) =

1 if (f 2TTA(ti, rv) = minimum of f 2

TTA(ti, rl) w.r.t. rl ∈ L)

∧ (f 2TTA(ti, rv) ≤ δT )

0 otherwise

(7.11)

where L is the set of all leaf nodes. After the ant has visited all the leaf nodes

and has failed to drop the isolated term, the term will be returned to its original

location. The failure to drop the isolated term in a more suitable node indicates

that the term is an outlier.

Referring back to the example in Figure 7.1, assume that snapshot 5 represents

the end of the first-pass where the intra-node similarity of all nodes have satisfied

sT . While all other leaf nodes, namely, r011, r012 and r021 consist of multiple terms,

leaf node r022 contains only one term t6. Hence, at the end of the first-pass, all ants,

namely, ant1, ant2, ant3 and ant4 retreat back to the root node r0. Then, during

the second-pass, one TTA ant is deployed to relocate the isolated term t6 from r022

to either leaf node r011, r012 or r021, depending on the average distances of these leaf

nodes with respect to t6. The algorithm for the second-pass using noW is described

in Algorithm 7. Unlike the ant traverse() function in Algorithm 6 where each new

sub-node is processed as a separate iteration of ant traverse() using an independent

TTA ant, there is only one ant required throughout the second-pass.


In this section, we focus on evaluations at the conceptual layer of ontologies to

verify the taxonomic structures discovered using TTA. We employ three existing

metrics. The first is known as Lexical Overlap (LO) for evaluating the intersection

between the discovered concepts (Cd) and the recommended (i.e. manually created)

concepts (Cm) [164]. The manually created concepts can be regarded as the reference

for our evaluations. LO is defined as:

LO =|Cd ∩ Cm|

|Cm|(7.12)

Some minor changes were made in terms of how the intersection between the set

of recommended clusters and discovered clusters (i.e. Cd ∩ Cm) is computed. The


Algorithm 7 Function ant refine(leafnodes, ru) using noW

1: if |ru| = 1 then

2: //current leaf node has isolated term ti

3: pick(ti)//based on Eq. 7.9

4: for each rl ∈ leafnodes do

5: for each term tj in current leaf node rl do

6: //jump from one leaf node to the next to sense neighbourhood density

7: δ(ti, tj) := sense distance(ti, tj)//based on Eq. 7.8

8: remember distance(δ(ti, tj))

9: f 2TTA(ti, rl) := sense neighbourhood()//based on Eq. 7.10

10: remember neighbourhood(f 2TTA(ti, rl))

11: //back to original leaf node of term ti after visiting all other leaves

12: ∀l, f2TTA(ti, rl) := recall neighbourhood()

13: rv := decide drop(∀l, f2TTA(ti, rl))// based on Eq. 7.11

14: if rv not null then

15: drop(ti,rv)//drop at ideal leaf node

16: else

17: drop(ti,ru)//outlier. no ideal leaf node. drop back at original leaf

normal way of having exact lexical matching of the concept identifiers cannot be

applied to our experiments. Due to the ability of the TTA in discovering concepts

with varying level of granularity depending on sT , we have to put into consideration

the possibility of sub-clusters that collectively correspond to some recommended

clusters. For our evaluations, the presence of discovered sub-clusters that correspond

to some recommended clusters are considered as a valid intersection. In other words,

given that Cd = c1, ..., cn and Cm = cx where cx /∈ Cd, then

|Cd ∩ Cm| = 1 if c1 ∪ ... ∪ cn = cx

The second metric is used to account for valid discovered concepts that are absent

from the reference set, while the third metric ensures that concepts which exist in

the reference set but are not discovered are also taken into consideration. The second

metric is referred to as Ontological Improvement (OI) and the third metric is known

as Ontological Loss (OL). They are defined as [214]:

OI =|Cd − Cm|

|Cm|(7.13)


Table 1. Summary of the datasets employed for experiments. Column

Cm are the recommended clusters and Cd are clusters automatically

discovered using TTA.

OL =|Cm − Cd|

|Cm|(7.14)

Ontology learning is an incremental process that involves the continuous mainte-

nance of ontology every time new terms are added. As such, we do not see clustering

large datasets as a problem. In this section, we employ seven datasets to assess the

quality of the discovered clusters using the three metrics described above. The origin

of the datasets and some brief descriptions are provided below:

• Three of the datasets used for our experiments were obtained from the UCI

Machine Learning Repository9. These sets are labelled as WINE 15T, MUSH-

9http://www.ics.uci.edu/ mlearn/MLRepository.html


Table 2. Summary of the evaluation results for all ten experiments

using the three metrics LO, OI and OL.

ROOM 16T and DISEASE 20T. The accompanying numerical attributes, which

were designed for use with feature-based similarities, were removed.

• We also employ the original animals dataset (i.e. ANIMAL 16T) proposed for

use with Self-Organising Maps (SOMs) by Ritter & Kohonen [209].

• We constructed the remaining three datasets called ANIMALGOOGLE 16T,

MIX 31T and MIX 60T. ANIMALGOOGLE 16T is similar to the ANIMAL 16T

dataset except for a single replacement with the term “Google”. The other two

MIX datasets consist of a mixture of terms from a large number of domains.

Table 1 summarises the datasets employed for our experiments. The column Cm

are the recommended clusters and Cd are clusters automatically discovered using

TTA. Table 2 summarises the evaluation of TTA using the three metrics for all

ten experiments. The high lexical overlap (LO) shows the good domain coverage of

the discovered clusters. The occasionally high ontological improvement (OI) demon-

strates the ability of TTA in highlighting new, interesting concepts that were ignored

during manual creation of the recommended clusters.

During the experiments, snapshots were produced to show the results in two

parts: results after the first-pass using NGD, and results after the second-pass using

noW. The first experiment uses WINE 15T. The original dataset has 178 nameless

instances spread out over 3 clusters. Each instance has 13 attributes for use with

feature-based similarity measures. We augment the dataset by introducing famous

names in the wine domain and remove their numerical attributes. We maintained

the three clusters namely, “white”, “red” and “mix”. “Mix” refers to wines that were


Figure 7.2: Experiment using 15 terms from the wine domain. Setting sT = 0.92

results in 5 clusters. Cluster A is simply red wine grapes or red wines, while Cluster E

represents white wine grapes or white wines. Cluster B represents wines named after

famous regions around the world and they can either be red, white or rose. Cluster

C represents white noble grapes for producing great wines. Cluster D represents

red noble grapes. Even though uncommon, Shiraz is occasionally admitted to this

group.

named after famous wine regions around the world. Such wines can either be red or

white. As shown in Figure 7.2, setting sT = 0.92 produces five clusters. Clusters A

and D are actually sub-clusters for the recommended cluster “red”, while Clusters C

and E are sub-clusters for the recommended cluster “white”. Cluster B corresponds

exactly to the recommended cluster “mix”. The second experiment uses MUSH-

ROOM 16T. The original dataset has 8124 nameless instances spread out over two

clusters. Each instance has 22 nominal attributes for use with feature-based simi-

larity measures. We augment the dataset by introducing names of mushrooms that

fit into one of the two recommended clusters, namely, “edible” and “poisonous”. As

shown in Figure 7.3, setting sT = 0.89 produces 4 clusters. Cluster A corresponds

exactly to the recommended cluster “poisonous”. The remaining three clusters are

actually sub-clusters of the recommended cluster “edible”. Cluster B contains edible

mushrooms prominent in East Asia, while Clusters C and D comprise mushrooms

found mostly in North America and Europe, and are prominent in Western cuisines.

Similarly, the third experiment was conducted using DISEASE 20T with the results

shown in Figure 7.4. At sT = 0.86, TTA discovered hidden sub-clusters within

the four recommended clusters, namely, “skin”, “blood”, “cardiovascular” and “di-

gestion”. In relation to this, Handl et al. [99] highlighted a shortcoming in their


Figure 7.3: Experiment using 16 terms from the mushroom domain. Setting sT =

0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B

comprises edible mushrooms which are prominent in East Asian cuisine except for

Agaricus Blazei. Nonetheless, this mushroom was included in this cluster probably

due to its high content of beta glucan for potential use in cancer treatment, just

like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also known

as Himematsutake, further relating this mushroom to East Asia. Cluster C and D

comprise edible mushrooms found mainly in Europe and North America, and are

more prominent in Western cuisines.

evaluation of ant-based clustering algorithms. The authors stated that the algo-

rithm “...only manages to identify these upper-level structures and fails to further

distinguish between groups of data within them.”. In other words, unlike existing

ant-based algorithms, the first three experiments demonstrated that our TTA has

the ability to further distinguish hidden structures within clusters.

The fourth and fifth experiments were conducted using ANIMAL 16T dataset.

This dataset has been employed to evaluate both the standard ant-based clustering

(SACA) and the improved version called A2CA by Vizine et al. [263]. The original

dataset consists of 16 named instances, each representing an animal using binary

feature attributes. Both SACA and A2CA discovered two natural clusters, one

for “mammal” while the other for “bird”. While SACA was inconsistent in its

results, A2CA yielded 100% recall rate over ten runs. The authors of A2CA stated

that the dataset can also be represented as three recommended clusters. In the

spirit of the evaluation by Vizine et al., we performed the clustering of the 16

animals using TTA over ten runs. In our case, no feature was used. Just like all

experiments in this chapter, the 16 animals were clustered based on their names.


Figure 7.4: Experiment using 20 terms from the disease domain. Setting sT = 0.86

results in 7 clusters. Cluster A represents skin diseases. Cluster B represents a

class of blood disorders known as anaemia. Cluster C represents other kinds of

blood disorders. Cluster D represents blood disorders characterised by the relatively

low count of leukocytes (i.e. white blood cells) or platelets. Cluster E represents

digestive diseases. Cluster F represents cardiovascular diseases characterised by

both the inflammation and thrombosis (i.e. clotting) of arteries and veins. Cluster

G represents cardiovascular diseases characterised by the inflammation of veins only.

As shown in the fourth experiment in Figure 7.5, by setting sT = 0.60, the TTA

automatically discovered the two recommended clusters after the second-pass: “bird”

and “mammal”. While ant-based techniques are known for their intrinsic capability

in identifying clusters automatically, conventional clustering techniques (e.g. K-

means, average link agglomerative clustering) rely on the specification of the number

of clusters [99]. The inability to control the desired number of natural clusters can

be troublesome. According to Vizine et al. [263], “in most cases, they generate a

number of clusters that is much larger that the natural number of clusters”. Unlike

both extremes, TTA has the flexibility in regard to the discovery of clusters. The

granularity and number of discovered clusters in TTA can be adjusted by simply

modifying the threshold sT . By setting higher sT , the number of discovered clusters

for ANIMAL 16T has been increased to five as shown in Figure 7.6. A lower value

of the desired ideal intra-node similarity sT results in less branching out and hence,

fewer clusters. Conversely, setting higher sT produces more tightly coupled terms

where the similarities between elements in the leaf nodes are very high. In the


fifth experiment depicted in Figure 7.6, the value sT was raised to 0.72 and more

refined clusters were discovered: “bird”, “mammal hoofed”, “mammal kept as pet”,

“predatory canine” and “predatory feline”.

Figure 7.5: Experiment using 16 terms from the animal domain. Setting sT = 0.60

produces 2 clusters. Cluster A comprises birds and Cluster B represents mammals.

The next three experiments were conducted using the ANIMALGOOGLE 16T

dataset. These three experiments are meant to reveal another advantage of TTA

through the presence of an outlier, namely, the term “Google”. An outlier can be

simply considered as a term that does not fit into any of the clusters. In Figure 7.7,

TTA successfully isolated the term “Google” while discovering clusters at different

levels of granularities based on different sT . As similar terms are clustered into the

same node, outliers are eventually singled out as isolated terms in individual leaf

nodes. Consequently, unlike some conventional techniques such as K-means [282],

clustering using TTA is not susceptible to poor results due to outliers. In fact, there

are two ways of looking at the term “Google”, one as an outlier as described above,

or the second as an extremely small cluster with one term. Either way, the term

“Google” demonstrates two abilities of TTA: capable of identifying and isolating

outliers, and tolerance to differing cluster sizes like its predecessors. Handl et al.

[99] have shown through experiments that certain conventional clustering techniques

such as K-means and one-dimensional self-organising maps perform poorly in the

face of increasing deviations between cluster sizes.

The last two experiments were conducted using MIX 31T and MIX 60T. Fig-

ure 7.8 shows the results after the first-pass and second-pass using 31 terms while


Figure 7.6: Experiment using 16 terms from the animal domain (the same dataset

from the experiment in Figure 7.5). Setting sT = 0.72 results in 5 clusters. Cluster

A represents birds. Cluster B includes hoofed mammals (i.e. ungulates). Cluster C

corresponds to predatory feline while Cluster D represents predatory canine. Cluster

E constitutes animals kept as pets.

Figure 7.9 shows the final results using 60 terms. Similar to the previous experi-

ments, the first-pass resulted in a number of clusters plus some isolated terms. The

second-pass aims to relocate these isolated terms to the most appropriate clusters.

Despite the rise in the number of terms from 31 to 60, all the clusters formed by

the TTA after the second-pass correspond precisely to their occurrences in real-life

(i.e. natural clusters). With the absolute consistency of the results over ten runs,

these two experiments yield 100% recalls just like the previous experiments. Con-

sequently, we can claim that TTA is able to produce consistent results, unlike the

standard ant-based clustering where the solution does not stabilise and fail to con-

verge. For example, in the evaluation by Vizine et al. [263], the standard ant-based

clustering were inconsistent in their performance over the ten runs using the AN-

IMAL 16T dataset. This is a very common problem in ant-based clustering when

“they constantly construct and deconstruct clusters during the iterative procedure of

adaptation” [263].

There is also another advantage of TTA that is not found in the standard ants,

namely, the ability to identify taxonomic relations between clusters. Referring to

all the ten experiments conducted, we noticed that there are implicit hierarchical

information that connects the discovered clusters. For example, referring to the most


Figure 7.7: Experiment using 15 terms from the animal domain plus an additional

term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60 (middle screenshot)

and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respec-

tively. In the left screenshot, Cluster A acts as the parent for the two recommended

clusters “bird” and “mammal”, while Cluster B includes the term “Google”. In the

middle screenshot, the recommended clusters “bird” and “mammal” were clearly

reflected through Cluster A and C respectively. By setting sT higher, we dissected

the recommended cluster “mammal” to obtain the discovered sub-clusters C, D and

E as shown in the right screenshot.

recent experiment in Figure 7.8, the two discovered Clusters A (which contains

“Sandra Bullock”, “Jackie Chan”, “Brad Pitt”) and B (which contains “3 Doors

Down”, “Aerosmith”,“Rod Stewart”) after the second-pass share the same parent

node. We can employ the graph of Wikipedia articles W to find the nearest common

ancestor of the two natural clusters and label it with the category name provided by

Wikipedia. In our case, we can label the parent node of the two natural clusters as

“Entertainers”. In fact, the labels of the natural clusters themselves can be named

using the same approach. For example, the terms in the discovered cluster B (which

contains “3 Doors Down”, “Aerosmith”,“Rod Stewart”) fall under the same category

“American musicians” in Wikipedia and hence, we can accordingly label this cluster

using that category name. In other words, clustering using TTA with the help of

NGD and noW does not only produce flexible and consistent natural clusters, but

is also able to identify implicit taxonomic relations between clusters. Nonetheless,

we would like to point out that not all hierarchies of natural clusters formed by

the TTA correspond to real-life hierarchical relations. More research is required to

properly validate this capability of the TTA.


Figure 7.8: Experiment using 31 terms from various domains. Setting sT = 0.70

results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents

musicians. Cluster C represents countries. Cluster D represents politics-related

notions. Cluster E is transport. Cluster F includes finance and accounting matters.

Cluster G constitutes technology and services on the Internet. Cluster H represents

food.

One can notice that in all the experiments in this section, the quality of the

clustering output using TTA was less desirable if we were to only rely on the results

from the first-pass. As pointed out earlier, the second-pass is necessary to produce

naturally-occurring clusters. The results after the first-pass usually contain isolated

terms due to discrepancies in NGD. This is mainly due to the appearance of words

and the popularity of word pairs that are not natural. For example, given the words

“Fox”, “Wolf” and “Entertainment”, the first two should go together naturally.

Unfortunately, due to the popularity of the name “Fox Entertainment”, a Google

search using the pair “Fox” and “Wolf” generates lower page count as compared to

“Fox” and “Entertainment”. A lower page count has adverse effects on Equation

7.2, resulting in lower similarity. Using Equation 7.4, “Fox” and “Entertainment”

achieve a similarity of 0.7488 while “Fox” and “Wolf” yield a lower similarity of

0.7364. Despite such shortcomings, search engine page counts and Wikipedia offer

TTA the ability to handle technical terms and common words of any domain re-

gardless of whether they have been around for some time or merely beginning to

evolve into common use on the Web. Due to the mere reliance on names or nouns

for clustering, some readers may question the ability of TTA in handling various

linguistic issues such as synonyms and word senses. Looking back at Figure 7.4, the


Figure 7.9: Experiment using 60 terms from various domains. Setting sT = 0.76

results in 20 clusters. Cluster A and B represent herbs. Cluster C comprises pastry

dishes while Cluster D represents dishes of Italian origin. Cluster E represents

computing hardware. Cluster F is a group of politicians. Cluster G represents cities

or towns in France while Cluster H includes countries and states other than France.

Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents marsupials.

Cluster K represents finance and accounting matters. Cluster L comprises transports

with four or more wheels. Cluster M includes plant organs. Cluster N represents

beverages. Cluster O represents predatory birds. Cluster P comprises birds other

than predatory birds. Cluster Q represents two-wheeled transports. Cluster R and

S represent predatory mammals. Cluster T includes trees of the genus Acacia.

term “Buerger’s disease” and “Thromboangiitis obliterans” are actually synonyms

referring to the acute inflammation and thrombosis (clotting) of arteries and veins

of the hands and feet. In the context of the experiment in Figure 7.2, the term “Bor-

deaux” was treated as “Bordeaux wine” instead of the “city of Bordeaux”, and was

successfully clustered together with other wines from other famous regions such as

“Burgundy”. In another experiment in Figure 7.9, the similar term “Bordeaux” was

automatically disambiguated and treated as a port-city in the Southwest of France

instead. The TTA then automatically clustered this term together with other cities

in France such as “Chamonix” and “Paris”. In short, TTA has the inherent capa-

bility of coping with synonyms, word senses and the fluctuation in term usage.

The quality of the clustering results is very much dependent on the choice of sT

7.6. Conclusion and Future Work 191

and to a lesser extent, δT . Nonetheless, as an effective rule-of-thumb, sT should be

set as high as possible. Higher sT will result in more leaf nodes with each having

possibly a smaller number of terms that are tightly coupled together. High sT

will also enable the isolation of potential outliers. The isolated terms and outliers

generated by a high sT can then be further refined in the second-pass. The ideal

range of sT derived through our experiments is within 0.60−0.9. Setting sT too low

will result in very coarse clusters like the ones shown in Figure 7.5 where potential

sub-clusters are left uncovered. Regarding the value of δT , it is usually set inversely

proportional to sT . As shown during our evaluations, the higher we set sT , the more

we decrease the value of δT . The reason behind the choices of these two threshold

values can be explained as follows: as we lower sT , TTA produces coarser clusters

with loosely coupled terms. The intra-node distance of such clusters are inevitably

higher compared to the finer clusters because the terms in these coarse clusters are

more likely to be less similar. In order for the second-pass to function appropriately

during the relocation of isolated terms and the isolation of outliers, δT has to be set

comparatively higher. Besides, lower sT will not provide the adequate discriminative

ability for the TTA to distinguish or pick out the outliers. Another interesting point

about sT is that by setting it to the maximum (i.e. 1.0), it results in a divisive

clustering effect. In divisive clustering, the process starts with one, all-inclusive

cluster and at each step, splits the cluster until only singleton clusters of individual

term remain [242].


In this chapter, we introduced a decentralised multi-agent system for term clus-

tering in ontology learning. Unlike documents clustering or other forms of clustering

in pattern recognition, clustering terms in ontology learning requires a different ap-

proach. The most evident adjustment required in term clustering is the measure of

similarity and distance. Existing term clustering techniques in many ontology learn-

ing systems remain confined within the realm of conventional clustering algorithms

and feature-based similarity measures. Since there is no explicit feature attached to

terms, these existing techniques have come to rely on contextual cues surrounding

the terms. These clustering techniques require extremely large collections of domain

documents to reliably extract contextual cues for the computation of similarity ma-

trices. In addition, the static background knowledge required for term clustering

such as WordNet, patterns and templates make such techniques even more difficult

to scale across domains.


Consequently, we introduced the use of featureless similarity and distance mea-

sures called Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)

for term clustering. The use of these two measures as part of a new multi-pass clus-

tering algorithm called Tree-Traversing Ant (TTA) demonstrated excellent results

during our evaluations. Standard ant-based techniques exhibit certain characteris-

tics that have been shown to be useful and superior compared to conventional clus-

tering techniques. The TTA is the result of an attempt to inherit these strengths

while avoiding some inherent drawbacks. In the process, certain advantages from the

conventional divisive clustering were incorporated, resulting in the appearance of a

hybrid between ant-based and conventional algorithms. Seven of the most notable

strengths of the TTA with NGD and noW are (1) able to further distinguish hidden

structures within clusters, (2) flexible in regard to the discovery of clusters, (3) ca-

pable of identifying and isolating outliers, (4) tolerance to differing cluster sizes, (5)

able to produce consistent results, (6) able to identify implicit taxonomic relations

between clusters, and (7) inherent capability of coping with synonyms, word senses

and the fluctuation in term usage.

Nonetheless, much work is still required in certain aspects. One of the main

future work we have in plan is to ascertain the validity and make good use of the

implicit hierarchical relations discovered using TTA. The next issue that interests

us is the automatic labelling of the natural clusters and the nodes in the hierarchy

using Wikipedia. Labelling has always been a hard problem in clustering, especially

document and term clustering. We are also keen on conducting more studies on the

interaction between the two thresholds in TTA, namely, sT and δT . If possible, we

intend to find ways to enable the automatic adjustment of these threshold values to

maximise the quality of clustering output.

7.7 Acknowledgement


graduate Research Scholarship, and a Research Grant 2006 from the University of

Western Australia. The authors would like to thank the anonymous reviewers for

their invaluable comments.


Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for Terms

Clustering using Tree-Traversing Ants. In the Proceedings of the International Sym-

posium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia.

7.8. Other Publications on this Topic 193

This paper reports the preliminary work on clustering terms using featureless simi-

larity measures. The resulting clustering technique called TTA was later refined to

contribute towards the core contents of Chapter 7.

Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. M.

Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technolo-

gies, IGI Global.

The research on TTA reported in Chapter 7 was generalised in this book chapter

to work with both terms and Internet domain names.


195CHAPTER 8

Relation Acquisition

Abstract

Common techniques for acquiring semantic relations rely on static domain and

linguistic resources, predefined patterns, and the presence of syntactic cues. This

chapter proposes a hybrid technique which brings together established and novel

techniques in lexical simplification, word disambiguation and association inference

for acquiring coarse-grained relations between potentially ambiguous and composite

terms using only dynamic Web data. Our experiments using terms from two different

domains demonstrate potential preliminary results.

8.1 Introduction

Relation acquisition, also known as relation extraction or relation discovery is an

important aspect of ontology learning. Traditionally, semantic relations are either

extracted as verbs based on grammatical structures [217], induced through term

co-occurrence using large text corpora [220], or discovered in the form of unnamed

associations through cluster analysis [212]. Challenges faced by conventional tech-

niques include (1) the reliance on static patterns and text corpora together with rare

domain knowledge, (2) the need for named entities to guide relation acquisition, (3)

the difficulty of classifying composite or ambiguous names into the required cate-

gories, and (4) the dependence on grammatical structures and the presence of verbs

can result in the overlooking of indirect, implicit relations. In recent years, there

is a growing trend in relation acquisition using Web data such as Wikipedia [245]

and online ontologies (e.g. Swoogle) [213] to partially address the shortcomings of

conventional techniques.

In this chapter, we propose a hybrid technique which integrates lexical simplifi-

cation, word disambiguation and association inference for acquiring semantic rela-

tions using only Web data (i.e. Wikipedia and Web search engines) for constructing

lightweight domain ontologies. The proposed technique performs an iterative process

of term mapping and term resolution to identify coarse-grained relations between

domain terms. The main contribution of this chapter is the resolution phase which

0This chapter appeared in the Proceedings of the 13th Pacific-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD), Bangkok, Thailand, 2009, with the title “Acquiring Se-

mantic Relations using the Web for Constructing Lightweight Ontologies”.

196 Chapter 8. Relation Acquisition

allows our relation acquisition technique to handle complex and ambiguous terms,

and terms not covered by our background knowledge on the Web. The proposed

technique can be used to complement conventional techniques for acquiring fine-

grained relations and to automatically extend online structured data as Wikipedia.

The rest of the chapter is structured as follows. Section 8.2 and 8.3 present existing

work related to relation acquisition, and the details of our technique, respectively.

The outcome of the initial experiment is summarised in Section 8.4. We conclude

this chapter in Section 8.5.

8.2 Related Work

Techniques for relation acquisition can be classified as symbolic-based, statistics-

based or a hybrid of both. The use of linguistic patterns enables the discovery of

fine-grained semantic relations. For instance, Poesio & Almuhareb [201] developed

specific lexico-syntactic patterns to discover named relations such as part-of and

causation. However, linguistic-based techniques using static rules tend to face dif-

ficulties in coping with the structural diversity of a language. The technique by

Sanchez & Moreno [217] for extracting verbs as potential named relations is re-

stricted to handling verbs in simple tense and verb phrases which do not contain

modifiers such as adverbs. In order to identify indirect relations, statistics-based

techniques such as co-occurrence analysis and cluster analysis are necessary. Co-

occurrence analysis employs the redundancy in large text corpora to detect the

presence of statistically significant associations between terms. However, the tex-

tual resources required by such techniques are difficult to obtain, and remain static

over a period of time. For example, Schutz & Buitelaar [220] manually constructed

a corpus for the football domain containing only 1, 219 documents from an online

football site for relation acquisition. Cluster analysis [212], on the other hand,

requires tremendous computational effort in preparing features from texts for sim-

ilarity measurement. The lack of emphasis on indirect relations is also evident in

existing techniques. Many relation acquisition techniques in information extraction

acquire semantic relations with the guidance of named entities [229]. Relation ac-

quisition techniques which require named entities have restricted applicability since

many domain terms with important relations cannot be easily categorised. In addi-

tion, the common practice of extracting triples using only patterns and grammatical

structures tends to disregard relations between syntactically unrelated terms.

In view of the shortcomings of conventional techniques, there is a growing trend

in relation acquisition which favours the exploration of rich, heterogeneous Web data

8.3. A Hybrid Technique for Relation Acquisition 197

over the use of static, rare background knowledge. SCARLET [213], which stemmed

from a work in ontology matching, follows this paradigm by harvesting online on-

tologies on the Semantic Web to discover relations between concepts. Sumida et

al. [245] developed a technique for extracting a large set of hyponymy relations in

Japanese using the hierarchical structures of Wikipedia. There is also a group of

researchers who employ Web documents as input for relation acquisition [115]. Sim-

ilar to the conventional techniques, this group of work still relies on the ubiquitous

Wordnet and other domain lexicons for determining the proper level of abstraction

and labelling of relations between the terms extracted from Web documents. Pei et

al. [196] employed predefined local (i.e. WordNet) and online ontologies to name

the unlabelled associations between concepts in Wikipedia. The labels are acquired

through a mapping process which attempts to find lexical matches for Wikipedia

concepts in the predefined ontologies. The obvious shortcomings include the in-

ability to handle complex and new terms which do not have lexical matches in the

predefined ontologies.

8.3 A Hybrid Technique for Relation Acquisition

Figure 8.1: An overview of the proposed relation acquisition technique. The main

phases are term mapping and term resolution, represented by black rectangles. The

three steps involved in resolution are simplification, disambiguation and inference.

The techniques represented by the white rounded rectangles were developed by

the authors, while existing techniques and resources are shown using grey rounded

rectangles.

The proposed relation acquisition technique is composed of two phases, namely,


Algorithm 8 termMap(t,WT ,M, root, iteration)

1: rslt := map(t)

2: if iteration equals to 1 then

3: if rslt equals to undef then

4: if t is multi-word then return composite

5: else return non-existent

6: else if rslt equals to Nt = (Vt, Et) ∧ Pt 6= φ then

7: return ambiguous

8: else if rslt equals to Nt = (Vt, Et) ∧ Pt = φ then

9: add neighbourhood Nt to the subgraph WT and iteration← iteration + 1

10: for each u ∈ Vt where (t, u) ∈ Ht ∪ At do

11: termMap(u,WT ,M, root, iteration)

12: M ←M ∪ t

13: return mapped

14: else if iteration more than 1 then

15: if rslt equals to Nt = (Vt, Et) ∧ Pt = φ then

16: add neighbourhood Nt to the subgraph WT and iteration← iteration + 1

17: for each u ∈ Vt where (t, u) ∈ Ht do

18: if u not equal to root then termMap(u,WT ,M, root, iteration)

19: else return // all paths from the origin t will arrive at the root

term mapping and term resolution. The input is a set of domain terms T produced

using a separate term recognition technique. The inclusion of a resolution phase sets

our technique apart from existing techniques which employ Web data for relation

acquisition. This resolution phase allows our technique to handle complex and

ambiguous terms, and terms which are not covered by the background knowledge

on the Web. Figure 8.1 provides an overview of the proposed technique.

In this technique, Wikipedia is seen as a directed acyclic graph W where vertices

V are topics covered by Wikipedia, and edges E are three types of coarse-grained

relations between the topics, namely, hierarchical H, associative A, and polysemous

P , or E = H ∪ A ∪ P . It is worth noting that H, A and P are disjoint sets. These

coarse-grained links are obtained from Wikipedia’s classification scheme, “See Also”

section, and disambiguation pages, respectively. The term mapping phase creates

a subgraph of W for each set T , denoted as WT by recursively querying W for

relations that belong to the terms t ∈ T . The querying aspect is defined as the

function map(t), which finds an equivalent topic u ∈ V for term t, and returns the


Algorithm 9 findNCA(M ,WT )

1: initialise commonAnc = φ, ancestors = φ, continue = true

2: for each m ∈M do

3: Nm := map(m)

4: ancestor := v : v ∈ Vm ∧ (m, v) ∈ Hm ∪ Am

5: ancestors← ancestors ∪ ancestor

6: while continue equals to true do

7: for each a ∈ ancestors do

8: initialise pthCnt = 0, sumD = 0

9: for each m ∈M do

10: dist := shortestDirectedPath(m,a,WT )

11: if dist not infinite then

12: pthCnt← pthCnt + 1 and sumD ← sumD + dist

13: if pthCnt equals to |M | then

14: commonAnc← commonAnc ∪ (a, sumD)

15: if commonAnc not equals to φ then

16: continue = false

17: else

18: initialise newAncestors = φ

19: for each a ∈ ancestors do

20: Na := map(a)

21: ancestor := v : v ∈ Va ∧ (a, v) ∈ Ha ∪ Aa

22: newAncestors← newAncestors ∪ ancestor

23: ancestors← newAncestors

24: return nca where (nca, dist) ∈ commonAnc and dist is the minimum distance

closed neighbourhood Nt:

map(t) =

Nt = (Vt, Et) if (∃u ∈ V, u ≡ t)

undef otherwise(8.1)

The neighbourhood for term t is denoted as (Vt, Et) where Et = (t, y) : (t, y) ∈

Ht ∪ At ∪ Pt ∧ y ∈ Vt and Vt is the set of vertices in the neighbourhood. The sets

Ht, At and Pt contain hierarchical, associative and polysemous links which connect

term t to its adjacent terms y ∈ Vt. The process of term mapping is summarised in

Algorithm 8. The term mapper in Algorithm 8 is invoked once for every t ∈ T . The

term mapper ceases the recursion upon encountering the base case, which consists


of the root vertices of Wikipedia (e.g. “Main topic classifications”). An input

term t ∈ T which is traced to the root vertex is considered as successfully mapped,

and is moved from set T to set M . Figure 8.2(a) shows the subgraph WT created

for the input set T=‘baking powder’,‘whole wheat flour’. In reality, many terms

cannot be straightforwardly mapped because they do not have lexically equivalent

topics in W due to (1) the non-exhaustive coverage of Wikipedia, (2) the tendency

to modify terms for domain-specific uses, and (3) the polysemous nature of certain

terms. The term mapper in Algorithm 8 returns different values, namely, composite,

non-existent and ambiguous to indicate the causes of mapping failures. The term

resolution phase resolves mapping failures through the iterative process of lexical

simplification, word disambiguation and association inference. Upon the completion

of mapping and resolution of all input terms, any direct or indirect relations between

the mapped terms t ∈ M can be identified by finding paths which connect them in

the subgraph WT .

Finally, we devise a 2-step technique to transform the subgraph WT into a

lightweight domain ontology. Firstly, we identify the nearest common ancestor

(NCA) for the mapped terms. Our simple algorithm for finding NCA is presented

in Algorithm 9. The discussion on more complex algorithms [21, 22] for finding

NCA is beyond the scope of this chapter. Secondly, we identify all directed paths

in WT which connect the mapped terms to the new root NCA and use those paths

to form the final lightweight domain ontology. The lightweight domain ontology for

the subgraph WT in Figure 8.2(a) is shown in Figure 8.2(b). We discuss the details

of the three parts of term resolution in the following three subsections.

8.3.1 Lexical Simplification

The term mapper in Algorithm 8 returns the composite value to indicate the

inability to map a composite term (i.e. multi-word term). Composite terms which

have many modifiers tend to face difficulty during term mapping due to the absence

of lexically equivalent topics in W . To address this, we designed a lexical simplifi-

cation step to reduce the lexical complexity of composite terms in a bid to increase

their chances of re-mapping. A composite term is comprised of a head noun al-

tered by some pre- (e.g. adjectives and nouns) or post-modifiers (e.g. prepositional

phrases). These modifiers are important in clarifying or limiting the extent of the

semantics of the terms in a particular context. For instance, the modifier “one cup”

as in “one cup whole wheat flour” is crucial for specifying the amount of “whole

wheat flour” required for a particular pastry. However, the semantic diversity of


(a) The dotted arrows represent additional hierarchical

links from each vertex. The only associative link is be-

tween “whole wheat flour” and “whole grain”.

(b) “Food ingredients” is the NCA.

Figure 8.2: Figure 8.2(a) shows the subgraph WT constructed for T=‘baking pow-

der’,‘whole wheat flour’ using Algorithm 8, which is later pruned to produce a

lightweight ontology in Figure 8.2(b).

terms created by certain modifiers is often unnecessary in a larger context. Our

lexical simplifier make use of this fact to reduce the complexity of a composite term

for re-mapping.

Figure 8.3: The computation of mutual information for all pairs of contiguous con-

stituents of the composite terms “one cup whole wheat flour” and “salt to taste”.

The lexical simplification step breaks down a composite term into two struc-


turally coherent parts, namely, an optional constituent and a mandatory constituent.

A mandatory constituent is composed of but not limited to the head noun of the

composite term, and has to be in common use in the language independent of the

optional constituent. The lexical simplifier then finds the least dependent pair as the

ideally decomposed constituents. The dependencies are measured by estimating the

mutual information of all contiguous constituents of a term. A term with n-words

has n−1 possible pairs denoted as < x1, y1 >, ..., < xn−1, yn−1 >. The mutual infor-

mation for each pair < x, y > of term t is computed as MI(x, y) = f(t)/f(x)f(y)

where f is a frequency measure. In a previous work [278], we utilise the page count

returned by Web search engines to compute the relative frequency required for mu-

tual information. Given that Z = t, x, y, the relative frequency for each z ∈ Z is

computed as f(z) = nz

nZe(− nz

nZ)

where nz is the page count returned by Web search

engines, and nZ =∑

u∈Z nu. Figure 8.3 shows an example of finding the least depen-

dent constituents of two complex terms. Upon identifying the two least dependent

constituents, we re-map the mandatory portion. To retain the possibly significant

semantics delivered by the modifiers, we also attempt to re-map the optional con-

stituents. If the decomposed constituents are in turn not mapped, another iteration

of term resolution is performed. Unrelated constituents will be discarded. For this

purpose, we define the distance of a constituent with respect to the set of mapped

terms M as:

δ(x, y,M) =

∑

m∈M noW (x, y,m)

|M |(8.2)

where noW (a, b) is a measure of geodesic distance between topic a and b based on

Wikipedia developed by Wong et al. [276] known as n-degree of Wikipedia (noW).

A constituent is discarded if δ(x, y,M) > τ and the current set of mapped terms

is not empty, |M | 6= 0. The threshold τ = δ(M)+σ(M), where δ(M) and σ(M) are

the average and the standard deviation of the intra-group distance of M .

8.3.2 Word Disambiguation

The term mapping phase in Algorithm 8 returns the ambiguous value if a term

t has a non-empty set of polysemous links Pt in its neighbourhood. In such cases,

the terms are considered as ambiguous and cannot be directly mapped. To address

this, we include a word disambiguation step which automatically resolves ambiguous

terms using noW [276]. Since all input terms in T belong to the same domain of

interest, the word disambiguator finds the proper senses to replace the ambiguous

terms by the virtue of the senses’ relatedness to the already mapped terms. Senses

which are highly related to the mapped terms have lower noW value. For example,


Figure 8.4: A graph showing the distribution of noW distance and the stepwise

difference for the sequence of word senses for the term “pepper”. The set of

mapped terms is M=“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red

onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking pow-

der”, “buttermilk”. The line “stepwise difference” shows the ∆i−1,i values. The

line “average stepwise difference” is the constant value µ∆. Note that the first sense

s1 is located at x = 0.

the term “pepper” is considered as ambiguous since its neighbourhood contains a

non-empty set Ppepper with numerous polysemous links pointing to various senses

in the food, music and sports domains. If the term “pepper” is provided as input

together with terms such as “vinegar” and “garlic”, we can eliminate all semantic

categories except food. Each ambiguous term t has a set of senses St = s : s ∈

Vt ∧ (t, s) ∈ Pt. Equation 8.2, denoted as δ(s,M), is used to measure the distance

between a sense s ∈ St with the set of mapped terms M .

The senses are then sorted into a list (s1, ..., sn) in ascending order according to

their distance with the mapped terms. The smaller the subscript, the smaller the

distance, and therefore, the closer to the domain in consideration. An interesting

observation is that many senses for an ambiguous term are in fact minor variations

belonging to the same semantic category (i.e. paradigm). Referring back to our

example term “pepper”, within the food domain alone, multiple possible senses exist

(e.g. “sichuan pepper”, “bell pepper”, “black pepper”). While these senses have their

intrinsic differences, they are paradigmatically substitutable for one another. Using

this property, we devise a senses selection mechanism to identify suitable paradigms

covering highly related senses as substitutes for the ambiguous terms. The mech-

anism computes the difference in noW value as ∆i−1,i = δ(si,M) − δ(si−1,M) for


2 ≤ i ≤ n between every two consecutive senses. We currently employ the aver-

age stepwise difference of the sequence as the cutoff point. The average stepwise

difference for a list of n senses is µ∆ =∑n

i=2∆i−1,i

n−1. Finally, the first k senses in

the sequence with ∆i−1,i < µ∆ are accepted as belonging to a single paradigm for

replacing the ambiguous term. Using this mechanism, we have reduced the scope

of the term “pepper” to only the food domain out of the many senses across do-

mains such as music (e.g. “pepper (band)”) and beverage (e.g. “dr pepper”). In

our example in Figure 8.4, the ambiguous term “pepper” is replaced by “black pep-

per”,“allspice”,“melegueta pepper”,“cubeb”. These k = 4 word senses are selected

as replacements since the stepwise difference at point i = 5, ∆4,5 = 0.5 exceeds

µ∆ = 0.2417.

8.3.3 Association Inference

Terms that are labelled as non-existent by Algorithm 8 simply do not have any

lexical matches on Wikipedia. We propose to use cluster analysis to infer potential

associations for such non-existent terms. We employ our term clustering algorithm

with featureless similarity measures known as Tree-Traversing Ant (TTA) [276].

TTA is a hybrid algorithm inspired by ant-based methods and hierarchical cluster-

ing which utilises two featureless similarity measures, namely, Normalised Google

Distance (NGD) [50] and noW . Unlike conventional clustering algorithms which in-

volve feature extraction and selection, terms are automatically clustered using TTA

based on their usage prevalence and co-occurrence on the Web.

In this step, we perform term clustering on the non-existent terms together

with the already mapped terms in M to infer hidden associations. The associa-

tion inference step is based on the premise that terms grouped into similar clusters

are bound by some common dominant properties. By inference, any non-existent

terms which appear in the same clusters as the mapped terms should have simi-

lar properties. The TTA returns a set of term clusters C = C1, ..., Cn upon the

completion of term clustering for each set of input terms. Each Ci ∈ C is a set

of related terms as determined by TTA. Figure 8.5 shows the results of clustering

the non-existent term “conchiglioni” with 14 mapped terms. The output is a set

of three clusters C1′ , C2, C3. Next, we acquire the parent topics of all mapped

terms located in the same cluster as the non-existent term by calling the mapping

function in Equation 8.1. We refer to such cluster as target cluster. These par-

ent topics, represented as the set R, constitute the potential topics which may be

associated with the non-existent term. In our example in Figure 8.5, the target


Figure 8.5: The result of clustering the non-existent term “conchiglioni” and

the mapped terms M=“fettuccine”, “fusilli”, “tortellini”, “vinegar”, “garlic”,“red

onion”,“coriander”, “maple syrup”, “whole wheat flour”, “egg white”, “baking pow-

der”, “buttermilk”,“carbonara”,“pancetta” using TTA.

cluster is C1′ , and the elements of set R are “pasta”,“pasta”,“pasta”,“italian cui-

sine”,“sauces”,“cuts of pork”,“dried meat”,“italian cuisine”,“pork”,“salumi”. We

devise a prevailing parent selection mechanism to identify the most suitable parent

in R to which we attach the non-existent term. The prevailing parent is determined

by assigning a weight to each parent r ∈ R, and ranking the parents according

to their weights. Given the non-existent term t and a set of parents R, the pre-

vailing parent weight (ρr) where 0 ≤ ρr < 1 for each unique r ∈ R is defined as

ρr = common(r)sim(r, t)subsume(r, t)δr where sim(a, b) is given by 1−NGD(a, b)θ,

and NGD(a, b) is the Normalised Google Distance [50] between a and b. θ is a

constant within the range (0, 1] for adjusting the NGD distance. The function

common(r) =∑

q∈R,q=r 1

|R|determines the number of occurrence of r in set R. δr = 1

if subsume(r, t) > subsume(t, r) and δr = 0 otherwise. The subsumption measure

subsume(x, y) [77] is the probability of x given y computed as n(x, y)/n(y), where

n(x, y) and n(y) are page counts obtained from Web search engines. This measure

is used to quantify the extent of term x being more general than term y. The higher

the subsumption value, the more general term x is with respect to y. Upon ranking

the unique parents in R based on their weights, we select the prevailing parent r

as the one with the largest ρ. A link is then created for the non-existent term t to

hierarchically relate it to r.


8.4 Initial Experiments and Discussions

Figure 8.6: The results of relation acquisition using the proposed technique for the

genetics and the food domains. The labels “correctly xxx” and “incorrectly xxx”

represent the true positives (TP) and false positives (FP). Precision is computed as

TP/(TP + FP ).

We experimented with the proposed technique shown in Figure 8.1 using two

manually-constructed datasets, namely, a set of 11 terms in the genetics domain, and

a set of 31 terms in the food domain. The system performed the initial mappings

of the input terms at level 0. This results in 6 successfully mapped terms and 5

unmapped composite terms in the genetics domain. As for the terms in the food

domain, 14 were mapped, 16 were composite and 1 was non-existent. At level 1, the 5

composite terms in the genetics domain were decomposed into 10 constituents where

8 were remapped and 2 required further level 2 resolution. For the food domain, the

16 composite terms were decomposed into 32 constituents in level 1 where 10, 5 and

3 were still composite, non-existent and discarded, respectively. Together with the

successfully clustered non-existent term and the 14 remapped constituents, there

were a total of 15 remapped terms at level 1. Figure 8.6 summarises the experiment

results.

Overall, the system has a 100% precision in the aspect of term mapping, lexical

simplication and word disambiguation in all levels using the small set of 11 terms

8.4. Initial Experiments and Discussions 207

(a) The lightweight domain ontology for genetics constructed using 11 terms.

(b) The lightweight domain ontology for food constructed using 31 terms.

Figure 8.7: The lightweight domain ontologies generated using the two sets of input

terms. The important vertices (i.e. NCAs, input terms, vertices with degree more

than 3) have darker shades. The concepts genetics and food in the center of the

graph are the NCAs. All input terms are located along the side of the graph.


in the genetics domain as shown in Figure 8.6. As for the set of food-related terms,

there was one false positive (i.e. incorrectly mapped) involving the composite term

“100g baby spinach” which results in an 80% precision in level 2. In level 1, this

composite term was decomposed into the appropriate constituents “100g” and “baby

spinach”. In level 2, the term “baby spinach” was further decomposed and its

constituent “spinach” was successfully remapped. The constituent “baby” in this

case refers to the adjectival sense of “comparatively little”. However, the modifier

“baby” was inappropriately remapped and attached to the concept of “infant”. The

lack of information on polysemes and synonyms for basic English words is the main

cause to this problem. In this regard, we are planning to incorporate dynamic

linguistic resources such as Wiktionary to complement the encyclopaedic nature

of Wikipedia. Other established, static resources such as WordNet can also be

used as a source of basic English vocabulary. Moreover, the incorporation of such

complementary resources can assist in retaining and capturing additional semantics

of complex terms by improving the mapping of constituents such as “dried” and

“sliced”. General words which act as modifiers in composite terms often do not have

corresponding topics in Wikipedia, and are usually unable to satisfy the relatedness

requirement outlined in Section 8.3.1. Such constituents are currently ignored as

shown through the high number of discarded constituents in level 2 in Figure 8.6.

Moreover, the clustering of terms to discover new associations is only performed at

level 1, and non-existent terms at level 2 and beyond are currently discarded.

Upon obtaining the subgraphs WT for the two input sets, the system finds the

corresponding nearest common ancestors. The NCAs for the genetics-related and

the food-related terms are genetics and food, respectively. Using these NCAs, our

system constructed the corresponding lightweight domain ontologies as shown in

Figure 8.7. A detailed account of this experiment is available to the public1.


Acquiring semantic relations is an important part of ontology learning. Many

existing techniques face difficulty in extending to different domains, disregard im-

plicit and indirect relations, and are unable to handle relations between compos-

ite, ambiguous and non-existent terms. We presented a hybrid technique which

combines lexical simplification, word disambiguation and association inference for

acquiring semantic relations between potentially composite and ambiguous terms

using only dynamic Web data (i.e. Wikipedia and Web search engines). During

1http://explorer.csse.uwa.edu.au/research/ sandbox evaluation.pl


our initial experiment, the technique demonstrated the ability to handle terms from

different domains, to accurately acquire relations between composite and ambiguous

terms, and to infer relations between terms which do not exist in Wikipedia. The

lightweight ontologies discovered using this technique is a valuable resource to com-

plement other techniques for constructing full-fledged ontologies. Our future work

includes the diversification of domain and linguistic knowledge by incorporating on-

line dictionaries to support general words not available on Wikipedia. Evaluation

using larger datasets, and the study on the effect of clustering words beyond level 1

is also required.

8.6 Acknowledgement

This research is supported by the Australian Endeavour International Post-

graduate Research Scholarship, the DEST (Australia-China) Grant, and the Inter-

university Grant from the Department of Chemical Engineering, Curtin University

of Technology.


211CHAPTER 9

Implementation

“Thinking about information overload isn’t accurately

describing the problem; thinking about filter failure is.”

- Clay Shirky, the Web 2.0 Expo (2008)

The focus of this chapter is to illustrate the advantages of the seamless and auto-

matic construction of term clouds and lightweight ontologies from text documents.

An application is developed to assist in the skimming and scanning of large amounts

of news articles across different domains, including technology, medicine and eco-

nomics. First, the implementation details of the proposed techniques described in

the previous six chapters are provided. Second, three representative use cases are

described to demonstrate our powerful interface for assisting document skimming

and scanning. The details on how term clouds and lightweight ontologies can be

used for this purpose are provided.

9.1 System Implementation

This section presents the implementation details of the core techniques discussed

in the previous six chapters (i.e. Chapter 3-8). Overall, the proposed ontology

learning system is implemented as a Web application hosted on http://explorer.

csse.uwa.edu.au/research/. The techniques are developed entirely using the Perl

programming language. The availability and reusability of a wide range of external

modules on the Comprehensive Perl Archive Network (CPAN) for text mining, nat-

ural language processing, statistical analysis, and other Web services makes Perl the

ideal development platform for this research. Moreover, the richer and more con-

sistent regular expression syntax in Perl provides a powerful tool for manipulating

and processing texts. The Prefuse visualisation toolkit1 using Java programming

language, and the Perl module SVGGraph2 for creating Scalable Vector Graphics

(SVG) graphs are also used for visualisation purposes. The CGI::Ajax3 module is

employed to enable the systems’s online interfaces to asynchronously access back-

end Perl modules. The use of Ajax improves interactivity, bandwidth usage and

load time by allowing back-end modules to be invoked and data to be returned

without interfering the interface behaviour. Overall, the Web application comprises

1http://prefuse.org/2http://search.cpan.org/dist/SVGGraph-0.07/3http://search.cpan.org/dist/CGI-Ajax-0.707/

212 Chapter 9. Implementation

a suite of modules with about 90, 000 lines of properly documented Perl codes. The

implementation details of the system’s modules are as follow:

(a) The screenshot of a webpage containing a short abstract of a journal article,

hosted at http://www.ncbi.nlm.nih.gov/pubmed/7602115. The relevant content,

which is the abstract, extracted by HERCULES is shown in Figure 9.1(b).

(b) The input section of this interface algorithm hercules.pl shows the HTML

source code of the webpage in Figure 9.1(a). The relevant content extracted from

the HTML source by the HERCULES module is shown in the results section. The

process log, not included in this figure, is also available in the results section.

Figure 9.1: The online interface for the HERCULES module.

• The relevant content extraction technique, described in Chapter 6, is imple-

mented as the HERCULES module that can be accessed and tested via the online

interface algorithm hercules.pl. The HERCULES module uses only regular

9.1. System Implementation 213

Figure 9.2: The input section of the interface algorithm issac.pl shows the error

sentence “Susan’s imabbirity to Undeerstant the msg got her INTu trubble.”. The

correction provided by ISSAC is shown in the results section of the interface. The

process log is also provided through this interface. Only a small portion of the

process log is shown in this figure.

expressions to implement the set of heuristic rules described in Chapter 6 for

removing HTML tags and other non-content elements. Figure 9.1(a) shows

an example webpage that has both relevant content, and boilerplates such as

navigation and complex search features in the header section and other related

information in the right panel. The relevant content extracted by HERCULES is

shown in Figure 9.1(b).

• The integrated technique for cleaning noisy text, described in Chapter 3, is im-

plemented as the ISSAC module accessible via the interface algorithm issac.pl.

The implementation of ISSAC uses the following Perl modules, namely, Text::

WagnerFischer4 for computing the Wagner-Fischer edit distance [266], WWW::

Search::AcronymFinder5 for accessing the online dictionary www.acronymfin

der.com, and Text::Aspell6 for interfacing with the GNU spell checker As-

pell. In addition, the Yahoo APIs for spelling suggestion7 and Web search8 are

used to obtain replacement candidates, and to obtain page counts for deriving

the general significance score. Figure 9.2 shows the correction by ISSAC for

4http://search.cpan.org/dist/Text-WagnerFischer-0.04/5http://search.cpan.org/dist/WWW-Search-AcronymFinder-0.01/6http://search.cpan.org/dist/Text-Aspell/7http://developer.yahoo.com/search/web/V1/spellingSuggestion.html8http://developer.yahoo.com/search/web/V1/webSearch.html


Figure 9.3: The online interface algorithm unithood.pl for the module unithood.

The interface shows the collocational stability of different phrases determined using

unithood. The various weights involved in determining the extent of stability are

also provided in these figures.

the noisy sentence “Susan’s imabbirity to Undeerstant the msg got her INTu

trubble.”.

• The two measures OU and UH described in Chapter 4 are implemented as a

single module called unithood that can be accessed online via the interface

algorithm unithood.pl. The unithood module also uses the Yahoo API for

Web search to access page counts for estimating the collocational strength

of noun phrases. Figure 9.3 shows the results of checking the collocational

strength of the phrases “Centers for Disease Control and Prevention” and

“Drug Enforcement Administration and Federal Bureau of Investigation”. As

mentioned in Chapter 4, phrases containing both prepositions and conjunc-

tions can be relatively difficult to deal with. The unithood module using the

UH measure automatically decides that the second phrase “Drug Enforcement

Administration and Federal Bureau of Investigation” does not form a stable

noun phrase as shown in Figure 9.3. The decision is correct considering that

the unstable phrase can refer to two separate entities in the real world.

• The technique for constructing text corpora described in Chapter 6 is im-

plemented under the SPARTAN module. As part of SPARTAN, three submod-

ules PROSE, STEP and SLOP are implemented to filter websites, expand seed


(a) The data virtualcorpus.pl interface for querying pre-constructed virtual cor-

pora by SPARTAN.

(b) The data localcorpus.pl interface for querying pre-constructed local corpora

by SPARTAN, and some other types of local corpora.

Figure 9.4: The online interfaces for querying the virtual and local corpora created

using the SPARTAN module.

terms and localise webpage contents, respectively. No online corpus con-

struction interface was provided for users due to the extensive storage space

required for downloading and constructing text corpora. Instead, an inter-

face to query pre-constructed virtual and local corpora is made available

via data virtualcorpus.pl and data localcorpus.pl, respectively. The

SPARTAN module uses both the Yahoo APIs for Web search and site search9

throughout the corpus construction process. The Perl modules WWW::Wikipedia10

and LWP::UserAgent11 are used to access Wikipedia during seed term expan-

9http://developer.yahoo.com/search/siteexplorer/siteexplorer.html10http://search.cpan.org/dist/WWW-Wikipedia-1.95/11http://search.cpan.org/dist/libwww-perl-5.826/lib/LWP/UserAgent.pm


(a) The online interface algorithm termhood.pl accepts short text snippets as

input and produces term clouds using the termhood module.

(b) The online interface data termcloud.pl for browsing pre-constructed term

clouds using the termhood module. Each term cloud is a summary of important

concepts of the corresponding news article.

(c) The online interface data corpus.pl summarises the text corpora available in

the system for use by the termhood module.

Figure 9.5: Online interfaces related to the termhood module.


(a) The algorithm nwd.pl interface for finding the semantic similarity between

terms using the NWD module.

(b) The algorithm now.pl interface for finding the semantic distance between

terms using the noW module.

(c) The algorithm tta.pl interface for clustering terms using the TTA module with

the support of featureless similarity metrics by the NWD and noW modules.

Figure 9.6: Online interfaces related to the ARCHILES module.


Figure 9.7: The interface data lightweightontology.pl for browsing pre-

constructed lightweight ontologies for online news articles using the ARCHILES mod-

ule.

sion and to access webpages using HTTP style communication. Figure 9.4(a)

shows the interface data virtualcorpus.pl for querying the virtual corpora

constructed using SPARTAN. Some statistics related to the virtual corpora, such

as document frequency and word frequency, are provided in this interface. A

simple implementation based on document frequency is also used in this inter-

face to decide if the search term is relevant to the domain represented by the

corpus or otherwise. For instance, Figure 9.4(a) shows the results of querying

the virtual corpus in the medicine domain using the word “tumor necrosis

factor”. There are 322, 065 documents that contain the word “tumor necrosis

factor” out of the total 84 million in the domain corpus. There are, however,

only 5 documents in the contrastive corpus that have this word. Based on these

frequencies, the interface decides that “tumor necrosis factor” is relevant to

the medicine domain. Figure 9.4(b) shows the interface data localcorpus.pl

for querying the localised versions of the virtual corpora, and other types of

local corpora.

• The two measures TH and OT described in Chapter 5 for recognising domain-

relevant terms are implemented as the termhood module. An interface is

created at algorithm termhood.pl to allow users to access the termhood

module online. Figure 9.5(a) shows the result of term recognition for the

9.2. Ontology-based Document Skimming and Scanning 219

input sentence “Melanoma is one of the rarer types of skin cancer. Around

160,000 new cases of melanoma are diagnosed each year.”. The termhood

module presents the output as term clouds containing domain-relevant terms of

different sizes. Larger terms assume a more significant role in representing the

content of the input text. The results section of the interface in Figure 9.5(a)

also provides information on the composition of the text corpora used and the

process log of text processing and term recognition. The termhood module

has the option of using either the text corpora constructed through guided

crawling of online news sites, the corpora (both local and virtual) built using

the SPARTAN module, publicly-available collections (e.g. Reuters-21578, texts

from the Gutenberg project, GENIA), or any combination thereof. Figure

9.5(c) shows the interface data corpus.pl that summarises information about

the available text corpora for use by the termhood module. A list of pre-

constructed term clouds from online news articles is available for browsing at

data termcloud.pl as shown in Figure 9.5(b).

• The relation acquisition technique, described in Chapter 8, is implemented un-

der the ARCHILES module. Due to some implementation challenges, an online

interface cannot be provided for users to directly access the ARCHILES module.

Nevertheless, a list of pre-constructed lightweight ontologies for online news ar-

ticles is available for browsing using the interface data lightweightontology.pl

as shown in Figure 9.7. ARCHILES employs two featureless similarity measures

NWD and noW, and the TTA clustering technique described in Chapter 7

for disambiguating terms and discovering relations between unknown terms.

The NWD and noW modules can be accessed online via algorithm nwd.pl and

algorithm now.pl. The NWD module relies on the Yahoo API for Web search

to access page counts for similarity estimation. The noW module uses the

external Graph and WWW::Wikipedia Perl modules to simulate Wikipedia’s

categorical system, to compute shortest paths for deriving distance values,

and to resolve ambiguous terms. The clustering technique is implemented as

the TTA module with an online interface at algorithm tta.pl. TTA uses both

the noW and NWD modules to determine similarity for clustering terms.

9.2 Ontology-based Document Skimming and Scanning

The growth of textual information on the Web is a double-edged sword. On the

one hand, we are blessed with unsurpassed freedom and accessibility to endless in-

formation. We all know that information is power, and so we thought the more the


Figure 9.8: The screenshot of the aggregated news services provided by Google (the

left portion of the figure) and Yahoo (the right portion of the figure) on 11 June

2009.

better. On the other hand, such explosion of information on the Web (i.e. informa-

tion explosion) can be a curse. While information has been growing exponentially

since the conception of the Web, our cognitive abilities have not caught up. We

have short attention span on the Web [133], and we are slow at reading off the

screen [91]. For this reason, users are finding it increasingly difficult to handle the

excess amount of information being provided on a daily basis, an effect known as

information overload. An interesting study at King’s College London showed that

information overload is actually doing more harm to our concentration than mari-

juana [270]. It has became apparent that “when it comes to information, sometimes


Figure 9.9: A splash screen on the online interface for document skimming and

scanning at http://explorer.csse.uwa.edu.au/research/.

Figure 9.10: The cross-domain term cloud summarising the main concepts occurring

in all the 395 articles listed in the news browser. This cloud currently contains terms

in the technology, medicine and economics domains.

less is more...” [179]. There are two key issues to be considered when attempting

to address the problem of information overload. Firstly, it is becoming increasingly

challenging for retrieval systems to locate relevant information amidst a growing

Web, and secondly, users are finding it more difficult to interpret a growing amount

of relevant information. While many studies have been conducted to improve the

performance of retrieval systems, there is virtually no work on the issue of informa-

tion interpretability. This lack of attention to information interpretability becomes


Figure 9.11: The single-domain term cloud for the domain of medicine. This cloud

summarises all the main concepts occurring in the 75 articles listed below in the

news browser. Users can arrive at this single-domain cloud from the cross-domain

cloud in Figure 9.10 by clicking on the [domain(s)] option in the latter.

obvious as we look at the way Google and Yahoo present search results, news arti-

cles and other documents to the users. At most, these systems rank the webpages

for relevance and generate short snippets with keyword bolding to assist users in

locating what they need. Studies [121, 11] have shown that these summaries often

have poor readability and are inadequate in conveying the gist of the documents. In

other words, the users would still have to painstakingly read through the documents

in order to find the information they need.

We take the two aggregated news services by Google and Yahoo shown in Figure

9.8 as examples to demonstrate the current lack of regard for information inter-

pretability. The left portion of the figure shows the Google News interface, while

the right portion shows the Yahoo News interface. Both interfaces are focused on the

health news category. The interfaces in Figure 9.8 merely show half of all the news

listed on 11 June 2009. The actual listings are considerably longer. A quick look

at both interfaces would immediately reveal the time and cognitive effort that users

have to invest in order to arrive at a summary of the texts or to find a particular piece

of information. Over time, users adopted the technique of skimming and scanning

to keep up with such a constant flow of textual documents online [218, 186, 268].


Figure 9.12: The single-domain term cloud for the medicine domain. Users can view

a list of articles describing a particular topic by clicking on the corresponding term

in the single-domain cloud.

Users employ skimming to quickly identify the main ideas conveyed by a document,

usually to decide if the text is interesting and whether one should read it in more

detail. Scanning, on the other hand, is used to obtain specific information from a

document (e.g. a particular page where a certain idea occurred).

This section provides details on the use of term clouds and lightweight ontologies

to aid document skimming and scanning for improving information interpretability.

More specifically, term clouds and lightweight ontologies are employed to assist

users in quickly identifying the overall ideas or specific information in individual

documents or groups of documents. In particular, the following three cases are

examined:

(1) Can the users quickly guess (in 3 seconds or so) from the listing alone what

are the main topics of interest across all articles for that day?

(2) Is there a better way to present the gist of individual news articles to the

users other than the conventional, ineffective use of short text snippets as

summaries?

(3) Are there other options besides the typical [find] feature for users to quickly

pinpoint a particular concept in an article or a group of articles?


(a) Abstraction of the news “Tai Chi may ease arthritis pain”.

(b) Abstraction of the news “Omega-3-fatty acids may slow macular disease”.

Figure 9.13: The use of document term cloud and information from lightweight

ontology to summarise individual news articles. Based on the term size in the

clouds, one can arrive at the conclusion that the news featured in Figure 9.13(b)

carries more domain-relevant (i.e. medical related) content than the news in Figure

9.13(a).


Figure 9.14: The document term cloud for the news “Tai Chi may ease arthritis

pain”. Users can focus on a particular concept in the annotated news by clicking on

the corresponding term in the document cloud.

For this purpose, an online interface for document skimming and scanning is in-

corporated into the Web application’s homepage12. While news articles may be

the focus of the current document skimming and scanning system, other text doc-

uments including product reviews, medical reports, emails and search results can

equally benefit from such automatic abstraction system. Figure 9.9 shows the splash

screen of the interface for document skimming and scanning. This splash screen ex-

plains the need for better means to assist document skimming and scanning while

data (i.e. term clouds and lightweight ontologies) is loading. Figure 9.10 shows

the main interface for skimming and scanning a list of news articles across different

domains. The white canvas on the top right corner containing words of different

colours and sizes is the cross-domain term cloud. This term cloud summarises the

key concepts in all news articles across all domains listed in the news browser panel

below. For instance, Figure 9.10 shows that there are 395 articles across three do-

mains (i.e. technology, medicine and economics) listed in the news browser with a

total of 727 terms in the term cloud. The solutions to the above three use cases

using our ontology-based document skimming and scanning system are as follows:

• Figure 9.11 shows the single-domain term cloud for summarising the key con-

12http://explorer.csse.uwa.edu.au/research/


cepts in the medicine domain. This term cloud is obtained by simply selecting

the medicine option in the [domain(s)] field. There are 75 articles in the

news browser with a total of 136 terms in the cloud. Looking at this single-

domain term cloud, one would immediately be able to conclude that some of

the news articles are concerned with “diabetes”, “drug”, “gene”, “hormone”,

“heart disease”, “H1N1 swine flu” and so on. One can also say that “dia-

betes” was discussed more intensely in these articles than other topics such as

“diarrhea”. The users are able to grasp the gist of large groups of articles in

a matter of seconds without any complex cognitive effort. Can the same be

accomplished through the typical news listing and text snippets as summaries

shown in Figure 9.8? The use of the cross-domain or single-domain term clouds

for summarising the main topics across multiple documents addresses the first

problem.

• If the users are interested in drilling down on a particular topic, they can do so

by simply clicking on the terms in the cloud. A list of news articles describing

the selected topic is provided in the news browser panel as shown in Figure

9.12. The context in which the selected topic exists is also provided. For

instance, Figure 9.12 shows that “diabetes” topic is mentioned in the context

of “hypertension” in the news “Psoriasis linked to...”. Clicking on the [back]

option brings the users back to the complete listing of articles in the medicine

domain as in Figure 9.11. The users can also preview the gist of a news article

by simply clicking on the title in the news browser panel. Figure 9.13(a)

and 9.13(b) show the document term clouds for the news “Tai Chi may ease

arthritis pain” and “Omega-3-fatty acids may slow macular disease”. These

document term clouds summarise the content of the news articles and present

the key terms in a visually appealing manner to enhance the interpretability

and retention of information. The interfaces in Figure 9.13(a) and 9.13(b) also

provide information derived from the corresponding lightweight ontologies.

For instance, the root concept in the ontology is shown in the [this news

is about] field. In the news “Tai Chi may ease arthritis pain”, the root

concept is “self-care”. The parent concepts of the key terms in the ontology

are presented as part of the field [the main concepts are]. In addition,

based on the term size in the clouds, one can arrive at the conclusion that

the news featured in Figure 9.13(b) carries more domain-relevant (i.e. medical

related) content than the news in Figure 9.13(a). Can the users arrive at such

comprehensive and abstract information regarding a document with minimal

9.3. Chapter Summary 227

time and cognitive effort using the conventional news listing interfaces shown

in Figure 9.8? The use of document term cloud and lightweight ontologies for

presenting the gist of individual news articles addresses the second problem.

• The use of the following features [click for articles], [find term], [context

terms] and [click to focus] helps users to locate a particular concept at

different level of granularities. At the document collection level, users can lo-

cate articles containing a particular term using the [click for articles],

the [find term] or the [context terms] features. The [click for articles]

feature allows users to view a list of articles (using the news browser) related

to a particular topic in the cross-domain or the single-domain term cloud. The

[find term] feature can be used anytime to refine and reduce the size of the

cross-domain or the single-domain term cloud. Context terms are provided

together with the listing of articles in the news browser when users selected

the [click for articles] feature. Clicking on any terms under the column

[context terms], as shown in Figure 9.12, will list all articles containing the

selected term. At the individual document level, news articles are annotated

with the key terms that occurred in the document clouds to assist scanning

activities. Users can employ the [click to focus] feature to pinpoint the

occurrence of a particular concept in an article by clicking on the correspond-

ing term in the document cloud. Figure 9.14 shows how a user clicked on

“chronic tension headache” in the document term cloud which triggered the

auto-scrolling and highlighting of that term in the annotated news. Can the

users pinpoint a particular topic that occurred in a large document collection

or a single lengthy document with minimal time and cognitive effort using the

conventional interfaces shown in Figure 9.8? The various features provided by

this system allow users to quickly pinpoint a particular concept, either in an

article or a group of articles, to address the last problem.

9.3 Chapter Summary

This chapter provided the implementation details of the proposed ontology learn-

ing system as a Web application. The type of programming language, external tools

and development environment was described. Online interfaces to several mod-

ules of the Web application were made publicly available. The benefits of using

automatically-generated term clouds and lightweight ontologies for document skim-

ming and scanning were highlighted using three use cases. It was qualitatively


demonstrated that conventional news listing interfaces, unlike ontology-based doc-

ument skimming and scanning, are unable to satisfy the following three common

scenarios: (1) to grasp the gist of large groups of articles in a matter of seconds

without any complex cognitive effort, (2) to arrive at a comprehensive and abstract

overview of a document with minimal time and cognitive effort, and (3) to pinpoint

a particular topic that occurred in a large document collection or a single lengthy

document with minimal time and cognitive effort.

In the next chapter, the research work presented throughout this dissertation is

summarised. Plans for system improvement are outlined, and an outlook to future

research direction in the area of ontology learning is provided.

229CHAPTER 10

Conclusions and Future Work

“We can only see a short distance ahead,

but we can see plenty there that needs to be done.”

- Alan Turing, Computing Machinery and Intelligence (1950)

Term clouds and lightweight ontologies are the key to bootstrapping the Seman-

tic Web, creating better search engines, and providing effective document manage-

ment for individuals and organisations. A major problem faced by current ontology

learning systems is the reliance on rare, static background knowledge (e.g. Word-

Net, British National Corpus). This problem is described in detail in Chapter 1 and

subsequently confirmed by the literature review in Chapter 2. Overall, this research

demonstrates that the use of dynamic Web data as the sole background knowledge

is a viable, long-term alternative for cross-domain ontology learning from text. This

finding verifies the thesis statement in Chapter 1.

In particular, four major research questions identified as part of the thesis state-

ment are addressed in Chapter 3 to 8 with a common theme on taking advantage

of the diversity and redundancy of Web data. These four problems are (1) the

absence of integrated techniques for cleaning noisy data, (2) the inability of cur-

rent term extraction techniques, which are heavily influenced by word frequency, to

systematically explicate, diversify and consolidate their evidence, (3) the inability

of current corpus construction techniques to automatically create very large, high-

quality text corpora using a small number of seed terms, and (4) the difficulty of

locating and preparing features for clustering and extracting relations. As a proof of

concept, Chapter 9 of this thesis demonstrated the benefits of using automatically-

constructed term clouds and lightweight ontologies for skimming and scanning large

number of real-world documents. More precisely, term clouds and lightweight on-

tologies are employed to assist users in quickly identifying the overall ideas or specific

information in individual news articles or groups of news articles across different do-

mains, including technology, medicine and economics. Chapter 9 also discussed the

implementation details of the proposed ontology learning system.

10.1 Summary of Contributions

The major contributions to the field of ontology learning that arose from this

thesis (described in Chapter 3 to 8) are summarised as follows.

Chapter 3 addressed the first problem through proposing and implementing one

of the first integrated technique called ISSAC for cleaning noisy text. ISSAC si-

230 Chapter 10. Conclusions and Future Work

multaneously corrects spelling errors, expands abbreviations and restores improper

casings. It was found that in order to cope with language change (e.g. appearance

of new words) and the blurring of boundaries between noises, the use of multiple

dynamic Web data sources (in the form of statistical evidence from search engine

page count and online abbreviation dictionaries) was necessary. Evaluations using

noisy chat records from industry demonstrated high accuracy of correction.

To address the second problem, firstly Chapter 4 outlined two measures for deter-

mining word collocation strength during noun phrase extraction. These measures are

UH, an adaptation of existing measures, and OU, a probabilistic measure. UH and

OU rely on page count from search engines to derive statistical evidence required for

measuring word collocation. It was found that the noun phrases extracted based on

the probabilistic measure OU achieved the best precision compared to the heuristic

measure UH. The stable noun phrases extracted using these measures constitute the

input to the next stage of term recognition. Secondly in Chapter 5, a novel proba-

bilistic framework for recognising domain-relevant terms from stable noun phrases

was developed. The framework allows different evidence to be added or removed de-

pending on the implementation constraints and the desired term recognition output.

The framework currently incorporates seven types of evidence which are formalised

using word distribution models into a new probabilistic measure called OT. The

adaptability of this framework is demonstrated through the close correlation be-

tween OT and its heuristic counterpart TH. It was concluded that OT offers the

best term recognition solution (compared to three existing heuristic measures) that

is both accurate and balanced in terms of recall and precision.

Chapter 6 solved the third problem by introducing the SPARTAN technique

for corpus construction to alleviate the dependence on manually-crafted corpora

during term recognition. SPARTAN uses a probabilistic filter with statistical infor-

mation gathered from search engine page count to analyse the domain representa-

tiveness of websites for constructing both virtual and local corpora. It was found

that adequately large corpora with high coverage and specific enough vocabulary are

necessary for high performance term recognition. An elaborate evaluation proved

that term recognition using SPARTAN -based corpora achieved the best precision

and recall in comparison to all other corpora based on existing corpus construction

techniques.

Chapter 8 addressed the last problem through the proposal of a novel technique

ARCHILES. It employs term clustering, word disambiguation and lexical simpli-

cation techniques with Wikipedia and search engines for acquiring coarse-grained

10.2. Limitations and Implications for Future Research 231

semantic relations between terms. Chapter 7 discussed in detail the multi-pass clus-

tering algorithm TTA with the featureless relatedness measures noW and NWD

used by ARCHILES. It was found that the use of mutual information during lexi-

cal simplication, TTA and NWD for term clustering, and noW and Wikipedia for

word disambiguation enables ARCHILES to cope with complex, uncommon and

ambiguous terms during relation acquisition.

10.2 Limitations and Implications for Future Research

There are at least five interesting questions related to the proposed ontology

learning system that remain unexplored. First, can the current text processing

techniques, including the word collocation measures OU and UH, adapt to domains

with highly-complex vocabulary such as those involving biological entities (e.g. pro-

teins, genes)? At the very least, the adaptation of existing sentence parsing and

noun phrase chunking techniques will be required to meet the needs of these do-

mains. Second, another area for future work is to incorporate sentence parsing and

named entity tagging ability into the current SPARTAN -based corpus construction

technique for creating annotated text corpora. Automatically-constructed anno-

tated corpora will prove to be invaluable resources for a wide range of applications

such as text categorisation and machine translation. Third, there is an increasing

interest in mining opinions or sentiments from text. The two main research inter-

ests in opinion mining are the automatic building of sentiment dictionaries (typically

comprise adjectives and adverbs as sentiments), and the recognition of sentiments

expressed in text and their relations with other aspects of the text (e.g. who ex-

pressed the sentiment, the sentiment’s target). Can the current term recognition

technique using OT and TH, which focuses on domain-relevant noun phrases, be

extended to handle other part of speech for opinion mining? If yes, the proposed

term recognition technique using SPARTAN -based corpora can ultimately be used

to produce high-quality sentiment clouds and sentiment ontologies. Fourth, the

current system lacks consideration for the temporal aspect of information such as

publication date during the discovery of term clouds and lightweight ontologies.

With the inclusion of the date factor into these abstractions, users can browse and

see the evolution of important concepts and relations across different time peri-

ods. Lastly, care should be taken when interpreting the results from some of the

preliminary experiments reported in this dissertation. In particular, more work is

required to critically evaluate the ARCHILES and ISSAC techniques using larger

datasets, and to demonstrate the significance of the results through statistical tests.

232 Chapter 10. Conclusions and Future Work

For instance, the ARCHILES technique for acquiring coarse-grained relations be-

tween terms reported in this dissertation has only been tested using small datasets.

Assessments using larger datasets are being planned for the near future. It would

also be interesting to look at how ARCHILES can be used to complement other

techniques for discovering fine-grained semantic relations.

In fact, ARCHILES and all techniques reported in this dissertation are con-

stantly undergoing further tests using real-world text from various domains. New

term clouds and lightweight ontologies are constantly being created automatically

from recent news articles. These clouds and ontologies are available for browsing

via our dedicated Web application1. In other words, this research, the new tech-

niques, and the resulting Web application are subjected to continuous scrutiny and

improvement in an effort to achieve better ontology learning performance and to

define new application areas. In order to further demonstrate the system’s over-

all ability at cross-domain ontology learning, practical applications using real-world

text in several domains have been planned. In the long run, ontology learning re-

search will cross paths with advances from ontology merging. As the number of

automatically-created ontologies grow, the need to consolidate and merge them into

one single extensive structure will arise.

This thesis has only looked at the automatic learning of term clouds and ontolo-

gies from cross-domain documents in the English language. Overall, the proposed

system represents only a few important steps in the vast area of ontology learning.

One area of growing interest is cross-media ontology learning. While news articles

may be the focus of the proposed ontology learning system, other text documents

including product reviews, medical reports, financial reports, emails and search re-

sults can equally benefit from such automatic abstraction service. The automatic

generation of term clouds and lightweight ontologies using different media types

such as audio (e.g. call center recording, interactive voice response systems) and

video (i.e. teleconferencing, video surveillance) is an interesting research direction

for future researchers. Another research direction that will gain greater attention in

the future is cross-language ontology learning. It remains to be seen as we cannot

underestimate the level of difficulties involved in transferring the proposed system

to other languages of different morphological and syntactic complexity.

All in all, the suggestions and questions raised in this section provide interesting

insights into future research directions in ontology learning from text.

1http://explorer.csse.uwa.edu.au/research/

233

Bibliography

[1] H. Abdi. The method of least squares. In N. Salkind, editor, Encyclopedia of

Measurement and Statistics. CA, USA: Thousand Oaks, 2007.

[2] L. Adamic and B. Huberman. Zipfs law and the internet. Glottometrics,

3(1):143–150, 2002.

[3] E. Adar, J. Teevan, S. Dumais, and J. Elsas. The web changes everything:

Understanding the dynamics of web content. In Proceedings of the 2nd ACM

International Conference on Web Search and Data Mining, Barcelona, Spain,

2009.

[4] A. Agbago and C. Barriere. Corpus construction for terminology. In Proceed-

ings of the Corpus Linguistics Conference, Birmingham, UK, 2005.

[5] A. Agustini, P. Gamallo, and G. Lopes. Selection restrictions acquisition for

parsing and information retrieval improvement. In Proceedings of the 14th

International Conference on Applications of Prolog, Tokyo, Japan, 2001.

[6] J. Allen. Natural language understanding. Benjamin/Cummings, California,

1995.

[7] G. Amati and C. vanRijsbergen. Term frequency normalization via pareto

distributions. In Proceedings of the 24th BCS-IRSG European Colloquium on

Information Retrieval Research, Glasgow, UK, 2002.

[8] M. Ashburner, C. Ball, J. Blake, D. Botstein, H. Butler, M. Cherry, A. Davis,

K. Dolinski, S. Dwight, and J. Eppig. Gene ontology: Tool for the unification

of biology. Nature Genetics, 25(1):25–29, 2000.

[9] K. Atkinson. Gnu aspell 0.60.4. http://aspell.sourceforge.net/, 2006.

[10] P. Atzeni, R. Basili, D. Hansen, P. Missier, P. Paggio, M. Pazienza, and F. Zan-

zotto. Ontology-based question answering in a federation of university sites:

the moses case study. In Proceedings of the 9th International Conference on

Applications of Natural Language to Information Systems (NLDB), Manch-

ester, United Kingdom, 2004.

[11] A. Aula. Enhancing the readability of search result summaries. In Proceed-

ings of the 18th British HCI Group Annual Conference, Leeds Metropolitan

University, UK, 2004.

234 Bibliography

[12] H. Baayen. Statistical models for word frequency distributions: A linguistic

evaluation. Computers and the Humanities, 26(5-6):347–363, 2004.

[13] C. Baker, R. Kanagasabai, W. Ang, A. Veeramani, H. Low, and M. Wenk.

Towards ontology-driven navigation of the lipid bibliosphere. In Proceedings

of the 6th International Conference on Bioinformatics (InCoB), Hong Kong,

2007.

[14] Z. Bar-Yossef, A. Broder, R. Kumar, and A. Tomkins. Sic transit gloria

telae: Towards an understanding of the webs decay. In Proceedings of the 13th

International Conference on World Wide Web (WWW), New York, 2004.

[15] M. Baroni and S. Bernardini. Bootcat: Bootstrapping corpora and terms

from the web. In Proceedings of the 4th Language Resources and Evaluation

Conference (LREC), Lisbon, Portugal, 2004.

[16] M. Baroni and S. Bernardini. Wacky! working papers on the web as corpus.

GEDIT, Bologna, Italy, 2006.

[17] M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. Webbootcat: Instant

domain-specific corpora to support human translators. In Proceedings of the

11th Annual Conference of the European Association for Machine Translation

(EAMT), Norway, 2006.

[18] M. Baroni and M. Ueyama. Building general- and special-purpose corpora by

web crawling. In Proceedings of the 13th NIJL International Symposium on

Language Corpora: Their Compilation and Application, 2006.

[19] R. Basili, A. Moschitti, M. Pazienza, and F. Zanzotto. A contrastive approach

to term extraction. In Proceedings of the 4th Terminology and Artificial Intel-

ligence Conference (TIA), France, 2001.

[20] R. Basili, M. Pazienza, and F. Zanzotto. Modelling syntactic context in au-

tomatic term extraction. In Proceedings of the International Conference on

Recent Advances in Natural Language Processing, Bulgaria, 2001.

[21] M. Bender and M. Farach-Colton. The lca problem revisited. In Proceedings

of the 4th Latin American Symposium on Theoretical Informatics, Punta del

Este, Uruguay, 2000.

Bibliography 235

[22] M. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin.

Lowest common ancestors in trees and directed acyclic graphs. Journal of

Algorithms, 57(2):7594, 2005.

[23] C. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek. Information distance.

IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.

[24] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web.

http://www.scientificamerican.com/article.cfm?id=the-semantic-web; 20 May

2009, 2001.

[25] S. Bird, E. Klein, E. Loper, and J. Baldridge. Multidisciplinary instruction

with the natural language toolkit. In Proceedings of the 3rd ACL Workshop

on Issues in Teaching Computational Linguistics, Ohio, USA, 2008.

[26] I. Blair, G. Urland, and J. Ma. Using internet search engines to estimate word

frequency. Behavior Research Methods Instruments & Computers, 34(2):286–

290, 2002.

[27] O. Bodenreider. Biomedical ontologies in action: Role in knowledge man-

agement, data integration and decision support. IMIA Yearbook of Medical

Informatics, 1(1):67–79, 2008.

[28] A. Bookstein, S. Klein, and T. Raita. Clumping properties of content-bearing

words. Journal of the American Society of Information Science, 49(2):102–114,

1998.

[29] A. Bookstein and D. Swanson. Probabilistic models for automatic indexing.

Journal of the American Society for Information Science, 25(5):312–8, 1974.

[30] J. Brank, M. Grobelnik, and D. Mladenic. A survey of ontology evaluation

techniques. In Proceedings of the Conference on Data Mining and Data Ware-

houses (SiKDD), Ljubljana, Slovenia, 2005.

[31] C. Brewster, H. Alani, S. Dasmahapatra, and Y. Wilks. Data driven ontol-

ogy evaluation. In Proceedings of the International Conference on Language

Resources and Evaluation (LREC), Lisbon, Portugal, 2004.

[32] C. Brewster, F. Ciravegna, and Y. Wilks. Background and foreground knowl-

edge in dynamic ontology construction: Viewing text as knowledge mainte-

nance. In Proceedings of the SIGIR Workshop on Semantic Web, Toronto,

Canada, 2003.

236 Bibliography

[33] E. Brill. A simple rule-based part of speech tagger. In Proceedings of the 3rd

Conference on Applied Natural Language Processing, 1992.

[34] A. Budanitsky. Lexical semantic relatedness and its application in natural lan-

guage processing. Technical Report CSRG-390, Computer Systems Research

Group, University of Toronto, 1999.

[35] P. Buitelaar, P. Cimiano, and B. Magnini. Ontology learning from text: An

overview. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology

Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.

[36] L. Burnard. Reference guide for the british national corpus.

http://www.natcorp.ox.ac.uk/XMLedition/URG/, 2007.

[37] T. Cabre-Castellvi, R. Estopa, and J. Vivaldi-Palatresi. Automatic term de-

tection: A review of current systems. In D. Bourigault, C. Jacquemin, and

M. LHomme, editors, Recent Advances in Computational Terminology. John

Benjamins, 2001.

[38] M. Castellanos. Hotminer: Discovering hot topics on the web. In M. Berry,

editor, Survey of Text Mining. Springer-Verlag, 2003.

[39] P. Castells, M. Fernandez, and D. Vallet. An adaptation of the vector-space

model for ontology-based information retrieval. IEEE Transactions on Knowl-

edge and Data Engineering, 19(2):261–272, 2007.

[40] G. Cavaglia and A. Kilgarriff. Corpora from the web. In Proceedings of the

4th Annual CLUCK Colloquium, Sheffield, UK, 2001.

[41] H. Chen, M. Lin, and Y. Wei. Novel association measures using web search

with double checking. In Proceedings of the 21st International Conference on

Computational Linguistics, Sydney, Australia, 2006.

[42] N. Chinchor, D. Lewis, and L. Hirschman. Evaluating message understanding

systems: An analysis of the third message understanding conference (muc-3).

Computational Linguistics, 19(3):409–449, 1993.

[43] J. Cho, S. Han, and H. Kim. Meta-ontology for automated information inte-

gration of parts libraries. Computer-Aided Design, 38(7):713–725, 2006.

[44] B. Choi and Z. Yao. Web page classification. In W. Chu and T. Lin, editors,

Foundations and Advances in Data Mining. Springer-Verlag, 2005.

Bibliography 237

[45] K. Church and W. Gale. Inverse document frequency (idf): A measure of

deviations from poisson. In Proceedings of the ACL 3rd Workshop on Very

Large Corpora, 1995.

[46] K. Church and W. Gale. Poisson mixtures. Natural Language Engineering,

1(2):Page 163–190, 1995.

[47] K. Church and P. Hanks. Word association norms, mutual information, and

lexicography. Computational Linguistics, 16(1):22–29, 1990.

[48] M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, and I. Rojas. Unsupervised

learning of semantic relations between concepts of a molecular biology ontol-

ogy. In Proceedings of the 19th International Joint Conference on Artificial

Intelligence (IJCAI), 2005.

[49] R. Cilibrasi and P. Vitanyi. Automatic extraction of meaning from the web.

In Proceedings of the IEEE International Symposium on Information Theory,

Seattle, USA, 2006.

[50] R. Cilibrasi and P. Vitanyi. The google similarity distance. IEEE Transactions

on Knowledge and Data Engineering, 19(3):370–383, 2007.

[51] P. Cimiano and S. Staab. Learning concept hierarchies from text with a guided

agglomerative clustering algorithm. In Proceedings of the Workshop on Learn-

ing and Extending Lexical Ontologies with Machine Learning Methods, Bonn,

Germany, 2005.

[52] J. Cimino and X. Zhu. The practical impact of ontologies on biomedical

informatics. IMIA Yearbook of Medical Informatics, 1(1):124–135, 2006.

[53] A. Clark. Pre-processing very noisy text. In Proceedings of the Workshop on

Shallow Processing of Large Corpora at Corpus Linguistics, 2003.

[54] P. Constant. L’analyseur linguistique sylex. In Proceedings of the 5eme Ecole

d’ete du CNET, 1995.

[55] O. Corcho. Ontology-based document annotation: Trends and open re-

search problems. International Journal on Metadata, Semantics and Ontolo-

gies(Volume 1):Issue 1, 2006.

238 Bibliography

[56] B. Croft and J. Ponte. A language modeling approach to information re-

trieval. In Proceedings of the 21st International Conference on Research and

Development in Information Retrieval, 1998.

[57] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: an archi-

tecture for development of robust hlt applications. In Proceedings of the 40th

Anniversary Meeting of the Association for Computational Linguistics (ACL),

Philadelphia, USA, 2002.

[58] F. Damerau. A technique for computer detection and correction of spelling

errors. Communications of the ACM, 7(3):171–176, 1964.

[59] J. Davies, D. Fensel, and F. vanHarmelen. Towards the semantic web:

Ontology-driven knowledge management. Wiley, UK, 2003.

[60] K. Dellschaft and S. Staab. On how to perform a gold standard based eval-

uation of ontology learning. In Proceedings of the 5th International Semantic

Web Conference (ISWC), 2006.

[61] J. Deneubourg, S. Goss, N. Franks, A. Sendova-Franks, C. Detrain, and

L. Chretien. The dynamics of collective sorting: Robot-like ants and ant-

like robots. In Proceedings of the 1st International Conference on Simulation

of Adaptive Behavior: From Animals to Animats, France, 1991.

[62] Y. Ding and S. Foo. Ontology research and development: Part 1. Journal of

Information Science, 28(2):123–136, 2002.

[63] N. Draper and H. Smith. Applied regression analysis (3rd ed.). John Wiley

& Sons, 1998.

[64] T. Dunning. Accurate methods for the statistics of surprise and coincidence.


[65] H. Edmundson. Statistical inference in mathematical and computational lin-

guistics. International Journal of Computer and Information Sciences, 6(2

Pages 95-129), 1977.

[66] R. Engels and T. Lech. Generating ontologies for the semantic web: Onto-

builder. In J. Davies, D. Fensel, and F. vanHarmelen, editors, Towards the

Semantic Web: Ontology-driven Knowledge Management. England: John Wi-

ley & Sons, 2003.

Bibliography 239

[67] S. Evert. A lightweight and efficient tool for cleaning web pages. In Proceedings

of the 4th Web as Corpus Workshop (WAC), Morocco, 2008.

[68] D. Faure and C. Nedellec. Asium: Learning subcategorization frames and

restrictions of selection. In Proceedings of the 10th Conference on Machine

Learning (ECML), Germany, 1998.

[69] D. Faure and C. Nedellec. A corpus-based conceptual clustering method for

verb frames and ontology acquisition. In Proceedings of the 1st International

Conference on Language Resources and Evaluation (LREC), Granada, Spain,

1998.

[70] D. Faure and C. Nedellec. Knowledge acquisition of predicate argument struc-

tures from technical texts using machine learning: The system asium. In Pro-

ceedings of the 11th European Workshop on Knowledge Acquisition, Modeling

and Management (EKAW), Dagstuhl Castle, Germany, 1999.

[71] D. Faure and T. Poibeau. First experiments of using semantic knowledge

learned by asium for information extraction task using intex. In Proceedings

of the 1st Workshop on Ontology Learning, Berlin, Germany, 2000.

[72] D. Fensel. Ontology-based knowledge management. Computer, 35(11):56–59,

2002.

[73] D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the

evolution of web pages. In Proceedings of the 12th International Conference

on World Wide Web, Budapest, Hungary, 2003.

[74] W. Fletcher. Implementing a bnc-comparable web corpus. In Proceedings of

the 3rd Web as Corpus Workshop, Belgium, 2007.

[75] C. Fluit, M. Sabou, and F. vanHarmelen. Supporting user tasks through

visualisation of lightweight ontologies. In S. Staab and R. Studer, editors,

Handbook on Ontologies in Information Systems. Springer-Verlag, 2003.

[76] B. Fortuna, D. Mladenic, and M. Grobelnik. Semi-automatic construction of

topic ontology. In Proceedings of the Conference on Data Mining and Data

Warehouses (SiKDD), Ljubljana, Slovenia, 2005.

[77] H. Fotzo and P. Gallinari. Learning generalization/specialization relations

between concepts - application for automatically building thematic document

240 Bibliography

hierarchies. In Proceedings of the 7th International Conference on Computer-

Assisted Information Retrieval (RIAO), Vaucluse, France, 2004.

[78] W. Francis and H. Kucera. Brown corpus manual.

http://icame.uib.no/brown/bcm.html, 1979.

[79] K. Frantzi. Incorporating context information for the extraction of terms.

In Proceedings of the 35th Annual Meeting on Association for Computational

Linguistics, Spain, 1997.

[80] K. Frantzi and S. Ananiadou. Automatic term recognition using contextual

cues. In Proceedings of the IJCAI Workshop on Multilinguality in Software

Industry: the AI Contribution, Japan, 1997.

[81] A. Franz. Independence assumptions considered harmful. In Proceedings of

the 8th Conference on European Chapter of the Association for Computational

Linguistics, Madrid, Spain, 1997.

[82] N. Fuhr. Two models of retrieval with probabilistic indexing. In Proceedings of

the 9th ACM SIGIR International Conference on Research and Development

in Information Retrieval, 1986.

[83] N. Fuhr. Probabilistic models in information retrieval. The Computer Journal,

35(3):243–255, 1992.

[84] F. Furst and F. Trichet. Heavyweight ontology engineering. In Proceedings

of the International Conference on Ontologies, Databases, and Applications of

Semantics (ODBASE), Montpellier, France, 2006.

[85] P. Gamallo, A. Agustini, and G. Lopes. Learning subcategorisation informa-

tion to model a grammar with co-restrictions. Traitement Automatic de la

Langue, 44(1):93–117, 2003.

[86] P. Gamallo, M. Gonzalez, A. Agustini, G. Lopes, and V. deLima. Mapping

syntactic dependencies onto semantic relations. In Proceedings of the ECAI

Workshop on Machine Learning and Natural Language Processing for Ontology

Engineering, 2002.

[87] J. Gantz. The diverse and exploding digital universe: An updated forecast of

worldwide information growth through 2011. Technical Report White paper,

International Data Corporation, 2008.

Bibliography 241

[88] C. Girardi. Htmlcleaner: Extracting the relevant text from the web pages. In

Proceedings of the 3rd Web as Corpus Workshop, Belgium, 2007.

[89] F. Giunchiglia and I. Zaihrayeu. Lightweight ontologies. Technical Report

DIT-07-071, University of Trento, 2007.

[90] A. Gomez-Perez and D. Manzano-Macho. A survey of ontology learning meth-

ods and techniques. Deliverable 1.5, OntoWeb Consortium, 2003.

[91] J. Gould, L. Alfaro, R. Finn, B. Haupt, A. Minuto, and J. Salaun. Why

reading was slower from crt displays than from paper. ACM SIGCHI Bulletin,

17(SI):7–11, 1986.

[92] T. Gruber. A translation approach to portable ontology specifications. Knowl-

edge Acquisition, 5(2):199–220, 1993.

[93] P. Grunwald and P. Vitanyi. Kolmogorov complexity and information theory.

Journal of Logic, Language(and Information):Volume 12, 2003.

[94] H. Gutowitz. Complexity-seeking ants. In Proceedings of the 3rd European

Conference on Artificial Life, 1993.

[95] U. Hahn and M. Romacker. Content management in the syndikate system:

How technical documents are automatically transformed to text knowledge

bases. Data & Knowledge Engineering, 35(1):137–159, 2000.

[96] U. Hahn and M. Romacker. The syndikate text knowledge base generator. In

Proceedings of the 1st International Conference on Human Language Technol-

ogy Research, San Diego, USA, 2001.

[97] M. Halliday, W. Teubert, C. Yallop, and A. Cermakova. Lexicology and corpus

linguistics: An introduction. Continuum, London, 2004.

[98] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering: A comparative

study of its relative performance with respect to k-means, average link and

1d-som. Technical Report TR/IRIDIA/2003-24, Universite Libre de Bruxelles,

2003.

[99] J. Handl, J. Knowles, and M. Dorigo. Ant-based clustering and topographic

mapping. Artificial Life, 12(1):35–61, 2006.

242 Bibliography

[100] J. Handl and B. Meyer. Improved ant-based clustering and sorting. In Pro-

ceedings of the 7th International Conference on Parallel Problem Solving from

Nature, 2002.

[101] S. Handschuh and S. Staab. Authoring and annotation of web pages in

cream. In Proceedings of the 11th International Conference on World Wide

Web (WWW), Hawaii, USA, 2002.

[102] M. Hearst. Automated discovery of wordnet relations. In Christiane Fellbaum,

editor, WordNet: An Electronic Lexical Database and Some of its Applications.

MIT Press, 1998.

[103] J. Heflin and J. Hendler. A portrait of the semantic web in action. IEEE

Intelligent Systems, 16(2):54–59, 2001.

[104] M. Henzinger and S. Lawrence. Extracting knowledge from the world wide

web. PNAS, 101(1):5186–5191, 2004.

[105] A. Hippisley, D. Cheng, and K. Ahmad. The head-modifier principle and

multilingual term extraction. Natural Language Engineering, 11(2):129–157,

2005.

[106] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative:

Critical assessment of information. BMC Bioinformatics, 6(1):S1, 2005.

[107] T. Hisamitsu, Y. Niwa, and J. Tsujii. A method of measuring term represen-

tativeness: Baseline method using co-occurrence di. In Proceedings of the 18th

International Conference on Computational Linguistics, Germany, 2000.

[108] D. Holmes and C. McCabe. Improving precision and recall for soundex re-

trieval. In Proceedings of the International Conference on Information Tech-

nology: Coding and Computing (ITCC), 2002.

[109] A. Hulth. Improved automatic keyword extraction given more linguistic knowl-

edge. In Proceedings of the Conference on Empirical Methods in Natural Lan-

guage Processing (EMNLP), Japan, 2003.

[110] C. Hwang. Incompletely and imprecisely speaking: Using dynamic ontologies

for representing and retrieving information. In Proceedings of the 6th Inter-

national Workshop on Knowledge Representation meets Databases (KRDB),

Sweden, 1999.

Bibliography 243

[111] E. Hyvonen, A. Styrman, and S. Saarela. Ontology-based image retrieval. In

Proceedings of the 12th International World Wide Web Conference, Budapest,

Hungary, 2003.

[112] J. Izsak. Some practical aspects of fitting and testing the zipf-mandelbrot

model. Scientometrics, 67(1):107–120, 2006.

[113] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing

Survey, 31(3):264–323, 1999.

[114] J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and

lexical taxonomy. In Proceedings of the International Conference Research on

Computational Linguistics (ROCLING), Taiwan, 1997.

[115] T. Jiang, A. Tan, and K. Wang. Mining generalized associations of semantic

relations from textual web content. IEEE Transactions on Knowledge and

Data Engineering, 19(2):164–179, 2007.

[116] F. Jock. An overview of the importance of page rank.

http://www.associatedcontent.com/article/1502284/an overview of the importance of page.h

9 March 2009, 2009.

[117] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information

retrieval: Development and status. Information Processing and Management,

36(6):809–840, 1998.

[118] K. Jones, S. Walker, and S. Robertson. A probabilistic model of information

retrieval: Development and comparative experiments. Information Processing

& Management, 36(6):809–840, 2000.

[119] R. Jones. Learning to Extract Entities from Labeled and Unlabeled Text. PhD

thesis, Carnegie Mellon University, 2005.

[120] K. Kageura and B. Umino. Methods of automatic term recognition: A review.

Terminology, 3(2):259–289, 1996.

[121] T. Kanungo and D. Orr. Predicting the readability of short web summaries.

In Proceedings of the 2nd ACM International Conference on Web Search and

Data Mining, Barcelona, Spain, 2009.

[122] S. Katz. Distribution of content words and phrases in text and language

modelling. Natural Language Engineering, 2(1):15–59, 1996.

244 Bibliography

[123] M. Kavalec and V. Svatek. A study on automated relation labelling in ontology

learning. In P. Buitelaar, P. Cimmiano, and B. Magnini, editors, Ontology

Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005.

[124] F. Keller, M. Lapata, and O. Ourioupina. Using the web to overcome data

sparseness. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing (EMNLP), Philadelphia, 2002.

[125] M. Kida, M. Tonoike, T. Utsuro, and S. Sato. Domain classification of tech-

nical terms using the web. Systems and Computers in Japan, 38(14):11–19,

2007.

[126] J. Kietz, R. Volz, and A. Maedche. Extracting a domain-specific ontology from

a corporate intranet. In Proceedings of the 4th Conference on Computational

Natural Language Learning, Lisbon, Portugal, 2000.

[127] A. Kilgarriff. Web as corpus. In Proceedings of the Corpus Linguistics (CL),

Lancaster University, UK, 2001.

[128] A. Kilgarriff. Googleology is bad science. Computational Linguistics,

33(1):147–151, 2007.

[129] A. Kilgarriff and G. Grefenstette. Web as corpus. Computational Linguistics,

29(3):1–15, 2003.

[130] J. Kim, T. Ohta, Y. Teteisi, and J. Tsujii. Genia corpus - a semantically

annotated corpus for bio-textmining. Bioinformatics, 19(1):Page 180–182,

2003.

[131] C. Kit. Corpus tools for retrieving and deriving termhood evidence. In Pro-

ceedings of the 5th East Asia Forum of Terminology, Haikou, China, 2002.

[132] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of

the 41st Meeting of the Association for Computational Linguistics, 2003.

[133] S. Krug. Dont make me think! a common sense approach to web usability.

New Riders, Indianapolis, USA, 2000.

[134] G. Kuenning. International ispell 3.3.02. http://ficus-

www.cs.ucla.edu/geoff/ispell.html, 2006.

Bibliography 245

[135] K. Kukich. Technique for automatically correcting words in text. ACM Com-

puting Surveys, 24(4):377– 439, 1992.

[136] D. Kurz and F. Xu. Text mining for the extraction of domain relevant terms

and term collocations. In Proceedings of the International Workshop on Com-

putational Approaches to Collocations, Vienna, 2002.

[137] K. Lagus, T. Honkela, S. Kaski, and T. Kohonen. Self-organizing maps of

document collections: A new approach to interactive exploration. In Proceed-

ings of the 2nd International Conference on Knowledge Discovery and Data

Mining, 1996.

[138] A. Lait and B. Randell. An assessment of name matching algorithms. Technical

report, University of Newcastle upon Tyne, 1993.

[139] T. Landauer, P. Foltz, and D. Laham. An introduction to latent semantic

analysis. Discourse Processes, 25(1):259–284, 1998.

[140] M. Lapata and F. Keller. Web-based models for natural language processing.

ACM Transactions on Speech and Language Processing, 2(1):1–30, 2005.

[141] N. Lavrac and S. Dzeroski. Inductive logic programming: Techniques and

applications. Ellis Horwood, NY, USA, 1994.

[142] D. Lelewer and D. Hirschberg. Data compression. Volume 19, Issue 3(Pages

261-296), 1987.

[143] S. Lemaignan, A. Siadat, J. Dantan, and A. Semenenko. Mason: A pro-

posal for an ontology of manufacturing domain. In Proceedings of the IEEE

Workshop on Distributed Intelligent Systems: Collective Intelligence and Its

Applications (DIS), Czech Republic, 2006.

[144] S. LeMoigno, J. Charlet, D. Bourigault, P. Degoulet, and M. Jaulent. Ter-

minology extraction from text to build an ontology in surgical intensive care.

In Proceedings of the ECAI Workshop on Machine Learning and Natural Lan-

guage Processing for Ontology Engine, 2002.

[145] M. Lesk. Automatic sense disambiguation using machine readable dictionaries:

How to tell a pine cone from an ice cream cone. In Proceedings of the 5th

International Conference on Systems Documentation, 1986.

246 Bibliography

[146] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and

reversals. Soviet Physics Doklady, 10(8):707–710, 1966.

[147] D. Lewis. Naive (bayes) at forty: The independence assumption in informa-

tion retrieval. In Proceedings of the 10th European Conference on Machine

Learning, 1998.

[148] M. Liberman. Questioning reality. http://itre.cis.upenn.edu./ myl/languagelog/archives/001837.html;

26 March 2009, 2005.

[149] D. Lin. Principar: An efficient, broad-coverage, principle-based parser. In Pro-

ceedings of the 15th International Conference on Computational Linguistics,

1994.

[150] D. Lin. Dependency-based evaluation of minipar. In Proceedings of the 1st

International Conference on Language Resources and Evaluation, 1998.

[151] D. Lindberg, B. Humphreys, and A. McCray. The unified medical language

system. Methods of Information in Medicine, 32(4):281–291, 1993.

[152] K. Linden and J. Piitulainen. Discovering synonyms and other related words.

In Proceedings of the CompuTerm 2004, Geneva, Switzerland, 2004.

[153] L. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. truecasing. In Pro-

ceedings of the 41st Annual Meeting of the Association for Computational

Linguistics, Japan, 2003.

[154] Q. Liu, K. Xu, L. Zhang, H. Wang, Y. Yu, and Y. Pan. Catriple: Extracting

triples from wikipedia categories. In Proceedings of the 3rd Asian Semantic

Web Conference (ASWC), Bangkok, Thailand, 2008.

[155] V. Liu and J. Curran. Web text corpus for natural language processing. In

Proceedings of the 11th Conference of the European Chapter of the Association

for Computational Linguistics (EACL), Italy, 2006.

[156] W. Liu, A. Weichselbraun, A. Scharl, and E. Chang. Semi-automatic on-

tology extension using spreading activation. Journal of Universal Knowledge

Management, volume 0(1):50–58, 2005.

[157] N. Loukachevitch and B. Dobrov. Sociopolitical domain as a bridge from gen-

eral words to terms of specific domains. In Proceedings of the 2nd International

Global Wordnet Conference, 2004.

Bibliography 247

[158] A. Lozano-Tello, A. Gomez-Perez, and E. Sosa. Selection of ontologies for

the semantic web. In Proceedings of the International Conference on Web

Engineering (ICWE), Oviedo, Spain, 2003.

[159] H. Luhn. Keyword in context index for technical literature. American Docu-

mentation, 11(4):288–295, 1960.

[160] E. Lumer and B. Faieta. Diversity and adaptation in populations of clustering

ants. In Proceedings of the 3rd International Conference on Simulation of

Adaptive Behavior: From Animals to Animats 3, 1994.

[161] A. Maedche, V. Pekar, and S. Staab. Ontology learning part one - on dis-

covering taxonomic relations from the web. In N. Zhong, J. Liu, and Y. Yao,

editors, Web Intelligence. Springer Verlag, 2002.

[162] A. Maedche and S. Staab. Discovering conceptual relations from text. In

Proceedings of the 14th European Conference on Artificial Intelligence, Berlin,

Germany, 2000.

[163] A. Maedche and S. Staab. The text-to-onto ontology learning environment.

In Proceedings of the 8th International Conference on Conceptual Structures,

Darmstadt, Germany, 2000.

[164] A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceed-

ings of the European Conference on Knowledge Acquisition and Management

(EKAW), Madrid, Spain, 2002.

[165] A. Maedche and R. Volz. The ontology extraction & maintenance framework:

Text-to-onto. In Proceedings of the IEEE International Conference on Data

Mining, California, USA, 2001.

[166] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performance mea-

sures for information extraction. In Proceedings of the DARPA Broadcast News

Workshop, 1999.

[167] A. Malucelli and E. Oliveira. Ontology-services to facilitate agents interop-

erability. In Proceedings of the 6th Pacific Rim International Workshop On

Multi-Agents (PRIMA), Seoul, South Korea, 2003.

[168] B. Mandelbrot. Information theory and psycholinguistics: A theory of word

frequencies. MIT Press, MA, USA, 1967.

248 Bibliography

[169] C. Manning and H. Schutze. Foundations of statistical natural language pro-

cessing. MIT Press, MA, USA, 1999.

[170] E. Margulis. N-poisson document modelling. In Proceedings of the 15th ACM

SIGIR International Conference on Research and Development in Information

Retrieval, 1992.

[171] D. Maynard and S. Ananiadou. Term extraction using a similarity-based ap-

proach. In Recent Advances in Computational Terminology. John Benjamins,

1999.

[172] D. Maynard and S. Ananiadou. Identifying terms by their family and friends.

In Proceedings of the 18th International Conference on Computational Lin-

guistics, Germany, 2000.

[173] T. McEnery, R. Xiao, and Y. Tono. Corpus-based language studies: An ad-

vanced resource book. Taylor & Francis Group Plc, London, UK, 2005.

[174] K. McKeown and D. Radev. Collocations. In R. Dale, H. Moisl, and H. Somers,

editors, Handbook of Natural Language Processing. Marcel Dekker, 2000.

[175] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In

Proceedings of the Conference of the North American Chapter of the Associa-

tion for Computational Linguistics (NAACL), Rochester, 2007.

[176] A. Mikheev. Periods, capitalized words, etc. Computational Linguistics,

28(3):289–318, 2002.

[177] G. Miller, R. Beckwith, C. Fellbaum, K. Miller, and D. Gross. Introduction to

wordnet: An on-line lexical database. International Journal of Lexicography,

3(4):235–244, 1990.

[178] M. Missikoff, R. Navigli, and P. Velardi. Integrated approach to web ontology

learning and engineering. IEEE Computer, 35(11):60–63, 2002.

[179] P. Morville. Ambient findability. OReilly, California, USA, 2005.

[180] H. Nakagawa and T. Mori. A simple but powerful automatic term extraction

method. In Proceedings of the International Conference On Computational

Linguistics (COLING), 2002.

Bibliography 249

[181] P. Nakov and M. Hearst. A study of using search engine page hits as a proxy

for n-gram frequencies. In Proceedings of the International Conference on

Recent Advances in Natural Language Processing (RANLP), Bulgaria, 2005.

[182] R. Navigli and P. Velardi. Semantic interpretation of terminological strings. In

Proceedings of the 3rd Terminology and Knowledge Engineering Conference,

Nancy, France, 2002.

[183] D. Nettle. Linguistic diversity. Oxford University Press, UK, 1999.

[184] G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information

extraction core system for real world german text processing. In Proceedings

of the 5th International Conference of Applied Natural Language, 1997.

[185] H. Ng, C. Lim, and L. Koo. Learning to recognize tables in free text. In

Proceedings of the 37th Annual Meeting of the Association for Computational

Linguistics on Computational Linguistics, 1999.

[186] J. Nielsen. How little do users read? http://www.useit.com/alertbox/percent-

text-read.html; 4 May 2009, 2008.

[187] V. Novacek and P. Smrz. Bole - a new bio-ontology learning platform. In

Proceedings of the Workshop on Biomedical Ontologies and Text Processing,

2005.

[188] A. Ntoulas, J. Cho, and C. Olston. Whats new on the web? the evolution of the

web from a search engine perspective. In Proceedings of the 13th International

Conference on World Wide Web, New York, USA, 2004.

[189] M. Odell and R. Russell. U.s. patent numbers 1,435,663. U.S. Patent Office,

Washington, D.C., 1922.

[190] T. OHara, K. Mahesh, and S. Nirenburg. Lexical acquisition with wordnet

and the microkosmos ontology. In Proceedings of the Coling-ACL Workshop

on Usage of WordNet in Natural Language Processing Systems, 1998.

[191] A. Oliveira, F. Pereira, and A. Cardoso. Automatic reading and learning from

text. In Proceedings of the International Symposium on Artificial Intelligence

(ISAI), Kolhapur, India, 2001.

[192] E. ONeill, P. McClain, and B. Lavoie. A methodology for sampling the world

wide web. Journal of Library Administration, 34(3):279–291, 2001.

250 Bibliography

[193] S. Pakhomov. Semi-supervised maximum entropy based approach to acronym

and abbreviation normalization in medical texts. In Proceedings of the 40th

Annual Meeting on Association for Computational Linguistics, 2001.

[194] J. Paralic and I. Kostial. Ontology-based information retrieval. In Proceedings

of the 14th International Conference on Information and Intelligent systems

(IIS), Varazdin, Croatia, 2003.

[195] Y. Park and R. Byrd. Hybrid text mining for finding abbreviations and their

definitions. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing (EMNLP), 2001.

[196] M. Pei, K. Nakayama, T. Hara, and S. Nishio. Constructing a global ontology

by concept mapping using wikipedia thesaurus. In Proceedings of the 22nd

International Conference on Advanced Information Networking and Applica-

tions, Okinawa, Japan, 2008.

[197] F. Pereira and A. Cardoso. Dr. divago: Searching for new ideas in a multi-

domain environment. In Proceedings of the 8th Cognitive Science of Natural

Language Processing (CSNLP), Ireland, 1999.

[198] F. Pereira, A. Oliveira, and A. Cardoso. Extracting concept maps with clouds.

In Proceedings of the Argentine Symposium of Artificial Intelligence (ASAI),

Buenos Aires, Argentina, 2000.

[199] L. Philips. Hanging on the metaphone. Computer Language Magazine,

7(12):38–44, 1990.

[200] R. Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157,

1950.

[201] M. Poesio and A. Almuhareb. Identifying concept attributes using a classifier.

In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition,

Ann Arbor, USA, 2005.

[202] R. Porzel and R. Malaka. A task-based approach for ontology evaluation. In

Proceedings of the 16th European Conference on Artificial Intelligence, Valen-

cia, Spain, 2004.

Bibliography 251

[203] D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and nlp:

Using locality sensitive hash function for high speed noun clustering. In Pro-

ceedings of the 43rd Annual Meeting on Association for Computational Lin-

guistics, Michigan, USA, 2005.

[204] J. Rennie. Derivation of the f-measure.

http://people.csail.mit.edu/jrennie/writing/fmeasure.pdf, 2004.

[205] A. Renouf, A. Kehoe, and J. Banerjee. Webcorp: An integrated system for

web text search. In Marianne Hundt & Carolin Biewer Nadja Nesselhauf,

editor, Corpus Linguistics and the Web. Amsterdam: Rodopi, 2007.

[206] P. Resnik. Semantic similarity in a taxonomy: An information-based measure

and its application to problems of ambiguity in natural language. Journal of

Artificial Intelligence Research, 11(1):95–130, 1999.

[207] P. Resnik and N. Smith. The web as a parallel corpus. Computational Lin-

guistics, 29(3):349–380, 2003.

[208] F. Ribas. On learning more appropriate selectional restrictions. In Proceed-

ings of the 7th Conference of the European Chapter of the Association for

Computational Linguistics, Dublin, Ireland, 1995.

[209] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cyber-

netics, 61(1):241–254, 1989.

[210] S. Robertson. Understanding inverse document frequency: On theoretical

arguments for idf. Journal of Documentation, 60(5):503–520, 2004.

[211] B. Rosario. Extraction of Semantic Relations from Bioscience Text. PhD

thesis, University of California Berkeley, 2005.

[212] B. Rozenfeld and R. Feldman. Clustering for unsupervised relation identi-

fication. In Proceedings of the 16th ACM Conference on Information and

Knowledge Management, 2007.

[213] M. Sabou, M. dAquin, and E. Motta. Scarlet: Semantic relation discovery

by harvesting online ontologies. In Proceedings of the 5th European Semantic

Web Conference, Tenerife, Spain, 2008.

252 Bibliography

[214] M. Sabou, C. Wroe, C. Goble, and G. Mishne. Learning domain ontologies

for web service descriptions: an experiment in bioinformatics. In Proceedings

of the 14th International Conference on World Wide Web, 2005.

[215] G. Salton and C. Buckley. Term-weighting approaches in automatic text re-

trieval. Information Processing & Management, 24(5):513–523, 1988.

[216] D. Sanchez and A. Moreno. Automatic discovery of synonyms and lexicaliza-

tions from the web. In Proceedings of the 8th Catalan Conference on Artificial

Intelligence, 2005.

[217] D. Sanchez and A. Moreno. Learning non-taxonomic relationships from web

documents for domain ontology construction. Data & Knowledge Engineering,

64(3):600–623, 2008.

[218] E. Schmar-Dobler. Reading on the internet: The link between literacy and

technology. Journal of Adolescent and Adult Literacy, 47(1):80–85, 2003.

[219] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Pro-

ceedings of the International Conference on New Methods in Language Pro-

cessing, Manchester, United Kingdom, 1994.

[220] A. Schutz and P. Buitelaar. Relext: A tool for relation extraction from text

in ontology extension. In Proceedings of the 4th International Semantic Web

Conference (ISWC), Ireland, 2005.

[221] A. Schwartz and M. Hearst. A simple algorithm for identifying abbreviation

definitions in biomedical text. In Proceedings of the Pacific Symposium on

Biocomputing (PSB), 2003.

[222] F. Sclano and P. Velardi. Termextractor: a web application to learn the shared

terminology of emergent web communities. In Proceedings of the 3rd Interna-

tional Conference on Interoperability for Enterprise Software and Applications

(I-ESA), Portugal, 2007.

[223] P. Senellart and V. Blondel. Automatic discovery of similar words. In M. Berry,

editor, Survey of Text Mining. Springer-Verlag, 2003.

[224] V. Seretan, L. Nerima, and E. Wehrli. Using the web as a corpus for the

syntactic-based collocation identification. In Proceedings of the International

Bibliography 253

Conference on on Language Resources and Evaluation (LREC), Lisbon, Por-

tugal, 2004.

[225] M. Shamsfard and A. Barforoush. An introduction to hasti: An ontology

learning system. In Proceedings of the 7th Iranian Conference on Electrical

Engineering, Tehran, Iran, 2002.

[226] M. Shamsfard and A. Barforoush. The state of the art in ontology learning:

A framework for comparison. Knowledge Engineering Review, 18(4):293–316,

2003.

[227] M. Shamsfard and A. Barforoush. Learning ontologies from natural language

texts. International Journal of Human-Computer Studies, 60(1):Page 17–63,

2004.

[228] S. Sharoff. Creating general-purpose corpora using automated search engine

queries. In M. Baroni and S. Bernardini, editors, Wacky! Working papers on

the Web as Corpus. Bologna: GEDIT, 2006.

[229] Y. Shinyama and S. Sekine. Preemptive information extraction using un-

restricted relation discovery. In Proceedings of the NAACL Conference on

Human Language Technology (HLT), New York, 2006.

[230] H. Simon. On a class of skew distribution functions. Biometrika, 42(3-4):425–

440, 1955.

[231] F. Smadja. Retrieving collocations from text: Xtract. Computational Linguis-

tics, 19(1):143–177, 1993.

[232] B. Smith, S. Penrod, A. Otto, and R. Park. Jurors use of probabilistic evidence.

Behavioral Science, 20(1):49–82, 1996.

[233] T. Smith and M. Waterman. Identification of common molecular subsequences.

Journal of Molecular Biology, 147(1):195–197, 1981.

[234] R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic

hypernym discovery. In Proceedings of the 17th Conference on Advances in

Neural, 2005.

[235] R. Snow, D. Jurafsky, and A. Ng. Semantic taxonomy induction from het-

erogenous evidence. In Proceedings of the ACL 23rd International Conference

on Computational Linguistics (COLING), 2006.

254 Bibliography

[236] R. Sombatsrisomboon, Y. Matsuo, and M. Ishizuka. Acquisition of hypernyms

and hyponyms from the www. In Proceedings of the 2d International Workshop

on Active Mining, Japan, 2003.

[237] L. Specia and E. Motta. A hybrid approach for relation extraction aimed at

the semantic web. In Proceedings of the International Conference on Flexible

Query Answering Systems (FQAS), Milan, Italy, 2006.

[238] M. Spiliopoulou, F. Rinaldi, W. Black, and G. Piero-Zarri. Coupling infor-

mation extraction and data mining for ontology learning in parmenides. In

Proceedings of the 7th International Conference on Computer-Assisted Infor-

mation Retrieval (RIAO);Vaucluse, France, 2004.

[239] R. Srikant and R. Agrawal. Mining generalized association rules. Future

Generation Computer Systems, 13(2-3):161–180, 1997.

[240] P. Srinivason. On generalizing the two-poisson model. Journal of the American

Society for Information Science, 41(1):61–66, 1989.

[241] S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the

semantic web. In Proceedings of the 1st International Workshop on MultiMedia

Annotation, Tokyo, Japan, 2001.

[242] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clus-

tering techniques. Technical Report 00-034, University of Minnesota, 2000.

[243] M. Stevenson and R. Gaizauskas. Experiments on sentence boundary de-

tection. In Proceedings of the 6th Conference on Applied Natural Language

Processing, 2000.

[244] A. Strehl. Relationship-based Clustering and Cluster Ensembles for High-

dimensional Data Mining. PhD thesis, University of Texas at Austin, 2002.

[245] A. Sumida, N. Yoshinaga, and K. Torisawa. Boosting precision and recall of

hyponymy relation acquisition from hierarchical layouts in wikipedia. In Pro-

ceedings of the 6th International Language Resources and Evaluation (LREC),

Marrakech, Morocco, 2008.

[246] J. Tang, H. Li, Y. Cao, and Z. Tang. Email data cleaning. In Proceedings of

the 11th ACM SIGKDD International Conference on Knowledge Discovery in

Data Mining, 2005.

Bibliography 255

[247] D. Temperley and D. Sleator. Parsing english with a link grammar. In Pro-

ceedings of the 3rd International Workshop on Parsing Technologies, 1993.

[248] M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy and

denial of service. Journal of the American Society for Information Science

and Technology, 57(13):1771–1779, 2006.

[249] C. Tullo and J. Hurford. Modelling zipfian distributions in language. In Pro-

ceedings of the ESSLLI Workshop on Language Evolution and Computation,

Vienna, 2003.

[250] D. Turcato, F. Popowich, J. Toole, D. Fass, D. Nicholson, and G. Tisher.

Adapting a synonym database to specific domains. In Proceedings of the ACL

Workshop on Recent Advances in Natural Language Processing and Informa-

tion Retrieval, Hong Kong, 2000.

[251] P. Turney. Learning algorithms for keyphrase extraction. Information Re-

trieval, 2(4):303–336, 2000.

[252] P. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In

Proceedings of the 12th European Conference on Machine Learning (ECML),

Freiburg, Germany, 2001.

[253] M. Uschold and M. Gruninger. Ontologies and semantics for seamless connec-

tivity. ACM SIGMOD, 33(4):58–64, 2004.

[254] A. Uyar. Investigation of the accuracy of search engine hit counts. Journal of

Information Science, 35(4):469–480, 2009.

[255] D. Vallet, M. Fernandez, and P. Castells. An ontology-based information

retrieval model. In Proceedings of the European Semantic Web Conference

(ESWC), Crete, Greece, 2005.

[256] C. vanRijsbergen. Automatic text analysis. In Information Retrieval. Univer-

sity of Glasgow, 1979.

[257] M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta, and S. Shum.

Template-driven information extraction for populating ontologies. In Pro-

ceedings of the IJCAI Workshop on Ontology Learning, 2001.

256 Bibliography

[258] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and

F. Ciravegna. Mnm: Ontology driven tool for semantic markup. In Proceed-

ings of the ECAI Workshop on Semantic Authoring, Annotation & Knowledge

Markup (SAAKM), Lyon France, 2002.

[259] P. Velardi, P. Fabriani, and M. Missikoff. Using text processing techniques to

automatically enrich a domain ontology. In Proceedings of the International

Conference on Formal Ontology in Information Systems (FOIS), Ogunquit,

Maine, USA, 2001.

[260] P. Velardi, R. Navigli, A. Cucchiarelli, and F. Neri. Evaluation of ontolearn, a

methodology for automatic learning of ontologies. In P. Buitelaar, P. Cimmi-

ano, and B. Magnini, editors, Ontology Learning from Text: Methods, Evalu-

ation and Applications. IOS Press, 2005.

[261] P. Vitanyi. Universal similarity. In Proceedings of the IEEE ITSOC Informa-

tion Theory Workshop on Coding and Complexity, New Zealand, 2005.

[262] P. Vitanyi, F. Balbach, R. Cilibrasi, and M. Li. Normalized information dis-

tance. In F. Emmert-Streib and M. Dehmer, editors, Information Theory and

Statistical Learning. New-York: Springer, 2009.

[263] A. Vizine, L. deCastro, E. Hruschka, and R. Gudwin. Towards improving

clustering ants: An adaptive ant clustering algorithm. Informatica, 29(2):143–

154, 2005.

[264] E. Voorhees and D. Harman. Trec experiment and evaluation in information

retrieval. MIT Press, MA, USA, 2005.

[265] J. Vronis and N. Ide. Word sense disambiguation: The state of the art.


[266] R. Wagner and M. Fischer. The string-to-string correction problem. Journal

of the ACM, 21(1):168–173, 1974.

[267] N. Weber and P. Buitelaar. Web-based ontology learning with isolde. In Pro-

ceedings of the ISWC Workshop on Web Content Mining with Human Lan-

guage Technologies, Athens, USA, 2006.

[268] H. Weinreich, H. Obendorf, E. Herder, and M. Mayer. Not quite the average:

An empirical study of web use. ACM Transactions on the Web, 2(1):1–31,

2008.

Bibliography 257

[269] J. Wermter and U. Hahn. Finding new terminology in very large corpora.

In Proceedings of the 3rd International Conference on Knowledge Capture,

Alberta, Canada, 2005.

[270] G. Wilson. Info-overload harms concentration more than marijuana. New

Scientists, April(2497):6–6, 2005.

[271] W. Wong. Practical approach to knowledge-based question answering with

natural language understanding and advanced reasoning. Master’s thesis, Na-

tional Technical University College of Malaysia, 2005.

[272] W. Wong, W. Liu, and M. Bennamoun. Featureless similarities for terms

clustering using tree-traversing ants. In Proceedings of the International Sym-

posium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia,

2006.

[273] W. Wong, W. Liu, and M. Bennamoun. Integrated scoring for spelling er-

ror correction, abbreviation expansion and case restoration in dirty text. In

Proceedings of the 5th Australasian Conference on Data Mining (AusDM),

Sydney, Australia, 2006.

[274] W. Wong, W. Liu, and M. Bennamoun. Determining termhood for learning

domain ontologies using domain prevalence and tendency. In Proceedings of the

6th Australasian Conference on Data Mining (AusDM), Gold Coast, Australia,

2007.

[275] W. Wong, W. Liu, and M. Bennamoun. Determining the unithood of word se-

quences using mutual information and independence measure. In Proceedings

of the 10th Conference of the Pacific Association for Computational Linguis-

tics (PACLING), Melbourne, Australia, 2007.

[276] W. Wong, W. Liu, and M. Bennamoun. Tree-traversing ant algorithm for

term clustering based on featureless similarities. Data Mining and Knowledge

Discovery, 15(3):349–381, 2007.

[277] W. Wong, W. Liu, and M. Bennamoun. Constructing web corpora through

topical web partitioning for term recognition. In Proceedings of the 21st

Australasian Joint Conference on Artificial Intelligence (AI), Auckland, New

Zealand, 2008.

258 Bibliography

[278] W. Wong, W. Liu, and M. Bennamoun. Determination of unithood and ter-

mhood for term recognition. In M. Song and Y. Wu, editors, Handbook of

Research on Text and Web Mining Technologies. IGI Global, 2008.

[279] W. Wong, W. Liu, and M. Bennamoun. A probabilistic framework for auto-

matic term recognition. Intelligent Data Analysis, 13(4):499–539, 2009.

[280] Y. Yang and J. Calmet. Ontobayes: An ontology-driven uncertainty model.

In Proceedings of the International Conference on Intelligent Agents, Web

Technologies and Internet Commerce (IAWTIC), Vienna, Austria, 2005.

[281] R. Yangarber, R. Grishman, and P. Tapanainen. Automatic acquisition of do-

main knowledge for information extraction. In Proceedings of the 18th Inter-

national Conference on Computational Linguistics (COLING), Saarbrucken,

Germany, 2000.

[282] Z. Yao and B. Choi. Bidirectional hierarchical clustering for web mining. In

Proceedings of the IEEE/WIC International Conference on Web Intelligence,

2003.

[283] J. Zelle and R. Mooney. Learning semantic grammars with constructive in-

ductive logic programming. In Proceedings of the 11th National Conference of

the American Association for Artificial Intelligence (AAAI), 1993.

[284] G. Zipf. The psycho-biology of language. Houghton Mifflin, Boston, MA, 1935.

[285] G. Zipf. Human behaviour and the principle of least-effort. Addison-Wesley,

Cambridge, MA, 1949.

Documents

Learning Lightweight Ontologies from Text across Diﬀerent ... · Learning Lightweight Ontologies from Text across Diﬀerent Domains using the Web as Background Knowledge ... ontology