54
i Masters Computing Minor Thesis Concept based Tree Structure Representation for Paraphrased Plagiarism Detection By Kiet Nim [email protected] A thesis submitted for the degree of Master of Science (Computer and Information Science) School of Computer and Information Science University of South Australia November 2012 Supervisor Dr. Jixue Liu Associate Supervisor Dr. Jiuyong Li

Concept based Tree Structure Representation for ... · PDF fileConcept based Tree Structure Representation for Paraphrased Plagiarism Detection By ... such as student assignments

Embed Size (px)

Citation preview

i

Masters Computing Minor Thesis

Concept based Tree Structure Representation for

Paraphrased Plagiarism Detection

By

Kiet Nim

[email protected]

A thesis submitted for the degree of

Master of Science (Computer and Information Science)

School of Computer and Information Science

University of South Australia

November 2012

Supervisor

Dr. Jixue Liu

Associate Supervisor

Dr. Jiuyong Li

ii

Declaration I declare that the thesis presents the original works conducted by myself and does not

incorporate without reference to any material submitted previously for a degree in any

university. To the best of my knowledge, the thesis does not contain any material

previously published or written except where due acknowledgement is made in the

content.

Kiet Nim

November 2012

iii

Acknowledgements I would like to express my sincere gratitude to my supervisors, Dr. Jixue Liu and Dr.

Jiuyong Li – professors and researchers at University of South Australia – for their

dedicated support, professional advice, feedback and encouragement throughout the

period of conducting the study. In addition, I would like to thank all of my course

coordinators for their dedicated and in-depth teaching. Finally, I would like to thank my

family for always encouraging and providing me their full support throughout the study

in Australia.

iv

Abstract In the era of World Wide Web, searching for information can be performed easily by the

support of several search engines and online databases. However, this also makes the task

of protecting intellectual property from information abuses become more difficult.

Plagiarism is one of those dishonest behaviors. Most existing systems can efficiently

detect literal plagiarism where exact copy or only minor changes are made. In cases

where plagiarists use intelligent methods to hide their intentions, these PD systems

usually fail to detect plagiarized documents.

The concept based tree structure representation can be the potential solution for

paraphrased plagiarism detection – one of the intelligent plagiarism tactics. By exploiting

WordNet as the background knowledge, concept-based feature can be generated. The

additional feature in combining with the traditional term-based feature and the term-

based tree structure can enhance document representation. In particular, this modified

model not only can capture syntactic information like the term-based model does but also

can discover hidden semantic information of a document. Consequently, semantic-similar

documents can be detected and retrieved.

The contributions of the modified structure can be expressed in the following two. Firstly,

a real-time prototype for high level plagiarism detection is proposed in this study.

Secondly, the additional concept-based feature provides considerable improvements for

the task of Document Clustering in the way that more semantic-related documents can be

grouped into same clusters even though they are expressed in different ways.

Consequently, the task of Document Retrieval can relatively retrieve more relevant

documents in same topics.

v

Table of Contents Declaration .......................................................................................................................... ii

Acknowledgements ............................................................................................................ iii

Abstract .............................................................................................................................. iv

List of Figures ................................................................................................................... vii

List of Tables ................................................................................................................... viii

Chapter 1 – Introduction ..................................................................................................... 1

1.1. Background .......................................................................................................... 1

1.2. Motivations........................................................................................................... 1

1.3. Fields of Thesis .................................................................................................... 2

1.4. Research Question ................................................................................................ 2

1.5. Contributions ........................................................................................................ 4

Chapter 2 – Literature Review ............................................................................................ 5

2.1. Plagiarism Taxonomy .......................................................................................... 5

2.2. Document Representation .................................................................................... 6

2.2.1. Flat Feature Representation .......................................................................... 6

2.2.2. Structural Representation .............................................................................. 9

2.3. Plagiarism Detection Techniques ....................................................................... 11

2.4. Limitations ......................................................................................................... 12

Chapter 3 – Methodology ................................................................................................. 14

3.1. Document Representation and Indexing ............................................................ 14

3.1.1. Term based Vocabulary Construction ........................................................ 14

3.1.2. Concept based Vocabulary Construction .................................................... 15

3.1.3. Document Representation ........................................................................... 16

3.1.4. Document Indexing ..................................................................................... 18

3.2. Source Detection and Retrieval .......................................................................... 19

3.3. Detail Plagiarism Analysis ................................................................................. 20

3.3.1. Paragraph Level Plagiarism Analysis ......................................................... 20

3.3.2. Sentence Level Plagiarism Analysis ........................................................... 21

Chapter 4 – Experiments ................................................................................................... 22

4.1. Experiment Initialization .................................................................................... 22

vi

4.1.1. The Dataset and Workstation Configuration .............................................. 22

4.1.2. Performance Measures and Parameter Configuration ................................ 23

4.2. Source Detection and Retrieval for Literal Plagiarism ...................................... 23

4.3. Source Detection and Retrieval for Paraphrased Plagiarism ............................. 25

4.4. Study of Parameters ........................................................................................... 27

4.4.1. Size of Term based Vocabulary ............................................................ 27

4.4.2. Size of Concept based Vocabulary ....................................................... 28

4.4.3. Dimensions of Term based PCA feature .................................................... 29

4.4.4. Dimensions of Concept based PCA feature ................................................ 30

4.4.5. Contribution of the Weights and ...................................................... 31

Chapter 5 – Conclusion ..................................................................................................... 33

5.1. Concluding Remarks .......................................................................................... 33

5.2. Future Works ...................................................................................................... 34

References ......................................................................................................................... 35

Appendix A – Source code of the Modified Porter Algorithm ......................................... 38

Appendix B – Output Example of a Term based Vocabulary .......................................... 40

Appendix C – Output Example of a Concept based Vocabulary ...................................... 43

vii

List of Figures Figure 1 - Taxonomy of Plagiarism (Alzahrani et al. 2012) ........................................................... 7

Figure 2 - Term-Document Matrix (Marksberry 2011) ................................................................... 8

Figure 3 - Singular Value Decomposition of term-document matrix A (Letsche et al. 1997) ........ 9

Figure 4 - 3 layer document-paragraphs-sentences tree representation (Zhang et al. 2011) ......... 11

Figure 5 - Comparision of the original & modified Porter Stemmers ........................................... 14

Figure 6 - Data structure of Term-based Vocabulary .................................................................... 15

Figure 7 - Example of looking for synonyms, hypernyms and hyponyms .................................... 16

Figure 8 - Data structure for the concept-based Vocabulary ......................................................... 16

Figure 9 - Concept based Document Tree Representation ............................................................ 18

Figure 10 - 2 level SOMs for document-paragraph-sentence document tree (Chow et al. 2009) . 19

Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism ........................ 25

Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism ............... 26

Figure 13 - Performance based on different sizes of Term based Vocabulary .............................. 28

Figure 14 - Performance based on different sizes of Concept based Vocabulary ......................... 29

Figure 15 - Performance based on different dimensions of Term based PCA feature .................. 30

Figure 16 - Performance based on different dimensions of Concept based PCA feature ............. 31

viii

List of Tables Table 1 - Configuration of Parameters for Literal Plagiarism .......................................... 23

Table 2 - Source Detection & Retrieval for Literal Plagiarism ........................................ 24

Table 3 - Configuration of Parameters for Paraphrased Plagiarism ................................. 25

Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism ............................... 26

Table 5 - Performance based on different sizes of Term-based Vocabulary .................... 27

Table 6 - Performance based on different sizes of Concept based Vocabulary ................ 28

Table 7 - Performance based on different dimensions of Term based PCA feature ......... 30

Table 8 - Performance based on different dimensions of Concept based PCA feature .... 31

Table 9 - Performance based on different values of and ...................................... 32

1

Chapter 1 – Introduction

1.1. Background In the era of World Wide Web, more and more documents are being digitalized and made

available for accessing remotely. Searching for information has become even easier with

the support of variety of search engines and online databases. However, these advantages

also make the task of protecting intellectual property from information abuses become

more difficult. One of those dishonest behaviors is plagiarism. It is clear that plagiarism

has caused significant damages to intellectual property. Most cases have been detected in

academic works such as student assignments and researches. Lukashenko et al. [1] define

plagiarism as activities of ―turning of someone else’s work as your own without reference

to original source‖.

Several systems and algorithms have been developed to tackle this problem. However,

most of them can only detect word-by-word plagiarism or can be referred as literal

plagiarism. These are cases in which plagiarists do the exact copy or only make minor

changes to original sources. But in cases where significant changes are made, most of

these ―flat feature‖ based methods fail to detect plagiarized documents [2]. This type is

referred as intellectual or intelligent plagiarism including text manipulation, translation

and idea adoption.

In this study, my focus is to improve an existing structural model and conduct several

experiments to test the detection of one tactic of text manipulation - Paraphrasing.

1.2. Motivations Paraphrasing is a strategy of intellectual plagiarism to bypass the systems only detecting

exact copy or plagiarized documents with minor modifications. For instance, one popular

and widely used system in academic for plagiarism detection is Turnitin. It can be seen

that Turnitin can detect word-by-word copy efficiently to sentence level. However, by

simply paraphrasing detected terms using their synonyms/hyponyms/hypernyms or

similar phrases, it can be bypassed easily. Apparently, paraphrasing is just one of many

existing intelligent tactics. It is clear that plagiarism has become more and more

sophisticated [3]. Therefore, it is also urgent to have more powerful mechanisms to

protect intellectual property from high level plagiarism.

In different plagiarism detection (PD) systems, documents are represented by different

non-structural or structural schemes. Non-structural or flat feature based representations

are the earliest mechanisms for document representation. Systems such as COPS [4] and

SCAM [5] are typical applications of these representation schemes where documents are

firstly broken into small chunks of words or sentences. These chunks are then hashed and

registered against a hash table to perform document retrieval (DR) and PD. Systems in

2

[6-8] use character or word n-grams as the units for similarity detection. In common, all

these flat feature based systems ignore the contextual information of words/terms in a

document. Therefore, structural representations are recently developed to overcome this

limitation. In these schemes, 2 promising candidates which can capture the rich textual

information are graphs [9, 10] and tree structure representations [2, 11-13]. The

applications of structural representation have shown significant improvements in the

tasks of DR, document clustering (DC) and PD. However, majority of both non-structural

and structural representation schemes are mostly based on word histograms or derived

features from word histograms. Obviously, they can be used to effectively detect literal

plagiarism but are not strong enough to perform such tasks of intelligent plagiarism

detection.

In the research, I focus on analyzing the tree structure representation and the studies of

Chow et al. [2, 11-13]. In their works, a document is hierarchically organized into layers.

In this way, the tree can capture not only syntactic but also semantic information of a

document. While the root represents global information or main topics of a document,

other layers capture local information or sub-topics of the main topics and the leaf level

can be used to perform detailed comparison. Their proposed models have improved

significantly the accuracy of DC, DR and PD. However, the features used to represent

each layer are still derived from the term-based Vocabulary and, hence, the systems show

some limitations when performing the task of intelligent plagiarism detection.

Therefore, this study provides an extension to the term-based tree structure representation,

particularly, the features used to represent each layer in order to perform the detection of

one specific type of high level plagiarism – Paraphrasing. The modified structure

representation is referred as the Concept based Tree Structure Representation.

1.3. Fields of Thesis Document Representation; Information Retrieval; Plagiarism Detection; Text Mining.

1.4. Research Question As outlined in section 1.2, most existing PD systems only implement flat-feature based

representation for DC, DR and PD. Even though there are some applications of structural

representation recently, the features used in those schemes are still derivatives of word

histograms which ignore semantic similarity between words or terms. Consequently,

semantic-similar documents might be considered as unrelated. Secondly, plagiarism has

evolved and become more sophisticated with multiple forms including text manipulation,

translation and idea adoptions. These tactics can be used to easily bypass systems only

based on flat features. Even though structural feature based systems have been proved to

be more effective than flat feature based systems, they are still suffered from such

devious techniques. Therefore, it is urgent to develop either new or additional features to

3

improve current structural feature based systems in order to protect intellectual property

from being abused by high level plagiarism.

This thesis presents the study and development of a new mechanism in details to detect

one particular tactic of sophisticated plagiarism – Paraphrasing. The aim of the research

is to extend the structural model studied by Chow et al. in [2, 12, 13]. The original tree

structure representation based solely on word histograms is enhanced with an additional

feature to capture multi-dimensional semantic information. The additional feature can be

referred as the Concept based feature. The ultimate aim of the study is to answer the

following question ―Is the Concept based feature in combination with the tree structure

representation and the Term based feature capable of discovering plagiarism by

paraphrasing and, potentially, higher level plagiarism?‖.

In addition to the main research question, there are also multiple sub-questions that need

to be addressed and answered including:

What is the tree model used to represent a document?

How to construct 2 types of feature for each layer?

The necessity of document organization and indexing?

What is the scheme for candidate detection and retrieval?

What is the scheme for detailed plagiarism analysis?

To answer the main research question, the experiments carried out focus on examining

how concept-based feature in combination with the term-based tree structure

representation contributes to the tasks of document organization, document retrieval and

paraphrased plagiarism detection. However, the experiment model can only be built when

all research sub-questions are answered and they are outlined in Methodology section in

details.

In the modified structural representation, for a brief Methodology overview, each node of

the tree is represented by 2 derived vectors of terms and concepts. To overcome the

―Curse of Dimensionality‖ due to the lengths of these vectors, Principal Component

Analysis (PCA) [14] – a well-known technique for dimensionality reduction – is further

applied. For the number of tree layers, I choose the 3-layer document-paragraph-sentence

model to represent a document. The task of document organization is also taken into

consideration by applying the Self-Organizing Map (SOM) clustering technique [15]. In

document retrieval, comparing only documents in the same areas is taken into account

since documents mention different topics are regarded as serving no purpose [16], e.g.

comparing a CIS paper against a collection of CIS papers rather than a CIS paper against

biology papers. To generate the concept-based feature, I consider using the external

background knowledge – WordNet – to firstly generate the concept-based Vocabulary.

4

After that, the concept-based feature is derived from this Vocabulary and together with

the term-based feature used to represent a document.

1.5. Contributions In this thesis, the C-PCA-SOM 3-stage prototype for high level Plagiarism Detection is

introduced. The 3 stages include: Stage 1 – Document Representation & Indexing; Stage

2 – Source Detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. In addition,

due to the achievement in constant processing time when conducting experiments, the

prototype can provide real-time applications for Document Representation, Document

Clustering, Document Retrieval and, potentially, Paraphrased Plagiarism Detection.

Through experiments, it is verified that the introduction of the additional Concept-based

feature can improve the performance of Source Detection and Retrieval comparing with

models solely based on Term-based feature. Furthermore, it is also proved that the

enhanced tree structure representation not only can capture syntactic information like the

original scheme does but also can discover hidden semantic information of a document.

By capturing multi-dimensional information of a document, the task of Document

Clustering can be improved in the way that more semantic-related documents can be

grouped into meaningful clusters even though they are expressed differently. As a result,

Document Retrieval can also be benefited since more documents mentioning the same

topics can be detected and retrieved.

5

Chapter 2 – Literature Review This section provides a comprehensive overview of literature on different types of

plagiarism in section 2.1. In section 2.2, variety of document representation schemes are

discussed including non-structural or flat feature based representation and structural

representation. Existing plagiarism detection techniques are outlined in section 2.3.

Finally, the limitations of these PD techniques and representation schemes are discussed

for potential improvements and further studies in section 2.4.

2.1. Plagiarism Taxonomy When human began the artifact of producing papers as a part of intellectual

documentation, plagiarism also came into existence. Documentation and plagiarism exist

in parallel but they are two different sides of a coin totally. While one contributes to the

knowledge body of human society, the other one causes serious damages to intellectual

property. Realizing this matter of fact, ethical community has developed many techniques

to fight against plagiarism. However, the battle against this phenomenon is a lifelong

battle since plagiarism has also evolved and become more sophisticated. Therefore, to

efficiently engage such devious enemy, it is necessary to have a mapping scheme to

identify and classify different types of plagiarism into meaningful categories. Many

studies have been conducted to perform this task [1, 3, 17]. Lukashenko et al. [1] point

out different types of plagiarism activities including:

Copy-paste plagiarism (word to word copying).

Paraphrasing (using synonyms/phrases to express same content).

Translated plagiarism (copying content expressed in another languages).

Artistic plagiarism (expressing plagiarized works in different formats such as

images or text).

Idea plagiarism (extracting and using others’ ideas).

Code plagiarism (copying others’ programming codes).

No proper use of quotation marks.

Misinformation of references.

More precisely, Alzahrani et al. [3] use a taxonomy and classify different types of

plagiarism into 2 main categories: literal and intelligent plagiarism Fig. 1. In the former,

plagiarists make exact copy or only few changes to original sources. Thus, this type of

plagiarism can be detected easily. However, the latter case is much more difficult to

detect because plagiarists try to hide their intentions by using many intelligent ways to

change original sources. These tactics include: text manipulation, translation and idea

adoption. In text manipulating, plagiarists try to change the appearance of the text while

keeping the same semantic meaning or idea of the text. Paraphrasing is one tactic of text

manipulation that is usually performed. It transforms text appearance by using synonyms,

hyponyms, hypernyms or equivalent phrases. In the research, my main focus is to detect

6

this type of intelligent plagiarism. Plagiarism based on Translation is also known as

cross-lingual plagiarism. Offenders can use some translation softwares to copy text

written in other languages to bypass monolingual systems. Finally, Alzahrani et al.

consider idea adoption is the most serious and dangerous type of plagiarism since

stealing ideas from others’ works without proper referencing is the most disrespectful

action toward their authors and intellectual property. Apparently, this type of plagiarism

is also the hardest to detect because plagiarized text might not carry any similar syntactic

information to original sources. Another reason is that the ideas being plagiarized can be

extracted from multiple parts of original documents.

2.2. Document Representation

Since there are a vast number of documents available online and many more are uploaded

every day, the demand of efficiently organizing and indexing them for fast retrieval

always poses challenges for research community. Many schemes have been developed

and improved to represent a document more effectively. Instead of using a whole

document as a query, these representation schemes can be applied to perform many text

processing related tasks such as classification, clustering, document retrieval and

plagiarism detection. This section discusses 2 main strategies of document representation

as well as available techniques to detect plagiarism.

2.2.1. Flat Feature Representation

One of the most popular and widely used model is the Vector Space Model (VSM) [18].

In this model, a weighted vector of term frequency and document frequency is calculated

based on a pre-constructed Vocabulary. This Vocabulary is the list of most frequent

words/terms and is previously derived from a previously given training corpus. The

scheme used to perform term weighting is TF-IDF. Term frequency (TF) counts the

number of occurrences of each term in one specific document while inverse document

frequency (IDF) counts the number of documents consist one specific term. In the VSM

model, a vector of word histograms is constructed for each document. All vectors

together form the term-document matrix Fig. 2 [19]. The similarity between two

documents is calculated by performing the Cosine distance function on their vectors [20].

One drawback of the VSM model is that vectors used to represent documents are usually

lengthy due to the size of the Vocabulary and hence not scalable for large datasets.

7

Figure 1 - Taxonomy of Plagiarism (Alzahrani et al. 2012)

8

Figure 2 - Term-Document Matrix (Marksberry 2011)

To overcome the Curse of Dimensionality in VSM, Latent Semantic Indexing (LSI) [21]

is proposed to project lengthy vectors to lower number of dimensions with semantic

information preserved. This is done by mapping the space spanned by those lengthy

vectors to a lower dimensional subspace. The mapping is performed based on the

Singular Value Decomposition (SVD) of the VSM-based term-document matrix Fig. 3

[22]. Similarly, another approach for high dimension reduction and feature compression

is to apply Self-Organizing Map (SOM) [15]. In SOM, similar documents are organized

closely to each other. Instead of being represented by word histogram vectors, each

document is represented by its winner neuron or Best Matching Unit on the map. The

applications of SOM such as VSM-SOM [23], WEBSOM [24], LSISOM [25] and in [23,

26] have shown considerable speed up in document clustering and retrieval. SOM can be

utilized in combining with not only flat feature representation but also structural

representation discussed later in the next section.

9

Figure 3 - Singular Value Decomposition of term-document matrix A (Letsche et al. 1997)

Considering only relying on ―bag of words‖ might not be enough, there are many further

studies that propose adding additional features and together with term-based flat features

to enhance document representation. In [27], Xue et al. propose using distributional

features in combination with the traditional term frequency to improve the task of text

categorization. The proposed features include: the compactness of the appearances of a

word and the position of the first appearance of a word. Basing on these features, they

assign a specific weight to a word based on its compactness and position, i.e. authors

likely to mention main contents in the earlier parts of a document. Hence, words appear

in these parts are considered more important and assigned higher weights. Similarly,

another approach to ―enrich‖ document representation is to utilize external background

knowledge such as WordNet, Wikipedia or thesaurus dictionaries. In [28], Hu et al. use

Wikipedia to derive 2 more additional features which are concept-based and category-

based features based on the conventional term-based feature. Their experiments have

shown significant improvements in the task of document clustering. Similar applications

of external background knowledge can be found in [28-33]. The study presented in this

thesis takes into account the application of WordNet instead of Wikipedia to generate the

concept-based feature and use it to enhance document representation.

2.2.2. Structural Representation

By only using word histogram vectors to represent documents, it can be seen that flat

feature representation ignores the contextual usage and relationship of terms throughout a

document [2] and hence leads to the loss of semantic information. In addition, two

documents might be contextually different even though they contain the same term

distribution. Recognizing this serious limitation, many studies are further carried out

trying to develop new ways to represent a document which can capture not only syntactic

but also semantic information of a document. These new schemes are referred as

structural representation in general.

10

To capture semantic information, Schenker et al. [9] propose using a directed graph

model to represent documents. The graph structure consists of two components: Nodes

and Edges. Nodes (Vertices) are terms appear in a document weighted by the number of

appearances. Edges that link nodes together indicate the relationship between terms. An

edge is only formed between two terms that appear immediately next to each other in a

sentence. Chow et al. [10] also study the directed graph and further develop another type

of graph – the undirected graph. Similarly, their directed model considers the order of

term occurrence in a sentence. In the additional model, the connection of terms in

undirected graph is considered without taking the usage order of terms into consideration.

They further perform Principal Component Analysis (PCA) for dimensionality reduction

and SOM for document organization. Their experiments show significant improvements

comparing with other single feature based approaches.

Another group of models which can capture both syntactic and semantic information of a

document is the group of tree-based representation models. The earliest study of the tree

structure representation is conducted by Si et al. [16]. They significantly realize that it is

unnecessary to compare documents addressing different subjects. Their model organizes

a document according to its structure and hence form the tree, i.e. a document may

contain many sections, a section may contain many subsections, a subsection again might

also have many sub-subsections, etc. This mechanism significantly improves the

effectiveness of document comparison since lower level comparisons can be terminated

at any level of the tree if the Cosine similarity measure exceeds a user-defined value.

However, lengthy vectors at each level and potential high number of layers make their

model not scalable for big corpuses. Most recent works of Chow et al. [2, 11-13] use

fixed number of layers (2 or 3 layers), reduced size of term-based Vocabularies and

perform PCA compression to make the tree structure representation applicable for large

datasets. To minimize the time complexity of document retrieval, they further apply SOM

to organize documents according to their similarities [2, 11]. Fig. 4 denotes the 3-layer

document-paragraph-sentence tree representation studied in [12] and also the selected

model which I focus on improving in the study. Other alternatives of layers and

representation units for layers can be found in [2, 11, 13].

11

Figure 4 - 3 layer document-paragraphs-sentences tree representation (Zhang et al. 2011)

2.3. Plagiarism Detection Techniques According to Lukashenko et al. [1], the task of fighting plagiarism is classified into

plagiarism prevention and plagiarism detection. The main difference between two

classes is that detection requires less time to implement but can only achieve short-term

positive effect. On the other hand, although prevention methods are time consuming to

develop and deploy, they have long-term effect and hence are considered to be a more

significant approach to effectively fight plagiarism. Prevention, unfortunately, is a global

issue and cannot be solved by just one institution. Therefore, most existing techniques

fall into the detection category and a lot of researches have been conducted to develop

more powerful plagiarism detection techniques.

Alzahrani et al. [3] categorize plagiarism detection techniques into 2 broad trends:

intrinsic and extrinsic plagiarism detection. Intrinsic PD techniques analyze a suspicious

document locally or without collecting a set of candidate documents for comparison [34,

35]. These approaches employ some novel analyses based on authors’ writing styles.

Since they are based on the hypothesis that each writer has a unique writing style, thus

changing in writing style signals there might be potential plagiarism cases. Features used

for this type of PD are Stylometric Features based on text statistics, syntactic features,

POS features, closed-class word sets and structural features. On the other hand, extrinsic

PD techniques do the comparison between query documents against a set of source

documents. Most of the existing PD systems are deploying these extrinsic techniques.

There are several common steps to perform extrinsic PD. Firstly, Document Indexing is

applied to store all registered documents into databases for later retrieval. Secondly,

Document Retrieval is performed to retrieve most relevant candidates that might be

12

plagiarized by given query documents. Eventually, Exhaustive Analysis is carried out

between candidates and query documents to locate plagiarized parts.

For extrinsic PD techniques, majority of exhaustive analysis methods partition all

documents into blocks (n-gram or chunks) [4, 36-38]. Units in each block can be

characters, words, sentences, paragraphs, etc. These blocks are then hashed and registered

against a hash table. To perform PD, suspicious documents are also divided into small

blocks and looked up in the hash table. Eventually, similar blocks are retrieved for

detailed comparison. COPS [4] and SCAM [5] are two typical implementations of these

approaches. According to [2, 16], these methods are inapplicable for large corpus due to

the increasing number of documents from time to time. Furthermore, they can be

bypassed easily by making some changes at sentence level.

It is noticed that the methods mentioned above apply flat features only and ignore the

contextual information of how words/terms are used throughout a document. Two

documents have the same term distribution might be different contextually. To tackle this

problem, PD systems which utilized structural representation are then proposed [2, 12,

16]. These approaches significantly improve the performance of extrinsic plagiarism

detection. Since documents are hierarchically organized into multi levels, the comparison

can be terminated at any level where the amount of dissimilarity is over a user-defined

threshold between query and candidate documents. Experiments in these structure-based

models have shown better performance comparing with flat feature based systems.

2.4. Limitations Most existing PD systems are implemented basing on flat feature representation. As

mentioned in 2.1, they cannot capture contextual usage of words/terms throughout a

document and can be bypassed easily with minor modifications performed on original

sources. Structural representation based PD systems have made significant improvements

by capturing the rich textual information. By organizing documents hierarchically,

structural models can capture not only syntactic but also semantic information of a

document. Recent studies have shown some important contributions of structural

representation in document organization such as classification and clustering [2, 11, 13].

Consequently, the task of plagiarism detection has also been improved in terms of time

complexity reduction and achievement of higher detection accuracy. Most relevant

documents are firstly retrieved to narrow the processing scope and further comparisons

are terminated at levels where similarities are different.

Although it has been proved that structural presentation can be applied to detect literal

plagiarism efficiently, structural representation based PD systems still show some

limitations in detecting intelligent plagiarism. For example, plagiarists can perform

paraphrasing and replace the detected words/terms with their synonyms, hyponyms or

hypernyms to bypass the detection of these systems. It is noticed that the problems arise

13

from the term-based Vocabulary. In this type of Vocabulary, terms with similar meanings

are treated as unrelated. For instance, it can be seen that large, huge and enormous carry

similar meaning and are exchangeable in usage. However, in this type of Vocabulary,

they are considered as different terms. Therefore, any feature derived from this

Vocabulary is not strong enough to detect sophisticated plagiarism. By changing

words/terms of an original sentence with semantic similar words/terms, a plagiarized

sentence will be treated as an unrelated sentence.

In order to discover similar sentences even though they are expressed in different ways,

my research is to exploit the external background knowledge WordNet to construct one

more type of Vocabulary called the Concept-based Vocabulary. The additional

Vocabulary is built by grouping words with similar meaning in the Term-based

Vocabulary into one concept. After that, this Vocabulary is utilized to generate one more

feature called the Concept-based feature to enrich the representation of a document.

14

Chapter 3 – Methodology This section outlines the main techniques applied to develop the prototype for

paraphrased plagiarism detection. It can be referred as the 3-stage prototype including:

Stage 1 – Document representation & Indexing, Stage 2 – Source detection & Retrieval

and Stage 3 – Detail Plagiarism Analysis. The content of Stage 1 is discussed in section

3.1 consisting of the construction of two types of Vocabulary, the extraction of the

corresponding 2 types of feature to represent a document and, subsequently, the

application of SOM to organize documents into meaningful clusters. Section 3.2 gives the

details of Stage 2 on how to use the stored data in Stage 1 to perform fast original source

identification and retrieval. Finally, Stage 3 of the prototype performs plagiarism

detection in details based on retrieved candidate documents from Stage 2. The

mechanism of the detail analysis of Stage 3 is outlined in section 3.3.

3.1. Document Representation and Indexing

3.1.1. Term based Vocabulary Construction

The construction of the term-based Vocabulary is straight forward. Firstly, the

application of term extraction technique is carried out on a training corpus. After that,

Word Stemming technique is further applied to transform terms to their simple forms. For

example, words such as ―computes‖, ―computing‖ and ―computed‖ are all considered as

―compute‖. Because the original Porter stemming algorithm only creates ―stems‖ instead

of words in their simple forms and makes it impossible to look up for them on an English

dictionary or thesaurus, I have initially modified the Porter algorithm (source code

written in Perl language is also provided in Appendix A). The modified version now tries

to stop at the stage where words are in or near their simple forms. As a result, it is

possible to search for these words’ synonyms, hypernyms and hyponyms via, for example,

a thesaurus dictionary. Fig. 5 depicts the different between the original and modified

Porter stemmers.

Figure 5 - Comparision of the original & modified Porter Stemmers

15

After applying stemming, Stop Word Removal technique is subsequently performed in

order to remove insignificant words such as ―a‖, ―the‖, ―are‖, etc. Finally, we use the

weighting scheme TF-IDF (Term Frequency – Inverse Document Frequency) to weight

the significance or importance of each word throughout the corpus. The weights of all

terms are then ranked from highest to lowest (from most to less significant). In a similar

way as Chow et al. [2, 12], the first terms are selected to form the Vocabulary used

for Document and Paragraph levels of the tree structure representation. The first terms

are selected to form the Vocabulary which is used for Sentence level. In addition,

is much larger than . The data structure of the two term-based Vocabularies is denoted

in Fig. 6.

Figure 6 - Data structure of Term-based Vocabulary

The data structure is simply an array of terms. Each item contains 2 values: the string of a

term and its corresponding TF-IDF weight for sorting. An output example of this type of

Vocabulary produced by the implemented program is provided in Appendix B.

3.1.2. Concept based Vocabulary Construction

In order to construct the additional concept-based Vocabulary, one of the background

knowledge – WordNet, the lexical database for English language [39] – is exploited.

WordNet is developed by Miller began in 1985 and has been applied in many text-related

processing tasks such as document clustering [31], document retrieval [30, 32] and also

for word-sense disambiguation. In WordNet; nouns, verbs, adjectives and adverbs are

distinguished and organized into meaningful sets of synonyms. In this section, the

mechanism to utilize WordNet for the construction of the concept-based Vocabulary is

discussed in details.

Firstly, for each term T in the term-based Vocabulary, its synonyms, hypernyms and

hyponyms is extracted from the WordNet database by using the synonym-hypernym-

hyponym relationship of the ontology. The result of this step is a ―bag‖ of terms similar

to T. For example, Fig. 7 illustrates the result of finding synonyms, hypernyms and

hyponyms for the word ―absent‖. After that, it is essential to check for these terms’

appearances in the term-based Vocabulary. Any term, which does not appear in the Term-

based Vocabulary, is removed and the remaining terms together form one concept.

Clearly, terms do not appear in the Term-based Vocabulary is the same as they do not

appear in the corpus and, hence, they must be removed. The procedure of constructing

concepts is repeatedly performed for the whole Term-based Vocabulary to achieve the

Concept-based Vocabulary. Fig. 8 denotes the data structure for the additional

Vocabulary.

16

Figure 7 - Example of looking for synonyms, hypernyms and hyponyms

Figure 8 - Data structure for the concept-based Vocabulary

The data structure is an array of pointers. Each pointer can be considered as one concept

and it points to the list of actual words/terms made up that concept. An example output of

a Concept-based Vocabulary is provided in Appendix C.

Similar to the construction of Vocabularies and

, the first concepts are selected

to form the Vocabulary used for Document and Paragraph levels and, similarly, the

first concepts are selected to form the Vocabulary used for Sentence level ( is

also much larger than ).

3.1.3. Document Representation

After the construction of the two types of Vocabulary, they are subsequently stored to

hard drive and now it is ready to perform the computation of each document’s tree

representation. In this study, I choose the Document-Paragraph-Sentence 3-layer tree

representation model mentioned in [12]. Following Zhang et al., each document is firstly

partitioned into paragraphs and each paragraph is similarly partitioned into sentences.

This process builds the 3 layer tree representation for each document. The root node

represents the whole document, second layer captures information of paragraphs of the

document and each paragraph has its sentences situating at the corresponding leaf nodes.

The modification of the original tree structure is formally carried out in the feature

construction stage for each layer. For all layers, term extraction, stemming and stop word

17

removal are applied to only extract significant terms. For the top and second layers, term-

based vectors are derived normally by performing the checking and weighting process of

a document’s terms that appear in the term-based Vocabulary . At the same time, the

mapping process is performed to map those terms to their concepts based on the

Vocabulary . The weight of a concept is the sum of all elements’ weights. For the

bottom layer, instead of using word histograms, we use ―appearance indices of terms‖

vector to indicate the absence/presence of corresponding terms in a sentence, similar to

[12]. In addition, ―appearance indices of concepts‖ vector is utilized to indicate the

absence/presence of corresponding concepts in the sentence. Up to this stage, each node

of the tree is represented by 2 features: the term-based feature and the additional concept-

based feature.

To overcome the ―Curse of Dimensionality‖, the Principal Component Analysis (PCA)

algorithm is relatively applied to compress the features on Document and Paragraph

levels. PCA is one of the well-known tools for feature compression and high

dimensionality reduction. The same training corpus as the one used for constructing the

Vocabularies is reused to calculate two PCA rotation matrices independently for term and

concept features. The matrices are also stored in hard disk in order to apply them for

query documents later in the stage of Source Detection and Retrieval. The PCA-projected

features are calculated as below:

(1)

Where = { , , …, } is the normalized term- or concept-based histograms with k

dimensions, is the k x l PCA rotation matrix and is the resulted PCA-

compressed feature with reduced dimensions of l (l is much smaller than k).

Finally, the data of all documents’ trees is stored to later perform the similarity

calculation between suspicious and original documents, paragraphs or sentences for both

Source Detection – Retrieval and Detail Plagiarism Analysis. The concept-based tree

structure representation of a document is illustrated in Fig. 9.

18

Figure 9 - Concept based Document Tree Representation

3.1.4. Document Indexing

According to Si et al. [16], it is unnecessary to compare documents mentioning different

topics. Therefore, document organization is crucial to avoid redundant comparisons and

minimize processing time as well as computational complexity. For this reason, one of

the powerful clustering techniques is applied to organize similar documents into same

clusters. The chosen clustering method is Self Organizing Map (SOM) due to its

flexibility and time efficiency. All documents in the training dataset have their trees

organized on the map. 2 SOM maps are constructed independently for the root and

paragraph levels in the same manner as Chow et al. [2].

Initially, the SOM of the paragraph level is built by mapping all paragraphs’ PCA-

compressed term- and concept-based features of all documents on the map. The results of

the 2nd

level SOM are then used as parts of the inputs for the root level SOM. For the root

level SOM, features of the root of each document’s tree in combining with the resulted

winner neurons (also known as Best Matching Units - BMU) of the document’s

corresponding child paragraphs form the input for the top SOM map. The compound

input is subsequently mapped to its nearest BMU on this root SOM. The mapping process

is repeated a number of times in order for all similar documents to converge on the two

maps. Eventually, the data of the SOMs is stored to be utilized in fast source detection and

retrieval in Stage 2. Fig. 10 illustrates how a document tree is organized or mapped onto

the document- and paragraph-level SOMs in [2].

19

Figure 10 - 2 level SOMs for document-paragraph-sentence document tree (Chow et al. 2009)

In addition, it is worth to notice that, by using the output of the paragraph-level SOM as

part of the input for the document-level SOM, the compression of local to global

information is relatively performed. In [2], it has been proved that this process can

improve the accuracy of detection and retrieval.

3.2. Source Detection and Retrieval Stage 2 of the prototype is for the detection and retrieval of source documents or relevant

candidates providing suspicious documents. For each query document, in the similar way

of when constructing the tree representation for corpus documents in Stage 1, the stored

term- and concept-based Vocabularies is firstly loaded to build the query’s tree

representation. Secondly, the stored PCA projection matrices is also loaded and feature

compression based on these matrices is performed for the query tree representation. After

that, the root node of the query tree is used to find its Best Matching Unit on the

document-level SOM. Subsequently, n candidate documents associated with the BMU

are retrieved. If the number of documents related to the BMU is less than n, the

remaining documents will be retrieved from the BMU’s nearest neighbors that contain

documents more similar to those in the BMU.

For the n candidate documents, the summed Cosine distance of term- and concept-based

PCA vectors between the query document and each candidate is computed. The formula

for the calculation of the summed Cosine distance or overall similarity is defined as the

following:

20

(2)

Where q, c: query and candidate documents

: Term-based PCA projected features

: Concept-based PCA projected features

d: Cosine distance function

The overall similarity is the sum of individual similarities of different types of feature.

and are the weights used to balance the importance of term- and concept-based

features to the overall similarity. In experiments, different weights are assigned to each

feature for studying the degree of contribution of different features on the overall

performance of source detection and retrieval.

After the calculation of the summed Cosine distances, we rank them in ascending order

and choose t documents that have distances lower than the user-defined similarity

threshold for further analysis. The threshold

is calculated as below:

(3)

Where ϵ [0, 1], it is noticed that = 0 is equivalent to the case of single source retrieval,

i.e. only the most similar document will be retrieved.

3.3. Detail Plagiarism Analysis The third stage of the prototype involves the calculation of local similarities in order to

identify candidate paragraphs that are most similar to each suspicious paragraph of the

query document. It is noticed that, by doing so, the exhaustive sentence comparison for

unrelated paragraphs can be avoided and the detection process can be relatively speeded

up. Otherwise, sentence comparison is carried out for each sentence of the suspicious

paragraph and candidate paragraphs to locate potential plagiarism cases. These cases are

then summarized and reported to user for human assessment.

3.3.1. Paragraph Level Plagiarism Analysis

For each suspicious paragraph of the query document, similarly, its nearest BMU on the

paragraph level SOM is identified. However, only the retrieval of paragraphs that belong

to the t detected candidates in Stage 2 is taken into consideration. The retrieval of other

paragraphs is excluded even though they are also associated with the BMU. It is simply

due to the rationale that their root nodes are different from the suspicious paragraph’s

root node (i.e., documents mention different topics serving no purpose).

After completely retrieving all candidate paragraphs, the summed Cosine distances

between these paragraphs and the corresponding suspicious paragraph are also calculated

21

using formula (2). Next, these distances are subsequently ranked in ascending order and

first t’ paragraphs that have distances lower than the similarity threshold

are

selected to further perform the exhaustive sentence level plagiarism analysis. The

threshold

is as the following:

(4)

Where ϵ [0, 1] and = 0 is for single plagiarized paragraph detection, i.e. only the

most similar paragraph will be retrieved.

3.3.2. Sentence Level Plagiarism Analysis

After the retrieval of the most relevant paragraphs for each suspicious paragraph, for each

pair of original and suspicious paragraphs, we perform the exhaustive comparison for all

of their sentences using the corresponding leaf nodes of the tree representations. Because

appearance indices of terms and concepts are used rather than histograms as features for

the bottom layer, the calculation of sentence similarity is slightly different. In this case,

the overall similarity of 2 sentences is defined as the amount of overlap between their

terms and concepts. The sentence similarity is calculated by the following formula:

(5)

Where

,

: Appearance indices of Terms

,

: Appearance indices of Concepts

The overall overlap between a query sentence and a candidate sentence is the sum of

individual overlaps of different types of feature. If the summed overlap is larger than the

overlap threshold

> α ϵ [0.5, 1], then this pair of paragraphs is considered as one

plagiarism case. In addition, user can flexibly change the overlap threshold to detect more

or less plagiarism cases. For example, if α = 0.8, it means that any pair of paragraphs that

has the degree of overlap of more than 80% is then considered as a plagiarism case. The

exhaustive process is repeatedly performed for other remaining pairs of paragraphs.

Finally, all plagiarism cases are presented to user for human assessment.

22

Chapter 4 – Experiments This section outlines the experiments carried out to test the performance of the 3-stage

prototype. Firstly, Section 4.1 introduces the dataset used for training and testing

plagiarism detection as well as the configuration of the experiment workstation. Up to the

point of writing up this thesis, I have conducted experiments on the performance of

Source Detection and Retrieval of the implemented system. Further experiments to test

the full functionality of the prototype are addressed in Chapter 5 as future works. For the

experiments on Source Detection and Retrieval, I compare the results of the prototype (C-

PCA-SOM) against variety systems including the original tree-based retrieval (PCA-

SOM) in [2] and the traditional VSM model. For the candidate of latent semantic

compression, many previous studies have chosen the LSI model. Therefore, in this study,

I would like to provide the comparison with the PCA model instead. All comparative

models are slightly modified to use the same modified Porter stemmer like the

implemented model in this study. For the 2 SOM-based models, only the top SOM maps

are involved in document retrieval and the contribution of the second layer SOM maps is

temporarily ignored. Section 4.2 provides the results of Source Detection and Retrieval

for Literal Plagiarism and Section 4.3 is for Paraphrased Plagiarism. In addition, I also

carry out the empirical tests to study the contribution of different parameters to the

accuracy and optimization of the C-PCA-SOM system such as the weights ( , ) or

the dimensions of term- and concept- based PCA features. The details are reported in

Section 4.4.

4.1. Experiment Initialization

4.1.1. The Dataset and Workstation Configuration

The Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) dataset is used to test the

implemented system in this study. This dataset formed part of PAN 2010 – the

international competition on plagiarism detection and is available to be downloaded at the

following address http://www.webis.de/research/corpora/corpus-pan-pc-10. In details, there

are 7,859 candidate documents in total to make up the original corpus. The test set also

contains 7,859 suspicious documents equivalent to those in the corpus, i.e. each

document in the corpus has exactly one plagiarized document in the test dataset. The test

set consists of 3,792 documents of literal plagiarism (non-paraphrased) cases and 4,067

documents of paraphrased plagiarism cases. For each type of plagiarism, I construct

multiple pairs of sub corpus and sub test set with the sizes of 50, 100, 200 and 400

randomly selected documents in order to evaluate the C-PCA-SOM system with different

data scales. For each sub corpus, the processes described in Stage 1 of the 3-stage

prototype are performed to firstly organize its documents and, later, corresponding sub

test set is used for evaluation.

23

The experiments, computation of PCA rotation matrices and SOM clustering are

conducted on a PC with 2.2 GHz Core 2 Duo CPU and 2GB of RAM.

4.1.2. Performance Measures and Parameter Configuration

To provide comparable results between different models, Precision and Recall are

applied to evaluate their performance. The computations of Precision and Recall are

outlined as the following:

(6)

(7)

Since each original document has exactly one plagiarized document, thus it is obvious to

only consider if the first retrieved document is the correct candidate or not. Hence, the

scaling parameter is set to zero (ε = 0) in formula (3). In this stage, it is assumed that the

contributions of term- and concept-based features are equivalent and the configuration of

the balancing weights in formula (2) is = = 0.5. The empirical study of these

parameters is outlined later in section 4.4.

Apparently, it is noticed that Precision and Recall are equal when using the Webis-CPC-

11 dataset. Therefore, I use PR to indicate both of these measures and further add in the

―No of correct retrieval‖ measure to indicate the number of correctly detected documents

for suspicious documents in the test sets.

4.2. Source Detection and Retrieval for Literal Plagiarism To begin with, the parameters for each sub corpus are arbitrarily assigned as in Table 1.

The implemented model, the C-PCA-SOM, use all of these parameters while the PCA-

SOM ignores the concept-based vocabulary and concept-based PCA feature dimensions.

The PCA model only uses the term-based vocabulary and term-based PCA feature

dimensions. The VSM model only uses the term-based Vocabulary to construct its term-

document matrix. In addition, I set = = 0.5 as mentioned above to indicate the

same amount of contribution between term- and concept-based features.

Corpus size size

size T/C PCA dimensions SOM size SOM iterations

50 1500 1000 40/40 6 x 8 100

100 2500 2000 80/80 7 x 8 150

200 3500 2500 130/130 8 x 8 200

400 5000 3500 220/220 8 x 9 200 Table 1 - Configuration of Parameters for Literal Plagiarism

24

The results of different models are reported in Table 2 and Fig. 11. The diagram

illustrates the PRs of different systems in detecting source documents for Literal

Plagiarism cases. It is noticed that the C-PCA-SOM model produces competitive results

comparing with other models for the case of single source detection. For the corpus sizes

of 100 and 200, C-PCA-SOM system is even slightly better than the PCA-SOM without

concept-based feature. For the corpus size of 50, all systems generate the same results of

96%. It is observed that PCA and VSM seem to perform better for single source detection

cases. Even though it takes more retrieval time for PCA and VSM, these models compare

per query document with all documents in the corpuses and the possibility of missing the

real candidate is unlikely. For SOM-based models such as the C-PCA-SOM and the

PCA-SOM, fast retrieval depends completely on the results from the earlier clustering

process. The clarification is outlined in Section 4.4 where the change of any parameter

can affect the accuracy of document clustering and consequently document retrieval.

Corpus size Algorithms No of correct retrieval

PR

50

C-PCA-SOM 48/50 0.96

PCA-SOM 48/50 0.96

PCA 48/50 0.96

VSM 48/50 0.96

100

C-PCA-SOM 91/100 0.91

PCA-SOM 88/100 0.88

PCA 91/100 0.91

VSM 91/100 0.91

200

C-PCA-SOM 182/200 0.91

PCA-SOM 181/200 0.905

PCA 185/200 0.925

VSM 187/200 0.935

400

C-PCA-SOM 355/400 0.8875

PCA-SOM 355/400 0.8875

PCA 359/400 0.8975

VSM 361/400 0.9025 Table 2 - Source Detection & Retrieval for Literal Plagiarism

25

Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism

4.3. Source Detection and Retrieval for Paraphrased Plagiarism In the same manner as conducting the test for Literal Plagiarism, arbitrarily parameters

are configured firstly. Table 3 denotes the parameter configuration for specific corpuses.

The results of Source Detection and Retrieval for Paraphrased Plagiarism cases are

reported in Table 3 and Fig. 12. , are still kept the same as in 4.2.

Surprisingly, in the case of Paraphrased Plagiarism, the PCA-SOM model performs better

than the C-PCA-SOM in detecting corresponding candidates. In addition, since only

global information is involved in retrieval, the exhaustive VSM and PCA still produce

better results in finding single source document for a suspicious document. It can be due

to the overall topic of a paraphrased document still remains mostly the same as the

original document. For different performance of the two SOM-based models, I further

investigate the contribution of concept-based feature to the performance of clustering and

later retrieval. At this stage, it can be assumed that concept-based feature might introduce

noise to the clustering process. To clarify whether this assumption is true or not, I later

try different values of the weights and . The results are reported in Section 4.4.

Corpus size size

size T/C PCA dimensions SOM size SOM iterations

50 1700 1100 45/45 6 x 8 100

100 2700 2300 90/90 7 x 8 150

200 3800 3000 140/140 8 x 8 200

400 5500 4000 240/240 8 x 9 200 Table 3 - Configuration of Parameters for Paraphrased Plagiarism

70%

75%

80%

85%

90%

95%

100%

50 100 200 400

PR

Corpus Size

C-PCA-SOM

PCA-SOM

PCA

VSM

26

Corpus Size Algorithms No of correct retrieval

PR

50

C-PCA-SOM 44/50 0.88

PCA-SOM 46/50 0.92

PCA 43/50 0.86

VSM 46/50 0.92

100

C-PCA-SOM 83/100 0.83

PCA-SOM 86/100 0.86

PCA 90/100 0.9

VSM 93/100 0.93

200

C-PCA-SOM 160/200 0.8

PCA-SOM 167/200 0.835

PCA 174/200 0.87

VSM 183/200 0.915

400

C-PCA-SOM 274/400 0.685

PCA-SOM 288/400 0.72

PCA 310/400 0.775

VSM 333/400 0.8325 Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism

Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism

60%

65%

70%

75%

80%

85%

90%

95%

100%

50 100 200 400

PR

Corpus Size

C-PCA-SOM

PCA-SOM

PCA

VSM

27

4.4. Study of Parameters This section provides a comprehensive and empirical study of different parameters to the

performance of the C-PCA-SOM model including: the size of term- and concept-based

Vocabularies in Sections 4.4.1 and 4.4.2, the dimensions of term- and concept-based

PCA features in Sections 4.4.3 and 4.4.4 and, lastly, the contribution of weighting

parameters and in Section 4.4.5. The experiments are carried out on a compound

corpus containing both literal and paraphrased plagiarism cases with the size of 300. 150

documents for each type of plagiarism are randomly chosen to achieve more accurate

outcomes. SOM-based document clustering and retrieval are the two processes affected

by the change of any parameter and, hence, they are relatively performed again at each

change. It is also noticed that the performance is also slightly different with the same set

of parameters as we can see later in the following sections.

4.4.1. Size of Term based Vocabulary

After performing term extraction, word stemming, stop word removal and concept

construction, we achieve the full term-based Vocabulary of 9868 distinct terms and the

full concept-based Vocabulary of 6815 distinct concepts. In experiment, different sizes of

the term-based Vocabulary are assigned to test the contribution of size to the

performance of the C-PCA-SOM prototype. In addition, other parameters are kept

consistent as the following: size = 4000, T/C PCA dimensions = 200/200, SOM size =

8 x 9, SOM training iterations = 150, = = 0.5.

Table 5 and Fig. 13 illustrate the performance of the C-PCA-SOM system with different

choices of size. It can be seen that the size of

does not affect much the accuracy of

Source Detection and Retrieval. Precision/Recall fluctuates from 85.66% to 87.3%.

However, optimum parameter configuration can be achieved in this case at the size of

around 6000.

Size No of correct

retrieval

PR

3000 261/300 0.87

4000 259/300 0.863 5000 257/300 0.8566

6000 262/300 0.873

7000 259/300 0.863

8000 261/300 0.87 Table 5 - Performance based on different sizes of Term-based Vocabulary

28

Figure 13 - Performance based on different sizes of Term based Vocabulary

4.4.2. Size of Concept based Vocabulary

In this section, the study of the influence of different sizes of the Concept-based

Vocabulary on the system performance is outlined. For the size of the Term-based

Vocabulary , the optimum value is chosen from the study in section 4.4.1 as

=

6000. While the size of is vary, other parameters are kept as the same as in 4.4.1 (

size = 6000, T/C PCA dimensions = 200/200, SOM size = 8 x 9, SOM training iterations

= 150, = = 0.5).

The results are documented in the following Table 6 and Fig. 14. Similarly, the change of

the size does not significantly change the accuracy of candidate retrieval. The highest

value of PR is 88% corresponding to the size of 6000 for the corpus size of 300.

Size No of correct

retrieval

PR

2000 259/300 0.863

3000 260/300 0.866

4000 254/300 0.846

5000 259/300 0.863

6000 264/300 0.88 Table 6 - Performance based on different sizes of Concept based Vocabulary

70%

75%

80%

85%

90%

95%

100%

3000 4000 5000 6000 7000 8000

PR

Term-based Vocab Size

29

Figure 14 - Performance based on different sizes of Concept based Vocabulary

4.4.3. Dimensions of Term based PCA feature

For the experiments on different dimensions of Term-based PCA feature, the parameters

are configured as in Section 4.4.2 except for the Concept-based Vocabulary size. The

size of 6000, which provides the best PR in earlier section, is chosen. The summary of

the involved parameters is as the following ( size = 6000,

size = 6000, Concept

PCA dimensions = 200, SOM size = 8 x 9, SOM training iterations = 150, = =

0.5).

In this study, we can see clearer trend comparing to the studies in 4.4.1 and 4.4.2. It is

observed from the results (Table 7 and Fig. 15) that the change in the dimensions of

Term-based PCA feature can significantly influence the performance of the C-PCA-SOM

model. Specifically, PR increases from 83.6% to 88% while number of dimensions rises

from 50 to 200. However, PR drops sharply from 250 onward (more than 60%). Basing

on the study, it is clarified that it is unnecessary to use all terms to build the Term-based

Vocabulary because it might introduce ―noisy‖ features to system performance.

70%

75%

80%

85%

90%

95%

100%

2000 3000 4000 5000 6000

PR

Concept-based Vocab Size

30

Term based

PCA dimensions

No of correct

retrieval

PR

50 251/300 0.836

100 260/300 0.866

150 264/300 0.88

200 264/300 0.88

250 83/300 0.276

300 87/300 0.29

Table 7 - Performance based on different dimensions of Term based PCA feature

Figure 15 - Performance based on different dimensions of Term based PCA feature

4.4.4. Dimensions of Concept based PCA feature

The parameters for the experiments on different dimensions of Concept based PCA

feature are set as the following ( size = 6000,

size = 6000, Term PCA dimensions =

150, SOM size = 8 x 9, SOM training iterations = 150, = = 0.5). The only

parameter, which is modified, is the dimensions of the Term-based PCA feature which is

set as 150 (providing the best retrieval result mentioned in 4.4.3).

Table 8 and Fig. 16 summarize the experiment outcomes on variety Concept-based PCA

feature dimensions. Different from the Term-based PCA feature, the increase of Concept-

based PCA feature dimensions can sometimes slightly enhance or decrease the

performance of the C-PCA-SOM system on Source Detection and Retrieval. It is reported

20%

30%

40%

50%

60%

70%

80%

90%

100%

50 100 150 200 250 300

PR

Dimensions of Term-based PCA feature

31

from the diagram that PR fluctuate from over 80% to nearly 90%. The highest PR of 89%

can be achieved with the Concept-based PCA dimensions of 250.

Concept based

PCA dimensions

No of correct

retrieval

PR

50 242/300 0.806

100 260/300 0.866

150 255/300 0.85

200 253/300 0.843

250 267/300 0.89

300 256/300 0.853 Table 8 - Performance based on different dimensions of Concept based PCA feature

Figure 16 - Performance based on different dimensions of Concept based PCA feature

4.4.5. Contribution of the Weights and

Finally, to investigate the contribution of Term- and Concept-based features, the usage of

different values of and corresponding to the assigned ―degree of significance‖ of

Terms and Concepts is studied in details here. For parameter configuration, only the

number of dimensions of Concept-based PCA feature is modified. It is set to 250 which

produce the best result in 4.4.4. The summary of all parameters is as the following (

size = 6000, size = 6000, T/C PCA dimensions = 150/250, SOM size = 8 x 9, SOM

iterations = 150).

70%

75%

80%

85%

90%

95%

100%

50 100 150 200 250 300

PR

Dimensions of Concept-based PCA feature

32

Table 9 provides the summary of Source Detection and Retrieval results. It can be seen

that ( = 1.0, = 0.0) and ( = 0.0, = 1.0) are 2 special cases. The former is the

same as the PCA-SOM model without utilizing Concept-based feature. The later only

uses Concept-based feature for similarity calculation, i.e. Term-based feature is ignored.

Solely using 2 types of feature independently can also achieve satisfactory performance

of 87.6% and 84% of PR. However, it is noticed that the combination of Term- and

Concept-based features can produce better results by appropriate configuration of the

weights and . It is observed from the case of the corpus size of 300 that the pair (

= 0.4, = 0.6) gives the highest PR of 88.3%. From this study, it is proved that

Concept-based feature can be used to improve Document Representation, Document

Clustering, Document Retrieval and, potentially, Plagiarism Detection.

In conclusion, even though VSM and PCA can provide better results than PCA-SOM and

C-PCA-SOM, they are unpractical for large datasets. The 2 SOM-based models can be

applied for real-time DR and PR due to constant processing time. In addition, C-PCA-

SOM with additional Concept-based feature can achieve better performance by

appropriately balancing the significance between Term- and Concept-based features.

(Term)

(Concept)

No of correct

retrieval

PR

1.0 0.0 263/300 0.876

0.8 0.2 262/300 0.873

0.6 0.4 261/300 0.87

0.4 0.6 265/300 0.883

0.2 0.8 254/300 0.846

0.0 1.0 252/300 0.84

Table 9 - Performance based on different values of and

33

Chapter 5 – Conclusion This chapter provides the summary of the thesis presented in the previous sections

consisting of 2 sub-sections. The development process of the proposed model is

summarized in 5.1 and the works taken in the future is outlined in 5.2.

5.1. Concluding Remarks Protecting intellectual property from information abuses always poses challenges for

ethical community. Plagiarism, one of those dishonest behaviours, is the act of using the

knowledge of other people without proper references. Systems deploy flat feature

representation can effectively detect literal plagiarism in which minor changes are made.

However, they are vulnerable against intelligent plagiarism. By using many intelligent

tactics, plagiarists can transform the original text while still keeping the main ideas. Even

though systems applying structural representation have been recently developed, they still

utilize term-based features or derivatives from these features and, hence, they show some

limitations when dealing with intellectual plagiarism.

This thesis proposes an enhanced model of the term-based tree structure representation

model to challenge one specific type of the text manipulation tactics – Paraphrasing. The

modified model is referred as the concept based tree structure representation. It is

improved by adding an additional feature called the Concept-based feature. The new

feature is constructed by exploiting the ontology – WordNet to group semantic-similar

terms into one concept. It is noticed that the term-based feature ignore this valuable

information since it considers all terms as unrelated. After the construction of the new

feature, each layer of the origin tree is no longer represented by only the term-based

feature but both the term- and concept-based feature. Hence, the modified structure

representation can capture not only syntactic but also semantic information of a document.

To make the proposed structure applicable for real-time applications, the Principal

Component Analysis technique is applied for dimensionality reduction and the Self-

Organizing Map clustering technique is also applied to organize documents into

meaningful cluster.

This thesis also introduces the C-PCA-SOM 3-stage prototype as a real-time

implementation of the enhanced model. Performance of the 3-stage prototype is tested

through multiple experiments on the task of Source Detection and Retrieval. From the

reported results, it is proved that the prototype can produce competitive performance

comparing with other systems including VSM, PCA and PCA-SOM. Even though VSM

and PCA are better in the case of single source detection and retrieval, it is unpractical to

apply them for large corpuses. On the other hand, the 3-stage prototype can achieve real-

time performance and, hence, can be deployed in practical systems. Furthermore, by

studying multiple parameters affecting the performance of the C-PCA-SOM model, it is

34

clarified that using both term- and concept-based features can produce better results

comparing with using either term- or concept-based feature alone.

5.2. Future Works In the works carried in future, Stage 3 of the prototype will be fully tested in order to

verify the contribution of the Concept-based feature to the task of Paraphrased Plagiarism

Detection and Analysis. It can be seen that the paraphrasing technique can bypass

systems such as Turnitin which is an application of Term-based feature. Therefore, it is

expected that, by using Concept-based feature, the C-PCA-SOM prototype can discover

semantic-similar sentences even though they are expressed differently. If positive results

can be achieved, it can be claimed that the Concept-based feature can be used to detect

plagiarism by paraphrasing and, potentially, even higher levels of plagiarism.

In addition, it is reported that another type of semantic feature called the Category-based

feature can be utilized to improve the task of Document Clustering [28]. Thus, I plan to

study and integrate this feature into the concept-enhanced structure representation. In [28],

Hu et al. use another form of background knowledge – The Encyclopedia Wikipedia – to

extract the Category-based feature. They have tested it on DC and receive positive

outcomes. Therefore, part of the future works is to investigate the contribution of the

Category-based feature and the application of Wikipedia for speeding up the process of

DC and DR.

Eventually, we can see how different parameters such as the weights , influence the

performance of the C-PCA-SOM 3-stage prototype. Thus, it is also important to study in

details the automatic configuration of these parameters for the optimum performance of

the prototype.

35

References [1] R. Lukashenko, V. Graudina, and J. Grundspenkis, "Computer-based plagiarism

detection methods and tools: an overview," in Proceedings of the 2007

international conference on Computer systems and technologies, Bulgaria, 2007,

pp. 1-6.

[2] T. W. S. Chow and M. K. M. Rahman, "Multilayer SOM With Tree-Structured

Data for Efficient Document Retrieval and Plagiarism Detection," Neural

Networks, IEEE Transactions on, vol. 20, pp. 1385-1402, Sept. 2009.

[3] S. M. Alzahrani, N. Salim, and A. Abraham, "Understanding Plagiarism

Linguistic Patterns, Textual Features, and Detection Methods," Systems, Man, and

Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42,

pp. 133-149, March 2012.

[4] S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital

documents," SIGMOD Rec., vol. 24, pp. 398-409, May 1995.

[5] N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for

Digital Documents," in 2nd International Conference in Theory and Practice of

Digital Libraries (DL 1995), Austin, Texas, 1995.

[6] C. Grozea, C. Gehl, and M. Popescu, "ENCOPLOT: Pairwise sequence matching

in linear time applied to plagiarism detection," in Proc. SEPLN, Donostia, Spain,

2009, pp. 10-18.

[7] R. Yerra and Y.-K. Ng, "A Sentence-Based Copy Detection Approach for Web

Documents," in Fuzzy Systems and Knowledge Discovery. vol. 3613, L. Wang

and Y. Jin, Eds., ed: Springer Berlin / Heidelberg, 2005, pp. 481-482.

[8] J. Koberstein and Y.-K. Ng, "Using Word Clusters to Detect Similar Web

Documents," in Knowledge Science, Engineering and Management. vol. 4092, J.

Lang, F. Lin, and J. Wang, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 215-

228.

[9] A. Schenker, M. Last, H. Bunke, and A. Kandel, "Classification of Web

documents using a graph model," in Document Analysis and Recognition, 2003.

Proceedings. Seventh International Conference on, 2003, pp. 240-244.

[10] T. W. S. Chow, H. Zhang, and M. K. M. Rahman, "A new document

representation using term frequency and vectorized graph connectionists with

application to document retrieval," Expert Systems with Applications, vol. 36, pp.

12023-12035, March 2009.

[11] M. K. M. Rahman and T. W. S. Chow, "Content-based hierarchical document

organization using multi-layer hybrid network and tree-structured features,"

Expert Systems with Applications, vol. 37, pp. 2874-2881, Sept. 2010.

[12] H. Zhang and T. W. S. Chow, "A coarse-to-fine framework to efficiently thwart

plagiarism," Pattern Recognition, vol. 44, pp. 471-487, 2011.

[13] H. Zhang and T. W. S. Chow, "A multi-level matching method with hybrid

similarity for document retrieval," Expert Systems with Applications, vol. 39, pp.

2710-2719, Feb. 2012.

[14] S. Wold, K. Esbensen, and P. Geladi, "Principal component analysis,"

Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 37-52, 1987.

[15] T. Kohonen, "The self-organizing map," Proceedings of the IEEE, vol. 78, pp.

1464-1480, 1990.

36

[16] A. Si, H. V. Leong, and R. W. H. Lau, "CHECK: a document plagiarism detection

system," in Proceedings of the 1997 ACM symposium on Applied computing, San

Jose, California, United States, 1997, pp. 70-77.

[17] L. Sindhu, B. B. Thomas, and S. M. Idicula, "A Study of Plagiarism Detection

Tools and Technologies," International Journal of Advanced Research In

Technology, vol. 1, pp. 64-70, 2011.

[18] G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic

indexing," Commun. ACM, vol. 18, pp. 613-620, 1975.

[19] P. Marksberry, "The Toyota Way – a quantitative approach," International

Journal of Lean Six Sigma, vol. 2, pp. 132-150, 2011.

[20] J. Zobel and A. Moffat, "Exploring the similarity space," ACM SIGIR Forum, vol.

32, pp. 18-34, 1998.

[21] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman,

"Indexing by Latent Semantic Analysis," Journal of the American Society for

Information Science, vol. 41, pp. 391-407, 1990.

[22] T. A. Letsche and M. W. Berry, "Large-scale information retrieval with latent

semantic indexing," Information Sciences, vol. 100, pp. 105-137, 1997.

[23] K. Lagus, "Text Retrieval Using Self-Organized Document Maps," Neural

Processing Letters, vol. 15, pp. 21-29, 2002.

[24] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "WEBSOM – Self-organizing

maps of document collections," Neurocomputing, vol. 21, pp. 101-117, 1998.

[25] N. Ampazis and S. Perantonis, "LSISOM — A Latent Semantic Indexing

Approach to Self-Organizing Maps of Document Collections," Neural Processing

Letters, vol. 19, pp. 157-173, April 2004.

[26] A. Georgakis, C. Kotropoulos, A. Xafopoulos, and I. Pitas, "Marginal median

SOM for document organization and retrieval," Neural Networks, vol. 17, pp.

365-377, 2004.

[27] X. Xue and Z. Zhou, "Distributional Features for Text Categorization,"

Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 428-442,

March 2009.

[28] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as

external knowledge for document clustering," in Proceedings of the 15th ACM

SIGKDD international conference on Knowledge discovery and data mining,

Paris, France, 2009, pp. 389-396.

[29] A. Hotho, S. Staab, and G. Stumme, "Ontologies improve text document

clustering," in Data Mining, 2003. ICDM 2003. Third IEEE International

Conference on, 2003, pp. 541-544.

[30] S. Liu, F. Liu, C. Yu, and W. Meng, "An effective approach to document retrieval

via utilizing WordNet and recognizing phrases," in Proceedings of the 27th

annual international ACM SIGIR conference on Research and development in

information retrieval, Sheffield, United Kingdom, 2004, pp. 266-272.

[31] J. Sedding and D. Kazakov, "WordNet-based text document clustering," in

Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural

Language Data, Geneva, 2004, pp. 104-113.

[32] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios,

"Semantic similarity methods in wordNet and their application to information

37

retrieval on the web," in Proceedings of the 7th annual ACM international

workshop on Web information and data management, Bremen, Germany, 2005,

pp. 10-16.

[33] G. Spanakis, G. Siolas, and A. Stafylopatis, "Exploiting Wikipedia Knowledge

for Conceptual Hierarchical Clustering of Documents," The Computer Journal,

vol. 55, pp. 299-312, March 2012.

[34] S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without

reference collections," in Advances in data analysis, R. Decker and H.-J. Lenz,

Eds., ed Berlin, Heidelberg: Springer, 2007, pp. 359-366.

[35] S. Eissen and B. Stein, "Intrinsic Plagiarism Detection," in Advances in

Information Retrieval. vol. 3936, M. Lalmas, A. MacFarlane, S. Rüger, A.

Tombros, T. Tsikrika, and A. Yavlinsky, Eds., ed: Springer Berlin / Heidelberg,

2006, pp. 565-569.

[36] N. Shivakumar and H. Garcia-Molina, "Building a scalable and accurate copy

detection mechanism," in Proceedings of the first ACM international conference

on Digital libraries, Bethesda, Maryland, United States, 1996, pp. 160-168.

[37] A. Barrón-Cedeno and P. Rosso, "On automatic plagiarism detection based on n-

grams comparison," in Proc. 31st Eur. Conf. IR Res. Adv. Info. Retrieval, 2009,

pp. 696-700.

[38] E. Stamatatos, "Plagiarism detection using stopword n-grams," Journal of the

American Society for Information Science and Technology, vol. 62, pp. 2512-

2527, Sept. 2011.

[39] G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38,

pp. 39-41, 1995.

38

Appendix A – Source code of the Modified Porter Algorithm Following is the source code of the Modified Porter Stemmer written in Perl

sub stem { my @parms = @_; foreach( @parms ) { $_ = lc $_; # Step 0 - remove punctuation s/'s$//; s/^[^a-z]+//; s/[^a-z]+$//; next unless /^[a-z]+$/; # step1a_rules if( /[^s]s$/ ) { s/sses$/ss/ || s/ies$/i/ || s/s$// } # step1b_rules. The business with rule==106 is embedded in the # boolean expressions here. (/[^aeiouy].*eed$/ && s/eed$/ee/ ) || ( s/([aeiou].*)ed$/$1/ || s/([aeiouy].*)ing$/$1/ ) && ( # step1b1_rules s/at$/ate/ || s/bl$/ble/ || s/iz$/ize/ || s/bb$/b/ || s/dd$/d/ || s/ff$/f/ || s/gg$/g/ || s/mm$/m/ || s/nn$/n/ || s/pp$/p/ || s/rr$/r/ || s/tt$/t/ || s/ww$/w/ || s/xx$/x/ || # This is wordsize==1 && CVC...addanE... s/^[^aeiouy]+[aeiouy][^aeiouy]$/$&e/ ) #DEBUG && warn "step1b1: $_\n" ; # step1c_rules #DEBUG warn "step1c: $_\n" if s/([aeiouy].*)y$/$1i/; # step2_rules if ( s/ational$/ate/ || s/tional$/tion/ || s/enci$/ence/ || s/anci$/ance/ || s/izer$/ize/ || s/iser$/ise/ || s/abli$/able/ || s/alli$/al/ || s/entli$/ent/ || s/eli$/e/ || s/ousli$/ous/ || s/ator$/ate/ ) { my ($l,$m) = ($`,$&); #DEBUG warn "step 2: l=$l m=$m\n"; $_ = $l.$m unless $l =~ /[^aeiou][aeiouy]/; } # step3_rules if ( s/icate$/ic/ || s/ative$// || s/alize$/al/ || s/ical$/ic/ || s/ful$// ) { my ($l,$m) = ($`,$&); #DEBUG warn "step 3: l=$l m=$m\n";

39

$_ = $l.$m unless $l =~ /[^aeiou][aeiouy]/; } # step4_rules if ( s/al$// || s/able$// || s/ible$// || s/ou$// || s/iti$// || s/ous$// || s/ive$// ) { my ($l,$m) = ($`,$&); # Look for two consonant/vowel transitions # NB simplified... #DEBUG warn "step 4: l=$l m=$m\n"; $_ = $l.$m unless $l =~ /[^aeiou][aeiouy].*[^aeiou][aeiouy]/; } # step5b_rules #DEBUG warn("step 5b: $_\n") && s/ll$/l/ if /[^aeiou][aeiouy].*[^aeiou][aeiouy].*ll$/; # Cosmetic step s/(.)i$/$1y/; } @parms; }

The original rules of the origin Porter Stemming algorithm can be referenced at

http://snowball.tartarus.org/algorithms/porter/stemmer.html

40

Appendix B – Output Example of a Term based Vocabulary Following is the output of a stored Term-based Vocabulary constructed from a corpus of

50 documents. The sorting order is based on the TF-IDF weights from highest to lowest

(from most to less significant).

Displaying corpus vocabulary: will => 62.502581 law => 37.291657 father => 34.656919 work => 33.637597 love => 32.864670 baron => 29.315752 thing => 27.084467 time => 25.953446 point => 25.843677 year => 25.843677 henry => 24.383006 gener => 23.189831 well => 23.071794 good => 22.704138 state => 22.135748 great => 21.638097 prince => 21.436863 side => 21.081665 bagot => 20.154579 lord => 20.080123 priest => 19.873719 place => 19.742729 three => 18.561821 french => 18.452617 cry => 18.374454 viola => 18.322345 middle => 18.217576 catherine => 18.217576 long => 18.186291 george => 18.034330 call => 17.653746 eye => 17.222443 slave => 17.211534 city => 16.561432 wolsey => 16.490110 pope => 16.490110 kate => 16.490110 abelard => 16.490110 peter => 16.490110 girl => 16.261359 order => 15.865703 hand => 15.811249 woman => 15.777239 case => 15.457997 nature => 15.457997 life => 15.327286

41

house => 15.327286 children => 14.906246 john => 14.906246 lady => 14.906246 pound => 14.905289 word => 14.757165 officer => 14.657876 hindenburg => 14.657876 francie => 14.657876 wife => 14.342945 france => 14.342945 pass => 14.194333 offer => 14.169831 mind => 14.148264 class => 13.780840 follow => 13.703082 perhap => 13.599174 number => 13.551132 duty => 13.551132 young => 13.551132 poor => 13.551132 live => 13.531919 body => 13.531919 half => 13.531919 dead => 13.249146 free => 12.969242 length => 12.908650 thought => 12.881664 sever => 12.881664 matery => 12.881664 hear => 12.881664 antiqu => 12.825641 olivia => 12.825641 divorce => 12.825641 highness => 12.825641 idea => 12.465909 accord => 12.465909 master => 12.465909 ground => 12.301745 view => 12.301745 water => 12.301745 answer => 12.301745 interest => 12.301745 footnote => 12.249636 exclaim => 12.249636 school => 12.196019 form => 12.196019 course => 12.196019 company => 12.196019 nee => 12.196019 face => 12.196019 country => 12.196019 mother => 11.593498 condition => 11.593498 account => 11.593498

42

matter => 11.593498 marshal => 11.593003 price => 11.593003 term => 11.474356 roman => 11.474356 history => 11.474356 save => 11.474356 marry => 11.474356 find => 11.332645 open => 11.332645 leave => 11.071570 single => 11.071570 fact => 11.071570 fall => 11.071570 pressure => 10.993407 coin => 10.993407 solid => 10.993407 orsino => 10.993407 outer => 10.993407 historic => 10.993407 crit => 10.993407 keeper => 10.993407 money => 10.840906 white => 10.840906 william => 10.718431 kill => 10.718431 force => 10.718431 best => 10.611198 move => 10.611198 care => 10.611198 hope => 10.611198 opportun => 10.305332 figure => 10.305332 early => 10.305332 till => 10.305332 pretty => 10.305332 feel => 10.305332 large => 10.305332 beauty => 10.305332 head => 10.199380 loss => 10.040061 fell => 10.040061 full => 10.040061 intellectu => 10.040061 feature => 10.040061 indian => 10.040061 measure => 9.936859 manufacturer => 9.936859 rome => 9.936859 Displaying: 150 / 3340

43

Appendix C – Output Example of a Concept based Vocabulary Following is the output of a stored Concept-based Vocabulary constructed from a corpus

of 50 documents. It is reminded that all words/terms making up this Vocabulary have to

appear in the Term-based Vocabulary as well.

Displaying concept based vocabulary: 1 => [will, purpose, intend, remember, faculty, leave] 2 => [law, collection, philosophy, police, principle, force] 3 => [father, leader, priest, parent, mother, title] 4 => [work, minister, slave, serve, operation, investigation, pass, exercise, wait, exchange, claw, duty, care, fill, operate, labor, collaborate, cultivate, move, succeed, double, study, farm, bank, function, mission, carpenter, till, ministry, roll, location, service, process, busy, bring, play, action] 5 => [love, object, emotion, dear, passion, devotion, enjoy, lover] 6 => [baron] 7 => [thing, attribute, matter, statement, change, situation, affair, feast] 8 => [time, hour, experience, future, day, sentence, schedule, moment, case, term, dead, occasion, determine, wee, clock] 9 => [point, finger, phase, charge, reflect, fact, steer, extent, head, guide, position, indicate, sheer, distance, park, mark, characteristic, middle, level, stage, spot, detail, signal, respect, dot, state, corner, direct, degree, punctum, place, measure] 10 => [year, class] 11 => [henry] 12 => [gener] 13 => [well, surface, easily, good, swell] 14 => [great, eager] 15 => [prince] 16 => [side, opinion, region, pull, front, face, root, edge, bottom, unit, hand] 17 => [bagot] 18 => [lord, duke, count, noble, master] 19 => [three, trinity] 20 => [french, nation] 21 => [cry, exclaim, express, shriek, weep, sob, utterance, tear, call, utter, noise] 22 => [viola] 23 => [catherine] 24 => [long, desire] 25 => [george] 26 => [eye, attention] 27 => [city] 28 => [wolsey] 29 => [pope] 30 => [kate] 31 => [abelard] 32 => [peter] 33 => [girl, baby, maid, daughter, woman] 34 => [order, chapter, club, peace, request, arrangement, magnitude, rule, association, commission, command, bull, stay, word, society, edict, hunt, condition] 35 => [nature, disposition, complexion, quality]

44

36 => [life, history, spirit, person] 37 => [house, business, chamber, household, family] 38 => [children] 39 => [john, room] 40 => [lady, madame] 41 => [pound, walk] 42 => [officer] 43 => [hindenburg] 44 => [francie] 45 => [wife, housewife] 46 => [france] 47 => [offer, market, bid, produce, extend, supply, proposition, project, reward] 48 => [mind, judgment, notice, brain, tend, decision] 49 => [follow, comply, choose, guard, trace, carry, accompany, obey, result, imitate, ascend, watch, observe] 50 => [perhap] 51 => [number, issue, list, base, symbol, figure, size, edition, constant, amount, square, turn, total, company, performance] 52 => [young, animal] 53 => [poor] 54 => [live, dissipate, breathe, camp, people, survive, occupy, taste, exist, board, swing] 55 => [body, property, colony, system, church, school, college, thickness, softness, mass, representation, opposition, public] 56 => [half, moiety] 57 => [free, clear, loose, discharge, liberate, relieve, smooth, forgive, release] 58 => [length, leg, diameter, circumference] 59 => [thought, content, suggestion, plan, consideration, idea, impression, inspiration, explanation, ideal] 60 => [sever, separate] 61 => [matery] 62 => [hear, discover, catch, learn] 63 => [antiqu] 64 => [olivia] 65 => [divorce] 66 => [highness] 67 => [accord, match, agree, grant] 68 => [ground, island, soil, neck, earth, view, reason, forest, plain, teach, background, fasten, land] 69 => [water, liquid, sound, element, main, hush, sea, food, ocean] 70 => [answer, solve, resolve, solution, field, response, reply, resolution, counter] 71 => [interest, enthusiasm, power, benefit, refer, diversion, fee, color, arouse, share, sake, behalf] 72 => [footnote, note] 73 => [form, strike, stamp, category, round, manner, throw, draw, plume, build, sort, mound, model, twist, type, connection, gestalt, add, cast, organize, topography, solid, description, layer, frame, kind, influence, terrace, variety, style, hill, document, blow, spring, column, cup, strain, shape, appearance] 74 => [course, direction, education, track, path] 75 => [nee] 76 => [country, open, anchorage, haunt, retreat, scene, kingdom, space, ally]

45

77 => [account, story, relationship, bill, report, profit] 78 => [marshal, gather] 79 => [price, worth, cost, rig] 80 => [roman] 81 => [save, favor, spend, reserve, prevent, deliver, hoard] 82 => [marry] 83 => [find, translate, feel, happen, sight, chance, encounter, sense] 84 => [single] 85 => [fall, shine, loss, break, fail, shrink, season, drop, descend, set, rain, yield, pitch, drip, sin, diminish] 86 => [pressure, compel, press] 87 => [coin, threepence, medallion, crown, penny, quarter, real, sixpence] 88 => [orsino] 89 => [outer] 90 => [historic] 91 => [crit] 92 => [keeper] 93 => [money, fund] 94 => [white, bone, whiteness, alabaster] 95 => [william] 96 => [kill, destruction, stone, sacrifice, fell, dismember, destroy, death, poison, execute] 97 => [best, attempt] 98 => [hope, trust, encouragement, promise] 99 => [opportun] 100 => [early] 101 => [pretty] 102 => [large] 103 => [beauty, glory] 104 => [full, fully, entire] 105 => [intellectu] 106 => [feature, bear, read, possess, temple, chin, cheek, wear] 107 => [indian] 108 => [manufacturer, producer] 109 => [rome] 110 => [remain, stand, stick, rest, persist, continue, linger] 111 => [brought] 112 => [small, minor] 113 => [hold, admit, surround, maintain, protect, cover, weather, defend, fetter, harbour, arrest, lock, sustain, stock, declare, support, nurse, include, apply, retain, book, sleep] 114 => [hair, eyebrow, coat] 115 => [probable] 116 => [success, bite, victory] 117 => [short, suddenly, tract] 118 => [heart, bosom, sum, substance, courage, stuff, marrow, nerve] 119 => [appear, rise, perform, glitter, reappear, occur, manifest] 120 => [consider, compare, debate, expect, regard, reckon, abstract, deliberate, deal, contemplate, weigh] 121 => [alway] 122 => [left] 123 => [fellow, friend, familiar, chap, companion, associate, colleague] 124 => [angry, wild] 125 => [royal] 126 => [german]

46

127 => [foreign] 128 => [boy, son] 129 => [perfect, better] 130 => [priggery] 131 => [cesario] 132 => [hydrogen] 133 => [mycelium] 134 => [song] 135 => [filament] 136 => [janet] 137 => [marriage, union] 138 => [goose] 139 => [sketch, resume, describe, outline] 140 => [gold, golden, yellow] 141 => [thu] 142 => [vary, alter, drift, contradict, differ] 143 => [speak, tone, murmur, mouth, talk, bark, sing, address, converse, mumble, whisper] 144 => [felt] 145 => [light, weak, twilight, flood, expression, burn] 146 => [wrote] 147 => [help, avail, lift, worker, expedite, assistance, resource, provide, relief, attendant] 148 => [continu] 149 => [true] 150 => [attack, pepper, storm, assail, savage, blast, touch, approach, jump, fire, criticism, rush, stroke] Displaying: 150 / 2234