Upload
donguyet
View
224
Download
2
Embed Size (px)
Citation preview
i
Masters Computing Minor Thesis
Concept based Tree Structure Representation for
Paraphrased Plagiarism Detection
By
Kiet Nim
A thesis submitted for the degree of
Master of Science (Computer and Information Science)
School of Computer and Information Science
University of South Australia
November 2012
Supervisor
Dr. Jixue Liu
Associate Supervisor
Dr. Jiuyong Li
ii
Declaration I declare that the thesis presents the original works conducted by myself and does not
incorporate without reference to any material submitted previously for a degree in any
university. To the best of my knowledge, the thesis does not contain any material
previously published or written except where due acknowledgement is made in the
content.
Kiet Nim
November 2012
iii
Acknowledgements I would like to express my sincere gratitude to my supervisors, Dr. Jixue Liu and Dr.
Jiuyong Li – professors and researchers at University of South Australia – for their
dedicated support, professional advice, feedback and encouragement throughout the
period of conducting the study. In addition, I would like to thank all of my course
coordinators for their dedicated and in-depth teaching. Finally, I would like to thank my
family for always encouraging and providing me their full support throughout the study
in Australia.
iv
Abstract In the era of World Wide Web, searching for information can be performed easily by the
support of several search engines and online databases. However, this also makes the task
of protecting intellectual property from information abuses become more difficult.
Plagiarism is one of those dishonest behaviors. Most existing systems can efficiently
detect literal plagiarism where exact copy or only minor changes are made. In cases
where plagiarists use intelligent methods to hide their intentions, these PD systems
usually fail to detect plagiarized documents.
The concept based tree structure representation can be the potential solution for
paraphrased plagiarism detection – one of the intelligent plagiarism tactics. By exploiting
WordNet as the background knowledge, concept-based feature can be generated. The
additional feature in combining with the traditional term-based feature and the term-
based tree structure can enhance document representation. In particular, this modified
model not only can capture syntactic information like the term-based model does but also
can discover hidden semantic information of a document. Consequently, semantic-similar
documents can be detected and retrieved.
The contributions of the modified structure can be expressed in the following two. Firstly,
a real-time prototype for high level plagiarism detection is proposed in this study.
Secondly, the additional concept-based feature provides considerable improvements for
the task of Document Clustering in the way that more semantic-related documents can be
grouped into same clusters even though they are expressed in different ways.
Consequently, the task of Document Retrieval can relatively retrieve more relevant
documents in same topics.
v
Table of Contents Declaration .......................................................................................................................... ii
Acknowledgements ............................................................................................................ iii
Abstract .............................................................................................................................. iv
List of Figures ................................................................................................................... vii
List of Tables ................................................................................................................... viii
Chapter 1 – Introduction ..................................................................................................... 1
1.1. Background .......................................................................................................... 1
1.2. Motivations........................................................................................................... 1
1.3. Fields of Thesis .................................................................................................... 2
1.4. Research Question ................................................................................................ 2
1.5. Contributions ........................................................................................................ 4
Chapter 2 – Literature Review ............................................................................................ 5
2.1. Plagiarism Taxonomy .......................................................................................... 5
2.2. Document Representation .................................................................................... 6
2.2.1. Flat Feature Representation .......................................................................... 6
2.2.2. Structural Representation .............................................................................. 9
2.3. Plagiarism Detection Techniques ....................................................................... 11
2.4. Limitations ......................................................................................................... 12
Chapter 3 – Methodology ................................................................................................. 14
3.1. Document Representation and Indexing ............................................................ 14
3.1.1. Term based Vocabulary Construction ........................................................ 14
3.1.2. Concept based Vocabulary Construction .................................................... 15
3.1.3. Document Representation ........................................................................... 16
3.1.4. Document Indexing ..................................................................................... 18
3.2. Source Detection and Retrieval .......................................................................... 19
3.3. Detail Plagiarism Analysis ................................................................................. 20
3.3.1. Paragraph Level Plagiarism Analysis ......................................................... 20
3.3.2. Sentence Level Plagiarism Analysis ........................................................... 21
Chapter 4 – Experiments ................................................................................................... 22
4.1. Experiment Initialization .................................................................................... 22
vi
4.1.1. The Dataset and Workstation Configuration .............................................. 22
4.1.2. Performance Measures and Parameter Configuration ................................ 23
4.2. Source Detection and Retrieval for Literal Plagiarism ...................................... 23
4.3. Source Detection and Retrieval for Paraphrased Plagiarism ............................. 25
4.4. Study of Parameters ........................................................................................... 27
4.4.1. Size of Term based Vocabulary ............................................................ 27
4.4.2. Size of Concept based Vocabulary ....................................................... 28
4.4.3. Dimensions of Term based PCA feature .................................................... 29
4.4.4. Dimensions of Concept based PCA feature ................................................ 30
4.4.5. Contribution of the Weights and ...................................................... 31
Chapter 5 – Conclusion ..................................................................................................... 33
5.1. Concluding Remarks .......................................................................................... 33
5.2. Future Works ...................................................................................................... 34
References ......................................................................................................................... 35
Appendix A – Source code of the Modified Porter Algorithm ......................................... 38
Appendix B – Output Example of a Term based Vocabulary .......................................... 40
Appendix C – Output Example of a Concept based Vocabulary ...................................... 43
vii
List of Figures Figure 1 - Taxonomy of Plagiarism (Alzahrani et al. 2012) ........................................................... 7
Figure 2 - Term-Document Matrix (Marksberry 2011) ................................................................... 8
Figure 3 - Singular Value Decomposition of term-document matrix A (Letsche et al. 1997) ........ 9
Figure 4 - 3 layer document-paragraphs-sentences tree representation (Zhang et al. 2011) ......... 11
Figure 5 - Comparision of the original & modified Porter Stemmers ........................................... 14
Figure 6 - Data structure of Term-based Vocabulary .................................................................... 15
Figure 7 - Example of looking for synonyms, hypernyms and hyponyms .................................... 16
Figure 8 - Data structure for the concept-based Vocabulary ......................................................... 16
Figure 9 - Concept based Document Tree Representation ............................................................ 18
Figure 10 - 2 level SOMs for document-paragraph-sentence document tree (Chow et al. 2009) . 19
Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism ........................ 25
Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism ............... 26
Figure 13 - Performance based on different sizes of Term based Vocabulary .............................. 28
Figure 14 - Performance based on different sizes of Concept based Vocabulary ......................... 29
Figure 15 - Performance based on different dimensions of Term based PCA feature .................. 30
Figure 16 - Performance based on different dimensions of Concept based PCA feature ............. 31
viii
List of Tables Table 1 - Configuration of Parameters for Literal Plagiarism .......................................... 23
Table 2 - Source Detection & Retrieval for Literal Plagiarism ........................................ 24
Table 3 - Configuration of Parameters for Paraphrased Plagiarism ................................. 25
Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism ............................... 26
Table 5 - Performance based on different sizes of Term-based Vocabulary .................... 27
Table 6 - Performance based on different sizes of Concept based Vocabulary ................ 28
Table 7 - Performance based on different dimensions of Term based PCA feature ......... 30
Table 8 - Performance based on different dimensions of Concept based PCA feature .... 31
Table 9 - Performance based on different values of and ...................................... 32
1
Chapter 1 – Introduction
1.1. Background In the era of World Wide Web, more and more documents are being digitalized and made
available for accessing remotely. Searching for information has become even easier with
the support of variety of search engines and online databases. However, these advantages
also make the task of protecting intellectual property from information abuses become
more difficult. One of those dishonest behaviors is plagiarism. It is clear that plagiarism
has caused significant damages to intellectual property. Most cases have been detected in
academic works such as student assignments and researches. Lukashenko et al. [1] define
plagiarism as activities of ―turning of someone else’s work as your own without reference
to original source‖.
Several systems and algorithms have been developed to tackle this problem. However,
most of them can only detect word-by-word plagiarism or can be referred as literal
plagiarism. These are cases in which plagiarists do the exact copy or only make minor
changes to original sources. But in cases where significant changes are made, most of
these ―flat feature‖ based methods fail to detect plagiarized documents [2]. This type is
referred as intellectual or intelligent plagiarism including text manipulation, translation
and idea adoption.
In this study, my focus is to improve an existing structural model and conduct several
experiments to test the detection of one tactic of text manipulation - Paraphrasing.
1.2. Motivations Paraphrasing is a strategy of intellectual plagiarism to bypass the systems only detecting
exact copy or plagiarized documents with minor modifications. For instance, one popular
and widely used system in academic for plagiarism detection is Turnitin. It can be seen
that Turnitin can detect word-by-word copy efficiently to sentence level. However, by
simply paraphrasing detected terms using their synonyms/hyponyms/hypernyms or
similar phrases, it can be bypassed easily. Apparently, paraphrasing is just one of many
existing intelligent tactics. It is clear that plagiarism has become more and more
sophisticated [3]. Therefore, it is also urgent to have more powerful mechanisms to
protect intellectual property from high level plagiarism.
In different plagiarism detection (PD) systems, documents are represented by different
non-structural or structural schemes. Non-structural or flat feature based representations
are the earliest mechanisms for document representation. Systems such as COPS [4] and
SCAM [5] are typical applications of these representation schemes where documents are
firstly broken into small chunks of words or sentences. These chunks are then hashed and
registered against a hash table to perform document retrieval (DR) and PD. Systems in
2
[6-8] use character or word n-grams as the units for similarity detection. In common, all
these flat feature based systems ignore the contextual information of words/terms in a
document. Therefore, structural representations are recently developed to overcome this
limitation. In these schemes, 2 promising candidates which can capture the rich textual
information are graphs [9, 10] and tree structure representations [2, 11-13]. The
applications of structural representation have shown significant improvements in the
tasks of DR, document clustering (DC) and PD. However, majority of both non-structural
and structural representation schemes are mostly based on word histograms or derived
features from word histograms. Obviously, they can be used to effectively detect literal
plagiarism but are not strong enough to perform such tasks of intelligent plagiarism
detection.
In the research, I focus on analyzing the tree structure representation and the studies of
Chow et al. [2, 11-13]. In their works, a document is hierarchically organized into layers.
In this way, the tree can capture not only syntactic but also semantic information of a
document. While the root represents global information or main topics of a document,
other layers capture local information or sub-topics of the main topics and the leaf level
can be used to perform detailed comparison. Their proposed models have improved
significantly the accuracy of DC, DR and PD. However, the features used to represent
each layer are still derived from the term-based Vocabulary and, hence, the systems show
some limitations when performing the task of intelligent plagiarism detection.
Therefore, this study provides an extension to the term-based tree structure representation,
particularly, the features used to represent each layer in order to perform the detection of
one specific type of high level plagiarism – Paraphrasing. The modified structure
representation is referred as the Concept based Tree Structure Representation.
1.3. Fields of Thesis Document Representation; Information Retrieval; Plagiarism Detection; Text Mining.
1.4. Research Question As outlined in section 1.2, most existing PD systems only implement flat-feature based
representation for DC, DR and PD. Even though there are some applications of structural
representation recently, the features used in those schemes are still derivatives of word
histograms which ignore semantic similarity between words or terms. Consequently,
semantic-similar documents might be considered as unrelated. Secondly, plagiarism has
evolved and become more sophisticated with multiple forms including text manipulation,
translation and idea adoptions. These tactics can be used to easily bypass systems only
based on flat features. Even though structural feature based systems have been proved to
be more effective than flat feature based systems, they are still suffered from such
devious techniques. Therefore, it is urgent to develop either new or additional features to
3
improve current structural feature based systems in order to protect intellectual property
from being abused by high level plagiarism.
This thesis presents the study and development of a new mechanism in details to detect
one particular tactic of sophisticated plagiarism – Paraphrasing. The aim of the research
is to extend the structural model studied by Chow et al. in [2, 12, 13]. The original tree
structure representation based solely on word histograms is enhanced with an additional
feature to capture multi-dimensional semantic information. The additional feature can be
referred as the Concept based feature. The ultimate aim of the study is to answer the
following question ―Is the Concept based feature in combination with the tree structure
representation and the Term based feature capable of discovering plagiarism by
paraphrasing and, potentially, higher level plagiarism?‖.
In addition to the main research question, there are also multiple sub-questions that need
to be addressed and answered including:
What is the tree model used to represent a document?
How to construct 2 types of feature for each layer?
The necessity of document organization and indexing?
What is the scheme for candidate detection and retrieval?
What is the scheme for detailed plagiarism analysis?
To answer the main research question, the experiments carried out focus on examining
how concept-based feature in combination with the term-based tree structure
representation contributes to the tasks of document organization, document retrieval and
paraphrased plagiarism detection. However, the experiment model can only be built when
all research sub-questions are answered and they are outlined in Methodology section in
details.
In the modified structural representation, for a brief Methodology overview, each node of
the tree is represented by 2 derived vectors of terms and concepts. To overcome the
―Curse of Dimensionality‖ due to the lengths of these vectors, Principal Component
Analysis (PCA) [14] – a well-known technique for dimensionality reduction – is further
applied. For the number of tree layers, I choose the 3-layer document-paragraph-sentence
model to represent a document. The task of document organization is also taken into
consideration by applying the Self-Organizing Map (SOM) clustering technique [15]. In
document retrieval, comparing only documents in the same areas is taken into account
since documents mention different topics are regarded as serving no purpose [16], e.g.
comparing a CIS paper against a collection of CIS papers rather than a CIS paper against
biology papers. To generate the concept-based feature, I consider using the external
background knowledge – WordNet – to firstly generate the concept-based Vocabulary.
4
After that, the concept-based feature is derived from this Vocabulary and together with
the term-based feature used to represent a document.
1.5. Contributions In this thesis, the C-PCA-SOM 3-stage prototype for high level Plagiarism Detection is
introduced. The 3 stages include: Stage 1 – Document Representation & Indexing; Stage
2 – Source Detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. In addition,
due to the achievement in constant processing time when conducting experiments, the
prototype can provide real-time applications for Document Representation, Document
Clustering, Document Retrieval and, potentially, Paraphrased Plagiarism Detection.
Through experiments, it is verified that the introduction of the additional Concept-based
feature can improve the performance of Source Detection and Retrieval comparing with
models solely based on Term-based feature. Furthermore, it is also proved that the
enhanced tree structure representation not only can capture syntactic information like the
original scheme does but also can discover hidden semantic information of a document.
By capturing multi-dimensional information of a document, the task of Document
Clustering can be improved in the way that more semantic-related documents can be
grouped into meaningful clusters even though they are expressed differently. As a result,
Document Retrieval can also be benefited since more documents mentioning the same
topics can be detected and retrieved.
5
Chapter 2 – Literature Review This section provides a comprehensive overview of literature on different types of
plagiarism in section 2.1. In section 2.2, variety of document representation schemes are
discussed including non-structural or flat feature based representation and structural
representation. Existing plagiarism detection techniques are outlined in section 2.3.
Finally, the limitations of these PD techniques and representation schemes are discussed
for potential improvements and further studies in section 2.4.
2.1. Plagiarism Taxonomy When human began the artifact of producing papers as a part of intellectual
documentation, plagiarism also came into existence. Documentation and plagiarism exist
in parallel but they are two different sides of a coin totally. While one contributes to the
knowledge body of human society, the other one causes serious damages to intellectual
property. Realizing this matter of fact, ethical community has developed many techniques
to fight against plagiarism. However, the battle against this phenomenon is a lifelong
battle since plagiarism has also evolved and become more sophisticated. Therefore, to
efficiently engage such devious enemy, it is necessary to have a mapping scheme to
identify and classify different types of plagiarism into meaningful categories. Many
studies have been conducted to perform this task [1, 3, 17]. Lukashenko et al. [1] point
out different types of plagiarism activities including:
Copy-paste plagiarism (word to word copying).
Paraphrasing (using synonyms/phrases to express same content).
Translated plagiarism (copying content expressed in another languages).
Artistic plagiarism (expressing plagiarized works in different formats such as
images or text).
Idea plagiarism (extracting and using others’ ideas).
Code plagiarism (copying others’ programming codes).
No proper use of quotation marks.
Misinformation of references.
More precisely, Alzahrani et al. [3] use a taxonomy and classify different types of
plagiarism into 2 main categories: literal and intelligent plagiarism Fig. 1. In the former,
plagiarists make exact copy or only few changes to original sources. Thus, this type of
plagiarism can be detected easily. However, the latter case is much more difficult to
detect because plagiarists try to hide their intentions by using many intelligent ways to
change original sources. These tactics include: text manipulation, translation and idea
adoption. In text manipulating, plagiarists try to change the appearance of the text while
keeping the same semantic meaning or idea of the text. Paraphrasing is one tactic of text
manipulation that is usually performed. It transforms text appearance by using synonyms,
hyponyms, hypernyms or equivalent phrases. In the research, my main focus is to detect
6
this type of intelligent plagiarism. Plagiarism based on Translation is also known as
cross-lingual plagiarism. Offenders can use some translation softwares to copy text
written in other languages to bypass monolingual systems. Finally, Alzahrani et al.
consider idea adoption is the most serious and dangerous type of plagiarism since
stealing ideas from others’ works without proper referencing is the most disrespectful
action toward their authors and intellectual property. Apparently, this type of plagiarism
is also the hardest to detect because plagiarized text might not carry any similar syntactic
information to original sources. Another reason is that the ideas being plagiarized can be
extracted from multiple parts of original documents.
2.2. Document Representation
Since there are a vast number of documents available online and many more are uploaded
every day, the demand of efficiently organizing and indexing them for fast retrieval
always poses challenges for research community. Many schemes have been developed
and improved to represent a document more effectively. Instead of using a whole
document as a query, these representation schemes can be applied to perform many text
processing related tasks such as classification, clustering, document retrieval and
plagiarism detection. This section discusses 2 main strategies of document representation
as well as available techniques to detect plagiarism.
2.2.1. Flat Feature Representation
One of the most popular and widely used model is the Vector Space Model (VSM) [18].
In this model, a weighted vector of term frequency and document frequency is calculated
based on a pre-constructed Vocabulary. This Vocabulary is the list of most frequent
words/terms and is previously derived from a previously given training corpus. The
scheme used to perform term weighting is TF-IDF. Term frequency (TF) counts the
number of occurrences of each term in one specific document while inverse document
frequency (IDF) counts the number of documents consist one specific term. In the VSM
model, a vector of word histograms is constructed for each document. All vectors
together form the term-document matrix Fig. 2 [19]. The similarity between two
documents is calculated by performing the Cosine distance function on their vectors [20].
One drawback of the VSM model is that vectors used to represent documents are usually
lengthy due to the size of the Vocabulary and hence not scalable for large datasets.
8
Figure 2 - Term-Document Matrix (Marksberry 2011)
To overcome the Curse of Dimensionality in VSM, Latent Semantic Indexing (LSI) [21]
is proposed to project lengthy vectors to lower number of dimensions with semantic
information preserved. This is done by mapping the space spanned by those lengthy
vectors to a lower dimensional subspace. The mapping is performed based on the
Singular Value Decomposition (SVD) of the VSM-based term-document matrix Fig. 3
[22]. Similarly, another approach for high dimension reduction and feature compression
is to apply Self-Organizing Map (SOM) [15]. In SOM, similar documents are organized
closely to each other. Instead of being represented by word histogram vectors, each
document is represented by its winner neuron or Best Matching Unit on the map. The
applications of SOM such as VSM-SOM [23], WEBSOM [24], LSISOM [25] and in [23,
26] have shown considerable speed up in document clustering and retrieval. SOM can be
utilized in combining with not only flat feature representation but also structural
representation discussed later in the next section.
9
Figure 3 - Singular Value Decomposition of term-document matrix A (Letsche et al. 1997)
Considering only relying on ―bag of words‖ might not be enough, there are many further
studies that propose adding additional features and together with term-based flat features
to enhance document representation. In [27], Xue et al. propose using distributional
features in combination with the traditional term frequency to improve the task of text
categorization. The proposed features include: the compactness of the appearances of a
word and the position of the first appearance of a word. Basing on these features, they
assign a specific weight to a word based on its compactness and position, i.e. authors
likely to mention main contents in the earlier parts of a document. Hence, words appear
in these parts are considered more important and assigned higher weights. Similarly,
another approach to ―enrich‖ document representation is to utilize external background
knowledge such as WordNet, Wikipedia or thesaurus dictionaries. In [28], Hu et al. use
Wikipedia to derive 2 more additional features which are concept-based and category-
based features based on the conventional term-based feature. Their experiments have
shown significant improvements in the task of document clustering. Similar applications
of external background knowledge can be found in [28-33]. The study presented in this
thesis takes into account the application of WordNet instead of Wikipedia to generate the
concept-based feature and use it to enhance document representation.
2.2.2. Structural Representation
By only using word histogram vectors to represent documents, it can be seen that flat
feature representation ignores the contextual usage and relationship of terms throughout a
document [2] and hence leads to the loss of semantic information. In addition, two
documents might be contextually different even though they contain the same term
distribution. Recognizing this serious limitation, many studies are further carried out
trying to develop new ways to represent a document which can capture not only syntactic
but also semantic information of a document. These new schemes are referred as
structural representation in general.
10
To capture semantic information, Schenker et al. [9] propose using a directed graph
model to represent documents. The graph structure consists of two components: Nodes
and Edges. Nodes (Vertices) are terms appear in a document weighted by the number of
appearances. Edges that link nodes together indicate the relationship between terms. An
edge is only formed between two terms that appear immediately next to each other in a
sentence. Chow et al. [10] also study the directed graph and further develop another type
of graph – the undirected graph. Similarly, their directed model considers the order of
term occurrence in a sentence. In the additional model, the connection of terms in
undirected graph is considered without taking the usage order of terms into consideration.
They further perform Principal Component Analysis (PCA) for dimensionality reduction
and SOM for document organization. Their experiments show significant improvements
comparing with other single feature based approaches.
Another group of models which can capture both syntactic and semantic information of a
document is the group of tree-based representation models. The earliest study of the tree
structure representation is conducted by Si et al. [16]. They significantly realize that it is
unnecessary to compare documents addressing different subjects. Their model organizes
a document according to its structure and hence form the tree, i.e. a document may
contain many sections, a section may contain many subsections, a subsection again might
also have many sub-subsections, etc. This mechanism significantly improves the
effectiveness of document comparison since lower level comparisons can be terminated
at any level of the tree if the Cosine similarity measure exceeds a user-defined value.
However, lengthy vectors at each level and potential high number of layers make their
model not scalable for big corpuses. Most recent works of Chow et al. [2, 11-13] use
fixed number of layers (2 or 3 layers), reduced size of term-based Vocabularies and
perform PCA compression to make the tree structure representation applicable for large
datasets. To minimize the time complexity of document retrieval, they further apply SOM
to organize documents according to their similarities [2, 11]. Fig. 4 denotes the 3-layer
document-paragraph-sentence tree representation studied in [12] and also the selected
model which I focus on improving in the study. Other alternatives of layers and
representation units for layers can be found in [2, 11, 13].
11
Figure 4 - 3 layer document-paragraphs-sentences tree representation (Zhang et al. 2011)
2.3. Plagiarism Detection Techniques According to Lukashenko et al. [1], the task of fighting plagiarism is classified into
plagiarism prevention and plagiarism detection. The main difference between two
classes is that detection requires less time to implement but can only achieve short-term
positive effect. On the other hand, although prevention methods are time consuming to
develop and deploy, they have long-term effect and hence are considered to be a more
significant approach to effectively fight plagiarism. Prevention, unfortunately, is a global
issue and cannot be solved by just one institution. Therefore, most existing techniques
fall into the detection category and a lot of researches have been conducted to develop
more powerful plagiarism detection techniques.
Alzahrani et al. [3] categorize plagiarism detection techniques into 2 broad trends:
intrinsic and extrinsic plagiarism detection. Intrinsic PD techniques analyze a suspicious
document locally or without collecting a set of candidate documents for comparison [34,
35]. These approaches employ some novel analyses based on authors’ writing styles.
Since they are based on the hypothesis that each writer has a unique writing style, thus
changing in writing style signals there might be potential plagiarism cases. Features used
for this type of PD are Stylometric Features based on text statistics, syntactic features,
POS features, closed-class word sets and structural features. On the other hand, extrinsic
PD techniques do the comparison between query documents against a set of source
documents. Most of the existing PD systems are deploying these extrinsic techniques.
There are several common steps to perform extrinsic PD. Firstly, Document Indexing is
applied to store all registered documents into databases for later retrieval. Secondly,
Document Retrieval is performed to retrieve most relevant candidates that might be
12
plagiarized by given query documents. Eventually, Exhaustive Analysis is carried out
between candidates and query documents to locate plagiarized parts.
For extrinsic PD techniques, majority of exhaustive analysis methods partition all
documents into blocks (n-gram or chunks) [4, 36-38]. Units in each block can be
characters, words, sentences, paragraphs, etc. These blocks are then hashed and registered
against a hash table. To perform PD, suspicious documents are also divided into small
blocks and looked up in the hash table. Eventually, similar blocks are retrieved for
detailed comparison. COPS [4] and SCAM [5] are two typical implementations of these
approaches. According to [2, 16], these methods are inapplicable for large corpus due to
the increasing number of documents from time to time. Furthermore, they can be
bypassed easily by making some changes at sentence level.
It is noticed that the methods mentioned above apply flat features only and ignore the
contextual information of how words/terms are used throughout a document. Two
documents have the same term distribution might be different contextually. To tackle this
problem, PD systems which utilized structural representation are then proposed [2, 12,
16]. These approaches significantly improve the performance of extrinsic plagiarism
detection. Since documents are hierarchically organized into multi levels, the comparison
can be terminated at any level where the amount of dissimilarity is over a user-defined
threshold between query and candidate documents. Experiments in these structure-based
models have shown better performance comparing with flat feature based systems.
2.4. Limitations Most existing PD systems are implemented basing on flat feature representation. As
mentioned in 2.1, they cannot capture contextual usage of words/terms throughout a
document and can be bypassed easily with minor modifications performed on original
sources. Structural representation based PD systems have made significant improvements
by capturing the rich textual information. By organizing documents hierarchically,
structural models can capture not only syntactic but also semantic information of a
document. Recent studies have shown some important contributions of structural
representation in document organization such as classification and clustering [2, 11, 13].
Consequently, the task of plagiarism detection has also been improved in terms of time
complexity reduction and achievement of higher detection accuracy. Most relevant
documents are firstly retrieved to narrow the processing scope and further comparisons
are terminated at levels where similarities are different.
Although it has been proved that structural presentation can be applied to detect literal
plagiarism efficiently, structural representation based PD systems still show some
limitations in detecting intelligent plagiarism. For example, plagiarists can perform
paraphrasing and replace the detected words/terms with their synonyms, hyponyms or
hypernyms to bypass the detection of these systems. It is noticed that the problems arise
13
from the term-based Vocabulary. In this type of Vocabulary, terms with similar meanings
are treated as unrelated. For instance, it can be seen that large, huge and enormous carry
similar meaning and are exchangeable in usage. However, in this type of Vocabulary,
they are considered as different terms. Therefore, any feature derived from this
Vocabulary is not strong enough to detect sophisticated plagiarism. By changing
words/terms of an original sentence with semantic similar words/terms, a plagiarized
sentence will be treated as an unrelated sentence.
In order to discover similar sentences even though they are expressed in different ways,
my research is to exploit the external background knowledge WordNet to construct one
more type of Vocabulary called the Concept-based Vocabulary. The additional
Vocabulary is built by grouping words with similar meaning in the Term-based
Vocabulary into one concept. After that, this Vocabulary is utilized to generate one more
feature called the Concept-based feature to enrich the representation of a document.
14
Chapter 3 – Methodology This section outlines the main techniques applied to develop the prototype for
paraphrased plagiarism detection. It can be referred as the 3-stage prototype including:
Stage 1 – Document representation & Indexing, Stage 2 – Source detection & Retrieval
and Stage 3 – Detail Plagiarism Analysis. The content of Stage 1 is discussed in section
3.1 consisting of the construction of two types of Vocabulary, the extraction of the
corresponding 2 types of feature to represent a document and, subsequently, the
application of SOM to organize documents into meaningful clusters. Section 3.2 gives the
details of Stage 2 on how to use the stored data in Stage 1 to perform fast original source
identification and retrieval. Finally, Stage 3 of the prototype performs plagiarism
detection in details based on retrieved candidate documents from Stage 2. The
mechanism of the detail analysis of Stage 3 is outlined in section 3.3.
3.1. Document Representation and Indexing
3.1.1. Term based Vocabulary Construction
The construction of the term-based Vocabulary is straight forward. Firstly, the
application of term extraction technique is carried out on a training corpus. After that,
Word Stemming technique is further applied to transform terms to their simple forms. For
example, words such as ―computes‖, ―computing‖ and ―computed‖ are all considered as
―compute‖. Because the original Porter stemming algorithm only creates ―stems‖ instead
of words in their simple forms and makes it impossible to look up for them on an English
dictionary or thesaurus, I have initially modified the Porter algorithm (source code
written in Perl language is also provided in Appendix A). The modified version now tries
to stop at the stage where words are in or near their simple forms. As a result, it is
possible to search for these words’ synonyms, hypernyms and hyponyms via, for example,
a thesaurus dictionary. Fig. 5 depicts the different between the original and modified
Porter stemmers.
Figure 5 - Comparision of the original & modified Porter Stemmers
15
After applying stemming, Stop Word Removal technique is subsequently performed in
order to remove insignificant words such as ―a‖, ―the‖, ―are‖, etc. Finally, we use the
weighting scheme TF-IDF (Term Frequency – Inverse Document Frequency) to weight
the significance or importance of each word throughout the corpus. The weights of all
terms are then ranked from highest to lowest (from most to less significant). In a similar
way as Chow et al. [2, 12], the first terms are selected to form the Vocabulary used
for Document and Paragraph levels of the tree structure representation. The first terms
are selected to form the Vocabulary which is used for Sentence level. In addition,
is much larger than . The data structure of the two term-based Vocabularies is denoted
in Fig. 6.
Figure 6 - Data structure of Term-based Vocabulary
The data structure is simply an array of terms. Each item contains 2 values: the string of a
term and its corresponding TF-IDF weight for sorting. An output example of this type of
Vocabulary produced by the implemented program is provided in Appendix B.
3.1.2. Concept based Vocabulary Construction
In order to construct the additional concept-based Vocabulary, one of the background
knowledge – WordNet, the lexical database for English language [39] – is exploited.
WordNet is developed by Miller began in 1985 and has been applied in many text-related
processing tasks such as document clustering [31], document retrieval [30, 32] and also
for word-sense disambiguation. In WordNet; nouns, verbs, adjectives and adverbs are
distinguished and organized into meaningful sets of synonyms. In this section, the
mechanism to utilize WordNet for the construction of the concept-based Vocabulary is
discussed in details.
Firstly, for each term T in the term-based Vocabulary, its synonyms, hypernyms and
hyponyms is extracted from the WordNet database by using the synonym-hypernym-
hyponym relationship of the ontology. The result of this step is a ―bag‖ of terms similar
to T. For example, Fig. 7 illustrates the result of finding synonyms, hypernyms and
hyponyms for the word ―absent‖. After that, it is essential to check for these terms’
appearances in the term-based Vocabulary. Any term, which does not appear in the Term-
based Vocabulary, is removed and the remaining terms together form one concept.
Clearly, terms do not appear in the Term-based Vocabulary is the same as they do not
appear in the corpus and, hence, they must be removed. The procedure of constructing
concepts is repeatedly performed for the whole Term-based Vocabulary to achieve the
Concept-based Vocabulary. Fig. 8 denotes the data structure for the additional
Vocabulary.
16
Figure 7 - Example of looking for synonyms, hypernyms and hyponyms
Figure 8 - Data structure for the concept-based Vocabulary
The data structure is an array of pointers. Each pointer can be considered as one concept
and it points to the list of actual words/terms made up that concept. An example output of
a Concept-based Vocabulary is provided in Appendix C.
Similar to the construction of Vocabularies and
, the first concepts are selected
to form the Vocabulary used for Document and Paragraph levels and, similarly, the
first concepts are selected to form the Vocabulary used for Sentence level ( is
also much larger than ).
3.1.3. Document Representation
After the construction of the two types of Vocabulary, they are subsequently stored to
hard drive and now it is ready to perform the computation of each document’s tree
representation. In this study, I choose the Document-Paragraph-Sentence 3-layer tree
representation model mentioned in [12]. Following Zhang et al., each document is firstly
partitioned into paragraphs and each paragraph is similarly partitioned into sentences.
This process builds the 3 layer tree representation for each document. The root node
represents the whole document, second layer captures information of paragraphs of the
document and each paragraph has its sentences situating at the corresponding leaf nodes.
The modification of the original tree structure is formally carried out in the feature
construction stage for each layer. For all layers, term extraction, stemming and stop word
17
removal are applied to only extract significant terms. For the top and second layers, term-
based vectors are derived normally by performing the checking and weighting process of
a document’s terms that appear in the term-based Vocabulary . At the same time, the
mapping process is performed to map those terms to their concepts based on the
Vocabulary . The weight of a concept is the sum of all elements’ weights. For the
bottom layer, instead of using word histograms, we use ―appearance indices of terms‖
vector to indicate the absence/presence of corresponding terms in a sentence, similar to
[12]. In addition, ―appearance indices of concepts‖ vector is utilized to indicate the
absence/presence of corresponding concepts in the sentence. Up to this stage, each node
of the tree is represented by 2 features: the term-based feature and the additional concept-
based feature.
To overcome the ―Curse of Dimensionality‖, the Principal Component Analysis (PCA)
algorithm is relatively applied to compress the features on Document and Paragraph
levels. PCA is one of the well-known tools for feature compression and high
dimensionality reduction. The same training corpus as the one used for constructing the
Vocabularies is reused to calculate two PCA rotation matrices independently for term and
concept features. The matrices are also stored in hard disk in order to apply them for
query documents later in the stage of Source Detection and Retrieval. The PCA-projected
features are calculated as below:
(1)
Where = { , , …, } is the normalized term- or concept-based histograms with k
dimensions, is the k x l PCA rotation matrix and is the resulted PCA-
compressed feature with reduced dimensions of l (l is much smaller than k).
Finally, the data of all documents’ trees is stored to later perform the similarity
calculation between suspicious and original documents, paragraphs or sentences for both
Source Detection – Retrieval and Detail Plagiarism Analysis. The concept-based tree
structure representation of a document is illustrated in Fig. 9.
18
Figure 9 - Concept based Document Tree Representation
3.1.4. Document Indexing
According to Si et al. [16], it is unnecessary to compare documents mentioning different
topics. Therefore, document organization is crucial to avoid redundant comparisons and
minimize processing time as well as computational complexity. For this reason, one of
the powerful clustering techniques is applied to organize similar documents into same
clusters. The chosen clustering method is Self Organizing Map (SOM) due to its
flexibility and time efficiency. All documents in the training dataset have their trees
organized on the map. 2 SOM maps are constructed independently for the root and
paragraph levels in the same manner as Chow et al. [2].
Initially, the SOM of the paragraph level is built by mapping all paragraphs’ PCA-
compressed term- and concept-based features of all documents on the map. The results of
the 2nd
level SOM are then used as parts of the inputs for the root level SOM. For the root
level SOM, features of the root of each document’s tree in combining with the resulted
winner neurons (also known as Best Matching Units - BMU) of the document’s
corresponding child paragraphs form the input for the top SOM map. The compound
input is subsequently mapped to its nearest BMU on this root SOM. The mapping process
is repeated a number of times in order for all similar documents to converge on the two
maps. Eventually, the data of the SOMs is stored to be utilized in fast source detection and
retrieval in Stage 2. Fig. 10 illustrates how a document tree is organized or mapped onto
the document- and paragraph-level SOMs in [2].
19
Figure 10 - 2 level SOMs for document-paragraph-sentence document tree (Chow et al. 2009)
In addition, it is worth to notice that, by using the output of the paragraph-level SOM as
part of the input for the document-level SOM, the compression of local to global
information is relatively performed. In [2], it has been proved that this process can
improve the accuracy of detection and retrieval.
3.2. Source Detection and Retrieval Stage 2 of the prototype is for the detection and retrieval of source documents or relevant
candidates providing suspicious documents. For each query document, in the similar way
of when constructing the tree representation for corpus documents in Stage 1, the stored
term- and concept-based Vocabularies is firstly loaded to build the query’s tree
representation. Secondly, the stored PCA projection matrices is also loaded and feature
compression based on these matrices is performed for the query tree representation. After
that, the root node of the query tree is used to find its Best Matching Unit on the
document-level SOM. Subsequently, n candidate documents associated with the BMU
are retrieved. If the number of documents related to the BMU is less than n, the
remaining documents will be retrieved from the BMU’s nearest neighbors that contain
documents more similar to those in the BMU.
For the n candidate documents, the summed Cosine distance of term- and concept-based
PCA vectors between the query document and each candidate is computed. The formula
for the calculation of the summed Cosine distance or overall similarity is defined as the
following:
20
(2)
Where q, c: query and candidate documents
: Term-based PCA projected features
: Concept-based PCA projected features
d: Cosine distance function
The overall similarity is the sum of individual similarities of different types of feature.
and are the weights used to balance the importance of term- and concept-based
features to the overall similarity. In experiments, different weights are assigned to each
feature for studying the degree of contribution of different features on the overall
performance of source detection and retrieval.
After the calculation of the summed Cosine distances, we rank them in ascending order
and choose t documents that have distances lower than the user-defined similarity
threshold for further analysis. The threshold
is calculated as below:
(3)
Where ϵ [0, 1], it is noticed that = 0 is equivalent to the case of single source retrieval,
i.e. only the most similar document will be retrieved.
3.3. Detail Plagiarism Analysis The third stage of the prototype involves the calculation of local similarities in order to
identify candidate paragraphs that are most similar to each suspicious paragraph of the
query document. It is noticed that, by doing so, the exhaustive sentence comparison for
unrelated paragraphs can be avoided and the detection process can be relatively speeded
up. Otherwise, sentence comparison is carried out for each sentence of the suspicious
paragraph and candidate paragraphs to locate potential plagiarism cases. These cases are
then summarized and reported to user for human assessment.
3.3.1. Paragraph Level Plagiarism Analysis
For each suspicious paragraph of the query document, similarly, its nearest BMU on the
paragraph level SOM is identified. However, only the retrieval of paragraphs that belong
to the t detected candidates in Stage 2 is taken into consideration. The retrieval of other
paragraphs is excluded even though they are also associated with the BMU. It is simply
due to the rationale that their root nodes are different from the suspicious paragraph’s
root node (i.e., documents mention different topics serving no purpose).
After completely retrieving all candidate paragraphs, the summed Cosine distances
between these paragraphs and the corresponding suspicious paragraph are also calculated
21
using formula (2). Next, these distances are subsequently ranked in ascending order and
first t’ paragraphs that have distances lower than the similarity threshold
are
selected to further perform the exhaustive sentence level plagiarism analysis. The
threshold
is as the following:
(4)
Where ϵ [0, 1] and = 0 is for single plagiarized paragraph detection, i.e. only the
most similar paragraph will be retrieved.
3.3.2. Sentence Level Plagiarism Analysis
After the retrieval of the most relevant paragraphs for each suspicious paragraph, for each
pair of original and suspicious paragraphs, we perform the exhaustive comparison for all
of their sentences using the corresponding leaf nodes of the tree representations. Because
appearance indices of terms and concepts are used rather than histograms as features for
the bottom layer, the calculation of sentence similarity is slightly different. In this case,
the overall similarity of 2 sentences is defined as the amount of overlap between their
terms and concepts. The sentence similarity is calculated by the following formula:
(5)
Where
,
: Appearance indices of Terms
,
: Appearance indices of Concepts
The overall overlap between a query sentence and a candidate sentence is the sum of
individual overlaps of different types of feature. If the summed overlap is larger than the
overlap threshold
> α ϵ [0.5, 1], then this pair of paragraphs is considered as one
plagiarism case. In addition, user can flexibly change the overlap threshold to detect more
or less plagiarism cases. For example, if α = 0.8, it means that any pair of paragraphs that
has the degree of overlap of more than 80% is then considered as a plagiarism case. The
exhaustive process is repeatedly performed for other remaining pairs of paragraphs.
Finally, all plagiarism cases are presented to user for human assessment.
22
Chapter 4 – Experiments This section outlines the experiments carried out to test the performance of the 3-stage
prototype. Firstly, Section 4.1 introduces the dataset used for training and testing
plagiarism detection as well as the configuration of the experiment workstation. Up to the
point of writing up this thesis, I have conducted experiments on the performance of
Source Detection and Retrieval of the implemented system. Further experiments to test
the full functionality of the prototype are addressed in Chapter 5 as future works. For the
experiments on Source Detection and Retrieval, I compare the results of the prototype (C-
PCA-SOM) against variety systems including the original tree-based retrieval (PCA-
SOM) in [2] and the traditional VSM model. For the candidate of latent semantic
compression, many previous studies have chosen the LSI model. Therefore, in this study,
I would like to provide the comparison with the PCA model instead. All comparative
models are slightly modified to use the same modified Porter stemmer like the
implemented model in this study. For the 2 SOM-based models, only the top SOM maps
are involved in document retrieval and the contribution of the second layer SOM maps is
temporarily ignored. Section 4.2 provides the results of Source Detection and Retrieval
for Literal Plagiarism and Section 4.3 is for Paraphrased Plagiarism. In addition, I also
carry out the empirical tests to study the contribution of different parameters to the
accuracy and optimization of the C-PCA-SOM system such as the weights ( , ) or
the dimensions of term- and concept- based PCA features. The details are reported in
Section 4.4.
4.1. Experiment Initialization
4.1.1. The Dataset and Workstation Configuration
The Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) dataset is used to test the
implemented system in this study. This dataset formed part of PAN 2010 – the
international competition on plagiarism detection and is available to be downloaded at the
following address http://www.webis.de/research/corpora/corpus-pan-pc-10. In details, there
are 7,859 candidate documents in total to make up the original corpus. The test set also
contains 7,859 suspicious documents equivalent to those in the corpus, i.e. each
document in the corpus has exactly one plagiarized document in the test dataset. The test
set consists of 3,792 documents of literal plagiarism (non-paraphrased) cases and 4,067
documents of paraphrased plagiarism cases. For each type of plagiarism, I construct
multiple pairs of sub corpus and sub test set with the sizes of 50, 100, 200 and 400
randomly selected documents in order to evaluate the C-PCA-SOM system with different
data scales. For each sub corpus, the processes described in Stage 1 of the 3-stage
prototype are performed to firstly organize its documents and, later, corresponding sub
test set is used for evaluation.
23
The experiments, computation of PCA rotation matrices and SOM clustering are
conducted on a PC with 2.2 GHz Core 2 Duo CPU and 2GB of RAM.
4.1.2. Performance Measures and Parameter Configuration
To provide comparable results between different models, Precision and Recall are
applied to evaluate their performance. The computations of Precision and Recall are
outlined as the following:
(6)
(7)
Since each original document has exactly one plagiarized document, thus it is obvious to
only consider if the first retrieved document is the correct candidate or not. Hence, the
scaling parameter is set to zero (ε = 0) in formula (3). In this stage, it is assumed that the
contributions of term- and concept-based features are equivalent and the configuration of
the balancing weights in formula (2) is = = 0.5. The empirical study of these
parameters is outlined later in section 4.4.
Apparently, it is noticed that Precision and Recall are equal when using the Webis-CPC-
11 dataset. Therefore, I use PR to indicate both of these measures and further add in the
―No of correct retrieval‖ measure to indicate the number of correctly detected documents
for suspicious documents in the test sets.
4.2. Source Detection and Retrieval for Literal Plagiarism To begin with, the parameters for each sub corpus are arbitrarily assigned as in Table 1.
The implemented model, the C-PCA-SOM, use all of these parameters while the PCA-
SOM ignores the concept-based vocabulary and concept-based PCA feature dimensions.
The PCA model only uses the term-based vocabulary and term-based PCA feature
dimensions. The VSM model only uses the term-based Vocabulary to construct its term-
document matrix. In addition, I set = = 0.5 as mentioned above to indicate the
same amount of contribution between term- and concept-based features.
Corpus size size
size T/C PCA dimensions SOM size SOM iterations
50 1500 1000 40/40 6 x 8 100
100 2500 2000 80/80 7 x 8 150
200 3500 2500 130/130 8 x 8 200
400 5000 3500 220/220 8 x 9 200 Table 1 - Configuration of Parameters for Literal Plagiarism
24
The results of different models are reported in Table 2 and Fig. 11. The diagram
illustrates the PRs of different systems in detecting source documents for Literal
Plagiarism cases. It is noticed that the C-PCA-SOM model produces competitive results
comparing with other models for the case of single source detection. For the corpus sizes
of 100 and 200, C-PCA-SOM system is even slightly better than the PCA-SOM without
concept-based feature. For the corpus size of 50, all systems generate the same results of
96%. It is observed that PCA and VSM seem to perform better for single source detection
cases. Even though it takes more retrieval time for PCA and VSM, these models compare
per query document with all documents in the corpuses and the possibility of missing the
real candidate is unlikely. For SOM-based models such as the C-PCA-SOM and the
PCA-SOM, fast retrieval depends completely on the results from the earlier clustering
process. The clarification is outlined in Section 4.4 where the change of any parameter
can affect the accuracy of document clustering and consequently document retrieval.
Corpus size Algorithms No of correct retrieval
PR
50
C-PCA-SOM 48/50 0.96
PCA-SOM 48/50 0.96
PCA 48/50 0.96
VSM 48/50 0.96
100
C-PCA-SOM 91/100 0.91
PCA-SOM 88/100 0.88
PCA 91/100 0.91
VSM 91/100 0.91
200
C-PCA-SOM 182/200 0.91
PCA-SOM 181/200 0.905
PCA 185/200 0.925
VSM 187/200 0.935
400
C-PCA-SOM 355/400 0.8875
PCA-SOM 355/400 0.8875
PCA 359/400 0.8975
VSM 361/400 0.9025 Table 2 - Source Detection & Retrieval for Literal Plagiarism
25
Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism
4.3. Source Detection and Retrieval for Paraphrased Plagiarism In the same manner as conducting the test for Literal Plagiarism, arbitrarily parameters
are configured firstly. Table 3 denotes the parameter configuration for specific corpuses.
The results of Source Detection and Retrieval for Paraphrased Plagiarism cases are
reported in Table 3 and Fig. 12. , are still kept the same as in 4.2.
Surprisingly, in the case of Paraphrased Plagiarism, the PCA-SOM model performs better
than the C-PCA-SOM in detecting corresponding candidates. In addition, since only
global information is involved in retrieval, the exhaustive VSM and PCA still produce
better results in finding single source document for a suspicious document. It can be due
to the overall topic of a paraphrased document still remains mostly the same as the
original document. For different performance of the two SOM-based models, I further
investigate the contribution of concept-based feature to the performance of clustering and
later retrieval. At this stage, it can be assumed that concept-based feature might introduce
noise to the clustering process. To clarify whether this assumption is true or not, I later
try different values of the weights and . The results are reported in Section 4.4.
Corpus size size
size T/C PCA dimensions SOM size SOM iterations
50 1700 1100 45/45 6 x 8 100
100 2700 2300 90/90 7 x 8 150
200 3800 3000 140/140 8 x 8 200
400 5500 4000 240/240 8 x 9 200 Table 3 - Configuration of Parameters for Paraphrased Plagiarism
70%
75%
80%
85%
90%
95%
100%
50 100 200 400
PR
Corpus Size
C-PCA-SOM
PCA-SOM
PCA
VSM
26
Corpus Size Algorithms No of correct retrieval
PR
50
C-PCA-SOM 44/50 0.88
PCA-SOM 46/50 0.92
PCA 43/50 0.86
VSM 46/50 0.92
100
C-PCA-SOM 83/100 0.83
PCA-SOM 86/100 0.86
PCA 90/100 0.9
VSM 93/100 0.93
200
C-PCA-SOM 160/200 0.8
PCA-SOM 167/200 0.835
PCA 174/200 0.87
VSM 183/200 0.915
400
C-PCA-SOM 274/400 0.685
PCA-SOM 288/400 0.72
PCA 310/400 0.775
VSM 333/400 0.8325 Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism
Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism
60%
65%
70%
75%
80%
85%
90%
95%
100%
50 100 200 400
PR
Corpus Size
C-PCA-SOM
PCA-SOM
PCA
VSM
27
4.4. Study of Parameters This section provides a comprehensive and empirical study of different parameters to the
performance of the C-PCA-SOM model including: the size of term- and concept-based
Vocabularies in Sections 4.4.1 and 4.4.2, the dimensions of term- and concept-based
PCA features in Sections 4.4.3 and 4.4.4 and, lastly, the contribution of weighting
parameters and in Section 4.4.5. The experiments are carried out on a compound
corpus containing both literal and paraphrased plagiarism cases with the size of 300. 150
documents for each type of plagiarism are randomly chosen to achieve more accurate
outcomes. SOM-based document clustering and retrieval are the two processes affected
by the change of any parameter and, hence, they are relatively performed again at each
change. It is also noticed that the performance is also slightly different with the same set
of parameters as we can see later in the following sections.
4.4.1. Size of Term based Vocabulary
After performing term extraction, word stemming, stop word removal and concept
construction, we achieve the full term-based Vocabulary of 9868 distinct terms and the
full concept-based Vocabulary of 6815 distinct concepts. In experiment, different sizes of
the term-based Vocabulary are assigned to test the contribution of size to the
performance of the C-PCA-SOM prototype. In addition, other parameters are kept
consistent as the following: size = 4000, T/C PCA dimensions = 200/200, SOM size =
8 x 9, SOM training iterations = 150, = = 0.5.
Table 5 and Fig. 13 illustrate the performance of the C-PCA-SOM system with different
choices of size. It can be seen that the size of
does not affect much the accuracy of
Source Detection and Retrieval. Precision/Recall fluctuates from 85.66% to 87.3%.
However, optimum parameter configuration can be achieved in this case at the size of
around 6000.
Size No of correct
retrieval
PR
3000 261/300 0.87
4000 259/300 0.863 5000 257/300 0.8566
6000 262/300 0.873
7000 259/300 0.863
8000 261/300 0.87 Table 5 - Performance based on different sizes of Term-based Vocabulary
28
Figure 13 - Performance based on different sizes of Term based Vocabulary
4.4.2. Size of Concept based Vocabulary
In this section, the study of the influence of different sizes of the Concept-based
Vocabulary on the system performance is outlined. For the size of the Term-based
Vocabulary , the optimum value is chosen from the study in section 4.4.1 as
=
6000. While the size of is vary, other parameters are kept as the same as in 4.4.1 (
size = 6000, T/C PCA dimensions = 200/200, SOM size = 8 x 9, SOM training iterations
= 150, = = 0.5).
The results are documented in the following Table 6 and Fig. 14. Similarly, the change of
the size does not significantly change the accuracy of candidate retrieval. The highest
value of PR is 88% corresponding to the size of 6000 for the corpus size of 300.
Size No of correct
retrieval
PR
2000 259/300 0.863
3000 260/300 0.866
4000 254/300 0.846
5000 259/300 0.863
6000 264/300 0.88 Table 6 - Performance based on different sizes of Concept based Vocabulary
70%
75%
80%
85%
90%
95%
100%
3000 4000 5000 6000 7000 8000
PR
Term-based Vocab Size
29
Figure 14 - Performance based on different sizes of Concept based Vocabulary
4.4.3. Dimensions of Term based PCA feature
For the experiments on different dimensions of Term-based PCA feature, the parameters
are configured as in Section 4.4.2 except for the Concept-based Vocabulary size. The
size of 6000, which provides the best PR in earlier section, is chosen. The summary of
the involved parameters is as the following ( size = 6000,
size = 6000, Concept
PCA dimensions = 200, SOM size = 8 x 9, SOM training iterations = 150, = =
0.5).
In this study, we can see clearer trend comparing to the studies in 4.4.1 and 4.4.2. It is
observed from the results (Table 7 and Fig. 15) that the change in the dimensions of
Term-based PCA feature can significantly influence the performance of the C-PCA-SOM
model. Specifically, PR increases from 83.6% to 88% while number of dimensions rises
from 50 to 200. However, PR drops sharply from 250 onward (more than 60%). Basing
on the study, it is clarified that it is unnecessary to use all terms to build the Term-based
Vocabulary because it might introduce ―noisy‖ features to system performance.
70%
75%
80%
85%
90%
95%
100%
2000 3000 4000 5000 6000
PR
Concept-based Vocab Size
30
Term based
PCA dimensions
No of correct
retrieval
PR
50 251/300 0.836
100 260/300 0.866
150 264/300 0.88
200 264/300 0.88
250 83/300 0.276
300 87/300 0.29
Table 7 - Performance based on different dimensions of Term based PCA feature
Figure 15 - Performance based on different dimensions of Term based PCA feature
4.4.4. Dimensions of Concept based PCA feature
The parameters for the experiments on different dimensions of Concept based PCA
feature are set as the following ( size = 6000,
size = 6000, Term PCA dimensions =
150, SOM size = 8 x 9, SOM training iterations = 150, = = 0.5). The only
parameter, which is modified, is the dimensions of the Term-based PCA feature which is
set as 150 (providing the best retrieval result mentioned in 4.4.3).
Table 8 and Fig. 16 summarize the experiment outcomes on variety Concept-based PCA
feature dimensions. Different from the Term-based PCA feature, the increase of Concept-
based PCA feature dimensions can sometimes slightly enhance or decrease the
performance of the C-PCA-SOM system on Source Detection and Retrieval. It is reported
20%
30%
40%
50%
60%
70%
80%
90%
100%
50 100 150 200 250 300
PR
Dimensions of Term-based PCA feature
31
from the diagram that PR fluctuate from over 80% to nearly 90%. The highest PR of 89%
can be achieved with the Concept-based PCA dimensions of 250.
Concept based
PCA dimensions
No of correct
retrieval
PR
50 242/300 0.806
100 260/300 0.866
150 255/300 0.85
200 253/300 0.843
250 267/300 0.89
300 256/300 0.853 Table 8 - Performance based on different dimensions of Concept based PCA feature
Figure 16 - Performance based on different dimensions of Concept based PCA feature
4.4.5. Contribution of the Weights and
Finally, to investigate the contribution of Term- and Concept-based features, the usage of
different values of and corresponding to the assigned ―degree of significance‖ of
Terms and Concepts is studied in details here. For parameter configuration, only the
number of dimensions of Concept-based PCA feature is modified. It is set to 250 which
produce the best result in 4.4.4. The summary of all parameters is as the following (
size = 6000, size = 6000, T/C PCA dimensions = 150/250, SOM size = 8 x 9, SOM
iterations = 150).
70%
75%
80%
85%
90%
95%
100%
50 100 150 200 250 300
PR
Dimensions of Concept-based PCA feature
32
Table 9 provides the summary of Source Detection and Retrieval results. It can be seen
that ( = 1.0, = 0.0) and ( = 0.0, = 1.0) are 2 special cases. The former is the
same as the PCA-SOM model without utilizing Concept-based feature. The later only
uses Concept-based feature for similarity calculation, i.e. Term-based feature is ignored.
Solely using 2 types of feature independently can also achieve satisfactory performance
of 87.6% and 84% of PR. However, it is noticed that the combination of Term- and
Concept-based features can produce better results by appropriate configuration of the
weights and . It is observed from the case of the corpus size of 300 that the pair (
= 0.4, = 0.6) gives the highest PR of 88.3%. From this study, it is proved that
Concept-based feature can be used to improve Document Representation, Document
Clustering, Document Retrieval and, potentially, Plagiarism Detection.
In conclusion, even though VSM and PCA can provide better results than PCA-SOM and
C-PCA-SOM, they are unpractical for large datasets. The 2 SOM-based models can be
applied for real-time DR and PR due to constant processing time. In addition, C-PCA-
SOM with additional Concept-based feature can achieve better performance by
appropriately balancing the significance between Term- and Concept-based features.
(Term)
(Concept)
No of correct
retrieval
PR
1.0 0.0 263/300 0.876
0.8 0.2 262/300 0.873
0.6 0.4 261/300 0.87
0.4 0.6 265/300 0.883
0.2 0.8 254/300 0.846
0.0 1.0 252/300 0.84
Table 9 - Performance based on different values of and
33
Chapter 5 – Conclusion This chapter provides the summary of the thesis presented in the previous sections
consisting of 2 sub-sections. The development process of the proposed model is
summarized in 5.1 and the works taken in the future is outlined in 5.2.
5.1. Concluding Remarks Protecting intellectual property from information abuses always poses challenges for
ethical community. Plagiarism, one of those dishonest behaviours, is the act of using the
knowledge of other people without proper references. Systems deploy flat feature
representation can effectively detect literal plagiarism in which minor changes are made.
However, they are vulnerable against intelligent plagiarism. By using many intelligent
tactics, plagiarists can transform the original text while still keeping the main ideas. Even
though systems applying structural representation have been recently developed, they still
utilize term-based features or derivatives from these features and, hence, they show some
limitations when dealing with intellectual plagiarism.
This thesis proposes an enhanced model of the term-based tree structure representation
model to challenge one specific type of the text manipulation tactics – Paraphrasing. The
modified model is referred as the concept based tree structure representation. It is
improved by adding an additional feature called the Concept-based feature. The new
feature is constructed by exploiting the ontology – WordNet to group semantic-similar
terms into one concept. It is noticed that the term-based feature ignore this valuable
information since it considers all terms as unrelated. After the construction of the new
feature, each layer of the origin tree is no longer represented by only the term-based
feature but both the term- and concept-based feature. Hence, the modified structure
representation can capture not only syntactic but also semantic information of a document.
To make the proposed structure applicable for real-time applications, the Principal
Component Analysis technique is applied for dimensionality reduction and the Self-
Organizing Map clustering technique is also applied to organize documents into
meaningful cluster.
This thesis also introduces the C-PCA-SOM 3-stage prototype as a real-time
implementation of the enhanced model. Performance of the 3-stage prototype is tested
through multiple experiments on the task of Source Detection and Retrieval. From the
reported results, it is proved that the prototype can produce competitive performance
comparing with other systems including VSM, PCA and PCA-SOM. Even though VSM
and PCA are better in the case of single source detection and retrieval, it is unpractical to
apply them for large corpuses. On the other hand, the 3-stage prototype can achieve real-
time performance and, hence, can be deployed in practical systems. Furthermore, by
studying multiple parameters affecting the performance of the C-PCA-SOM model, it is
34
clarified that using both term- and concept-based features can produce better results
comparing with using either term- or concept-based feature alone.
5.2. Future Works In the works carried in future, Stage 3 of the prototype will be fully tested in order to
verify the contribution of the Concept-based feature to the task of Paraphrased Plagiarism
Detection and Analysis. It can be seen that the paraphrasing technique can bypass
systems such as Turnitin which is an application of Term-based feature. Therefore, it is
expected that, by using Concept-based feature, the C-PCA-SOM prototype can discover
semantic-similar sentences even though they are expressed differently. If positive results
can be achieved, it can be claimed that the Concept-based feature can be used to detect
plagiarism by paraphrasing and, potentially, even higher levels of plagiarism.
In addition, it is reported that another type of semantic feature called the Category-based
feature can be utilized to improve the task of Document Clustering [28]. Thus, I plan to
study and integrate this feature into the concept-enhanced structure representation. In [28],
Hu et al. use another form of background knowledge – The Encyclopedia Wikipedia – to
extract the Category-based feature. They have tested it on DC and receive positive
outcomes. Therefore, part of the future works is to investigate the contribution of the
Category-based feature and the application of Wikipedia for speeding up the process of
DC and DR.
Eventually, we can see how different parameters such as the weights , influence the
performance of the C-PCA-SOM 3-stage prototype. Thus, it is also important to study in
details the automatic configuration of these parameters for the optimum performance of
the prototype.
35
References [1] R. Lukashenko, V. Graudina, and J. Grundspenkis, "Computer-based plagiarism
detection methods and tools: an overview," in Proceedings of the 2007
international conference on Computer systems and technologies, Bulgaria, 2007,
pp. 1-6.
[2] T. W. S. Chow and M. K. M. Rahman, "Multilayer SOM With Tree-Structured
Data for Efficient Document Retrieval and Plagiarism Detection," Neural
Networks, IEEE Transactions on, vol. 20, pp. 1385-1402, Sept. 2009.
[3] S. M. Alzahrani, N. Salim, and A. Abraham, "Understanding Plagiarism
Linguistic Patterns, Textual Features, and Detection Methods," Systems, Man, and
Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42,
pp. 133-149, March 2012.
[4] S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital
documents," SIGMOD Rec., vol. 24, pp. 398-409, May 1995.
[5] N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for
Digital Documents," in 2nd International Conference in Theory and Practice of
Digital Libraries (DL 1995), Austin, Texas, 1995.
[6] C. Grozea, C. Gehl, and M. Popescu, "ENCOPLOT: Pairwise sequence matching
in linear time applied to plagiarism detection," in Proc. SEPLN, Donostia, Spain,
2009, pp. 10-18.
[7] R. Yerra and Y.-K. Ng, "A Sentence-Based Copy Detection Approach for Web
Documents," in Fuzzy Systems and Knowledge Discovery. vol. 3613, L. Wang
and Y. Jin, Eds., ed: Springer Berlin / Heidelberg, 2005, pp. 481-482.
[8] J. Koberstein and Y.-K. Ng, "Using Word Clusters to Detect Similar Web
Documents," in Knowledge Science, Engineering and Management. vol. 4092, J.
Lang, F. Lin, and J. Wang, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 215-
228.
[9] A. Schenker, M. Last, H. Bunke, and A. Kandel, "Classification of Web
documents using a graph model," in Document Analysis and Recognition, 2003.
Proceedings. Seventh International Conference on, 2003, pp. 240-244.
[10] T. W. S. Chow, H. Zhang, and M. K. M. Rahman, "A new document
representation using term frequency and vectorized graph connectionists with
application to document retrieval," Expert Systems with Applications, vol. 36, pp.
12023-12035, March 2009.
[11] M. K. M. Rahman and T. W. S. Chow, "Content-based hierarchical document
organization using multi-layer hybrid network and tree-structured features,"
Expert Systems with Applications, vol. 37, pp. 2874-2881, Sept. 2010.
[12] H. Zhang and T. W. S. Chow, "A coarse-to-fine framework to efficiently thwart
plagiarism," Pattern Recognition, vol. 44, pp. 471-487, 2011.
[13] H. Zhang and T. W. S. Chow, "A multi-level matching method with hybrid
similarity for document retrieval," Expert Systems with Applications, vol. 39, pp.
2710-2719, Feb. 2012.
[14] S. Wold, K. Esbensen, and P. Geladi, "Principal component analysis,"
Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 37-52, 1987.
[15] T. Kohonen, "The self-organizing map," Proceedings of the IEEE, vol. 78, pp.
1464-1480, 1990.
36
[16] A. Si, H. V. Leong, and R. W. H. Lau, "CHECK: a document plagiarism detection
system," in Proceedings of the 1997 ACM symposium on Applied computing, San
Jose, California, United States, 1997, pp. 70-77.
[17] L. Sindhu, B. B. Thomas, and S. M. Idicula, "A Study of Plagiarism Detection
Tools and Technologies," International Journal of Advanced Research In
Technology, vol. 1, pp. 64-70, 2011.
[18] G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic
indexing," Commun. ACM, vol. 18, pp. 613-620, 1975.
[19] P. Marksberry, "The Toyota Way – a quantitative approach," International
Journal of Lean Six Sigma, vol. 2, pp. 132-150, 2011.
[20] J. Zobel and A. Moffat, "Exploring the similarity space," ACM SIGIR Forum, vol.
32, pp. 18-34, 1998.
[21] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman,
"Indexing by Latent Semantic Analysis," Journal of the American Society for
Information Science, vol. 41, pp. 391-407, 1990.
[22] T. A. Letsche and M. W. Berry, "Large-scale information retrieval with latent
semantic indexing," Information Sciences, vol. 100, pp. 105-137, 1997.
[23] K. Lagus, "Text Retrieval Using Self-Organized Document Maps," Neural
Processing Letters, vol. 15, pp. 21-29, 2002.
[24] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "WEBSOM – Self-organizing
maps of document collections," Neurocomputing, vol. 21, pp. 101-117, 1998.
[25] N. Ampazis and S. Perantonis, "LSISOM — A Latent Semantic Indexing
Approach to Self-Organizing Maps of Document Collections," Neural Processing
Letters, vol. 19, pp. 157-173, April 2004.
[26] A. Georgakis, C. Kotropoulos, A. Xafopoulos, and I. Pitas, "Marginal median
SOM for document organization and retrieval," Neural Networks, vol. 17, pp.
365-377, 2004.
[27] X. Xue and Z. Zhou, "Distributional Features for Text Categorization,"
Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 428-442,
March 2009.
[28] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as
external knowledge for document clustering," in Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining,
Paris, France, 2009, pp. 389-396.
[29] A. Hotho, S. Staab, and G. Stumme, "Ontologies improve text document
clustering," in Data Mining, 2003. ICDM 2003. Third IEEE International
Conference on, 2003, pp. 541-544.
[30] S. Liu, F. Liu, C. Yu, and W. Meng, "An effective approach to document retrieval
via utilizing WordNet and recognizing phrases," in Proceedings of the 27th
annual international ACM SIGIR conference on Research and development in
information retrieval, Sheffield, United Kingdom, 2004, pp. 266-272.
[31] J. Sedding and D. Kazakov, "WordNet-based text document clustering," in
Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural
Language Data, Geneva, 2004, pp. 104-113.
[32] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios,
"Semantic similarity methods in wordNet and their application to information
37
retrieval on the web," in Proceedings of the 7th annual ACM international
workshop on Web information and data management, Bremen, Germany, 2005,
pp. 10-16.
[33] G. Spanakis, G. Siolas, and A. Stafylopatis, "Exploiting Wikipedia Knowledge
for Conceptual Hierarchical Clustering of Documents," The Computer Journal,
vol. 55, pp. 299-312, March 2012.
[34] S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without
reference collections," in Advances in data analysis, R. Decker and H.-J. Lenz,
Eds., ed Berlin, Heidelberg: Springer, 2007, pp. 359-366.
[35] S. Eissen and B. Stein, "Intrinsic Plagiarism Detection," in Advances in
Information Retrieval. vol. 3936, M. Lalmas, A. MacFarlane, S. Rüger, A.
Tombros, T. Tsikrika, and A. Yavlinsky, Eds., ed: Springer Berlin / Heidelberg,
2006, pp. 565-569.
[36] N. Shivakumar and H. Garcia-Molina, "Building a scalable and accurate copy
detection mechanism," in Proceedings of the first ACM international conference
on Digital libraries, Bethesda, Maryland, United States, 1996, pp. 160-168.
[37] A. Barrón-Cedeno and P. Rosso, "On automatic plagiarism detection based on n-
grams comparison," in Proc. 31st Eur. Conf. IR Res. Adv. Info. Retrieval, 2009,
pp. 696-700.
[38] E. Stamatatos, "Plagiarism detection using stopword n-grams," Journal of the
American Society for Information Science and Technology, vol. 62, pp. 2512-
2527, Sept. 2011.
[39] G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38,
pp. 39-41, 1995.
38
Appendix A – Source code of the Modified Porter Algorithm Following is the source code of the Modified Porter Stemmer written in Perl
sub stem { my @parms = @_; foreach( @parms ) { $_ = lc $_; # Step 0 - remove punctuation s/'s$//; s/^[^a-z]+//; s/[^a-z]+$//; next unless /^[a-z]+$/; # step1a_rules if( /[^s]s$/ ) { s/sses$/ss/ || s/ies$/i/ || s/s$// } # step1b_rules. The business with rule==106 is embedded in the # boolean expressions here. (/[^aeiouy].*eed$/ && s/eed$/ee/ ) || ( s/([aeiou].*)ed$/$1/ || s/([aeiouy].*)ing$/$1/ ) && ( # step1b1_rules s/at$/ate/ || s/bl$/ble/ || s/iz$/ize/ || s/bb$/b/ || s/dd$/d/ || s/ff$/f/ || s/gg$/g/ || s/mm$/m/ || s/nn$/n/ || s/pp$/p/ || s/rr$/r/ || s/tt$/t/ || s/ww$/w/ || s/xx$/x/ || # This is wordsize==1 && CVC...addanE... s/^[^aeiouy]+[aeiouy][^aeiouy]$/$&e/ ) #DEBUG && warn "step1b1: $_\n" ; # step1c_rules #DEBUG warn "step1c: $_\n" if s/([aeiouy].*)y$/$1i/; # step2_rules if ( s/ational$/ate/ || s/tional$/tion/ || s/enci$/ence/ || s/anci$/ance/ || s/izer$/ize/ || s/iser$/ise/ || s/abli$/able/ || s/alli$/al/ || s/entli$/ent/ || s/eli$/e/ || s/ousli$/ous/ || s/ator$/ate/ ) { my ($l,$m) = ($`,$&); #DEBUG warn "step 2: l=$l m=$m\n"; $_ = $l.$m unless $l =~ /[^aeiou][aeiouy]/; } # step3_rules if ( s/icate$/ic/ || s/ative$// || s/alize$/al/ || s/ical$/ic/ || s/ful$// ) { my ($l,$m) = ($`,$&); #DEBUG warn "step 3: l=$l m=$m\n";
39
$_ = $l.$m unless $l =~ /[^aeiou][aeiouy]/; } # step4_rules if ( s/al$// || s/able$// || s/ible$// || s/ou$// || s/iti$// || s/ous$// || s/ive$// ) { my ($l,$m) = ($`,$&); # Look for two consonant/vowel transitions # NB simplified... #DEBUG warn "step 4: l=$l m=$m\n"; $_ = $l.$m unless $l =~ /[^aeiou][aeiouy].*[^aeiou][aeiouy]/; } # step5b_rules #DEBUG warn("step 5b: $_\n") && s/ll$/l/ if /[^aeiou][aeiouy].*[^aeiou][aeiouy].*ll$/; # Cosmetic step s/(.)i$/$1y/; } @parms; }
The original rules of the origin Porter Stemming algorithm can be referenced at
http://snowball.tartarus.org/algorithms/porter/stemmer.html
40
Appendix B – Output Example of a Term based Vocabulary Following is the output of a stored Term-based Vocabulary constructed from a corpus of
50 documents. The sorting order is based on the TF-IDF weights from highest to lowest
(from most to less significant).
Displaying corpus vocabulary: will => 62.502581 law => 37.291657 father => 34.656919 work => 33.637597 love => 32.864670 baron => 29.315752 thing => 27.084467 time => 25.953446 point => 25.843677 year => 25.843677 henry => 24.383006 gener => 23.189831 well => 23.071794 good => 22.704138 state => 22.135748 great => 21.638097 prince => 21.436863 side => 21.081665 bagot => 20.154579 lord => 20.080123 priest => 19.873719 place => 19.742729 three => 18.561821 french => 18.452617 cry => 18.374454 viola => 18.322345 middle => 18.217576 catherine => 18.217576 long => 18.186291 george => 18.034330 call => 17.653746 eye => 17.222443 slave => 17.211534 city => 16.561432 wolsey => 16.490110 pope => 16.490110 kate => 16.490110 abelard => 16.490110 peter => 16.490110 girl => 16.261359 order => 15.865703 hand => 15.811249 woman => 15.777239 case => 15.457997 nature => 15.457997 life => 15.327286
41
house => 15.327286 children => 14.906246 john => 14.906246 lady => 14.906246 pound => 14.905289 word => 14.757165 officer => 14.657876 hindenburg => 14.657876 francie => 14.657876 wife => 14.342945 france => 14.342945 pass => 14.194333 offer => 14.169831 mind => 14.148264 class => 13.780840 follow => 13.703082 perhap => 13.599174 number => 13.551132 duty => 13.551132 young => 13.551132 poor => 13.551132 live => 13.531919 body => 13.531919 half => 13.531919 dead => 13.249146 free => 12.969242 length => 12.908650 thought => 12.881664 sever => 12.881664 matery => 12.881664 hear => 12.881664 antiqu => 12.825641 olivia => 12.825641 divorce => 12.825641 highness => 12.825641 idea => 12.465909 accord => 12.465909 master => 12.465909 ground => 12.301745 view => 12.301745 water => 12.301745 answer => 12.301745 interest => 12.301745 footnote => 12.249636 exclaim => 12.249636 school => 12.196019 form => 12.196019 course => 12.196019 company => 12.196019 nee => 12.196019 face => 12.196019 country => 12.196019 mother => 11.593498 condition => 11.593498 account => 11.593498
42
matter => 11.593498 marshal => 11.593003 price => 11.593003 term => 11.474356 roman => 11.474356 history => 11.474356 save => 11.474356 marry => 11.474356 find => 11.332645 open => 11.332645 leave => 11.071570 single => 11.071570 fact => 11.071570 fall => 11.071570 pressure => 10.993407 coin => 10.993407 solid => 10.993407 orsino => 10.993407 outer => 10.993407 historic => 10.993407 crit => 10.993407 keeper => 10.993407 money => 10.840906 white => 10.840906 william => 10.718431 kill => 10.718431 force => 10.718431 best => 10.611198 move => 10.611198 care => 10.611198 hope => 10.611198 opportun => 10.305332 figure => 10.305332 early => 10.305332 till => 10.305332 pretty => 10.305332 feel => 10.305332 large => 10.305332 beauty => 10.305332 head => 10.199380 loss => 10.040061 fell => 10.040061 full => 10.040061 intellectu => 10.040061 feature => 10.040061 indian => 10.040061 measure => 9.936859 manufacturer => 9.936859 rome => 9.936859 Displaying: 150 / 3340
43
Appendix C – Output Example of a Concept based Vocabulary Following is the output of a stored Concept-based Vocabulary constructed from a corpus
of 50 documents. It is reminded that all words/terms making up this Vocabulary have to
appear in the Term-based Vocabulary as well.
Displaying concept based vocabulary: 1 => [will, purpose, intend, remember, faculty, leave] 2 => [law, collection, philosophy, police, principle, force] 3 => [father, leader, priest, parent, mother, title] 4 => [work, minister, slave, serve, operation, investigation, pass, exercise, wait, exchange, claw, duty, care, fill, operate, labor, collaborate, cultivate, move, succeed, double, study, farm, bank, function, mission, carpenter, till, ministry, roll, location, service, process, busy, bring, play, action] 5 => [love, object, emotion, dear, passion, devotion, enjoy, lover] 6 => [baron] 7 => [thing, attribute, matter, statement, change, situation, affair, feast] 8 => [time, hour, experience, future, day, sentence, schedule, moment, case, term, dead, occasion, determine, wee, clock] 9 => [point, finger, phase, charge, reflect, fact, steer, extent, head, guide, position, indicate, sheer, distance, park, mark, characteristic, middle, level, stage, spot, detail, signal, respect, dot, state, corner, direct, degree, punctum, place, measure] 10 => [year, class] 11 => [henry] 12 => [gener] 13 => [well, surface, easily, good, swell] 14 => [great, eager] 15 => [prince] 16 => [side, opinion, region, pull, front, face, root, edge, bottom, unit, hand] 17 => [bagot] 18 => [lord, duke, count, noble, master] 19 => [three, trinity] 20 => [french, nation] 21 => [cry, exclaim, express, shriek, weep, sob, utterance, tear, call, utter, noise] 22 => [viola] 23 => [catherine] 24 => [long, desire] 25 => [george] 26 => [eye, attention] 27 => [city] 28 => [wolsey] 29 => [pope] 30 => [kate] 31 => [abelard] 32 => [peter] 33 => [girl, baby, maid, daughter, woman] 34 => [order, chapter, club, peace, request, arrangement, magnitude, rule, association, commission, command, bull, stay, word, society, edict, hunt, condition] 35 => [nature, disposition, complexion, quality]
44
36 => [life, history, spirit, person] 37 => [house, business, chamber, household, family] 38 => [children] 39 => [john, room] 40 => [lady, madame] 41 => [pound, walk] 42 => [officer] 43 => [hindenburg] 44 => [francie] 45 => [wife, housewife] 46 => [france] 47 => [offer, market, bid, produce, extend, supply, proposition, project, reward] 48 => [mind, judgment, notice, brain, tend, decision] 49 => [follow, comply, choose, guard, trace, carry, accompany, obey, result, imitate, ascend, watch, observe] 50 => [perhap] 51 => [number, issue, list, base, symbol, figure, size, edition, constant, amount, square, turn, total, company, performance] 52 => [young, animal] 53 => [poor] 54 => [live, dissipate, breathe, camp, people, survive, occupy, taste, exist, board, swing] 55 => [body, property, colony, system, church, school, college, thickness, softness, mass, representation, opposition, public] 56 => [half, moiety] 57 => [free, clear, loose, discharge, liberate, relieve, smooth, forgive, release] 58 => [length, leg, diameter, circumference] 59 => [thought, content, suggestion, plan, consideration, idea, impression, inspiration, explanation, ideal] 60 => [sever, separate] 61 => [matery] 62 => [hear, discover, catch, learn] 63 => [antiqu] 64 => [olivia] 65 => [divorce] 66 => [highness] 67 => [accord, match, agree, grant] 68 => [ground, island, soil, neck, earth, view, reason, forest, plain, teach, background, fasten, land] 69 => [water, liquid, sound, element, main, hush, sea, food, ocean] 70 => [answer, solve, resolve, solution, field, response, reply, resolution, counter] 71 => [interest, enthusiasm, power, benefit, refer, diversion, fee, color, arouse, share, sake, behalf] 72 => [footnote, note] 73 => [form, strike, stamp, category, round, manner, throw, draw, plume, build, sort, mound, model, twist, type, connection, gestalt, add, cast, organize, topography, solid, description, layer, frame, kind, influence, terrace, variety, style, hill, document, blow, spring, column, cup, strain, shape, appearance] 74 => [course, direction, education, track, path] 75 => [nee] 76 => [country, open, anchorage, haunt, retreat, scene, kingdom, space, ally]
45
77 => [account, story, relationship, bill, report, profit] 78 => [marshal, gather] 79 => [price, worth, cost, rig] 80 => [roman] 81 => [save, favor, spend, reserve, prevent, deliver, hoard] 82 => [marry] 83 => [find, translate, feel, happen, sight, chance, encounter, sense] 84 => [single] 85 => [fall, shine, loss, break, fail, shrink, season, drop, descend, set, rain, yield, pitch, drip, sin, diminish] 86 => [pressure, compel, press] 87 => [coin, threepence, medallion, crown, penny, quarter, real, sixpence] 88 => [orsino] 89 => [outer] 90 => [historic] 91 => [crit] 92 => [keeper] 93 => [money, fund] 94 => [white, bone, whiteness, alabaster] 95 => [william] 96 => [kill, destruction, stone, sacrifice, fell, dismember, destroy, death, poison, execute] 97 => [best, attempt] 98 => [hope, trust, encouragement, promise] 99 => [opportun] 100 => [early] 101 => [pretty] 102 => [large] 103 => [beauty, glory] 104 => [full, fully, entire] 105 => [intellectu] 106 => [feature, bear, read, possess, temple, chin, cheek, wear] 107 => [indian] 108 => [manufacturer, producer] 109 => [rome] 110 => [remain, stand, stick, rest, persist, continue, linger] 111 => [brought] 112 => [small, minor] 113 => [hold, admit, surround, maintain, protect, cover, weather, defend, fetter, harbour, arrest, lock, sustain, stock, declare, support, nurse, include, apply, retain, book, sleep] 114 => [hair, eyebrow, coat] 115 => [probable] 116 => [success, bite, victory] 117 => [short, suddenly, tract] 118 => [heart, bosom, sum, substance, courage, stuff, marrow, nerve] 119 => [appear, rise, perform, glitter, reappear, occur, manifest] 120 => [consider, compare, debate, expect, regard, reckon, abstract, deliberate, deal, contemplate, weigh] 121 => [alway] 122 => [left] 123 => [fellow, friend, familiar, chap, companion, associate, colleague] 124 => [angry, wild] 125 => [royal] 126 => [german]
46
127 => [foreign] 128 => [boy, son] 129 => [perfect, better] 130 => [priggery] 131 => [cesario] 132 => [hydrogen] 133 => [mycelium] 134 => [song] 135 => [filament] 136 => [janet] 137 => [marriage, union] 138 => [goose] 139 => [sketch, resume, describe, outline] 140 => [gold, golden, yellow] 141 => [thu] 142 => [vary, alter, drift, contradict, differ] 143 => [speak, tone, murmur, mouth, talk, bark, sing, address, converse, mumble, whisper] 144 => [felt] 145 => [light, weak, twilight, flood, expression, burn] 146 => [wrote] 147 => [help, avail, lift, worker, expedite, assistance, resource, provide, relief, attendant] 148 => [continu] 149 => [true] 150 => [attack, pepper, storm, assail, savage, blast, touch, approach, jump, fire, criticism, rush, stroke] Displaying: 150 / 2234