Upload
omar-azzam
View
50
Download
0
Embed Size (px)
Citation preview
Page 0
Evaluation of Techniques for Automatic Text Extraction
Submitted September 2006, in partial fulfilment of the conditions of the award of the degree M.Sc. in IT
Omar Ali Hussein Azzam
School of Computer Science and Information Technology
University of Nottingham
Supervisor: Dr. Tim Brailsford
I hereby declare that this dissertation is all my own work, except as indicated in the text:
Signature ______________________
Date _____/_____/_____
Page 1
ABSTRACT
With the rapid growth of the World Wide Web and electronic information services,
information is becoming overwhelmingly available on-line. The problem is that a user cannot cope with all the text that is on the internet. No one has time to read everything, yet we often have to make critical decisions based on what we are able to understand. Some examples from everyday life that require summarization are:
• News Headlines • Movie Reviews
• Abstracts of Scientific Papers • Book Reviews or Excerpts
• Highlights of a Meeting The technology of automatic text summarization has become essential for dealing
with such problems. Text summarization is the process of extracting the most important information from a source to produce an abbreviated version for a particular user or task.
Summaries can be conducted either by abstraction or extraction. The abstraction
implies the reformation of the document in a more abbreviated version. Extraction is the process of selecting the most eligible excerpts of the document (sentences, paragraphs, etc.) to insert into the summary.
This thesis will be concerned with implementing three summarizers using three
different extraction methodologies. The first one is based on sentence extraction which is based on Latent Semantic Analysis. While the other two are based on paragraph extraction, one is based on a bushy-node path algorithm and the other is based on a depth-first node path algorithm.
In addition, a comparison will be performed between the three extraction
methodologies using two empirical methods; questionnaire and field observation. The comparison should come out with a guide specifying which methodology is the most effective and the circumstances promoting the use of that certain technique.
Page 2
Acknowledgements
I would like to acknowledge all my thanks to ALLAH for his help in this
research.
I would like to acknowledge the superb effort, expertise and perceptive insights
offered by Dr. Tim Brailsforsd. It has been an honour and a very educational experience
working under his supervision.
I am greatly indebted to James Goulding, at the Department of computer science,
in the University of Nottingham for his ongoing support.
I would like to send all my sincere thanks for all my family. Special thanks to my
father and mother for supporting and sponsoring me for this degree, without their support
& encouragement this work might never have been done. Special thanks to my sister for
her moral support and always being there for me.
No words can express my gratitude towards Eng. Ramy from the ITI for his cooperation and assistance before and during the Masters.
Page 3
Contents
Section 1: Literature Review
1 Introduction ………………………………………………………………….. 7 1.1 What is Automatic Text Summarization? …………………………………. 7 1.2 Approaches and methods ………………………………………………… 10 1.3 Motivation ………………………………………………………………... 12 1.4 Aims and Objectives ……………………………………………………... 13 1.5 Method …………………………………………………………………… 14
2 Previous Works …………………………………………………………….. 16 2.1 Classical Approaches ……………………………………………………... 16 2.1.1 Hans Peter Luhn's Approach ………………………………………… 17 2.1.2 H.P. Edmundson's Approach ………………………………………… 18
2.2 Corpus-based Approaches ………………………………………………… 19 2.2.1 Morris and Hirst's Approach ………………………………………… 20 2.2.2 A Trainable Document Summarizer Approach ……………………… 22
3 Automatic Text Summary Evaluation ……………………………………… 24 3.1 Introduction ………………………………………………………………. 24 3.2 Evaluation Types …………………………………………………………. 25
Section 2: Text Extraction Automation
4 Software Development ……………………………………………………... 28 4.1 Introduction ………………………………………………………………. 28 4.2 Technologies and Tools ………………………………………………….. 29
5 Sentence Extraction ………………………………………………………… 31 5.1 Introduction ………………………………………………………………. 31 5.2 Latent Semantic Analysis ………………………………………………… 31 5.3 Analysis and Design ……………………………………………………… 34 5.4 Implementation …………………………………………………………… 38 5.5 Example …………………………………………………………………... 42
Page 4
6 Paragraph Extraction ……………………………………………………….. 53 6.1 Introduction ………………………………………………………………. 53 6.2 Text Relationship Maps ………………………………………………….. 53 6.3 Paragraph Extraction Methodologies …………………………………….. 57 6.4 Analysis and Design ……………………………………………………… 59 6.5 Implementation …………………………………………………………… 62 6.6 Example …………………………………………………………………... 67
6.6.1 Depth First Node Paragraphs ………………………………………... 67 6.6.2 Bushy Node Paragraphs ……………………………………………... 70
Section 3: Evaluation and Results
7 Evaluations and Comparison ……………………………………………….. 71
7.1 Evaluation Procedures ……………………………………………………. 71 7.2 Test Data …………………………………………………………………. 71 7.3 Aspects of the extraction process ……………………………………..… 72
7.4 Participant Users ………………………………………………………..... 72 7.5 Evaluation Techniques …………………………………………………… 72 7.6 Evaluation Equipment ……………………………………………………. 73
7.6.1 User Interface ………………………………………………………... 74 7.6.2 Evaluation Sheet ……………………………………………………... 76
7.7 Evaluation Results ……………………………………………………....... 77 7.7.1 Expert Users Results ………………………………………………… 77 7.7.2 Medium Users Results ………………………………………………. 79 7.7.3 Beginner Users Results ……………………………………………… 80
Section 4: Discussion
8 Discussion ………………………………………………………………….. 81 8.1 Conclusion ………………………………………………………………... 84 8.2 Future Work ……………………………………………………………… 86
References ……..……………………………………………………………………….. 87 Appendices
Appendix A: Basic classes of the system…………………………………………... 91 Appendix B: The sentence extractor program…………………………………….. 102 Appendix C: The paragraph extractor program…………………………………… 111 Appendix D: Evaluation Sheet…………………………………………………….. 125 Appendix E: Evaluation Samples……………………………………………….… 125 Appendix F: Evaluation Results……………………………………………...…… 126
Page 5
List of Figures
Figure 5.1 A class diagram for the class Word …………………………….. 34
Figure 5.2 A class diagram for the class Sentence …………………………... 34
Figure 5.3 A class diagram for the class WordFrequency ………………. 34
Figure 5.4 Class diagram of the Extractor …………………. 35
Figure 5.5 A class diagram for the class SentenceExtractor ………... 36
Figure 5.6 The class diagram of the sentence extractor ……… 37
Figure 5.7 The TDM of a document in encryption …………. 40
Figure 5.8 The original document about E-commerce ………… 42
Figure 5.9 A summary based on sentence extraction ……………. 49
Figure 6.1 A text relationship map for an article in Telecommunication… 55
Figure 6.2 Text relationship map of the Telecommunication article after
refinement …………...
56
Figure 6.3 Class diagram of Class Paragraph …………... 60
Figure 6.4 Class diagram of ParagraphExtractor class 61
Figure 6.5 A class diagram of the paragraph extractor 62
Figure 6.6 A document-document Matrix of a document in Encryption 65
Figure 6.7 An illustration of using depth-first node path algorithm ……….. 66
Figure 6.8 A summary based on depth-first node path algorithm …………... 68-
69
Figure 6.9 A summary based on bushy node path algorithm……………….. 70
Page 6
Figure 7.1 Interface of the extrator system where user chooses the method.. 74
Figure 7.2 Interface of the system where the user chooses the document to
perform the extraction on ……………………….
75
Figure 7.3 The result page of our extractor where a summary is provided for
the document……………………………………….
76
Figure 7.4 A chart for the questionnaire answers of the expert users …. 78
Figure 7.5 A chart for the questionnaire answers of the medium users…. 80
Figure 7.6 A chart for the questionnaire answers of the novice users ……… 82
List of Tables
Table 5.1 An example of a term document matrix …………….. 31
Table 6.1 A document-document matrix………………………………. 58
Page 7
Section 1: Literature Review
1 Introduction
1.1 What is Automatic Text Summarization?
Text summarization is the process of distilling the most important information from a
source to produce an abbreviated version for a particular user or task. [19]
Automatic text summarization has strongly been recommended after realizing the
problems of manual text summarization. The most significant problem would be the time
consumption for manually summarizing a text. Another problem would be inconsistency;
the summary extracted by two individuals from the same article can be surprisingly
different; in fact, even a single professional summarizer would find it difficult to maintain
consistency over a period of time. Bias is considered another major problem in manual
summarizing where the summarizer might potentially be biased towards his own thoughts
or beliefs.
Although the summarization process has several aspects that control the output and
style of summary, most of the available summarizers do not take this fact into
consideration and even the minority that do, have failed to execute it effectively which
diminishes the significance of the summaries.
Page 8
There are two major techniques for creating a summary:
1. Abstract Summarization The result of this summary is an interpretation of the
original text. The result is a smaller text where word concepts are transformed into
shorter ones. For example: “They went to Italy, France, and Germany” is becoming
“They went to some European countries”. This kind of summarization requires
symbolic word knowledge which makes it difficult to provide a good summary.
2. Extract Summarization: This method uses statistical, linguistical and heuristic
methods, or a combination of all to provide a summary. The result is not
syntactically or content wise altered. However, it is considered less complex to
summarize using this technique.
Aspects of Automatic Text Summarization
The output of a summary depends on several aspects. These aspects control the
methodology used to summarize and the style of the summary. The main aspects that will
be reviewed are intent, focus, coverage, background, and genre of text [9] [3].
Intent of the summary describes the potential use of the summary. It can be
indicative, informative, or evaluative. Indicative summaries give an indication of the
content of the source text. They can be used as appetizers for the whole text. Informative
summaries serve as substitutes for the document. Evaluative summaries provide the point
of view of the author on a given subject.
The focus of the summary refers to the scope of the summary. This could be either
generic or query-relevant. A generic summary summarizes the whole text without bias to
Page 9
any particular topic. However, the query-relevant summary is constructed to concentrate
on a specific topic. A more simplified explanation could be, in a generic summary the
user is interested in all the aspects of the document and all the information that is
considered important. While in a query-relevant summary the user is only interested in
specific concepts and needs to be provided with all the information about these concepts.
Coverage refers to whether the summary is based on a single document or multiple
documents relating to the same subject matters.
The background of the summary refers to the user's prior knowledge of the subject
area. A user's background in a particular subject could be weak requiring the summary to
focus on the concepts of the subject area or strong requiring the summary to focus on the
latest news and events.
Genre of text refers to the nature of the text, whether it is a scientific paper, journal,
book, or a story. It will be demonstrated further why this aspect in particular is very
important.
Lin and Hovy had specified three basic stages required for summarizing a topic; topic
identification, topic interpretation and topic generalization. [10]
Topic Identification: Understanding the concepts of the document to obtain the essence
of the text.
Interpretation: Identifying the most important pieces of information contained in the
text.
Page 10
Generalization: The transformation of the source text into a coherent new text. This is
done by merging the phrases that are eligible to be used in the summary
1.2 Approaches and Methods
In the previous part it was noted that there are two types of a summary, abstract and
extract. Abstract summaries require symbolic word knowledge, semantic parsing and a
lot of other NLP (Natural Language Processing) approaches. Up to this day the creation
of abstracts still remains a challenge. Most existing systems and researches were
concerned more with extracts where the most indicative sentences of the topic are
selected and merged together.
Although extracts are considered relatively easy but there are some major difficulties
in order to ensure a high quality and readable extract [24]:
• How to identify the significance of a sentence?
• How to merge coherently the selected sentences?
Each sentence in the source text is marked with a rank. The sentences with the highest
rank are chosen to be included in the extract. Some of the rules that were used throughout
the history are presented below. However, all these rules depend on the fact that each
word is given a grade and the sentence is the sum of the words' grades contained in it.
[24][26]
Proper Name: Certain types of nouns like peoples' names or cities are considered
significant.
Page 11
Pronoun: Sentences containing pronouns are given a higher score than sentences
that do not.
Bonus Words: Based on the hypothesis that the probable relevance of a sentence is
affected by the presence of pragmatic words such as "significant", "impossible",
and "hardly".
Word Frequency: A word's frequency is an indication of its importance. Sentences
containing such frequent words could be considered significant.
Word Position: A word's position within the whole text or the paragraph could also
indicate its eligibility.
Sentence Position: Topic sentences tend to occur very early or very late in a
document and its paragraphs.
Headings: Sentences occurring under certain headings are positively relevant, e.g.
"Introduction", "Purpose", and "Conclusion". Each word has its corresponding
weight.
Title: Words of a title could be considered positively relevant.
Uppercase Word: Upper Case words could be considered highly significant if
they are acronyms.
Numerical Data: Sentences containing numerical data are given a higher score
than sentences that do not.
Page 12
Weekdays and Months: Sentences containing weekdays or months are scored
higher than ones who do not.
Lexical Chains: The previous methods were able to obtain the most significant
sentences successfully. However, these methods do not take into consideration the
relationships between the different parts of a text. The output will be a summary
lacking cohesion. Cohesion is a term for sticking together; it means that the text all
hangs together as in one fluent stream. Lexical Cohesion is the cohesion that arises
from semantic relationships association between words. All that is required is that
there be some recognizable relation between the words. Lexical Chains represent
the lexical cohesion among an arbitrary number of related words. Lexical Chains
will be discussed further in Chapter 2. [23]
1.3 Motivation
In the previous sections it was demonstrated how automatic summarization is
essential due to the problems of manual summarization and the high rate and size of
information flow that users cannot cope with. However, critical decisions must be taken
based on what is understood. Summarization could be described as a crucial process in
some cases whereas if a concept in the original text is not well highlighted in the
summary, this could lead to misunderstanding of the original text which may lead to
serious consequences.
However, most of the available summarizers are not reliable in the construction of
summaries of critical documents. In addition, the readability of the auto summaries is
absolutely unsatisfactory. This may be considered ironic because although there are a
large number of text summarizers available they are not practically used.
Page 13
1.4 Aims and Objectives
It has been mentioned previously that in order to summarize a document there are
some aspects (e.g. intent, background, genre of text, etc.) that should be taken into
consideration, but this is usually not the case: either they are ignored or not implemented
properly.
After realizing how critical automatic text summarization is, this thesis will be
concerned with:
• Implementation of an automatic summarizer that is based on sentence extraction.
• Implementation of an automatic summarizer based on bushy node path paragraph
extraction.
• Implementation of an automatic summarizer based on depth-first node path
paragraph extraction.
• Evaluating the text summarizers specifying the most suitable summarizer
regarding the intent of the summary aspect.
• Evaluating the text summarizers specifying the most suitable summarizer with
respect to the genre of text aspect.
• Evaluating the text summarizers specifying the most suitable summarizer with
regards to the background aspect.
• Enabling the public to contribute and share their opinions on the automatic
summaries regarding the previous aspects.
• Test how informative the summarizers are; the degree of sufficient information
compared to the original document.
• Test the readability of the summarizers.
• Constructing a guide recommending each summarizer in specific cases.
Page 14
1.5 Method
The three algorithms that will be implemented are Latent Semantic Analysis,
paragraph extraction based on bushy node path and paragraph extraction based on depth-
first node path. The programming language used will be Java. The software will be web
based using Java Servlets. The web server used will be Apache Tomcat 5.5.9.
Since text summarization is a crucial process, the informativeness of the summary is
the most important criterion of the validity of a summary. The evaluation in this study
will focus on the informativeness of the summary rather than the convenience. The
summaries will be evaluated in 2 ways:
FIRST: The summary will be presented to the public. The validity of the summary
will be questioned according to specific criteria that the people will answer. These
criteria are:
• Can a user answer all the questions by reading a summary, as he would by
reading the entire document from which the summary was produced?
• Is the summary misleading in any way?
• Is the summary as elegant as the original document?
• Is the summary readable?
• Does the summary require the user to have background in the subject
area?
• Which summarizer would the user prefer?
This survey will be presented to different types of users with different backgrounds.
The summarizers also will be tested using different text genres.
Page 15
SECOND: Each summary has specific properties. The availability of these
properties will check for. These properties are:
• What is the best compression ratio between the given document and its
summary?
• Redundancy—is any information repeated in the summary?
• Cohesiveness
• Coherence
• Readability (depends on cohesion/ coherence/ intelligibility)
Finally, a guide is provided that recommends the best method to summarize within
each case.
Chapter 2 presents an overview of the previous work of the subject area consisting of
two periods Classical and Corpus-based. Chapter 3 presents an overview of automatic
text summarization evaluations and the evaluation types.
Page 16
2 Previous Works
The foundations of Automatic Text Summarization began in the mid 50s. Researches
in Automatic Text Summarization was divided into two periods, Classical and Corpus-
based. [19]
Classical Approaches begins with work originating from 40 years ago. It is
classic in the sense that it uses fundamental approaches and surface-level
approaches. It focuses on analysing static textual features to construct summaries.
Corpus-based Approaches are concerned with the question of how different
textual features can be extracted from text corpora and manually or automatically
combined to produce better abstracts. It focuses on the fact that there are textual
features dependent on the text genre. Examples of text genre are newspapers,
scientific papers, TV news, etc. Approaches can not be standardized among the
different text genres.
2.1 Classical Approaches
Classical Approaches represent the traditional approaches used at the birth of
automatic text summarization. These approaches mostly depend on discrete features in
the text. The following sections present two of the godfathers of the idea of automating
the abstraction process.
Page 17
2.1.1 Hans Peter Luhn's Approach
Luhn’s basic idea in his statistical approach in his paper “The Automatic Creation of
Literature Abstracts – 1958” [17] is depending on term frequency and term
normalization. Luhn’s method is based on the assumption that the frequency of a word
implies its significance and the significance of a sentence is obtained by the analysis of
its words.
Luhn believed that manual abstracting could not be optimum as the manual abstracts
could be influenced by the abstracter’s background, attitude, and beliefs. The abstractor
may be biased towards his own ideas and could not maintain consistency. Automatic
abstracts could eliminate both human effort and bias.
Luhn’s algorithm simply selects the significant words by obtaining the stem of each
word and counting its frequency. There are some words that are too frequent to be
significant. Such frequent words are not considered. Words within the text are given
scores relative to their frequency. Each sentence has a significance factor that if it is over
a specific cut-off it will be retrieved. Luhn defines the significance factor as the factor
that reflects the number of occurrences of significant words within a sentence and the
linear distance between them due to the intervention of non-significant words. The
significance factor is calculated for each sentence and the sentences with the significance
factor over a certain cut-off will be retrieved. Finally, the selected sentences are
combined to constitute the auto- abstract.
Page 18
2.1.2 H.P. Edmundson's Approach
Edmundson was aiming to construct an extracting system to produce indicative
extracts that allow a researcher to screen a body of literature to decide which documents
deserve more detailed attention. This was demonstrated in his paper “New Methods in
Automatic Extracting - 1969”. [2]
Edmundson extended the Luhn’s method as it was not sufficient to construct
automatic extracts that could supersede manual extracts. Edmundson has designed a way
to weigh the text using four basic methods. The weight of a sentence would be the
function of the weights of these four characteristics:
• Cue: This method is based on the hypothesis that the probable relevance of a
sentence is affected by the presence of pragmatic words such as "significant",
"impossible", and "hardly".
• Key: It is like the one proposed by Luhn. It is based on the hypothesis that
words that are highly frequent are positively relevant.
• Title: This method depends on certain specific characteristics of the skeleton
of the document (titles, headings, and format). It is based on the hypothesis
that words of the title and headings are positively relevant. When the author
partitions the body of the document into major sections he summarizes it by
choosing appropriate headings.
Page 19
• Location: This method is based on the hypothesis that sentences occurring
under certain headings are positively relevant and that topic sentences tend to
occur very early or very late in a document and its paragraphs. There is a
Heading dictionary containing words that appear in headings of documents,
e.g. "Introduction", "Purpose", and "Conclusion". Each word has its
corresponding weight.
After obtaining the four characteristics for each sentence, the weight of the sentence
SentW is calculated using the following equation:
SentW = aC + bK + cT + dL;
Where a, b, c, and d are constant positive integers and C: Cue Weight,
K: Key Weight, T: Title Weight, and L: Location Weight.
The highest N sentences are retrieved to construct the final abstract. Generally the
compression rate of the summary is preferred to be 25% of the original text. So N is
calculated as a function of 25% of the size of the original document and the number of
sentences of the original document.
2.2 Corpus-based Approaches
The most severe limitation of location and cue phrases is their dependence on the
text genre. Each text genre has its style of writing. Techniques relying on formal clues
can be seen as a high risk [19].
Page 20
Another limitation is that the merging of the sentences that are eligible to constitute
the summary may result in an incoherent summary [19].
Methods that rely more on content do not suffer from this problem. The problem with
these methods is that a detailed semantic representation must be created and a domain
specific knowledge base must be available. [19]
2.2.1 Morris's and Hirst's Approach
The simplest method to ensure coherence of the summary is lexical cohesion.
Coherence is a term for making sense; it means that there is sense in the text. Using
lexical chains in text summarization is efficient, because these relations are easily
identifiable within the source text, and very vast knowledge bases are not necessary for
computations. [23]
By using lexical chains, we can statistically find the most important concepts by
looking at structure in the document rather than deep semantic meaning. All that is
required to calculate these is a generic knowledge base that contains nouns and their
associations (thesaurus). These associations capture concept relations such as synonyms
(a word having the same or nearly the same meaning as another word or other words in a
language), antonym (a word having meaning opposite to that of another word), and
hyperonym (isa relation). [4]
Chains are created by taking a new text word and finding a related chain for it
according to relatedness criteria.
Morris and Hirst have defined a methodology to use lexical chains in the abstraction
process in their paper
Page 21
Generally, a procedure for constructing lexical chains follows three steps:
1. Selection of a set of candidate words.
2. For each candidate word, find an appropriate chain relying on a relatedness
criterion among members of the chains. To be more specific it must be
specified exactly what counts as a cohesive relationship between words. This
could be done using a thesaurus. According to Morris and Hirst’s suggestions,
two words could be considered to be related if they are connected in the
thesaurus in one or more of the following five possible ways:
a. Their index entries point to the same thesaurus category, or point to
adjacent categories.
b. The index entry of one contains the other.
c. The index entry of one points to a thesaurus category that contains the
other.
d. The index entry of one points to a thesaurus category that in turn
contains a pointer to category pointed to by the index entry of the other.
e. The index entries of each point to thesaurus categories that in turn
contain a pointer to the same category.
Page 22
3. If the appropriate chain was found, insert the word in the chain and update it
accordingly.
2.2.2 A Trainable Summarizer
To summarize is to reduce in complexity, and hence in length, while retaining some
of the essential qualities of the original. Abstracts are sometimes used as full document
surrogates. They can be easily accessed. They provide an easily digested intermediate
point between a document's title and its full text that it is useful for rapid relevance
assessment. [14]
This paper focused on the construction of document extracts using new discrete sentence
scoring features. [14]
Sentence Length Cut-off Feature: Short sentences tend not to be included in
summaries. So a given threshold of the number of words in a sentence is
specified.
Fixed-Phrase Feature: There is a list of fixed phrases that indicate that the
coming sentences are significant.
Paragraph Feature: This discrete feature focuses on the first ten paragraphs and
the last five paragraphs, considering them containing substantial information
regarding the document.
Page 23
Thematic Word Feature: This feature focuses on the words that occur most
frequently.
Uppercase Word Feature: Uppercase words are often important. Uppercase
words are computed like the previous feature where the most frequent uppercase
words are significantly credited. That is with the constraint that it is not a sentence
initial. In addition, frequent uppercase words could not an abbreviated unit of
measurement (like F, C, Kg, etc.). These abbreviations are discarded.
Page 24
3 Automatic Text Summary Evaluation
The goal of automatic summarization is to take an information source, extract content
from it, and present the most important content to the user in a condensed form and in a
manner sensitive to the user’s or application’s needs [18].
The evaluation of any system is always a key point for any research or development
effort. Evaluation has long been of interest to automatic summarization, with extensive
evaluations being carried out at the early 1960’s.
This chapter will give an overview of automatic text summarization. 3.1 will give an
overview of text summarization evaluation and previous research evaluations. 3.2 will
describe the evaluation methods applied on automatic summaries.
3.1 Introduction
Summarization is considered a fast-developing new research area. Therefore, this
needs good evaluation methodologies. Evaluation of automatic summaries faces a lot of
challenges that could make it useless. Evaluation still is not automated which therefore
needs human effort. This will increase the expenses of the evaluation process. As
automatic text summarization process had several factors that should be taken into
consideration, analogous is the evaluation process. Summarization involves compression;
evaluation should differ according to the compression ratio of the summary.
Page 25
During evaluation of a summary two properties must be measured:
Compression Ratio: The ratio between the summary’s length and the original
document’s length. It is denoted by CR.
CR = length of summary/ length of full text;
Retention Ratio: The amount of information retained in the summary. Retention
ratio is sometimes referred to as Omission Ratio. It is denoted by RR.
RR = information in summary/ information in original text
Any evaluation system of a summarizer must use these two properties. [20]
3.2 Evaluation Methods
Methods for evaluating text summarizers can be classified into two categories. The
first is an intrinsic evaluation, comparing to some gold standard. The second is extrinsic
evaluation, measures the system’s performance in a particular task. [20]
Extrinsic Evaluation measures the efficiency and the acceptability of the
generated summaries in some task. The quality of a summary evaluated using
extrinsic evaluation is judged against a set of external criteria, e.g., whether the
summary retains the information needed to satisfy an information need. Extrinsic
evaluation could for example be used for reading comprehension or relevance
assessment.
Page 26
Intrinsic Evaluation means that the quality of a summary is judged only by
analysis of its textual structure and by a comparison of the summary text to other
summaries. These summaries should be a gold standard performed either by a
reference summarization system or hand made.
Most evaluation systems used now use the intrinsic evaluation approach. An intrinsic
evaluation usually focuses on two concepts, coherence and informativeness[18].
Summary Coherence: Automatic Extracts are constructed by selecting the most
significant sentences and combining them together. These extracts could
sometimes suffer from coherence problems, where there are a lot of gaps between
the sentences in the information flow.
Summary Informativeness: It measures how much information from the source
is preserved in the summary. Summary informativeness can also be measured by
comparing the information covered in the reference summary with the
information covered in the auto summary.
Page 27
Section 2: Text Extraction Automation
Automatic text extraction is widely used in the summarization process in which the
most important segments of the document are selected. The better the document is
structured, the more superior the output of the extraction process. A structured document
would be a document in which its segments are well identified. A text segment is a block
of information that can describe an entire topic. Text segments can be units, chapters,
sections, pages, paragraphs or sentences. The segmentation of a document also helps in
text understanding and in information retrieval (IR). The main difficulty in identifying
text segments is automatic termination, i.e. to determine the number of topic boundaries
in a document. [4]
There has been a lot studies in specifying what would be best identified as a segment.
Would it be sections, pages, paragraphs or sentences? Linguistic theories and work in IR
suggest a coherent text segment is represented by paragraphs. Some other studies suggest
that a sentence is the best to represent a segment.
Chapter 4 will be concerned with the technical view of this study describing the
technical reasons for the technologies, tools and evaluation used in this study. Chapter 5
will discuss text extraction considering the segments as sentences and Chapter 6 will
discuss text extraction considering the segments as paragraphs.
Page 28
4 Software Development
This chapter will discuss the technical part of the dissertation. Three different
algorithms will be implemented, two of them are based on paragraph extraction and one
is based on sentence extraction. In addition, the technologies and tools used for the
summarizers will be described.
4.1 Introduction
In order to compare between two systems there are some criteria that must be
fulfilled. The comparison must be a like to like comparison meaning that the systems
should be constructed by the same manufacturer, tools and design. Therefore, fulfilling
these criteria eliminates any external factors that can affect the result of the comparisons.
Therefore, this dissertation will implement the three different algorithms using the
same tools and functionalities to allow only the methodology of each algorithm be the
only factor for the output and style of the summary. The algorithms implemented are:
• Sentence Extraction based on Latent Semantic Analysis.
• Paragraph Extraction based on depth-first node path.
• Paragraph Extraction based on bushy node path.
The first algorithm will be explained in Chapter 5. The other two algorithms will be
explained in chapter 6.
Page 29
4.2 Technologies and Tools
This part is concerned mainly with the technologies and tools used for the
summarizers' construction.
The programming language used is Java. That is due to the strong capabilities of the
java classes for text processing. In addition to that, java is platform independent which
gives it the advantage of running on any platform.
The IDE (Integrated Development Environment) used is NetBeans 5.0. The NetBeans
IDE is a robust, free, open source Java IDE that provides the developer with everything
they need to create cross-platform desktop, web and mobile applications straight out of
the box. NetBeans attains its power from being composed of extensible plug-ins. The
IDE itself could be extended to provide new customized development environments.
The summarizers constructed will be web based. This will be a client server
application centralizing the whole functionality on the server side. Centralization is
considered more secure. In addition, the computing is all done in the server side which
limits the computer power needed to use the summarizers on the client side, as the
summarizers perform a lot of text processing which needs a lot of computing power.
Another advantage is that centralized systems are much more scalable, as there is only
one source where any change in that source could be reflected right away to the users.
Moreover, enabling the system to be web-based increases the easiness of its accessibility.
However, we cannot ignore the problem of the reliability of the centralized systems as
any failure occurring to the server means that the whole system is down.
Page 30
The web tool used is Java Servlets that is due to its simplicity compared to other web
frameworks like Struts, Spring or Java Server Faces (JSF) as the summarizers will not
need to be implemented using such extensible web frameworks. The web server used is
Tomcat 5.5.9. The Tomcat web server is bundled within the NetBeans 5.0.
The next two chapters will explain the algorithms used and describe how they are
implemented and provide an example for each.
Page 31
5 Sentence Extraction
5.1 Introduction
Sentence extraction is considered the oldest extraction technique used. It was first
introduced by Luhn 1959. The sentence weight was calculated by the weights of the
words contained in the sentence. The sentences with the highest weights are selected.
This chapter will discuss an implementation of a summarizer based on sentence
extraction. Sentence extraction can be performed in a lot of ways. Examples of methods
used are Word Frequency, Cues, Key, Title, Heading, Latent Semantic Analysis (LSA),
and others. This study will perform sentence extraction based on LSA.
5.2 Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a theory and method for extracting and
representing the contextual-usage meaning of words by statistical computations applied
to a large corpus of text [15].
Let there be a set of documents or text segments S = {s1, …, s2} and a set of
vocabulary W = {w1, w2,…wn}. LSA transforms these two sets into a relationship
between terms and concepts, and a relationship between the documents and the same
concepts. Therefore, the terms and documents are related indirectly through the concepts.
Latent Semantic Analysis uses a term-document matrix (TDM) which describes the
occurrence in documents, passages, or excerpts [16]. A Term-Document matrix is a
matrix where each row is identified by a document and each column is identified by a
Page 32
word. The value in the cellij (in row i and column j) is the relation between the word i and
the document j. An example of a term document matrix is shown in table 5.1. The value
of cell1,1 in Table 5.1 is 3. This means that the relation between the word "Automatic"
and document "Document 1" is 3(whatever this relation is). LSA provides the
relationship between terms and concepts producing measures of word-word, word-
passage, and passage-passage relations [16].
Automatic Text Summarization Extraction
Document 1 3 2 6 5
Document 2 5 2 3 4
Document 3 4 1 2 7
Table 5.1 An example of a term document matrix.
It is important to note that the similarity estimates derived by LSA are not simply
adjacent frequencies, co-occurrence counts, or correlations in usage, but LSA depends on
a powerful mathematical analysis that is capable of correctly inferring much deeper
relations and as a consequence are often more precise.
Applications of LSA:
• Compare your documents in the concept space
• Find similar documents across languages after analyzing a base set of
translated documents.
• Finding relation between terms (synonymy and polysemy).
• Given a query of terms, translate it into the concept space, and find matching
documents (IR). In this case it is called LSI, Latent Semantic Indexing.
• Recently LSA has been used in auto tutoring systems for students.
• LSA has been used for grading essays. It was applied on TOEFL (Test of
English as a foreign language) [6].
Page 33
LSA is closely related to neural net models, but it is based on SVD.
SVD, Singular Value Decomposition, is an important factorization of a rectangular
real or complex matrix. SVD can be seen as a generalization of the spectral theorem
which says that normal matrices can be diagonalized using a basis of eigen vectors to
arbitrary, not necessarily square, matrices. SVD allows the arrangement of the space to
reflect the major associative patterns in the data and ignore the smaller less important
influences. [28]
Steps of LSA: 1. Represent text as a matrix (term document matrix). Each row stands for the terms.
Each column stands for the passage, excerpt, or other context. The value of cellij is the frequency of the term row i in the document j. In this step term weighting is constructed. A lot of terms weighting methodologies are used. Example of these could be:
a. Luhn's method b. Edmundson's method
2. The cells are subjected to an introductory (preparatory) transformation. Each cell
frequency is weighted by a function that expresses both the word's importance in a particular passage and its importance in general(dispersion). This function could be done in various ways. A methodology is as following:
fij * G(j) * L(i,j). [11] fij: The number of times the term i appears in document j. G(i): The global weighting for term i. L(i,j): The local weighting for term i in document j.
3. LSA applies SVD to the matrix to decrease the redundancy.
Page 34
5.3 Analysis and Design
The previous steps of the LSA method are the standard steps for creating a summary
based on LSA. Any summarizer based on LSA does not go through each step precisely.
However, they customize these steps according to their user requirements and their
system capability. Our sentence extractor does not perform each step precisely. It uses
alternative methods and some additional functionality. As an example our algorithm uses
another method other than SVD to decrease the redundancy of the TDM. This method
will be described in details in the next part.
In order to perform a sentence extraction, we need to keep track of the words and
sentences available in the document. Each word in the document is specified by its Id,
value, global frequency, and the Ids of the sentences the word is in. The class diagram of
the class Word is illustrated in Figure 5.1. Each sentence is specified by its Id, value, the
words representing it (WordFrequency). The class diagram of the class Sentence is
illustrated in Figure 5.2.
Figure 5. 1 A class diagram for the class Word
Page 35
Figure 5. 2 A class diagram for the class Sentence
So the decision whether to choose a sentence to be included in the summary basically
depends on the words that represent this sentence. Each word that represents a sentence is
specified by the word's id and the word's frequency in that sentence. The class diagram of
the class Word Frequency is illustrated in Figure 5.3.
Figure 5. 3 A class diagram for the class WordFrequency
Any extraction process goes through some major steps. Thus, these major steps will
be included in a class called Extractor that contains the basic functionality of any
extractor whatever the scope of the segment is. In addition, the Extractor class contains
the basic functionality for handling the Http requests. Therefore the Extractor class
extends the built-in java servlet HttpServlet. Figure 5.4 shows a class diagram of the
Extractor class. The source code of the previous classes (Word, WordFrequency,
Sentence, Paragraph, and Extractor) is shown in Appendix A.
Page 36
Figure 5. 4 Class diagram of the Extractor
The basic class is the Sentence Extractor class. The Sentence Extractor class extends
from the Extractor class and uses the classes mentioned above and apply some functions
on them. The Figure 5.5 shows the class diagram of the Sentence Extractor Class.
Page 37
Figure 5. 5 A class diagram for the class SentenceExtractor Figure 5.6 shows a class diagram for our sentence extractor.
Page 38
Figure 5. 6 The class diagram of the sentence extractor.
5.4 Implementation
The basic functionality of our sentence extractor is:
1. Identify words
2. Identify sentences
3. Estimate words' position and global frequency
4. Estimate sentences' weight by obtaining the words representing them.
Page 39
5. Create a term-document matrix.
6. Minimize the size of the TDM by eliminating irrelevant words and sentences.
7. Construct the summary
Our sentence extractor can perform two services for a text. The first one is it performs
an extraction of the most important sentences in the text. Second, it displays the term-
document matrix for this document where this matrix maps the relation between the
words and the sentences. This matrix could be useful for other studies in text extraction
or information retrieval.
The previous points are mainly the basic steps of our algorithm. The following is a
more detailed explanation of the algorithm. The source code of the sentence extractor is
shown in Appendix B.
1. Read the file: The contents of the file are read. There is an important point, that
this summarizer only works on text data not on figures or tables.
2. Parse the sentences:
a. A sentence is identified as a sequence of characters that are terminated by
a full stop (.), an exclamation mark (!), or a question mark (?).
b. Previous studies have proven that short sentences are not important.
Therefore, these sentences are neglected from the beginning. There have
not been suggestions about the sentence length cut-off. Therefore, a
sentence length threshold has been specified with a length that best suits
the system.
3. Parse the words: The words are extracted from each sentence.
Page 40
4. Eliminate unimportant words: Remove the noise words to produce a refined word
set.
a. There are some words that occur too frequently to be significant. These
words are not eligible to represent the document due to their uselessness as
they do not convey any important concept (like any, this, that, etc.). To
solve this problem we can either use a stop list that contains these noise
words or use a high frequency cut-off (threshold) where words which are
over a specific cut-off are ignored. This sentence extractor uses a stop list.
b. Most words with length less than 3 characters (like it, is, or, etc.) are
considered not important. Therefore, these words are neglected from the
beginning.
5. Estimate the global frequency of each word in the word set.
6. Perform another refinement on the words: Since the global frequency of each
word in the set is available, this indicates the word's importance. So words with
global frequency less than a specific value are neglected.
7. It was discussed before that each word has some properties; id, value, sentences
containing it. In this step the Ids of the sentences that contain this word will be
obtained.
8. It was also discussed before that each sentence has some properties; id, value, the
words in the sentences and the frequency of this word in the sentence. In this step
the algorithm will go through each sentence and find the frequency of each word
contained in this sentence.
Page 41
9. Obtain the number of sentences intended to be included in the summary. Usually
the number of sentences required for the summary is derived from the number of
sentences in the original document.
10. Construct the term-document matrix: In this step the term-document matrix is
constructed. The columns will represent the words and the rows will represent the
sentences. The value of the cell in row i and column j is the frequency of word j in
the sentence i. Figure 5.7 is an example of a term-document matrix on a document
about encryption.
Figure 5. 7 The TDM of a document in encryption.
Page 42
11. Perform the intended service: This summarizer can perform two services as
discussed before, the summarizer can display TDM matrix or summarize the
document. If the intention of the service is to display the TDM matrix, the matrix
is exported to an excel sheet as shown above and this would be the end of the
algorithm. However, if the intended service is to summarize then go to the next
step.
12. Sort the sentences: The sentences are sorted according to their importance. This is
done as following:
a. Within each row (sentence) in the TDM, the summation of the values in
the cells of this row is calculated.
b. Sort the sentences according to this summation.
13. Construct the summary: The most important sentences are selected to be included
in the summary:
a. After the sentences are sorted in a descending order. The first N sentences
are selected, where N is the number of sentences required in the summary.
b. The selected sentences are sorted according to their order in the original
document.
c. The summary is constructed by adding these sorted sentences.
5.5 Example
This part shows an example of a document extracted using our sentence extractor.
The document to be summarized is about e-commerce. Figure 5.8 shows the original
document.
Page 43
Page 44
Page 45
Page 46
Page 47
Page 48
Figure 5. 8 The original document about E-commerce
The term-document matrix of this document was shown in Figure 5.7. The result of
the sentence extraction on the document about E-commerce is shown in Figure 5.9. The
sentences are separated by a delimiter for readability purposes.
Page 49
Page 50
Page 51
Page 52
Figure 5. 9 A summary based on sentence extraction
Page 53
6 Paragraph Extraction
This chapter will discuss the process of paragraph extraction where the algorithm
extracts the most eligible paragraphs. 6.1 will provide an introduction to the paragraph
extraction. 6.2 will introduce a new concept used for the paragraph extraction. This
concept is called a text relationship map which is also called a document-document
matrix. 6.3 will explain the different methodologies used for the process of the selection
of the paragraphs to include in the summary. 6.4 will explain the analysis of the system
and the design of the summarizer. 6.5 will explain the algorithm used to implement the
summarizer with respect to each method. Finally, 6.6 will provide two examples, one for
each method.
6.1 Introduction
The idea of the extraction of paragraphs instead of sentences was first introduced in
1997 [22]. It was expected that since a paragraph is considered a self independent part of
the text, the problems of readability and coherence that were seen in the summaries
generated by sentence extraction would be improved. It is agreed that a paragraph can
address multiple topics and is motivated by context, writing style, and presentation.
6.2 Text Relationship Maps
Usually in information retrieval each text segment or excerpt is represented by a
vector of weighted terms as following:
Page 54
Di = (di1 di2, di3, …, dit) where k = 1..t
Di : Document i.
dik : The importance weight of tem tk in document i.
tk may be words or phrases derived from the document texts by an automatic indexing
procedure.
The vector similarity might be computed as the inner product between corresponding
vector elements:
t
Sim(Di, Dj) = ∑ dik.djk [22]
k=1
The similarity function may be normalized to lie between 0 for disjoint vectors and 1
for complete identical vectors.
To decide the eligible paragraphs, we want to determine how the paragraphs are
related to each other. This is done using a text relationship map.
Text Relationship Maps: Nodes (paragraphs) are joined by links based on a numerical
similarity computed for each pair of texts using the equation above. All paragraphs with a
similarity over a specific threshold are connected by links [22].
The importance of a paragraph within the text is likely to be related to the number of
links incident on the corresponding node. Figure 6.1 demonstrates a text relationship map
for of the paragraphs of the article Telecommunications from the Funk and Wagnalls
Encyclopedia. The paragraphs are denoted by nodes. Paragraphs which are sufficiently
similar are joined by a link. The similarity threshold used in this map is 0.12.
Page 55
Figure 6. 1 A text relationship map for an article in Telecommunication
Figure 6.2 shows the relationship map for the article on Telecommunications at a
similarity threshold of 0.12 with links between distant paragraphs deleted.
Page 56
Figure 6. 2 Text relationship map of the Telecommunication article after refinement
Important information about the document can be drawn from a text relationship map.
A text relationship map could be useful in:
1. Identifying related passages covering particular topic areas.
2. Providing information about the homogeneity of the text under consideration.
3. Decomposing a document into segments
A segment is a contiguous piece of text that is linked internally, but largely
disconnected from the adjacent text. Segments are our automatic approximation to
sectioning when a text does not have well defined sections [4].
Page 57
To create the extract we have to select the paragraphs that are considered the most
important for inclusion for the summary. The next part will discuss the different
methodologies on how to select these paragraphs.
6.3 Paragraph Extraction Methodologies
The process of extracting paragraph using a text relationship map can be
accomplished by automatically identifying the important paragraphs (nodes) and passing
across the selected nodes in their text order to construct the extract or path. The point
here is how to construct the path as obviously the path determines the quality and style of
the extract.
Mainly, there are four types of paths used to construct an extract [22]:
1. Bushy path: The bushiness of a node is determined by the number of links
connecting it to the other nodes. Such nodes are good overview paragraphs and
can be used in the summary. Let's say we need N paragraphs in the summary. A
bushy path would be constructed to be the N bushiest nodes on the map. The
order of the nodes in the summary is the same as their order in the original text.
2. Depth first path: Bushy nodes are the nodes that have the most connections to
other nodes in the text relationship map. However, that does not mean that the
bushy nodes are connected to each other. Therefore, the bushy nodes could not be
related to each other so much. This could provide a summary that covers the
article well but it will lack coherence. The readability of the summary might be
poor. The solution will be to use the depth first path instead. The depth first path
is constructed as following:
Page 58
a. Start an important node (a highly bushy node or the first node in the
original text).
b. Visit the next most similar node.
c. Repeat step 2 till you reach the limit of the summary length.
Since each paragraph is similar to the node after it in the path so the summary
will be coherent. The summary will be the nodes that fall within the path. The
summary contents will be controlled by the contents of first paragraph chosen.
Therefore, all aspects of a paragraph may not be covered by a depth first path.
3. Segmented bushy path: Some nodes could be well connected to each other but not
connected to the rest of the nodes. For example, it could be that each group of
nodes is interconnected to each other and there are small connections between
these node groups. These node groups are called segments. Using any of the two
previous paths will not construct a good summary. However, a segmented bushy
path could be the solution: For each segment there is a bushy path constructed
according to its text order in the original text. At least one paragraph is selected
from each segment. The remainder of the extract is constructed by picking more
bushy nodes from each segment. The more the length of the segment in the
original text the more paragraphs chosen. Since all segments are represented in
the extracts, this algorithm should enhance the comprehensiveness of the extract.
4. Augmented Segmented bushy path: Usually authors describe their work in the
first couple of paragraphs. So the introductory part and the concepts of the text
that following the introductory could be considered a segment. So a segmented
bushy system might ignore the introductory part as it is not very bushy and go to
the middle of the text and pick a bushier node. This could affect the readability of
the summary. Besides, the introductory paragraph is a very rich part to be ignored.
So the augmented segmented bushy path does what the previous method did but
Page 59
in addition it always chooses the introductory paragraph from a segment, and then
picks the bushiest paragraphs according to the size required for the summary.
6.4 Analysis and Design
An extract based on paragraph extraction is attained by constructing a path from the
text relationship map. Our text relationship map is not represented as a graph as shown
previously. It is represented as a document-document matrix. This is a term used
extensively in IR. The rows in the matrix represent the documents in the corpus and the
columns as well represent the same documents of the same corpus. The value in the cellij
(in row i and column j) is the relation between the document j to the document i. Table
6.1 shows an example of a document-document matrix. The similarity values range from
0-1 where 0 shows that the two documents are completely disjoint and 1 mean the two
documents are identical or almost identical. The value of cell1,3 is 0.6 means that
"Document 3" is similar to "Document 1"by 0.6.
Document
1
Document
2
Document
3
Document
4
Document
1
1 0.2 0.6 0.5
Document
2
0.5 1 0.13 0.4
Document
4
0.02 0.444 1 0.22
Document
3
0.4 0.1 0.2 1
Table 6.1. A document-document matrix
Page 60
There are some important notes about the previous table:
1. The similarity function used in the above table is non-commutative; the
similarity of document X to document Y is not the same as the similarity of
document Y to document X. An illustration of that in Table 6.1 is:
cell12 ≠ cell21
2. Apparently, the value of cellij will always equal 1 when i=j. As this would be
the relation between a document and itself.
In order to perform a paragraph extraction, we need to keep track of the words,
sentences and paragraphs available in the document. We have discussed in the previous
chapter the words and sentences. The class diagram of the classes Word and Sentence are
illustrated in Figure 5.1 and Figure 5.2 in Chapter 5. Each paragraph in the document is
specified by its Id, value, number of sentences, words representing it (WordFrequency).
Figure 6.3 shows the class diagram of the Paragraph Class
Figure 6. 3 Class diagram of Class Paragraph We also mentioned in Chapter 5 the Extractor class which performs the basic
functionality of an extractor whatever the scope of the text segment is. A class diagram of
Page 61
the Extractor class is shown in Figure 5.4. The ParagraphExtractor class is the controller
class of our paragraph extractor. A class diagram of the Paragraph Extractor is shown in
Figure 6.4.
Figure 6. 4 Class diagram of ParagraphExtractor class Figure 6.5 shows a class diagram for our paragraph extractor.
Page 62
Figure 6. 5 A class diagram of the paragraph extractor
6.5 Implementation
Our paragraph extractor can perform two services for a text. The first one is it
performs an extraction of the most important paragraphs in the text by constructing a
path. The paragraph extractor implements two different paths, these paths are depth-first
path and bushy node path. Second, it displays the document-document matrix for this
document where this matrix maps the relation between the paragraphs. This matrix could
be useful for other studies in text extraction or information retrieval.
Page 63
The basic functionality of our paragraph extractor is:
1. Identify paragraphs
2. Identify words
3. Identify sentences
4. Estimate words' position and global frequency
5. Estimate paragraphs' weight by obtaining the words representing them.
6. Create a document-document matrix.
7. Construct the path of the summary through the paragraphs (whether depth-
first or bushy node path).
8. Construct the summary
Some parts of this algorithm were explained in the previous chapter. The previous
points are mainly the basic steps of our algorithm. The source code of the paragraph
extractor is shown in Appendix C. The following is a more detailed explanation of the
algorithm:
1. Read the file: The contents of the file are read. There is an important point, which
is this summarizer only works on text data not on figures or tables.
2. Parse the paragraphs: The termination of a paragraph would be by skipping a line.
However, paragraphs identification can by tricky where each style of writing has
a different way of terminating a paragraph.
3. Parse the sentences:
a. A sentence is identified as a sequence of characters that are terminated by
a full stop (.), an exclamation mark (!), or a question mark (?).
Page 64
b. Previous studies have proven that short sentences are not important.
Therefore, we neglect these sentences from the beginning. There have not
been suggestions about the sentence length cut-off, so we specified a
sentence length threshold that best suits our system.
4. Parse the words: The words are extracted from each sentence.
5. Eliminate unimportant words: Remove the noise words to produce a refined word
set.
a. There are some words that occur too frequently to be significant. These
words are not eligible to represent the document due to their uselessness as
they do not convey any important concept (like any, this, that, etc.). To
solve this problem we can either use a stop list that contains these noise
words or use a high frequency cut-off (threshold) where words which are
over a specific cut-off are ignored. Our sentence extractor uses a stop list.
b. Most words with length less than 3 characters (like it, is, or, etc.) are
considered not important. Therefore, we neglect these words from the
beginning.
6. This step collects some information on the paragraphs:
a. Get number of sentences in the paragraph.
b. Obtain the words that best represent the paragraph. A word that is eligible
to represent a paragraph should occur several times. We keep track of the
word's Id and its frequency within the paragraph.
7. Perform a refinement on the paragraphs: Each paragraph is represented by some
words. The number of words representing the paragraph is an indication of its
Page 65
importance. Therefore, the paragraphs with representative words less than a
certain threshold are ignored.
8. We have discussed before that each word has some properties; id, value,
sentences containing it. In this step we will find the Ids of the sentences that
contain this word.
9. Sort the words representing the paragraph by importance.
10. Obtain the number of characters intended to be included in the summary. The
number of characters in the summary is directly derived from the number of
characters in the original document.
11. Construct the document-document matrix: The documents here are the paragraphs
of the document. Figure 6.5 is an example of a term-document matrix on a
document about encryption.
Figure 6. 6 A document-document Matrix of a document in Encryption
Page 66
12. Perform the intended service: If the service intended was to view the document-
document matrix then the matrix is exported to an excel sheet as shown above in
Figure 6.5. If the intended service was to summarize using a depth-first node path
then go to step 13 and if the intended service to summarize using a bushy node
path then go to step 14.
13. Construct the depth-first node path: The depth-first path starts with the first node
"1" and searches for the node most similar to it. Then it searches for the most
similar node to that node, figure 6.7 shows an illustration. This goes on until the
paragraphs chosen reach the limit of the summary. The selected nodes constitute
the summary.
Figure 6. 7 An illustration of using depth-first node path algorithm
14. Construct the bushy node path: The nodes that are bushiest are selected.
Page 67
a. Using the document-document matrix show in Figure 6.5 we sum the
similarity values for each row. So for example, the summation of row 1 is
1+0.66667+0+0+0+0+1+0.33333+0 = 3.
b. The rows (nodes) are sorted in a descending order with respect to their
summation values.
c. The sorted nodes are selected one by one until they reach the limit.
d. The selected nodes are sorted again with respect to their position in the
original document.
e. The sorted selected nodes finally compose the summary.
6.6 Example
This part shows an example of a document extracted using our paragraph extractor.
The document to be summarized is about e-commerce. The document is titled with
Consumer perceptions of Internet retail service quality. The original document is
shown in Chapter 5 in figures 5.8, 5.9, and 5.10.
6.6.1 Depth-First Node Path
Figure 6.7 shows the summary of the document Consumer perceptions of Internet
retail service quality shown in Figure 5.8.
Page 68
Page 69
Figure 6. 8 A summary based on depth-first node path algorithm
Page 70
6.6.2 Bushy Node path
Figure 6.8 shows the summary of the document Consumer perceptions of Internet
retail service quality shown in Figure 5.8.
Figure 6. 9 A summary based on bushy node path algorithm
Page 71
Section 3: Evaluation and results
7 Evaluation
This chapter will be concerned with evaluating the three extractors implemented in
order to come out with a guidance of which one to use under which case.
7.1 Evaluation Procedures
This evaluation will take place as following:
1. Specify the aspects of the summarization process.
2. Prepare the test data to perform the evaluation on.
3. Bring testers (participants) to test the system. The testers will have different
backgrounds.
4. Present a survey to these testers asking them some questions.
5. Obtain the results.
6. We will perform some analysis on the survey results.
7.2 Aspects of the extraction process
We have discussed before in Chapter 3 that the compression rate of the
summarization process controls the quality and style of the summary. Studies have shown
that the best compression rate of the summary would be 25% of the original document
[12]. All the text extractors implemented in this dissertation use this compression rate.
Page 72
We also have discussed in Chapter 1 the factors that affect the quality and style of the
summary (intent, background, coverage, and focus). We will fix the intent, coverage and
focus factors so the background factor will be the only variable factor.
The intent of the summary will be informative, the coverage of the summary will be
single document, the focus of the summary will be generic, and the text genre of the
summary is academic papers. The user's knowledge will be variable in this study.
Basically, the user's knowledge of a subject area could turn out to be in one of the
following levels: novice, shallow, medium, or expert.
7.3 Test Data
The test data of the evaluation process was accurately chosen to be well
representatives of the academic papers corpus. The test data was chosen from different
fields, like e-commerce, e-learning, image processing, NLP, IR and GIS (Geographical
Information System).
7.4 Participant Users
The users chosen for this survey were selected to represent the large-scale range of
users' knowledge. Therefore, we can test the summaries with users with different
backgrounds; novice, shallow, medium, or expert.
7.5 Evaluation Techniques
Page 73
A user chooses an algorithm and a test data and then asks the system to perform the
summarization of the chosen document using the algorithm specified. A summary is then
constructed.
After the user reads the summary, the evaluation process begins. We use empirical
methods to perform the evaluation [25]. Empirical methods are evaluation methods that
require the involvement of participants. There are a lot of empirical methods used for
evaluation, like co-discovery, user workshops, think aloud protocols, field observation,
questionnaires, and others. We have chosen two empirical methods for our evaluation:
1. Questionnaires: A group of questions presented to the users. The questionnaire
type used in this study is a fixed-response questionnaire where the users are asked to
register their opinions on a licker scale (a scale from 1 to 5, one represents the lowest
weight and five represents the highest). The results of the questionnaire should tell us
how the user thinks of the summary's informativeness, readability and coherence.
These are the properties of any evaluation process. They are discussed in detail in
Chapter 3. The user will then be asked to grade the summarizer.
2. Field Observation: It involves watching how users interact with the system. We
will observe the user's reactions while reading the summary and while answering the
questionnaire.
7.6 Evaluation Equipment
This part will be concerned with discussing the equipment used to perform the
evaluation, which are the user interface and the evaluation sheet.
Page 74
7.6.1 User Interface
A user interface of the system was constructed for evaluation purposes. The user
can select the extraction method and the test data then he submits to view the
summary. Figure 5.1 shows a snapshot of the interface while selecting the extraction
method.
Figure 7. 1 Interface of the extrator system where user chooses the method
Figure 7.2 shows a snapshot of the interface while choosing the test data to perform
the extraction on.
Page 75
Figure 7. 2 Interface of the system where the user chooses the document to perform the extraction on.
Figure 7.3 shows the result page where on the top there is a link to the original
document in case the user wants to go back to it and then a summary is presented.
Page 76
Figure 7. 3 The result page of our extractor where a summary is provided for the document
7.6.2 Evaluation Sheet
After the user reads the summary he fills in a form. The form first asks some general
questions like the user's name, the document title, the subject area of the document, and
the user's knowledge about the subject area. Then some questions are asked about the
summary. A sample of the questionnaire is shown in Appendix D.
Page 77
7.7 Evaluation Results
About 90 evaluation sheet was filled, 30 for each method. The questionnaire was
filled with users with different backgrounds. Within each method, 10 were filled by
experts, 13 were filled by medium users and 8 were filled by novice users. The numerical
results of this evaluation are shown in Appendix F.
7.7.1 Expert Users Results
The users found that the sentence extractor algorithm and the bushy node path
algorithm equally retain the information of the original document with a mean of 3.5.
However, the depth-first node path algorithm was found to be slightly below medium in
retaining the information of the original document with a mean of 2.8.
The users found that the two paragraph extractors are more misleading than the
sentence extractor. However, the misleading of the paragraph extractors is below medium
which diminishes the risk.
The users found that the sentence extractor provides the most elegant summary with a
mean of 3.33. The depth-first comes second with a mean of 2.88 and the bushy comes
last with a mean of 2.5.
The users found that the depth-first algorithm requires the user to have the least
background knowledge about the subject area compared with the other algorithms. The
users see that the sentence extractor requires a slightly more background about the
subject area. The bushy node algorithm is viewed as the most one that requires a user to
have back knowledge about the subject area.
Page 78
The users graded the bushy node to be the best algorithm with a mean of 4. Next is the
sentence extractor with a mean of 3.8. Finally, the depth-first algorithm comes with a
mean of 3.
Figure 7.4 shows the chart of the expert users.
Figure 7. 4 A chart for the questionnaire answers of the expert users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.
Page 79
7.7.2 Medium Users Results
The users found that the sentence extractor algorithm and the depth-first node path
algorithm equally retain the information of the original document with a mean of 3.77.
However, the bushy node path algorithm was found to be slightly below the other two
algorithms in retaining the information of the original document with a mean of 3.58.
The users found that the bushy node algorithm is the most misleading with a mean of
2.5. The other paragraph extractor is considered less misleading with a mean of 2.36.The
sentence extractor is considered the least misleading with a mean of 2.
The users found that the bushy node algorithm provides the most elegant summary
with a mean of 4. Right next to it comes the depth-first with a mean of 3.92 and the
sentence extractor comes last with a mean of 3.15.
The users found that the sentence extractor requires the user to have the least
background knowledge about the subject area compared with the other algorithms. The
users see that the depth-first algorithm requires a slightly more background about the
subject area. The bushy node algorithm also comes last. The differences between the
three are the same.
The users graded the bushy node to be the last algorithm with a mean of 3.5 (which is
still good as that still is above medium). The sentence extractor is considered the best to
these users with a mean of 3.84. Next is the depth-first algorithm with a mean of 3.6.
Figure 7.5 shows a chart of the medium users.
Page 80
Figure 7. 5 A chart for the questionnaire answers of the medium users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.
7.7.3 Beginner Users Results
Novice users consider the depth-first algorithm to be the best in retaining of the
information from the original document with a mean of 3.625. The other two algorithms
are almost the same in the information retention with a mean of 3.2.
Page 81
The users found that the bushy node path is the most misleading with a mean of 3
(which is exactly the medium). The other two algorithms are rated slightly below the
medium.
The depth-first algorithm was the most elegant with a mean of 3.875. The other two
algorithms have the exact same elegance rate.
Again the bushy node path algorithm requires the user to have a large amount of
background information with a mean of 4. Next is the sentence extractor with a mean of
3.25. The depth-first algorithm comes last with a mean of 2.2.5. The depth-first algorithm is graded the best with a mean of 3.875. The other two
algorithms have the same exact mean of 3. Figure 7.6 shows a chart of the beginner users (novice)
Page 82
Figure 7. 6 A chart for the questionnaire answers of the novice users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.
Page 83
Section 4: Discussion
8 Discussion
In Chapter 1, I have stated my aims and objectives in this thesis. I succeeded to achieve
the following:
• I implemented an automatic summarizer that is based on sentence extraction.
• I implemented an automatic summarizer that is based on bushy node path
paragraph extraction.
• I implemented an automatic summarizer that is based on depth-first node path
paragraph extraction.
• I have enabled the public to contribute and share their opinions on the automatic
summaries asking them some questions.
• I presented to them a lot of questions although I did not use them all. As the
results of these questions could be analyzed in the future.
• I evaluated the text summarizers with regards to the background aspect.
• I have been able to test how informative the summarizers are.
• I have been able to come out with which one of the summarizers used is the safest
(least misleading).
• I had performed some analysis on the results of the questionnaire I used that came
out with interesting observations. However, extensible analysis could be
performed on these results where very interesting conclusions could be obtained.
8.1 will discuss the conclusions I achieved and also the interesting observations I
discovered. 8.2 will discuss the future work I intend to do.
Page 84
8.1 Conclusions
Interesting conclusions were found in this study. These conclusions are backed up by
the results of the evaluation and the personal experience in testing the algorithms. The
most important and positive conclusions are shown below.
The depth-first algorithm proved out to be the most appropriate algorithm for novice
users. While the sentence extractor algorithm proved out to be the most appropriate
algorithm for the medium users. Finally, the bushy node algorithm proved to be the most
appropriate for expert users. This method in particular was given very high grades in the
questionnaire which reassures that the expert users would prefer a bushy node algorithm.
This conclusion is considered logical as if an individual is an expert in a specific subject
area, he would prefer to have the basic points (most important points) of the document
which what a bushy node algorithm provides.
The bushy node algorithm is considered the most algorithm that requires the user to
have background knowledge about the subject area. This is backed up by the results of
the questionnaire. This could be a reason why novice users gave very bad grades to this
summarizer. The sentence extractor algorithm proved to be the safest algorithm as it is
the least misleading. The dept-first node algorithm proved to construct the most elegant
summaries.
Page 85
These are some guidelines for which one algorithm to use. These guidelines are
backed up by the evaluation results and my personal experience in testing the algorithms:
• It is recommended to use a bushy node algorithm for experts.
• It is recommended to use a depth-first node algorithm for user who are new in
the subject area.
• Students that need a summary for their course at the night before their exam
would find the sentence extractor the best for them. It will provide small notes
that can be easy to remember.
• If the document is not very large, it is recommended not using any of the
paragraph extractors.
• If the document is small; i.e. a page or two, it is recommended not using any
one of the extractors.
• If the document need to be summarized is a critical document, it is
recommended to use the sentence extractor algorithm.
• No specific algorithm could be recommended if you are looking for elegance.
That is because the summarizers' elegance grades are very similar. However,
they all provide elegant summaries.
Page 86
8.2 Future Work
Automatic Text Summarization will always be a very rich field for software
development as it faces a lot of challenges and it can provide a massive number of
opportunities.
The summarizers could be enhanced to construct more coherent summaries. This will
be done by adding to the system the functionality of lexical chains and using pronominal
resolution. Pronominal resolution is of, relating to, or functioning as a pronoun,
resembling a pronoun, as by specifying a person, place, or thing, while functioning
primarily as another part of speech [8].
More statistics could be performed on the results of the evaluation performed on the
three algorithms. As it was mentioned before that these results could provide interesting
information if a lot of analysis is performed on them.
The algorithms could be applied on different text genres and on different focuses
(query relevant or generic).
Video summarization has become a very interesting field that faces a lot of problems
and challenges. Performing summarization on video will be a very interesting project to
do.
Page 87
References [1] Choi, F. Y. Y., Wiemer-Hastings, P. and Moore, J., 2001, Latent semantic analysis for text segmentation. In Proceedings of EMNLP (Pittsburgh, USA), 109-117.
[2] Edmunson, H. 1969. New methods in automatic extracting, Journal of the
ACM 16(2):264–285.
[3]Eduard H. and Daniel M., 1998, Automatic text Summarization Tutorial, http://ww w.isi.edu/~marcu/acl-tutorial.ppt, Accessed (23/6/2006).
[4] Gregory S. & Kathleen F. McCoy, 2000, Efficient Text Summarization Using
Lexical Chains.
[5] Firmin T. and Chranowski M.J., 1999, An Evaluation of Automatic Text
Summarization Systems.
[6] Graesser, A.C., et al., 2001, Intelligent tutoring systems with conversational dialogue. AI Magazine, 22(4).
[7]Halliday M. and Hasan H, 1976, Cohesion in English. Longman, London.
[8] Hassel, M. 2000. Pronominal Resolution in Automatic Text Summarisation. Master Thesis, University of Stockholm, Department of Computer and Systems Sciences (DSV).
[9] Hercules D, 2003, Automatic Text Summarization, www.gslt.hum.gu.se/courses/ia/OHSummarizeSept2003.pdf, Accessed (23/8/2006)
Page 88
[10] Hovy, E.H. and Lin, C-Y, 1998, Automated Text Summarization in
SUMMARIST. In M. Maybury and I. Mani (eds), Intelligent Scalable
Summarization Text Summarization. Forthcoming.
[11] Hu, X., Cai, Z., Louwerse, M., Olney, A., Penumatsa, P., Graesser, A.C., & TRG (2003). A revised algorithm for Latent Semantic Analysis. Proceedings of the 2003 International Joint Conference on Artificial Intelligence(pp. 1489-1491)
[12]Jing, H., Barzilay, R., McKeown, K. and Elhadad M., Summarization Evaluation
Methods: Experiments and Analysis, Working Notes of he AAI-98 Spring Symposium on
Intelligent Text Summarization, pp. 60-68 (1998).
[13] Jing H., Sentence reduction for automatic text summarization. In Proceedings of ANLP, 2000.
[14] Kupiec J., Pedersen J. and Chen, F., 1995, A trainable document summarizer.
In Proceedings, 18th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, 68–73. Seattle, Washington: Special
Interest Group on Information Retrieval
[15] Landauer, T. K. & Dumais, S. T., 1997, A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.
[16] Landauer T.K., Foltz, P.W., & Laham D., 1998, An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
[17] Luhn, H. 1958, The automatic creation of literature abstracts. IBM Journal of
Research and Development 2(2).
Page 89
[18] Mani, I., 2001, Summarization Evaluation: An overview. In Proceedings of
the Second NTCIR Workshop on Rsearch in Chinese
[19] Mani, I. and Maybury M., eds. 1999, Advances in Automatic Text
Summarization, MIT Press.
[20]Martin H. and Hercules D.,2004, Generation of Reference Summaries http://www.nada.kth.se/~xmartin/papers/ltc_026_hassel_dalianis_final.pdf#search=%22Generation%20of%20Reference%20Summaries%22, Accessed (22/7/2006)
[21] Martin H. and Hercules D., 2005, Generation of reference summaries. In Proceedings of 2nd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 21–23, 2005.
[22] Mitra M., Amit S., and Chris B. 1997. Automatic text summarization by paragraph extraction. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization, pages 31{36, Madrid, Spain}
[23] Morris J., and Hirst G., 1991, Lexical cohesion computed by thesaural
relations as an indicator of the structure of text. Computational Linguistics 17(1):
21-45, 1001.
[24]Pachantouris G., 2005, GreekSum - A Greek Text Summarizer, Master Thesis,
Department of Computer and Systems Sciences, KTH-Stockholm university
[25] Patrick, W. Jordan,1998, An Introduction to Usability, Taylor & Francis Ltd
.
Page 90
[26] Regina B. and Michael E., 1997, Using lexical chains for text summarization. In Proceedings of the ACL’97/EACL’07 Workshop on Intelligent Scalable Text Summarization, Madrid , Spain.
[27] Silber, H. G., McCoy, K. F, 2000, Efficient text summarization using lexical chains. In Proceedings of Intelligent User Interfaces 2000.
[28] Wikipedia Encyclopedia, http://en.wikipedia.org/wiki/Singular_Value_Decomposition, Accessed (18/8/2006)
Page 91
Appendixes
Appendix A: Basic classes of the system
This part has the basic classes using in the system which are Extractor, Paragraph,
Sentence, Word, and WordFrequency.
First the Paragraph class:
//////////////////////////////////////////Paragraph////////////////////////////////////////////////////////////////////////
/*
* Paragraph.java
*
* Created on August 28, 2006, 7:28 PM
*
* To change this template, choose Tools | Template Manager
* and open the template in the editor.
*/
/**
*
* @author Omar Azzam
*/
public class Paragraph {
int id;
String value;
int numOfSentences;
Page 92
int numOfChars;
WordFrequency[] representativeWordsFrequency;
/** Creates a new instance of Paragraph */
public Paragraph() {
}
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Second the WordFrequency Class: //////////////////////////////////////////WordFrequency////////////////////////////////////////////////////////////////
/* * WordFrequency.java * * Created on August 24, 2006, 12:53 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class WordFrequency { /** Creates a new instance of WordFrequency */ int id,frequency; public WordFrequency(int id,int frequency) { this.id = id; this.frequency = frequency; } }
Page 93
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Third the Sentence Class: //////////////////////////////////////////Sentence/////////////////////////////////////////////////////////////////// import java.util.ArrayList; import java.util.Vector; /* * Sentence.java * * Created on August 24, 2006, 12:48 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class Sentence { Integer id; String value; ArrayList<WordFrequency> wordFrequency = new ArrayList<WordFrequency>(); WordFrequency sentenceWordFrequency[]; /** Creates a new instance of Sentence */ public Sentence(int id,String value) { this.id = new Integer(id); this.value = value; } } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Page 94
Fourth the Word Class: ///////////////////////////////////////Word////////////////////////////////////////////////////////////////////////////////// /* * Paragraph.java * * Created on August 28, 2006, 7:28 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class Paragraph { int id; String value; int numOfSentences; int numOfChars; WordFrequency[] representativeWordsFrequency; /** Creates a new instance of Paragraph */ public Paragraph() { } } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Fifth the Extractor class: ///////////////////////////////////////Extractor///////////////////////////////////////////////////////////////////////////// /* * Extractor.java * * Created on August 31, 2006, 3:52 PM
Page 95
*/ import java.io.*; import java.net.*; import java.util.ArrayList; import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version */ public abstract class Extractor extends HttpServlet { /** Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods. * @param request servlet request * @param response servlet response */ // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response */ FileInputStream fis;// byte b[]; ArrayList<String> a = new ArrayList<String>(); Sentence[] sentence; int sentenceCapacity = 0; String documentData; String words[]; Word[] preEnhancedWords; Word[] enhancedWords; int preEnhancedWordsCapacity; int enhancedWordsCapacity; int wordCapacity = 0; protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
Page 96
} /** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */ protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { } public String readData(String fileName) { try { fis = new FileInputStream(fileName); b = new byte[fis.available()]; fis.read(b); } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch (IOException ex) { ex.printStackTrace(); } return new String(b); } public void parseSentences() { int cursor = 0; String tempSentence; String restOfText; documentData = documentData.replaceAll("\r",""); documentData = documentData.replaceAll("\n",""); documentData = documentData.replaceAll("\t",""); String data = new String(documentData); cursor = data.indexOf(".");
Page 97
if(cursor> data.indexOf("?") && data.indexOf("?")!=-1) cursor = data.indexOf("?"); if(cursor> data.indexOf("!") && data.indexOf("!")!=-1) cursor = data.indexOf("!"); do { tempSentence = data.substring(0,cursor+1); try { data = data.substring(cursor+1); } catch(StringIndexOutOfBoundsException strIndExc) { break; } if(tempSentence.length()>25) { a.add(sentenceCapacity,tempSentence); sentenceCapacity++; } cursor = data.indexOf("."); if(cursor> data.indexOf("?") && data.indexOf("?")!=-1) cursor = data.indexOf("?"); if(cursor> data.indexOf("!") && data.indexOf("!")!=-1) cursor = data.indexOf("!"); } while(cursor>=0); sentence = new Sentence[sentenceCapacity]; for(int z=0;z<sentenceCapacity;z++) { sentence[z] = new Sentence(z,a.get(z)); } } public void inStopList() {
Page 98
preEnhancedWords = new Word[words.length]; preEnhancedWordsCapacity = 0; boolean exists = false; for(int i=0;i<words.length;i++) { try{ if(words[i].length()>3) //Stop List if( words[i].equalsIgnoreCase("want") || words[i].equalsIgnoreCase("then") || words[i].equalsIgnoreCase("that") || words[i].equalsIgnoreCase("when") || words[i].equalsIgnoreCase("this") || words[i].equalsIgnoreCase("where") || words[i].equalsIgnoreCase("which") || words[i].equalsIgnoreCase("when") || words[i].equalsIgnoreCase("with") || words[i].equalsIgnoreCase("these") || words[i].equalsIgnoreCase("those") || words[i].equalsIgnoreCase("know") || words[i].equalsIgnoreCase("have") || words[i].equalsIgnoreCase("from") || words[i].equalsIgnoreCase("about") || words[i].equalsIgnoreCase("your") || words[i].equalsIgnoreCase("what") || words[i].equalsIgnoreCase("between") || words[i].equalsIgnoreCase("using") || words[i].equalsIgnoreCase("different") || words[i].equalsIgnoreCase("like") || words[i].equalsIgnoreCase("very") || words[i].equalsIgnoreCase("other") || words[i].equalsIgnoreCase("part") || words[i].equalsIgnoreCase("just") || words[i].equalsIgnoreCase("don't") || words[i].equalsIgnoreCase("th ey") || words[i].equalsIgnoreCase("used") || words[i].equalsIgnoreCase("there") || words[i].equalsIgnoreCase("also") || words[i].equalsIgnoreCase("than") || words[i].equalsIgnoreCase("such") || words[i].equalsIgnoreCase("more") || words[i].equalsIgnoreCase("is") || words[i].equalsIgnoreCase("many") || words[i].equalsIgnoreCase("of") || words[i].equalsIgnoreCase("and") || words[i].equalsIgnoreCase("at") || words[i].equalsIgnoreCase("an") || words[i].equalsIgnoreCase("a") || words[i].equalsIgnoreCase("all") || words[i].equalsIgnoreCase("on") || words[i].equalsIgnoreCase("no") || words[i].equalsIgnoreCase("one") || words[i].equalsIgnoreCase("two") || words[i].equalsIgnoreCase("three") || words[i].equalsIgnoreCase("four") || words[i].equalsIgnoreCase("five") || words[i].equalsIgnoreCase("six") || words[i].equalsIgnoreCase("seven") || words[i].equalsIgnoreCase("eight") || words[i].equalsIgnoreCase("nine") || words[i].equalsIgnoreCase("ten") || words[i].equalsIgnoreCase("has") || words[i].equalsIgnoreCase("to ") || words[i].equalsIgnoreCase("yet") || words[i].equalsIgnoreCase("we") || words[i].equalsIgnoreCase("make") || words[i].equalsIgnoreCase("been") || words[i].equalsIgnoreCase("based") || words[i].equalsIgnoreCase("are") || words[i].equalsIgnoreCase("able") || words[i].equalsIgnoreCase("or") || words[i].equalsIgnoreCase("for") || words[i].equalsIgnoreCase("after") ||
Page 99
words[i].equalsIgnoreCase("be") || words[i].equalsIgnoreCase("same") || words[i].equalsIgnoreCase("can") || words[i].equalsIgnoreCase("even") || words[i].equalsIgnoreCase("find") || words[i].equalsIgnoreCase("it") || words[i].equalsIgnoreCase("in") || words[i].equalsIgnoreCase("his") || words[i].equalsIgnoreCase("her") || words[i].equalsIgnoreCase("own") || words[i].equalsIgnoreCase("the") || words[i].equalsIgnoreCase("most") || words[i].equalsIgnoreCase("would") || words[i].equalsIgnoreCase("could") || words[i].equalsIgnoreCase("into") || words[i].equalsIgnoreCase("however") || words[i].equalsIgnoreCase("will") || words[i].equalsIgnoreCase("they") || words[i].equalsIgnoreCase("were") || words[i].equalsIgnoreCase("only") || words[i].equalsIgnoreCase("here") || words[i].equalsIgnoreCase("made")) { } else { exists = false; for(int j=0;j<preEnhancedWordsCapacity;j++) { try{ if(words[i].equalsIgnoreCase(preEnhancedWords[j].value)) { exists = true; break; } }catch(NullPointerException er) { int z = 44; } } if(!exists) { preEnhancedWords[preEnhancedWordsCapacity] = new Word(preEnhancedWordsCapacity,words[i]); preEnhancedWordsCapacity++; } }}catch(NullPointerException rr) { int jkh =33; }
Page 100
} } public void getGlobalFrequencyOfWords() { int frequency=0; String tempData; String tempWord; int startIndex,endIndex; documentData = documentData.toLowerCase(); tempData = documentData; for(int i=0;i<preEnhancedWordsCapacity;i++) { tempData = documentData; ///Get the global frequency of the word frequency=0; tempWord = preEnhancedWords[i].value; tempWord = tempWord.toLowerCase(); do { startIndex = tempData.indexOf(tempWord); if(startIndex==-1) break; else { tempData = tempData.substring(startIndex+tempWord.length()); frequency++; } }while(startIndex!=-1); preEnhancedWords[i].globalFrequency = frequency; } } public void parseWords() { String tempwordsOfEachSentence[]; ArrayList <String[]>wordSent = new ArrayList<String[]>();
Page 101
int i=0,j=0; int index=0; int questionMarkPosition; String tempSentence; for(i=0;i<sentence.length;i++) { tempSentence = new String(sentence[i].value); tempSentence = tempSentence.substring(0,tempSentence.length()-1); tempSentence = tempSentence.replaceAll("!"," "); tempSentence = tempSentence.replaceAll(","," "); tempSentence = tempSentence.replaceAll(";"," "); tempSentence = tempSentence.replaceAll(":"," "); tempwordsOfEachSentence = tempSentence.split(" "); wordCapacity+=tempwordsOfEachSentence.length; wordSent.add(i,tempwordsOfEachSentence); } words = new String[wordCapacity]; tempwordsOfEachSentence = wordSent.get(j++); while(true) { for(int k=0;k<tempwordsOfEachSentence.length;k++) { words[index] = tempwordsOfEachSentence[k]; index++; } try { tempwordsOfEachSentence = wordSent.get(j); } catch(IndexOutOfBoundsException rf) { break; } j++; } } public void removeNoiseWords() { enhancedWordsCapacity = 0; enhancedWords = new Word[preEnhancedWordsCapacity]; for(int i=0;i<preEnhancedWordsCapacity;i++) { if(preEnhancedWords[i].globalFrequency>2)
Page 102
{ enhancedWords[enhancedWordsCapacity] = new Word(enhancedWordsCapacity,preEnhancedWords[i]); enhancedWordsCapacity++; } } } public abstract void exportToExcelSheet(); /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } // </editor-fold> } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Appendix B: The sentence extractor program /* * SentenceExtractor.java * * Created on August 28, 2006, 11:54 AM */ import java.io.*; import java.net.*; import java.util.ArrayList; //import java. import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version
Page 103
*/ public class SentenceExtractor extends Extractor { int numOfSentencesRequiredInSummary; int termDocumentMatrix[][]; PrintWriter out; Sentence enhancedSentence[]; int sortedSentencesID[]; int enhancedSentenceCapacity; String fileName = null;// = "C:/Trial.txt"; Word sortedEnhancedWords[]; boolean export = false; // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response */ protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { out = response.getWriter(); export = false; fileName = null; documentData = null; enhancedSentence = null; enhancedSentenceCapacity = 0; numOfSentencesRequiredInSummary = 0; preEnhancedWords = null; preEnhancedWordsCapacity = 0; sentence = null; sentenceCapacity = 0; sortedEnhancedWords = null; sortedSentencesID = null; termDocumentMatrix = null; wordCapacity = 0; words = null;
Page 104
if(request.getParameter("service").equalsIgnoreCase("export")) { response.setContentType("application/vnd.ms-excel"); export = true; } fileName = request.getParameter("file"); documentData = readData(fileName); parseSentences(); parseWords(); inStopList(); getPreEnhancedWordsLength(); getGlobalFrequencyOfWords(); removeNoiseWords(); fillinWordClass(); fillInWordFrequencyClass(); enhancedSentence = sentence; enhancedSentenceCapacity = sentenceCapacity; numOfSentencesRequiredInSummary = getNumberOfSentencesRequiredInSummary(); termDocumentMatrix = createTermDocumentMatrix(); if(export) exportToExcelSheet(); else { // Set the outlook of the web page. int inn = fileName.lastIndexOf("/"); String tempFileName = fileName.substring(inn+1); tempFileName = tempFileName.replace(".txt",""); String title = tempFileName.replaceAll("/",""); String tt = null; boolean problema = false; try { tt = fileName.substring(0,inn); } catch(Exception e){ problema = true; } if(!problema) { tt+="/OriginalData/"+tempFileName+"OriginalDocument.htm"; title = title.replaceAll("\"","");
Page 105
title = title.replaceAll("C:",""); tt.replaceAll("/","\""); out.println("<title>"+title+"</title>"); out.println("<br><a href='"+tt+"'>Full Document</a>"); } out.println("<br><br><h4>Summary</h4><br><br>"); String summarySentencesIds = ""; int temp; int []sortedSentencesId = sortSentences(); int selectedSentencedId[] = new int[numOfSentencesRequiredInSummary]; for(int i=0;i<numOfSentencesRequiredInSummary;i++) { selectedSentencedId[i] = sortedSentencesId[i]; } for(int i=0;i<numOfSentencesRequiredInSummary-1;i++) { for(int j=i+1;j<numOfSentencesRequiredInSummary;j++) { if(enhancedSentence[selectedSentencedId[i]].id>enhancedSentence[selectedSentencedId[j]].id) { temp = selectedSentencedId[i]; selectedSentencedId[i] = selectedSentencedId[j]; selectedSentencedId[j] = temp; } } } for(int i=0;i<numOfSentencesRequiredInSummary;i++) { out.println(enhancedSentence[selectedSentencedId[i]].value); out.println("<br>------<br>"); } } } /** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */
Page 106
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { } /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } public void removeNoiseSentences() { enhancedSentence = new Sentence[sentence.length]; enhancedSentenceCapacity = 0; boolean eligible = true; for(int i=0;i<sentence.length;i++) { if(sentence[i].sentenceWordFrequency.length<3) { enhancedSentence[enhancedSentenceCapacity] = sentence[i]; enhancedSentence[enhancedSentenceCapacity].id = enhancedSentenceCapacity; enhancedSentenceCapacity++; } } } public int getNumberOfSentencesRequiredInSummary() { return (int)(Math.floor(enhancedSentence.length/4)); } public void fillinWordClass() { int cursor; int wordFrequency; String tempSentence; String tempWord; WordFrequency tempWordFrequency; for(int i=0;i<enhancedWordsCapacity;i++) {
Page 107
tempWord = enhancedWords[i].value; enhancedWords[i].numberOfSentencesContainingIt = 0; for(int j=0;j<sentence.length;j++) { tempSentence = sentence[j].value; wordFrequency=0; if((cursor=tempSentence.indexOf(tempWord))>=0) { do { wordFrequency++; tempSentence = tempSentence.substring(cursor+tempWord.length()); cursor = tempSentence.indexOf(tempWord); } while(cursor>=0); enhancedWords[i].numberOfSentencesContainingIt++; enhancedWords[i].sentenceContainingIt.add( new Integer(sentence[j].id)); tempWordFrequency = new WordFrequency( enhancedWords[i].id,wordFrequency); sentence[j].wordFrequency.add(tempWordFrequency); } } } for(int i=0;i<enhancedWordsCapacity;i++) { enhancedWords[i].sentencesIdContainingIt = new int[enhancedWords[i].numberOfSentencesContainingIt]; for(int j=0;j< enhancedWords[i].numberOfSentencesContainingIt;j++) { enhancedWords[i].sentencesIdContainingIt[j] = enhancedWords[i].sentenceContainingIt.get(j).intValue(); } } }
Page 108
public void getPreEnhancedWordsLength() { for(int i=0;i<preEnhancedWords.length;i++) { if(preEnhancedWords[i]==null) { preEnhancedWordsCapacity = i-1; break; } } } public void fillInWordFrequencyClass() { ArrayList<WordFrequency> tempWordFrequency; Sentence tempSentence; int wordFrequencyCounter; for(int i=0;i<sentenceCapacity;i++) { tempSentence = sentence[i]; tempWordFrequency = tempSentence.wordFrequency; wordFrequencyCounter=0; while(true) { try{ tempWordFrequency.get(wordFrequencyCounter); }catch(IndexOutOfBoundsException igg){ break; } wordFrequencyCounter++; } tempSentence.sentenceWordFrequency = new WordFrequency[wordFrequencyCounter]; for(int j=0;j<wordFrequencyCounter;j++) tempSentence.sentenceWordFrequency[j] = (WordFrequency)tempWordFrequency.get(j);
Page 109
} } public int[][] createTermDocumentMatrix() { int tdm[][] = new int[enhancedSentenceCapacity][enhancedWordsCapacity]; int counter,wordFrequencyCounter; Sentence tempSentence; Word tempWord; WordFrequency[] tempWordFrequency; for(int i=0;i<enhancedSentenceCapacity;i++) { counter = 0; tempSentence = enhancedSentence[i]; tempWordFrequency = tempSentence.sentenceWordFrequency; for(int j=0;j<tempWordFrequency.length;j++) { tdm[i][tempWordFrequency[j].id] = tempWordFrequency[j].frequency; } } return tdm; } public void exportToExcelSheet() { String wordValues = new String(); String sentenceValues = new String(); for(int i=0;i<enhancedWordsCapacity;i++) { wordValues = wordValues +enhancedWords[i].value+"\t"; } out.println(wordValues); for(int i=0;i<enhancedSentenceCapacity;i++) { sentenceValues =""; for(int j=0;j<enhancedWordsCapacity;j++) { sentenceValues = sentenceValues+termDocumentMatrix[i][j]+"\t";
Page 110
} out.println(sentenceValues); } } public int[] sortSentences() { int tempTDM[][] = termDocumentMatrix; int tempSentencesWeight[] = new int[enhancedSentenceCapacity]; int tempSentencesId[] = new int[enhancedSentenceCapacity]; int temp; int sum; for(int i=0;i<enhancedSentenceCapacity;i++) { sum = 0; for(int j=0;j<enhancedWordsCapacity;j++) { sum+=tempTDM[i][j]; } tempSentencesWeight[i] = sum; tempSentencesId[i] = i; } for(int i=0;i<tempSentencesId.length-1;i++) { for(int j=i+1;j<tempSentencesId.length;j++) { if(tempSentencesWeight[i]<tempSentencesWeight[j]) { temp = tempSentencesWeight[i]; tempSentencesWeight[i] = tempSentencesWeight[j]; tempSentencesWeight[j] = temp; temp = tempSentencesId[i]; tempSentencesId[i] = tempSentencesId[j]; tempSentencesId[j] = temp; } } } return tempSentencesId; }
Page 111
} Appendix C: The paragraph extractor program /* * ParagraphExtractor.java * * Created on August 28, 2006, 7:00 PM */ import java.io.*; import java.net.*; import java.util.ArrayList; import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version */ public class ParagraphExtractor extends Extractor { PrintWriter out; String documentParagraphs[]; String fileName; int enhancedParagraphsCapacity=0; float documentSize; double summaryLimit; static int dd=0; boolean export = false; boolean summarize = false; double upperRatioLimit = 0.3; double lowerRatioLimit = 0.2; double[][] documentToDocumentMatrix; String service = new String(); String method = new String();
Page 112
Paragraph[] paragraphs; Paragraph[] enhancedParagraph; /** * Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods. * * @param request servlet request * @param response servlet response * * * // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> * /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response * @throws javax.servlet.ServletException * @throws java.io.IOException */ protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { //Initializing data documentData = null; documentParagraphs = null; documentToDocumentMatrix = null; enhancedParagraph = null; enhancedParagraphsCapacity = 0; enhancedWords = null; enhancedWordsCapacity = 0; export = false; fileName = null; method = null; paragraphs = null; preEnhancedWords = null; preEnhancedWordsCapacity = 0; sentence = null; sentenceCapacity = 0; summarize = false; service = null; wordCapacity = 0; words = null; method = request.getParameter("method");
Page 113
fileName = null; service = null; out = response.getWriter(); fileName = request.getParameter("file"); service = request.getParameter("service"); fileNames = getFileNames(); int counterr = 0; //Obtaining the physical name of the file while(counterr<NUM_OF_TEST_DATA) { if(fileName.equalsIgnoreCase(fileNames[counterr][0])) { fileName = fileNames[counterr][1]; break; } counterr++; } if(service.equalsIgnoreCase("Summarize")) summarize = true; else { export = true; response.setContentType("application/vnd.ms-excel"); } if(method.equalsIgnoreCase("Sentence Extraction")) { //This will redirect to another servlet "SentenceExtractor" if(summarize) response.sendRedirect("http://localhost:8084/trial2/trial?file="+fileName+"&service=summarize"); else response.sendRedirect("http://localhost:8084/trial2/trial?file="+fileName+"&service=export"); return; }
Page 114
documentData = readData(fileName); documentSize = documentData.length(); parseParagraphs(); parseSentences(); parseWords(); inStopList(); getPreEnhancedWordsLength(); getGlobalFrequencyOfWords(); removeNoiseWords(); fillInParagraphClass(); removeIneligibleParagraphs(); sortWordFrequenciesOfParagraphs(); documentToDocumentMatrix = getDocumentToDocumentMatrix(); if(export) { exportToExcelSheet(); return; } // Set the outlook of the web page. int inn = fileName.lastIndexOf("/"); String tempFileName = fileName.substring(inn+1); tempFileName = tempFileName.replace(".txt",""); String title = tempFileName.replaceAll("/",""); String tt = fileName.substring(0,inn); tt+="/OriginalData/"+tempFileName+"OriginalDocument.htm"; title = title.replaceAll("\"",""); title = title.replaceAll("C:",""); tt.replaceAll("/","\""); out.println("<title>"+title+"</title>"); out.println("<br><a href='"+title+"'>Full Document</a>"); out.println("<br><br><h4>Summary</h4><br><br>"); //Select the path the summary will go through if(method.equalsIgnoreCase("Paragraph Extraction using Depth First Nodes path")) createDepthFirstPath(); else if(method.equalsIgnoreCase("Paragraph Extraction using Bushy Nodes path")) createBushyNodePath(); else
Page 115
out.println("Malformed URL"); } public int getNumberOfSentences(String par) { int cursor=0; int sentenceCounter=0; do { cursor = par.indexOf("."); if(cursor>par.indexOf("?") && par.indexOf("?")!=-1) cursor = par.indexOf("?"); if(cursor> par.indexOf("!") && par.indexOf("!")!=-1) cursor = par.indexOf("!"); par = par.substring(cursor+1); sentenceCounter++; } while(cursor!=-1); return sentenceCounter; } /* *Creates a path for the summary based on the bushy node path which selectes the bushiest nodes */ public void createBushyNodePath() { summaryLimit = Math.floor((double)documentData.length()/4); int currentCapacity=0; double nodesWeights[] = new double[enhancedParagraphsCapacity]; int bushiestNodes[] = new int[enhancedParagraphsCapacity]; double sum; for(int i=0;i<enhancedParagraphsCapacity;i++) { sum = 0; for(int j=0;j<enhancedParagraphsCapacity;j++) { sum+=documentToDocumentMatrix[i][j]; }
Page 116
nodesWeights[i] = sum; } bushiestNodes = sort(nodesWeights); currentCapacity = 0; int numOfParagraphsIncludedInSummary = 0; for(numOfParagraphsIncludedInSummary=0;numOfParagraphsIncludedInSummary<enhancedParagraphsCapacity;numOfParagraphsIncludedInSummary++) { currentCapacity+=enhancedParagraph[bushiestNodes[numOfParagraphsIncludedInSummary]].value.length(); if(currentCapacity>summaryLimit) break; } int finalParagraphs[] = sortChosenParagraphs(bushiestNodes,numOfParagraphsIncludedInSummary); int ii = 0; for(int i=0;i<numOfParagraphsIncludedInSummary;i++) { out.println(enhancedParagraph[finalParagraphs[i]].value); out.println("<br>-------------------------------------------------------------------<br>"); ii++; } } public int[] sortChosenParagraphs(int[] selectedParagrahs,int paragraphsInSummary) { int[] sortedParagraphs = new int[paragraphsInSummary]; for(int i=0;i<sortedParagraphs.length;i++) { sortedParagraphs[i] = selectedParagrahs[i]; } return sort(sortedParagraphs); } public int[] sort(int[] array) { int temp; for(int i=0;i<array.length-1;i++) {
Page 117
for(int j=i+1;j<array.length;j++) { if(array[i]>array[j]) { temp = array[i]; array[i] = array[j]; array[j] = temp; } } } return array; } /* *Sorting the eligible nodes or paragraphs due to their weight and returns an array *containing the IDs of the paragraphs after being sorted in a descending order. */ public int[] sort(double[] array) { int[] bushiestNodes = new int[enhancedParagraphsCapacity]; int tempBushyNode; double tempArray; for(int i=0;i<bushiestNodes.length;i++) bushiestNodes[i] = i; for(int i=0;i<array.length-1;i++) { for(int j=i+1;j<array.length;j++) { if(array[i]<array[j]) { tempBushyNode = bushiestNodes[i]; bushiestNodes[i] = bushiestNodes[j]; bushiestNodes[j] = tempBushyNode; tempArray = array[i]; array[i] = array[j]; array[j] = tempArray; } } } return bushiestNodes; } /*
Page 118
*Creates a path for the summary based on the depth first node path which begins with the first node and selectes the most similar node to it and so on. */ public void createDepthFirstPath() { summaryLimit = Math.floor((double)documentData.length()/4); String summaryPath="0;"; double similarity = 0.0; int position=-1; int counter=0; int ii = 0; int jj = 0; while(counter<=enhancedParagraphsCapacity && ii<enhancedParagraphsCapacity) { position=-1; similarity = 0.0; for(jj=0;jj<enhancedParagraphsCapacity;jj++) { if(ii==jj) continue; if(documentToDocumentMatrix[ii][jj]>similarity) { similarity = documentToDocumentMatrix[ii][jj]; position=jj; } } if(position!=-1) { if(summaryPath.indexOf(position+";")==-1) { ii=position; summaryPath+=position+";"; counter++; } else { ii++; if(ii==enhancedParagraphsCapacity) break; summaryPath+=ii+";"; counter++; }
Page 119
} else { ii++; } } String eligibleParagraphs[] = summaryPath.split(";"); Integer eligibleParagraphsId[] = new Integer[eligibleParagraphs.length]; for(int i=0;i<eligibleParagraphs.length;i++) eligibleParagraphsId[i] = new Integer(eligibleParagraphs[i]); double currentCapacity=0; for(int i=0;i<eligibleParagraphsId.length;i++) { currentCapacity+=enhancedParagraph[eligibleParagraphsId[i].intValue()].value.length(); if(currentCapacity>summaryLimit) { out.println(enhancedParagraph[eligibleParagraphsId[i].intValue()].value); break; }out.println(enhancedParagraph[eligibleParagraphsId[i].intValue()].value); out.println("<br>----------------------------------------------- <br>"); } } /* *Gets the words that can represent this text excerpt and their frequencies in this part. */ public WordFrequency[] getRepresentativeWords(String par) { ArrayList<WordFrequency> arrWordFrequency = new ArrayList<WordFrequency>(); WordFrequency[] representativeWords;// = new WordFrequency[8]; int numOfRepresentativeWords = 0; int wordOccurence; for(int i = 0; i < enhancedWordsCapacity; i++) { wordOccurence = getWordOccurence(enhancedWords[i].value,par);
Page 120
if(wordOccurence>2) { arrWordFrequency.add(new WordFrequency(i, wordOccurence)); numOfRepresentativeWords++; } } representativeWords = new WordFrequency[numOfRepresentativeWords]; for(int i=0;i<numOfRepresentativeWords;i++) { representativeWords[i] = arrWordFrequency.get(i); } return representativeWords; } /* *Returns the number of occurences of the paramater word in the parameter paragraph */ public int getWordOccurence(String word,String paragraph) { String tempParagraph = new String(paragraph); int wordCounter=0; int cursor; while(true) { try { cursor = tempParagraph.indexOf(word); if(cursor==-1) break; tempParagraph = tempParagraph.substring(cursor+word.length()); } catch(StringIndexOutOfBoundsException strExc) { break; } wordCounter++; } return wordCounter; } //Fill in the the properties of the paragraph class. public void fillInParagraphClass() { paragraphs = new Paragraph[documentParagraphs.length];
Page 121
for(int i=0;i<documentParagraphs.length;i++) { paragraphs[i] = new Paragraph(); paragraphs[i].id = i; paragraphs[i].value = documentParagraphs[i]; paragraphs[i].numOfChars = documentParagraphs[i].length(); paragraphs[i].numOfSentences = getNumberOfSentences(documentParagraphs[i]); paragraphs[i].representativeWordsFrequency = getRepresentativeWords(documentParagraphs[i]); } } /* *Remove paragraphs that are considered ineligible. The paragraphs that have less than two words to represent them are considerend not eligible. */ public void removeIneligibleParagraphs() { enhancedParagraph = new Paragraph[paragraphs.length]; enhancedParagraphsCapacity = 0; // summarySize = 0; for(int i=0;i<paragraphs.length;i++) { if(paragraphs[i].representativeWordsFrequency.length>1) { enhancedParagraph[enhancedParagraphsCapacity++] = paragraphs[i]; } } } // /* *This method returns an array of string where each item contains a paragraph. */ /* *Parse paragraphs; paragraphs mostly are seperated by a new line. */ public void parseParagraphs() { documentParagraphs = documentData.split("\r\n\r\n"); }
Page 122
/** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */ public void getPreEnhancedWordsLength() { for(int i=0;i<preEnhancedWords.length;i++) { if(preEnhancedWords[i]==null) { preEnhancedWordsCapacity = i-1; break; } } } /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } //Within each paragraph sort the words representing it in a descending order public void sortWordFrequenciesOfParagraphs() { Paragraph tempParagraph; WordFrequency[] tempWordFrequencyArray; WordFrequency tempWordFrequency; for(int i=0;i<enhancedParagraphsCapacity;i++) { tempParagraph = enhancedParagraph[i]; tempWordFrequencyArray = tempParagraph.representativeWordsFrequency; for(int j=0;j<tempWordFrequencyArray.length-1;j++) { for(int k=j+1;k<tempWordFrequencyArray.length;k++) { if(tempWordFrequencyArray[j].frequency < tempWordFrequencyArray[k].frequency) {
Page 123
tempWordFrequency = tempWordFrequencyArray[j]; tempWordFrequencyArray[j] = tempWordFrequencyArray[k]; tempWordFrequencyArray[k] = tempWordFrequency; } } } enhancedParagraph[i].representativeWordsFrequency = tempWordFrequencyArray; } } //Get the document-document matrix mapping the relations between the paragraphs and each other public double[][] getDocumentToDocumentMatrix() { double tempMatrix[][] = new double[enhancedParagraphsCapacity][enhancedParagraphsCapacity]; WordFrequency[] wordFrequencyI,wordFrequencyJ; double similarity; String[] wordsOfI,wordsOfJ; int documentsSimilarity; for( int i = 0 ; i < enhancedParagraphsCapacity ; i++ ) { wordFrequencyI = enhancedParagraph[i].representativeWordsFrequency; for( int j = 0 ; j<enhancedParagraphsCapacity ; j++) { if(i==j) tempMatrix[i][j] = 1; else { wordFrequencyJ = enhancedParagraph[j].representativeWordsFrequency; similarity = findSimilarity(wordFrequencyI,wordFrequencyJ); tempMatrix[i][j] = similarity/(double)wordFrequencyI.length; } } } return tempMatrix; } //Finding the similarity between two paragraphs; representing words that are common within the two paragraphs are chosen public double findSimilarity(WordFrequency[] wordFrequencyI,WordFrequency[] wordFrequencyJ)
Page 124
{ double similarity=0; for(int i=0;i<wordFrequencyI.length;i++) for(int j=0;j<wordFrequencyJ.length;j++) if(wordFrequencyI[i].id==wordFrequencyJ[j].id) similarity++; return similarity; } public void exportToExcelSheet() { String excelSheet=""; for(int i=0;i<enhancedParagraphsCapacity;i++) { excelSheet = ""; for(int j=0;j<enhancedParagraphsCapacity;j++) { excelSheet+=documentToDocumentMatrix[i][j]+"\t"; } out.println(excelSheet); } } } Appendix D: Evaluation Sheet
Evaluation sheet
Document Title: ……………………. Subject Area: ……………………….
Page 125
User Name: ………………………… Knowledge about Subject area:
Novice Shallow Medium Expert Please answer the following questions by choosing from one of the check boxes. The
check boxes range from 1(very low) to 5(very high). Please choose only one answer per question.
Question Range
1 2 3 4 5
Comments
To what extent can the user answer all the questions of the original document by only reading the summary?
To what extent is the summary misleading?
To what extent is the summary elegant?
To what extent does the summary require the user to have background about the subject area?
How much will you grade that system?
Appendix E: Evaluation Samples
Page 126
Page 127
Page 128
Page 129
Page 130
Page 131
Appendix F: Evaluation Results Results of the Expert Users:
The mean values of the results of the expert users from the evaluation
Results of Medium Users:
Page 132
The mean values of the results of the medium users from the evaluation
Results of Novice Users:
The mean values of the results of the novice users from the evaluation