133
Page 0 Evaluation of Techniques for Automatic Text Extraction Submitted September 2006, in partial fulfilment of the conditions of the award of the degree M.Sc. in IT Omar Ali Hussein Azzam School of Computer Science and Information Technology University of Nottingham Supervisor: Dr. Tim Brailsford I hereby declare that this dissertation is all my own work, except as indicated in the text: Signature ______________________ Date _____/_____/_____

Evaluation of Techniques for Automatic Text Extraction

Embed Size (px)

Citation preview

Page 1: Evaluation of Techniques for Automatic Text Extraction

Page 0

Evaluation of Techniques for Automatic Text Extraction

Submitted September 2006, in partial fulfilment of the conditions of the award of the degree M.Sc. in IT

Omar Ali Hussein Azzam

School of Computer Science and Information Technology

University of Nottingham

Supervisor: Dr. Tim Brailsford

I hereby declare that this dissertation is all my own work, except as indicated in the text:

Signature ______________________

Date _____/_____/_____

Page 2: Evaluation of Techniques for Automatic Text Extraction

Page 1

ABSTRACT

With the rapid growth of the World Wide Web and electronic information services,

information is becoming overwhelmingly available on-line. The problem is that a user cannot cope with all the text that is on the internet. No one has time to read everything, yet we often have to make critical decisions based on what we are able to understand. Some examples from everyday life that require summarization are:

• News Headlines • Movie Reviews

• Abstracts of Scientific Papers • Book Reviews or Excerpts

• Highlights of a Meeting The technology of automatic text summarization has become essential for dealing

with such problems. Text summarization is the process of extracting the most important information from a source to produce an abbreviated version for a particular user or task.

Summaries can be conducted either by abstraction or extraction. The abstraction

implies the reformation of the document in a more abbreviated version. Extraction is the process of selecting the most eligible excerpts of the document (sentences, paragraphs, etc.) to insert into the summary.

This thesis will be concerned with implementing three summarizers using three

different extraction methodologies. The first one is based on sentence extraction which is based on Latent Semantic Analysis. While the other two are based on paragraph extraction, one is based on a bushy-node path algorithm and the other is based on a depth-first node path algorithm.

In addition, a comparison will be performed between the three extraction

methodologies using two empirical methods; questionnaire and field observation. The comparison should come out with a guide specifying which methodology is the most effective and the circumstances promoting the use of that certain technique.

Page 3: Evaluation of Techniques for Automatic Text Extraction

Page 2

Acknowledgements

I would like to acknowledge all my thanks to ALLAH for his help in this

research.

I would like to acknowledge the superb effort, expertise and perceptive insights

offered by Dr. Tim Brailsforsd. It has been an honour and a very educational experience

working under his supervision.

I am greatly indebted to James Goulding, at the Department of computer science,

in the University of Nottingham for his ongoing support.

I would like to send all my sincere thanks for all my family. Special thanks to my

father and mother for supporting and sponsoring me for this degree, without their support

& encouragement this work might never have been done. Special thanks to my sister for

her moral support and always being there for me.

No words can express my gratitude towards Eng. Ramy from the ITI for his cooperation and assistance before and during the Masters.

Page 4: Evaluation of Techniques for Automatic Text Extraction

Page 3

Contents

Section 1: Literature Review

1 Introduction ………………………………………………………………….. 7 1.1 What is Automatic Text Summarization? …………………………………. 7 1.2 Approaches and methods ………………………………………………… 10 1.3 Motivation ………………………………………………………………... 12 1.4 Aims and Objectives ……………………………………………………... 13 1.5 Method …………………………………………………………………… 14

2 Previous Works …………………………………………………………….. 16 2.1 Classical Approaches ……………………………………………………... 16 2.1.1 Hans Peter Luhn's Approach ………………………………………… 17 2.1.2 H.P. Edmundson's Approach ………………………………………… 18

2.2 Corpus-based Approaches ………………………………………………… 19 2.2.1 Morris and Hirst's Approach ………………………………………… 20 2.2.2 A Trainable Document Summarizer Approach ……………………… 22

3 Automatic Text Summary Evaluation ……………………………………… 24 3.1 Introduction ………………………………………………………………. 24 3.2 Evaluation Types …………………………………………………………. 25

Section 2: Text Extraction Automation

4 Software Development ……………………………………………………... 28 4.1 Introduction ………………………………………………………………. 28 4.2 Technologies and Tools ………………………………………………….. 29

5 Sentence Extraction ………………………………………………………… 31 5.1 Introduction ………………………………………………………………. 31 5.2 Latent Semantic Analysis ………………………………………………… 31 5.3 Analysis and Design ……………………………………………………… 34 5.4 Implementation …………………………………………………………… 38 5.5 Example …………………………………………………………………... 42

Page 5: Evaluation of Techniques for Automatic Text Extraction

Page 4

6 Paragraph Extraction ……………………………………………………….. 53 6.1 Introduction ………………………………………………………………. 53 6.2 Text Relationship Maps ………………………………………………….. 53 6.3 Paragraph Extraction Methodologies …………………………………….. 57 6.4 Analysis and Design ……………………………………………………… 59 6.5 Implementation …………………………………………………………… 62 6.6 Example …………………………………………………………………... 67

6.6.1 Depth First Node Paragraphs ………………………………………... 67 6.6.2 Bushy Node Paragraphs ……………………………………………... 70

Section 3: Evaluation and Results

7 Evaluations and Comparison ……………………………………………….. 71

7.1 Evaluation Procedures ……………………………………………………. 71 7.2 Test Data …………………………………………………………………. 71 7.3 Aspects of the extraction process ……………………………………..… 72

7.4 Participant Users ………………………………………………………..... 72 7.5 Evaluation Techniques …………………………………………………… 72 7.6 Evaluation Equipment ……………………………………………………. 73

7.6.1 User Interface ………………………………………………………... 74 7.6.2 Evaluation Sheet ……………………………………………………... 76

7.7 Evaluation Results ……………………………………………………....... 77 7.7.1 Expert Users Results ………………………………………………… 77 7.7.2 Medium Users Results ………………………………………………. 79 7.7.3 Beginner Users Results ……………………………………………… 80

Section 4: Discussion

8 Discussion ………………………………………………………………….. 81 8.1 Conclusion ………………………………………………………………... 84 8.2 Future Work ……………………………………………………………… 86

References ……..……………………………………………………………………….. 87 Appendices

Appendix A: Basic classes of the system…………………………………………... 91 Appendix B: The sentence extractor program…………………………………….. 102 Appendix C: The paragraph extractor program…………………………………… 111 Appendix D: Evaluation Sheet…………………………………………………….. 125 Appendix E: Evaluation Samples……………………………………………….… 125 Appendix F: Evaluation Results……………………………………………...…… 126

Page 6: Evaluation of Techniques for Automatic Text Extraction

Page 5

List of Figures

Figure 5.1 A class diagram for the class Word …………………………….. 34

Figure 5.2 A class diagram for the class Sentence …………………………... 34

Figure 5.3 A class diagram for the class WordFrequency ………………. 34

Figure 5.4 Class diagram of the Extractor …………………. 35

Figure 5.5 A class diagram for the class SentenceExtractor ………... 36

Figure 5.6 The class diagram of the sentence extractor ……… 37

Figure 5.7 The TDM of a document in encryption …………. 40

Figure 5.8 The original document about E-commerce ………… 42

Figure 5.9 A summary based on sentence extraction ……………. 49

Figure 6.1 A text relationship map for an article in Telecommunication… 55

Figure 6.2 Text relationship map of the Telecommunication article after

refinement …………...

56

Figure 6.3 Class diagram of Class Paragraph …………... 60

Figure 6.4 Class diagram of ParagraphExtractor class 61

Figure 6.5 A class diagram of the paragraph extractor 62

Figure 6.6 A document-document Matrix of a document in Encryption 65

Figure 6.7 An illustration of using depth-first node path algorithm ……….. 66

Figure 6.8 A summary based on depth-first node path algorithm …………... 68-

69

Figure 6.9 A summary based on bushy node path algorithm……………….. 70

Page 7: Evaluation of Techniques for Automatic Text Extraction

Page 6

Figure 7.1 Interface of the extrator system where user chooses the method.. 74

Figure 7.2 Interface of the system where the user chooses the document to

perform the extraction on ……………………….

75

Figure 7.3 The result page of our extractor where a summary is provided for

the document……………………………………….

76

Figure 7.4 A chart for the questionnaire answers of the expert users …. 78

Figure 7.5 A chart for the questionnaire answers of the medium users…. 80

Figure 7.6 A chart for the questionnaire answers of the novice users ……… 82

List of Tables

Table 5.1 An example of a term document matrix …………….. 31

Table 6.1 A document-document matrix………………………………. 58

Page 8: Evaluation of Techniques for Automatic Text Extraction

Page 7

Section 1: Literature Review

1 Introduction

1.1 What is Automatic Text Summarization?

Text summarization is the process of distilling the most important information from a

source to produce an abbreviated version for a particular user or task. [19]

Automatic text summarization has strongly been recommended after realizing the

problems of manual text summarization. The most significant problem would be the time

consumption for manually summarizing a text. Another problem would be inconsistency;

the summary extracted by two individuals from the same article can be surprisingly

different; in fact, even a single professional summarizer would find it difficult to maintain

consistency over a period of time. Bias is considered another major problem in manual

summarizing where the summarizer might potentially be biased towards his own thoughts

or beliefs.

Although the summarization process has several aspects that control the output and

style of summary, most of the available summarizers do not take this fact into

consideration and even the minority that do, have failed to execute it effectively which

diminishes the significance of the summaries.

Page 9: Evaluation of Techniques for Automatic Text Extraction

Page 8

There are two major techniques for creating a summary:

1. Abstract Summarization The result of this summary is an interpretation of the

original text. The result is a smaller text where word concepts are transformed into

shorter ones. For example: “They went to Italy, France, and Germany” is becoming

“They went to some European countries”. This kind of summarization requires

symbolic word knowledge which makes it difficult to provide a good summary.

2. Extract Summarization: This method uses statistical, linguistical and heuristic

methods, or a combination of all to provide a summary. The result is not

syntactically or content wise altered. However, it is considered less complex to

summarize using this technique.

Aspects of Automatic Text Summarization

The output of a summary depends on several aspects. These aspects control the

methodology used to summarize and the style of the summary. The main aspects that will

be reviewed are intent, focus, coverage, background, and genre of text [9] [3].

Intent of the summary describes the potential use of the summary. It can be

indicative, informative, or evaluative. Indicative summaries give an indication of the

content of the source text. They can be used as appetizers for the whole text. Informative

summaries serve as substitutes for the document. Evaluative summaries provide the point

of view of the author on a given subject.

The focus of the summary refers to the scope of the summary. This could be either

generic or query-relevant. A generic summary summarizes the whole text without bias to

Page 10: Evaluation of Techniques for Automatic Text Extraction

Page 9

any particular topic. However, the query-relevant summary is constructed to concentrate

on a specific topic. A more simplified explanation could be, in a generic summary the

user is interested in all the aspects of the document and all the information that is

considered important. While in a query-relevant summary the user is only interested in

specific concepts and needs to be provided with all the information about these concepts.

Coverage refers to whether the summary is based on a single document or multiple

documents relating to the same subject matters.

The background of the summary refers to the user's prior knowledge of the subject

area. A user's background in a particular subject could be weak requiring the summary to

focus on the concepts of the subject area or strong requiring the summary to focus on the

latest news and events.

Genre of text refers to the nature of the text, whether it is a scientific paper, journal,

book, or a story. It will be demonstrated further why this aspect in particular is very

important.

Lin and Hovy had specified three basic stages required for summarizing a topic; topic

identification, topic interpretation and topic generalization. [10]

Topic Identification: Understanding the concepts of the document to obtain the essence

of the text.

Interpretation: Identifying the most important pieces of information contained in the

text.

Page 11: Evaluation of Techniques for Automatic Text Extraction

Page 10

Generalization: The transformation of the source text into a coherent new text. This is

done by merging the phrases that are eligible to be used in the summary

1.2 Approaches and Methods

In the previous part it was noted that there are two types of a summary, abstract and

extract. Abstract summaries require symbolic word knowledge, semantic parsing and a

lot of other NLP (Natural Language Processing) approaches. Up to this day the creation

of abstracts still remains a challenge. Most existing systems and researches were

concerned more with extracts where the most indicative sentences of the topic are

selected and merged together.

Although extracts are considered relatively easy but there are some major difficulties

in order to ensure a high quality and readable extract [24]:

• How to identify the significance of a sentence?

• How to merge coherently the selected sentences?

Each sentence in the source text is marked with a rank. The sentences with the highest

rank are chosen to be included in the extract. Some of the rules that were used throughout

the history are presented below. However, all these rules depend on the fact that each

word is given a grade and the sentence is the sum of the words' grades contained in it.

[24][26]

Proper Name: Certain types of nouns like peoples' names or cities are considered

significant.

Page 12: Evaluation of Techniques for Automatic Text Extraction

Page 11

Pronoun: Sentences containing pronouns are given a higher score than sentences

that do not.

Bonus Words: Based on the hypothesis that the probable relevance of a sentence is

affected by the presence of pragmatic words such as "significant", "impossible",

and "hardly".

Word Frequency: A word's frequency is an indication of its importance. Sentences

containing such frequent words could be considered significant.

Word Position: A word's position within the whole text or the paragraph could also

indicate its eligibility.

Sentence Position: Topic sentences tend to occur very early or very late in a

document and its paragraphs.

Headings: Sentences occurring under certain headings are positively relevant, e.g.

"Introduction", "Purpose", and "Conclusion". Each word has its corresponding

weight.

Title: Words of a title could be considered positively relevant.

Uppercase Word: Upper Case words could be considered highly significant if

they are acronyms.

Numerical Data: Sentences containing numerical data are given a higher score

than sentences that do not.

Page 13: Evaluation of Techniques for Automatic Text Extraction

Page 12

Weekdays and Months: Sentences containing weekdays or months are scored

higher than ones who do not.

Lexical Chains: The previous methods were able to obtain the most significant

sentences successfully. However, these methods do not take into consideration the

relationships between the different parts of a text. The output will be a summary

lacking cohesion. Cohesion is a term for sticking together; it means that the text all

hangs together as in one fluent stream. Lexical Cohesion is the cohesion that arises

from semantic relationships association between words. All that is required is that

there be some recognizable relation between the words. Lexical Chains represent

the lexical cohesion among an arbitrary number of related words. Lexical Chains

will be discussed further in Chapter 2. [23]

1.3 Motivation

In the previous sections it was demonstrated how automatic summarization is

essential due to the problems of manual summarization and the high rate and size of

information flow that users cannot cope with. However, critical decisions must be taken

based on what is understood. Summarization could be described as a crucial process in

some cases whereas if a concept in the original text is not well highlighted in the

summary, this could lead to misunderstanding of the original text which may lead to

serious consequences.

However, most of the available summarizers are not reliable in the construction of

summaries of critical documents. In addition, the readability of the auto summaries is

absolutely unsatisfactory. This may be considered ironic because although there are a

large number of text summarizers available they are not practically used.

Page 14: Evaluation of Techniques for Automatic Text Extraction

Page 13

1.4 Aims and Objectives

It has been mentioned previously that in order to summarize a document there are

some aspects (e.g. intent, background, genre of text, etc.) that should be taken into

consideration, but this is usually not the case: either they are ignored or not implemented

properly.

After realizing how critical automatic text summarization is, this thesis will be

concerned with:

• Implementation of an automatic summarizer that is based on sentence extraction.

• Implementation of an automatic summarizer based on bushy node path paragraph

extraction.

• Implementation of an automatic summarizer based on depth-first node path

paragraph extraction.

• Evaluating the text summarizers specifying the most suitable summarizer

regarding the intent of the summary aspect.

• Evaluating the text summarizers specifying the most suitable summarizer with

respect to the genre of text aspect.

• Evaluating the text summarizers specifying the most suitable summarizer with

regards to the background aspect.

• Enabling the public to contribute and share their opinions on the automatic

summaries regarding the previous aspects.

• Test how informative the summarizers are; the degree of sufficient information

compared to the original document.

• Test the readability of the summarizers.

• Constructing a guide recommending each summarizer in specific cases.

Page 15: Evaluation of Techniques for Automatic Text Extraction

Page 14

1.5 Method

The three algorithms that will be implemented are Latent Semantic Analysis,

paragraph extraction based on bushy node path and paragraph extraction based on depth-

first node path. The programming language used will be Java. The software will be web

based using Java Servlets. The web server used will be Apache Tomcat 5.5.9.

Since text summarization is a crucial process, the informativeness of the summary is

the most important criterion of the validity of a summary. The evaluation in this study

will focus on the informativeness of the summary rather than the convenience. The

summaries will be evaluated in 2 ways:

FIRST: The summary will be presented to the public. The validity of the summary

will be questioned according to specific criteria that the people will answer. These

criteria are:

• Can a user answer all the questions by reading a summary, as he would by

reading the entire document from which the summary was produced?

• Is the summary misleading in any way?

• Is the summary as elegant as the original document?

• Is the summary readable?

• Does the summary require the user to have background in the subject

area?

• Which summarizer would the user prefer?

This survey will be presented to different types of users with different backgrounds.

The summarizers also will be tested using different text genres.

Page 16: Evaluation of Techniques for Automatic Text Extraction

Page 15

SECOND: Each summary has specific properties. The availability of these

properties will check for. These properties are:

• What is the best compression ratio between the given document and its

summary?

• Redundancy—is any information repeated in the summary?

• Cohesiveness

• Coherence

• Readability (depends on cohesion/ coherence/ intelligibility)

Finally, a guide is provided that recommends the best method to summarize within

each case.

Chapter 2 presents an overview of the previous work of the subject area consisting of

two periods Classical and Corpus-based. Chapter 3 presents an overview of automatic

text summarization evaluations and the evaluation types.

Page 17: Evaluation of Techniques for Automatic Text Extraction

Page 16

2 Previous Works

The foundations of Automatic Text Summarization began in the mid 50s. Researches

in Automatic Text Summarization was divided into two periods, Classical and Corpus-

based. [19]

Classical Approaches begins with work originating from 40 years ago. It is

classic in the sense that it uses fundamental approaches and surface-level

approaches. It focuses on analysing static textual features to construct summaries.

Corpus-based Approaches are concerned with the question of how different

textual features can be extracted from text corpora and manually or automatically

combined to produce better abstracts. It focuses on the fact that there are textual

features dependent on the text genre. Examples of text genre are newspapers,

scientific papers, TV news, etc. Approaches can not be standardized among the

different text genres.

2.1 Classical Approaches

Classical Approaches represent the traditional approaches used at the birth of

automatic text summarization. These approaches mostly depend on discrete features in

the text. The following sections present two of the godfathers of the idea of automating

the abstraction process.

Page 18: Evaluation of Techniques for Automatic Text Extraction

Page 17

2.1.1 Hans Peter Luhn's Approach

Luhn’s basic idea in his statistical approach in his paper “The Automatic Creation of

Literature Abstracts – 1958” [17] is depending on term frequency and term

normalization. Luhn’s method is based on the assumption that the frequency of a word

implies its significance and the significance of a sentence is obtained by the analysis of

its words.

Luhn believed that manual abstracting could not be optimum as the manual abstracts

could be influenced by the abstracter’s background, attitude, and beliefs. The abstractor

may be biased towards his own ideas and could not maintain consistency. Automatic

abstracts could eliminate both human effort and bias.

Luhn’s algorithm simply selects the significant words by obtaining the stem of each

word and counting its frequency. There are some words that are too frequent to be

significant. Such frequent words are not considered. Words within the text are given

scores relative to their frequency. Each sentence has a significance factor that if it is over

a specific cut-off it will be retrieved. Luhn defines the significance factor as the factor

that reflects the number of occurrences of significant words within a sentence and the

linear distance between them due to the intervention of non-significant words. The

significance factor is calculated for each sentence and the sentences with the significance

factor over a certain cut-off will be retrieved. Finally, the selected sentences are

combined to constitute the auto- abstract.

Page 19: Evaluation of Techniques for Automatic Text Extraction

Page 18

2.1.2 H.P. Edmundson's Approach

Edmundson was aiming to construct an extracting system to produce indicative

extracts that allow a researcher to screen a body of literature to decide which documents

deserve more detailed attention. This was demonstrated in his paper “New Methods in

Automatic Extracting - 1969”. [2]

Edmundson extended the Luhn’s method as it was not sufficient to construct

automatic extracts that could supersede manual extracts. Edmundson has designed a way

to weigh the text using four basic methods. The weight of a sentence would be the

function of the weights of these four characteristics:

• Cue: This method is based on the hypothesis that the probable relevance of a

sentence is affected by the presence of pragmatic words such as "significant",

"impossible", and "hardly".

• Key: It is like the one proposed by Luhn. It is based on the hypothesis that

words that are highly frequent are positively relevant.

• Title: This method depends on certain specific characteristics of the skeleton

of the document (titles, headings, and format). It is based on the hypothesis

that words of the title and headings are positively relevant. When the author

partitions the body of the document into major sections he summarizes it by

choosing appropriate headings.

Page 20: Evaluation of Techniques for Automatic Text Extraction

Page 19

• Location: This method is based on the hypothesis that sentences occurring

under certain headings are positively relevant and that topic sentences tend to

occur very early or very late in a document and its paragraphs. There is a

Heading dictionary containing words that appear in headings of documents,

e.g. "Introduction", "Purpose", and "Conclusion". Each word has its

corresponding weight.

After obtaining the four characteristics for each sentence, the weight of the sentence

SentW is calculated using the following equation:

SentW = aC + bK + cT + dL;

Where a, b, c, and d are constant positive integers and C: Cue Weight,

K: Key Weight, T: Title Weight, and L: Location Weight.

The highest N sentences are retrieved to construct the final abstract. Generally the

compression rate of the summary is preferred to be 25% of the original text. So N is

calculated as a function of 25% of the size of the original document and the number of

sentences of the original document.

2.2 Corpus-based Approaches

The most severe limitation of location and cue phrases is their dependence on the

text genre. Each text genre has its style of writing. Techniques relying on formal clues

can be seen as a high risk [19].

Page 21: Evaluation of Techniques for Automatic Text Extraction

Page 20

Another limitation is that the merging of the sentences that are eligible to constitute

the summary may result in an incoherent summary [19].

Methods that rely more on content do not suffer from this problem. The problem with

these methods is that a detailed semantic representation must be created and a domain

specific knowledge base must be available. [19]

2.2.1 Morris's and Hirst's Approach

The simplest method to ensure coherence of the summary is lexical cohesion.

Coherence is a term for making sense; it means that there is sense in the text. Using

lexical chains in text summarization is efficient, because these relations are easily

identifiable within the source text, and very vast knowledge bases are not necessary for

computations. [23]

By using lexical chains, we can statistically find the most important concepts by

looking at structure in the document rather than deep semantic meaning. All that is

required to calculate these is a generic knowledge base that contains nouns and their

associations (thesaurus). These associations capture concept relations such as synonyms

(a word having the same or nearly the same meaning as another word or other words in a

language), antonym (a word having meaning opposite to that of another word), and

hyperonym (isa relation). [4]

Chains are created by taking a new text word and finding a related chain for it

according to relatedness criteria.

Morris and Hirst have defined a methodology to use lexical chains in the abstraction

process in their paper

Page 22: Evaluation of Techniques for Automatic Text Extraction

Page 21

Generally, a procedure for constructing lexical chains follows three steps:

1. Selection of a set of candidate words.

2. For each candidate word, find an appropriate chain relying on a relatedness

criterion among members of the chains. To be more specific it must be

specified exactly what counts as a cohesive relationship between words. This

could be done using a thesaurus. According to Morris and Hirst’s suggestions,

two words could be considered to be related if they are connected in the

thesaurus in one or more of the following five possible ways:

a. Their index entries point to the same thesaurus category, or point to

adjacent categories.

b. The index entry of one contains the other.

c. The index entry of one points to a thesaurus category that contains the

other.

d. The index entry of one points to a thesaurus category that in turn

contains a pointer to category pointed to by the index entry of the other.

e. The index entries of each point to thesaurus categories that in turn

contain a pointer to the same category.

Page 23: Evaluation of Techniques for Automatic Text Extraction

Page 22

3. If the appropriate chain was found, insert the word in the chain and update it

accordingly.

2.2.2 A Trainable Summarizer

To summarize is to reduce in complexity, and hence in length, while retaining some

of the essential qualities of the original. Abstracts are sometimes used as full document

surrogates. They can be easily accessed. They provide an easily digested intermediate

point between a document's title and its full text that it is useful for rapid relevance

assessment. [14]

This paper focused on the construction of document extracts using new discrete sentence

scoring features. [14]

Sentence Length Cut-off Feature: Short sentences tend not to be included in

summaries. So a given threshold of the number of words in a sentence is

specified.

Fixed-Phrase Feature: There is a list of fixed phrases that indicate that the

coming sentences are significant.

Paragraph Feature: This discrete feature focuses on the first ten paragraphs and

the last five paragraphs, considering them containing substantial information

regarding the document.

Page 24: Evaluation of Techniques for Automatic Text Extraction

Page 23

Thematic Word Feature: This feature focuses on the words that occur most

frequently.

Uppercase Word Feature: Uppercase words are often important. Uppercase

words are computed like the previous feature where the most frequent uppercase

words are significantly credited. That is with the constraint that it is not a sentence

initial. In addition, frequent uppercase words could not an abbreviated unit of

measurement (like F, C, Kg, etc.). These abbreviations are discarded.

Page 25: Evaluation of Techniques for Automatic Text Extraction

Page 24

3 Automatic Text Summary Evaluation

The goal of automatic summarization is to take an information source, extract content

from it, and present the most important content to the user in a condensed form and in a

manner sensitive to the user’s or application’s needs [18].

The evaluation of any system is always a key point for any research or development

effort. Evaluation has long been of interest to automatic summarization, with extensive

evaluations being carried out at the early 1960’s.

This chapter will give an overview of automatic text summarization. 3.1 will give an

overview of text summarization evaluation and previous research evaluations. 3.2 will

describe the evaluation methods applied on automatic summaries.

3.1 Introduction

Summarization is considered a fast-developing new research area. Therefore, this

needs good evaluation methodologies. Evaluation of automatic summaries faces a lot of

challenges that could make it useless. Evaluation still is not automated which therefore

needs human effort. This will increase the expenses of the evaluation process. As

automatic text summarization process had several factors that should be taken into

consideration, analogous is the evaluation process. Summarization involves compression;

evaluation should differ according to the compression ratio of the summary.

Page 26: Evaluation of Techniques for Automatic Text Extraction

Page 25

During evaluation of a summary two properties must be measured:

Compression Ratio: The ratio between the summary’s length and the original

document’s length. It is denoted by CR.

CR = length of summary/ length of full text;

Retention Ratio: The amount of information retained in the summary. Retention

ratio is sometimes referred to as Omission Ratio. It is denoted by RR.

RR = information in summary/ information in original text

Any evaluation system of a summarizer must use these two properties. [20]

3.2 Evaluation Methods

Methods for evaluating text summarizers can be classified into two categories. The

first is an intrinsic evaluation, comparing to some gold standard. The second is extrinsic

evaluation, measures the system’s performance in a particular task. [20]

Extrinsic Evaluation measures the efficiency and the acceptability of the

generated summaries in some task. The quality of a summary evaluated using

extrinsic evaluation is judged against a set of external criteria, e.g., whether the

summary retains the information needed to satisfy an information need. Extrinsic

evaluation could for example be used for reading comprehension or relevance

assessment.

Page 27: Evaluation of Techniques for Automatic Text Extraction

Page 26

Intrinsic Evaluation means that the quality of a summary is judged only by

analysis of its textual structure and by a comparison of the summary text to other

summaries. These summaries should be a gold standard performed either by a

reference summarization system or hand made.

Most evaluation systems used now use the intrinsic evaluation approach. An intrinsic

evaluation usually focuses on two concepts, coherence and informativeness[18].

Summary Coherence: Automatic Extracts are constructed by selecting the most

significant sentences and combining them together. These extracts could

sometimes suffer from coherence problems, where there are a lot of gaps between

the sentences in the information flow.

Summary Informativeness: It measures how much information from the source

is preserved in the summary. Summary informativeness can also be measured by

comparing the information covered in the reference summary with the

information covered in the auto summary.

Page 28: Evaluation of Techniques for Automatic Text Extraction

Page 27

Section 2: Text Extraction Automation

Automatic text extraction is widely used in the summarization process in which the

most important segments of the document are selected. The better the document is

structured, the more superior the output of the extraction process. A structured document

would be a document in which its segments are well identified. A text segment is a block

of information that can describe an entire topic. Text segments can be units, chapters,

sections, pages, paragraphs or sentences. The segmentation of a document also helps in

text understanding and in information retrieval (IR). The main difficulty in identifying

text segments is automatic termination, i.e. to determine the number of topic boundaries

in a document. [4]

There has been a lot studies in specifying what would be best identified as a segment.

Would it be sections, pages, paragraphs or sentences? Linguistic theories and work in IR

suggest a coherent text segment is represented by paragraphs. Some other studies suggest

that a sentence is the best to represent a segment.

Chapter 4 will be concerned with the technical view of this study describing the

technical reasons for the technologies, tools and evaluation used in this study. Chapter 5

will discuss text extraction considering the segments as sentences and Chapter 6 will

discuss text extraction considering the segments as paragraphs.

Page 29: Evaluation of Techniques for Automatic Text Extraction

Page 28

4 Software Development

This chapter will discuss the technical part of the dissertation. Three different

algorithms will be implemented, two of them are based on paragraph extraction and one

is based on sentence extraction. In addition, the technologies and tools used for the

summarizers will be described.

4.1 Introduction

In order to compare between two systems there are some criteria that must be

fulfilled. The comparison must be a like to like comparison meaning that the systems

should be constructed by the same manufacturer, tools and design. Therefore, fulfilling

these criteria eliminates any external factors that can affect the result of the comparisons.

Therefore, this dissertation will implement the three different algorithms using the

same tools and functionalities to allow only the methodology of each algorithm be the

only factor for the output and style of the summary. The algorithms implemented are:

• Sentence Extraction based on Latent Semantic Analysis.

• Paragraph Extraction based on depth-first node path.

• Paragraph Extraction based on bushy node path.

The first algorithm will be explained in Chapter 5. The other two algorithms will be

explained in chapter 6.

Page 30: Evaluation of Techniques for Automatic Text Extraction

Page 29

4.2 Technologies and Tools

This part is concerned mainly with the technologies and tools used for the

summarizers' construction.

The programming language used is Java. That is due to the strong capabilities of the

java classes for text processing. In addition to that, java is platform independent which

gives it the advantage of running on any platform.

The IDE (Integrated Development Environment) used is NetBeans 5.0. The NetBeans

IDE is a robust, free, open source Java IDE that provides the developer with everything

they need to create cross-platform desktop, web and mobile applications straight out of

the box. NetBeans attains its power from being composed of extensible plug-ins. The

IDE itself could be extended to provide new customized development environments.

The summarizers constructed will be web based. This will be a client server

application centralizing the whole functionality on the server side. Centralization is

considered more secure. In addition, the computing is all done in the server side which

limits the computer power needed to use the summarizers on the client side, as the

summarizers perform a lot of text processing which needs a lot of computing power.

Another advantage is that centralized systems are much more scalable, as there is only

one source where any change in that source could be reflected right away to the users.

Moreover, enabling the system to be web-based increases the easiness of its accessibility.

However, we cannot ignore the problem of the reliability of the centralized systems as

any failure occurring to the server means that the whole system is down.

Page 31: Evaluation of Techniques for Automatic Text Extraction

Page 30

The web tool used is Java Servlets that is due to its simplicity compared to other web

frameworks like Struts, Spring or Java Server Faces (JSF) as the summarizers will not

need to be implemented using such extensible web frameworks. The web server used is

Tomcat 5.5.9. The Tomcat web server is bundled within the NetBeans 5.0.

The next two chapters will explain the algorithms used and describe how they are

implemented and provide an example for each.

Page 32: Evaluation of Techniques for Automatic Text Extraction

Page 31

5 Sentence Extraction

5.1 Introduction

Sentence extraction is considered the oldest extraction technique used. It was first

introduced by Luhn 1959. The sentence weight was calculated by the weights of the

words contained in the sentence. The sentences with the highest weights are selected.

This chapter will discuss an implementation of a summarizer based on sentence

extraction. Sentence extraction can be performed in a lot of ways. Examples of methods

used are Word Frequency, Cues, Key, Title, Heading, Latent Semantic Analysis (LSA),

and others. This study will perform sentence extraction based on LSA.

5.2 Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a theory and method for extracting and

representing the contextual-usage meaning of words by statistical computations applied

to a large corpus of text [15].

Let there be a set of documents or text segments S = {s1, …, s2} and a set of

vocabulary W = {w1, w2,…wn}. LSA transforms these two sets into a relationship

between terms and concepts, and a relationship between the documents and the same

concepts. Therefore, the terms and documents are related indirectly through the concepts.

Latent Semantic Analysis uses a term-document matrix (TDM) which describes the

occurrence in documents, passages, or excerpts [16]. A Term-Document matrix is a

matrix where each row is identified by a document and each column is identified by a

Page 33: Evaluation of Techniques for Automatic Text Extraction

Page 32

word. The value in the cellij (in row i and column j) is the relation between the word i and

the document j. An example of a term document matrix is shown in table 5.1. The value

of cell1,1 in Table 5.1 is 3. This means that the relation between the word "Automatic"

and document "Document 1" is 3(whatever this relation is). LSA provides the

relationship between terms and concepts producing measures of word-word, word-

passage, and passage-passage relations [16].

Automatic Text Summarization Extraction

Document 1 3 2 6 5

Document 2 5 2 3 4

Document 3 4 1 2 7

Table 5.1 An example of a term document matrix.

It is important to note that the similarity estimates derived by LSA are not simply

adjacent frequencies, co-occurrence counts, or correlations in usage, but LSA depends on

a powerful mathematical analysis that is capable of correctly inferring much deeper

relations and as a consequence are often more precise.

Applications of LSA:

• Compare your documents in the concept space

• Find similar documents across languages after analyzing a base set of

translated documents.

• Finding relation between terms (synonymy and polysemy).

• Given a query of terms, translate it into the concept space, and find matching

documents (IR). In this case it is called LSI, Latent Semantic Indexing.

• Recently LSA has been used in auto tutoring systems for students.

• LSA has been used for grading essays. It was applied on TOEFL (Test of

English as a foreign language) [6].

Page 34: Evaluation of Techniques for Automatic Text Extraction

Page 33

LSA is closely related to neural net models, but it is based on SVD.

SVD, Singular Value Decomposition, is an important factorization of a rectangular

real or complex matrix. SVD can be seen as a generalization of the spectral theorem

which says that normal matrices can be diagonalized using a basis of eigen vectors to

arbitrary, not necessarily square, matrices. SVD allows the arrangement of the space to

reflect the major associative patterns in the data and ignore the smaller less important

influences. [28]

Steps of LSA: 1. Represent text as a matrix (term document matrix). Each row stands for the terms.

Each column stands for the passage, excerpt, or other context. The value of cellij is the frequency of the term row i in the document j. In this step term weighting is constructed. A lot of terms weighting methodologies are used. Example of these could be:

a. Luhn's method b. Edmundson's method

2. The cells are subjected to an introductory (preparatory) transformation. Each cell

frequency is weighted by a function that expresses both the word's importance in a particular passage and its importance in general(dispersion). This function could be done in various ways. A methodology is as following:

fij * G(j) * L(i,j). [11] fij: The number of times the term i appears in document j. G(i): The global weighting for term i. L(i,j): The local weighting for term i in document j.

3. LSA applies SVD to the matrix to decrease the redundancy.

Page 35: Evaluation of Techniques for Automatic Text Extraction

Page 34

5.3 Analysis and Design

The previous steps of the LSA method are the standard steps for creating a summary

based on LSA. Any summarizer based on LSA does not go through each step precisely.

However, they customize these steps according to their user requirements and their

system capability. Our sentence extractor does not perform each step precisely. It uses

alternative methods and some additional functionality. As an example our algorithm uses

another method other than SVD to decrease the redundancy of the TDM. This method

will be described in details in the next part.

In order to perform a sentence extraction, we need to keep track of the words and

sentences available in the document. Each word in the document is specified by its Id,

value, global frequency, and the Ids of the sentences the word is in. The class diagram of

the class Word is illustrated in Figure 5.1. Each sentence is specified by its Id, value, the

words representing it (WordFrequency). The class diagram of the class Sentence is

illustrated in Figure 5.2.

Figure 5. 1 A class diagram for the class Word

Page 36: Evaluation of Techniques for Automatic Text Extraction

Page 35

Figure 5. 2 A class diagram for the class Sentence

So the decision whether to choose a sentence to be included in the summary basically

depends on the words that represent this sentence. Each word that represents a sentence is

specified by the word's id and the word's frequency in that sentence. The class diagram of

the class Word Frequency is illustrated in Figure 5.3.

Figure 5. 3 A class diagram for the class WordFrequency

Any extraction process goes through some major steps. Thus, these major steps will

be included in a class called Extractor that contains the basic functionality of any

extractor whatever the scope of the segment is. In addition, the Extractor class contains

the basic functionality for handling the Http requests. Therefore the Extractor class

extends the built-in java servlet HttpServlet. Figure 5.4 shows a class diagram of the

Extractor class. The source code of the previous classes (Word, WordFrequency,

Sentence, Paragraph, and Extractor) is shown in Appendix A.

Page 37: Evaluation of Techniques for Automatic Text Extraction

Page 36

Figure 5. 4 Class diagram of the Extractor

The basic class is the Sentence Extractor class. The Sentence Extractor class extends

from the Extractor class and uses the classes mentioned above and apply some functions

on them. The Figure 5.5 shows the class diagram of the Sentence Extractor Class.

Page 38: Evaluation of Techniques for Automatic Text Extraction

Page 37

Figure 5. 5 A class diagram for the class SentenceExtractor Figure 5.6 shows a class diagram for our sentence extractor.

Page 39: Evaluation of Techniques for Automatic Text Extraction

Page 38

Figure 5. 6 The class diagram of the sentence extractor.

5.4 Implementation

The basic functionality of our sentence extractor is:

1. Identify words

2. Identify sentences

3. Estimate words' position and global frequency

4. Estimate sentences' weight by obtaining the words representing them.

Page 40: Evaluation of Techniques for Automatic Text Extraction

Page 39

5. Create a term-document matrix.

6. Minimize the size of the TDM by eliminating irrelevant words and sentences.

7. Construct the summary

Our sentence extractor can perform two services for a text. The first one is it performs

an extraction of the most important sentences in the text. Second, it displays the term-

document matrix for this document where this matrix maps the relation between the

words and the sentences. This matrix could be useful for other studies in text extraction

or information retrieval.

The previous points are mainly the basic steps of our algorithm. The following is a

more detailed explanation of the algorithm. The source code of the sentence extractor is

shown in Appendix B.

1. Read the file: The contents of the file are read. There is an important point, that

this summarizer only works on text data not on figures or tables.

2. Parse the sentences:

a. A sentence is identified as a sequence of characters that are terminated by

a full stop (.), an exclamation mark (!), or a question mark (?).

b. Previous studies have proven that short sentences are not important.

Therefore, these sentences are neglected from the beginning. There have

not been suggestions about the sentence length cut-off. Therefore, a

sentence length threshold has been specified with a length that best suits

the system.

3. Parse the words: The words are extracted from each sentence.

Page 41: Evaluation of Techniques for Automatic Text Extraction

Page 40

4. Eliminate unimportant words: Remove the noise words to produce a refined word

set.

a. There are some words that occur too frequently to be significant. These

words are not eligible to represent the document due to their uselessness as

they do not convey any important concept (like any, this, that, etc.). To

solve this problem we can either use a stop list that contains these noise

words or use a high frequency cut-off (threshold) where words which are

over a specific cut-off are ignored. This sentence extractor uses a stop list.

b. Most words with length less than 3 characters (like it, is, or, etc.) are

considered not important. Therefore, these words are neglected from the

beginning.

5. Estimate the global frequency of each word in the word set.

6. Perform another refinement on the words: Since the global frequency of each

word in the set is available, this indicates the word's importance. So words with

global frequency less than a specific value are neglected.

7. It was discussed before that each word has some properties; id, value, sentences

containing it. In this step the Ids of the sentences that contain this word will be

obtained.

8. It was also discussed before that each sentence has some properties; id, value, the

words in the sentences and the frequency of this word in the sentence. In this step

the algorithm will go through each sentence and find the frequency of each word

contained in this sentence.

Page 42: Evaluation of Techniques for Automatic Text Extraction

Page 41

9. Obtain the number of sentences intended to be included in the summary. Usually

the number of sentences required for the summary is derived from the number of

sentences in the original document.

10. Construct the term-document matrix: In this step the term-document matrix is

constructed. The columns will represent the words and the rows will represent the

sentences. The value of the cell in row i and column j is the frequency of word j in

the sentence i. Figure 5.7 is an example of a term-document matrix on a document

about encryption.

Figure 5. 7 The TDM of a document in encryption.

Page 43: Evaluation of Techniques for Automatic Text Extraction

Page 42

11. Perform the intended service: This summarizer can perform two services as

discussed before, the summarizer can display TDM matrix or summarize the

document. If the intention of the service is to display the TDM matrix, the matrix

is exported to an excel sheet as shown above and this would be the end of the

algorithm. However, if the intended service is to summarize then go to the next

step.

12. Sort the sentences: The sentences are sorted according to their importance. This is

done as following:

a. Within each row (sentence) in the TDM, the summation of the values in

the cells of this row is calculated.

b. Sort the sentences according to this summation.

13. Construct the summary: The most important sentences are selected to be included

in the summary:

a. After the sentences are sorted in a descending order. The first N sentences

are selected, where N is the number of sentences required in the summary.

b. The selected sentences are sorted according to their order in the original

document.

c. The summary is constructed by adding these sorted sentences.

5.5 Example

This part shows an example of a document extracted using our sentence extractor.

The document to be summarized is about e-commerce. Figure 5.8 shows the original

document.

Page 44: Evaluation of Techniques for Automatic Text Extraction

Page 43

Page 45: Evaluation of Techniques for Automatic Text Extraction

Page 44

Page 46: Evaluation of Techniques for Automatic Text Extraction

Page 45

Page 47: Evaluation of Techniques for Automatic Text Extraction

Page 46

Page 48: Evaluation of Techniques for Automatic Text Extraction

Page 47

Page 49: Evaluation of Techniques for Automatic Text Extraction

Page 48

Figure 5. 8 The original document about E-commerce

The term-document matrix of this document was shown in Figure 5.7. The result of

the sentence extraction on the document about E-commerce is shown in Figure 5.9. The

sentences are separated by a delimiter for readability purposes.

Page 50: Evaluation of Techniques for Automatic Text Extraction

Page 49

Page 51: Evaluation of Techniques for Automatic Text Extraction

Page 50

Page 52: Evaluation of Techniques for Automatic Text Extraction

Page 51

Page 53: Evaluation of Techniques for Automatic Text Extraction

Page 52

Figure 5. 9 A summary based on sentence extraction

Page 54: Evaluation of Techniques for Automatic Text Extraction

Page 53

6 Paragraph Extraction

This chapter will discuss the process of paragraph extraction where the algorithm

extracts the most eligible paragraphs. 6.1 will provide an introduction to the paragraph

extraction. 6.2 will introduce a new concept used for the paragraph extraction. This

concept is called a text relationship map which is also called a document-document

matrix. 6.3 will explain the different methodologies used for the process of the selection

of the paragraphs to include in the summary. 6.4 will explain the analysis of the system

and the design of the summarizer. 6.5 will explain the algorithm used to implement the

summarizer with respect to each method. Finally, 6.6 will provide two examples, one for

each method.

6.1 Introduction

The idea of the extraction of paragraphs instead of sentences was first introduced in

1997 [22]. It was expected that since a paragraph is considered a self independent part of

the text, the problems of readability and coherence that were seen in the summaries

generated by sentence extraction would be improved. It is agreed that a paragraph can

address multiple topics and is motivated by context, writing style, and presentation.

6.2 Text Relationship Maps

Usually in information retrieval each text segment or excerpt is represented by a

vector of weighted terms as following:

Page 55: Evaluation of Techniques for Automatic Text Extraction

Page 54

Di = (di1 di2, di3, …, dit) where k = 1..t

Di : Document i.

dik : The importance weight of tem tk in document i.

tk may be words or phrases derived from the document texts by an automatic indexing

procedure.

The vector similarity might be computed as the inner product between corresponding

vector elements:

t

Sim(Di, Dj) = ∑ dik.djk [22]

k=1

The similarity function may be normalized to lie between 0 for disjoint vectors and 1

for complete identical vectors.

To decide the eligible paragraphs, we want to determine how the paragraphs are

related to each other. This is done using a text relationship map.

Text Relationship Maps: Nodes (paragraphs) are joined by links based on a numerical

similarity computed for each pair of texts using the equation above. All paragraphs with a

similarity over a specific threshold are connected by links [22].

The importance of a paragraph within the text is likely to be related to the number of

links incident on the corresponding node. Figure 6.1 demonstrates a text relationship map

for of the paragraphs of the article Telecommunications from the Funk and Wagnalls

Encyclopedia. The paragraphs are denoted by nodes. Paragraphs which are sufficiently

similar are joined by a link. The similarity threshold used in this map is 0.12.

Page 56: Evaluation of Techniques for Automatic Text Extraction

Page 55

Figure 6. 1 A text relationship map for an article in Telecommunication

Figure 6.2 shows the relationship map for the article on Telecommunications at a

similarity threshold of 0.12 with links between distant paragraphs deleted.

Page 57: Evaluation of Techniques for Automatic Text Extraction

Page 56

Figure 6. 2 Text relationship map of the Telecommunication article after refinement

Important information about the document can be drawn from a text relationship map.

A text relationship map could be useful in:

1. Identifying related passages covering particular topic areas.

2. Providing information about the homogeneity of the text under consideration.

3. Decomposing a document into segments

A segment is a contiguous piece of text that is linked internally, but largely

disconnected from the adjacent text. Segments are our automatic approximation to

sectioning when a text does not have well defined sections [4].

Page 58: Evaluation of Techniques for Automatic Text Extraction

Page 57

To create the extract we have to select the paragraphs that are considered the most

important for inclusion for the summary. The next part will discuss the different

methodologies on how to select these paragraphs.

6.3 Paragraph Extraction Methodologies

The process of extracting paragraph using a text relationship map can be

accomplished by automatically identifying the important paragraphs (nodes) and passing

across the selected nodes in their text order to construct the extract or path. The point

here is how to construct the path as obviously the path determines the quality and style of

the extract.

Mainly, there are four types of paths used to construct an extract [22]:

1. Bushy path: The bushiness of a node is determined by the number of links

connecting it to the other nodes. Such nodes are good overview paragraphs and

can be used in the summary. Let's say we need N paragraphs in the summary. A

bushy path would be constructed to be the N bushiest nodes on the map. The

order of the nodes in the summary is the same as their order in the original text.

2. Depth first path: Bushy nodes are the nodes that have the most connections to

other nodes in the text relationship map. However, that does not mean that the

bushy nodes are connected to each other. Therefore, the bushy nodes could not be

related to each other so much. This could provide a summary that covers the

article well but it will lack coherence. The readability of the summary might be

poor. The solution will be to use the depth first path instead. The depth first path

is constructed as following:

Page 59: Evaluation of Techniques for Automatic Text Extraction

Page 58

a. Start an important node (a highly bushy node or the first node in the

original text).

b. Visit the next most similar node.

c. Repeat step 2 till you reach the limit of the summary length.

Since each paragraph is similar to the node after it in the path so the summary

will be coherent. The summary will be the nodes that fall within the path. The

summary contents will be controlled by the contents of first paragraph chosen.

Therefore, all aspects of a paragraph may not be covered by a depth first path.

3. Segmented bushy path: Some nodes could be well connected to each other but not

connected to the rest of the nodes. For example, it could be that each group of

nodes is interconnected to each other and there are small connections between

these node groups. These node groups are called segments. Using any of the two

previous paths will not construct a good summary. However, a segmented bushy

path could be the solution: For each segment there is a bushy path constructed

according to its text order in the original text. At least one paragraph is selected

from each segment. The remainder of the extract is constructed by picking more

bushy nodes from each segment. The more the length of the segment in the

original text the more paragraphs chosen. Since all segments are represented in

the extracts, this algorithm should enhance the comprehensiveness of the extract.

4. Augmented Segmented bushy path: Usually authors describe their work in the

first couple of paragraphs. So the introductory part and the concepts of the text

that following the introductory could be considered a segment. So a segmented

bushy system might ignore the introductory part as it is not very bushy and go to

the middle of the text and pick a bushier node. This could affect the readability of

the summary. Besides, the introductory paragraph is a very rich part to be ignored.

So the augmented segmented bushy path does what the previous method did but

Page 60: Evaluation of Techniques for Automatic Text Extraction

Page 59

in addition it always chooses the introductory paragraph from a segment, and then

picks the bushiest paragraphs according to the size required for the summary.

6.4 Analysis and Design

An extract based on paragraph extraction is attained by constructing a path from the

text relationship map. Our text relationship map is not represented as a graph as shown

previously. It is represented as a document-document matrix. This is a term used

extensively in IR. The rows in the matrix represent the documents in the corpus and the

columns as well represent the same documents of the same corpus. The value in the cellij

(in row i and column j) is the relation between the document j to the document i. Table

6.1 shows an example of a document-document matrix. The similarity values range from

0-1 where 0 shows that the two documents are completely disjoint and 1 mean the two

documents are identical or almost identical. The value of cell1,3 is 0.6 means that

"Document 3" is similar to "Document 1"by 0.6.

Document

1

Document

2

Document

3

Document

4

Document

1

1 0.2 0.6 0.5

Document

2

0.5 1 0.13 0.4

Document

4

0.02 0.444 1 0.22

Document

3

0.4 0.1 0.2 1

Table 6.1. A document-document matrix

Page 61: Evaluation of Techniques for Automatic Text Extraction

Page 60

There are some important notes about the previous table:

1. The similarity function used in the above table is non-commutative; the

similarity of document X to document Y is not the same as the similarity of

document Y to document X. An illustration of that in Table 6.1 is:

cell12 ≠ cell21

2. Apparently, the value of cellij will always equal 1 when i=j. As this would be

the relation between a document and itself.

In order to perform a paragraph extraction, we need to keep track of the words,

sentences and paragraphs available in the document. We have discussed in the previous

chapter the words and sentences. The class diagram of the classes Word and Sentence are

illustrated in Figure 5.1 and Figure 5.2 in Chapter 5. Each paragraph in the document is

specified by its Id, value, number of sentences, words representing it (WordFrequency).

Figure 6.3 shows the class diagram of the Paragraph Class

Figure 6. 3 Class diagram of Class Paragraph We also mentioned in Chapter 5 the Extractor class which performs the basic

functionality of an extractor whatever the scope of the text segment is. A class diagram of

Page 62: Evaluation of Techniques for Automatic Text Extraction

Page 61

the Extractor class is shown in Figure 5.4. The ParagraphExtractor class is the controller

class of our paragraph extractor. A class diagram of the Paragraph Extractor is shown in

Figure 6.4.

Figure 6. 4 Class diagram of ParagraphExtractor class Figure 6.5 shows a class diagram for our paragraph extractor.

Page 63: Evaluation of Techniques for Automatic Text Extraction

Page 62

Figure 6. 5 A class diagram of the paragraph extractor

6.5 Implementation

Our paragraph extractor can perform two services for a text. The first one is it

performs an extraction of the most important paragraphs in the text by constructing a

path. The paragraph extractor implements two different paths, these paths are depth-first

path and bushy node path. Second, it displays the document-document matrix for this

document where this matrix maps the relation between the paragraphs. This matrix could

be useful for other studies in text extraction or information retrieval.

Page 64: Evaluation of Techniques for Automatic Text Extraction

Page 63

The basic functionality of our paragraph extractor is:

1. Identify paragraphs

2. Identify words

3. Identify sentences

4. Estimate words' position and global frequency

5. Estimate paragraphs' weight by obtaining the words representing them.

6. Create a document-document matrix.

7. Construct the path of the summary through the paragraphs (whether depth-

first or bushy node path).

8. Construct the summary

Some parts of this algorithm were explained in the previous chapter. The previous

points are mainly the basic steps of our algorithm. The source code of the paragraph

extractor is shown in Appendix C. The following is a more detailed explanation of the

algorithm:

1. Read the file: The contents of the file are read. There is an important point, which

is this summarizer only works on text data not on figures or tables.

2. Parse the paragraphs: The termination of a paragraph would be by skipping a line.

However, paragraphs identification can by tricky where each style of writing has

a different way of terminating a paragraph.

3. Parse the sentences:

a. A sentence is identified as a sequence of characters that are terminated by

a full stop (.), an exclamation mark (!), or a question mark (?).

Page 65: Evaluation of Techniques for Automatic Text Extraction

Page 64

b. Previous studies have proven that short sentences are not important.

Therefore, we neglect these sentences from the beginning. There have not

been suggestions about the sentence length cut-off, so we specified a

sentence length threshold that best suits our system.

4. Parse the words: The words are extracted from each sentence.

5. Eliminate unimportant words: Remove the noise words to produce a refined word

set.

a. There are some words that occur too frequently to be significant. These

words are not eligible to represent the document due to their uselessness as

they do not convey any important concept (like any, this, that, etc.). To

solve this problem we can either use a stop list that contains these noise

words or use a high frequency cut-off (threshold) where words which are

over a specific cut-off are ignored. Our sentence extractor uses a stop list.

b. Most words with length less than 3 characters (like it, is, or, etc.) are

considered not important. Therefore, we neglect these words from the

beginning.

6. This step collects some information on the paragraphs:

a. Get number of sentences in the paragraph.

b. Obtain the words that best represent the paragraph. A word that is eligible

to represent a paragraph should occur several times. We keep track of the

word's Id and its frequency within the paragraph.

7. Perform a refinement on the paragraphs: Each paragraph is represented by some

words. The number of words representing the paragraph is an indication of its

Page 66: Evaluation of Techniques for Automatic Text Extraction

Page 65

importance. Therefore, the paragraphs with representative words less than a

certain threshold are ignored.

8. We have discussed before that each word has some properties; id, value,

sentences containing it. In this step we will find the Ids of the sentences that

contain this word.

9. Sort the words representing the paragraph by importance.

10. Obtain the number of characters intended to be included in the summary. The

number of characters in the summary is directly derived from the number of

characters in the original document.

11. Construct the document-document matrix: The documents here are the paragraphs

of the document. Figure 6.5 is an example of a term-document matrix on a

document about encryption.

Figure 6. 6 A document-document Matrix of a document in Encryption

Page 67: Evaluation of Techniques for Automatic Text Extraction

Page 66

12. Perform the intended service: If the service intended was to view the document-

document matrix then the matrix is exported to an excel sheet as shown above in

Figure 6.5. If the intended service was to summarize using a depth-first node path

then go to step 13 and if the intended service to summarize using a bushy node

path then go to step 14.

13. Construct the depth-first node path: The depth-first path starts with the first node

"1" and searches for the node most similar to it. Then it searches for the most

similar node to that node, figure 6.7 shows an illustration. This goes on until the

paragraphs chosen reach the limit of the summary. The selected nodes constitute

the summary.

Figure 6. 7 An illustration of using depth-first node path algorithm

14. Construct the bushy node path: The nodes that are bushiest are selected.

Page 68: Evaluation of Techniques for Automatic Text Extraction

Page 67

a. Using the document-document matrix show in Figure 6.5 we sum the

similarity values for each row. So for example, the summation of row 1 is

1+0.66667+0+0+0+0+1+0.33333+0 = 3.

b. The rows (nodes) are sorted in a descending order with respect to their

summation values.

c. The sorted nodes are selected one by one until they reach the limit.

d. The selected nodes are sorted again with respect to their position in the

original document.

e. The sorted selected nodes finally compose the summary.

6.6 Example

This part shows an example of a document extracted using our paragraph extractor.

The document to be summarized is about e-commerce. The document is titled with

Consumer perceptions of Internet retail service quality. The original document is

shown in Chapter 5 in figures 5.8, 5.9, and 5.10.

6.6.1 Depth-First Node Path

Figure 6.7 shows the summary of the document Consumer perceptions of Internet

retail service quality shown in Figure 5.8.

Page 69: Evaluation of Techniques for Automatic Text Extraction

Page 68

Page 70: Evaluation of Techniques for Automatic Text Extraction

Page 69

Figure 6. 8 A summary based on depth-first node path algorithm

Page 71: Evaluation of Techniques for Automatic Text Extraction

Page 70

6.6.2 Bushy Node path

Figure 6.8 shows the summary of the document Consumer perceptions of Internet

retail service quality shown in Figure 5.8.

Figure 6. 9 A summary based on bushy node path algorithm

Page 72: Evaluation of Techniques for Automatic Text Extraction

Page 71

Section 3: Evaluation and results

7 Evaluation

This chapter will be concerned with evaluating the three extractors implemented in

order to come out with a guidance of which one to use under which case.

7.1 Evaluation Procedures

This evaluation will take place as following:

1. Specify the aspects of the summarization process.

2. Prepare the test data to perform the evaluation on.

3. Bring testers (participants) to test the system. The testers will have different

backgrounds.

4. Present a survey to these testers asking them some questions.

5. Obtain the results.

6. We will perform some analysis on the survey results.

7.2 Aspects of the extraction process

We have discussed before in Chapter 3 that the compression rate of the

summarization process controls the quality and style of the summary. Studies have shown

that the best compression rate of the summary would be 25% of the original document

[12]. All the text extractors implemented in this dissertation use this compression rate.

Page 73: Evaluation of Techniques for Automatic Text Extraction

Page 72

We also have discussed in Chapter 1 the factors that affect the quality and style of the

summary (intent, background, coverage, and focus). We will fix the intent, coverage and

focus factors so the background factor will be the only variable factor.

The intent of the summary will be informative, the coverage of the summary will be

single document, the focus of the summary will be generic, and the text genre of the

summary is academic papers. The user's knowledge will be variable in this study.

Basically, the user's knowledge of a subject area could turn out to be in one of the

following levels: novice, shallow, medium, or expert.

7.3 Test Data

The test data of the evaluation process was accurately chosen to be well

representatives of the academic papers corpus. The test data was chosen from different

fields, like e-commerce, e-learning, image processing, NLP, IR and GIS (Geographical

Information System).

7.4 Participant Users

The users chosen for this survey were selected to represent the large-scale range of

users' knowledge. Therefore, we can test the summaries with users with different

backgrounds; novice, shallow, medium, or expert.

7.5 Evaluation Techniques

Page 74: Evaluation of Techniques for Automatic Text Extraction

Page 73

A user chooses an algorithm and a test data and then asks the system to perform the

summarization of the chosen document using the algorithm specified. A summary is then

constructed.

After the user reads the summary, the evaluation process begins. We use empirical

methods to perform the evaluation [25]. Empirical methods are evaluation methods that

require the involvement of participants. There are a lot of empirical methods used for

evaluation, like co-discovery, user workshops, think aloud protocols, field observation,

questionnaires, and others. We have chosen two empirical methods for our evaluation:

1. Questionnaires: A group of questions presented to the users. The questionnaire

type used in this study is a fixed-response questionnaire where the users are asked to

register their opinions on a licker scale (a scale from 1 to 5, one represents the lowest

weight and five represents the highest). The results of the questionnaire should tell us

how the user thinks of the summary's informativeness, readability and coherence.

These are the properties of any evaluation process. They are discussed in detail in

Chapter 3. The user will then be asked to grade the summarizer.

2. Field Observation: It involves watching how users interact with the system. We

will observe the user's reactions while reading the summary and while answering the

questionnaire.

7.6 Evaluation Equipment

This part will be concerned with discussing the equipment used to perform the

evaluation, which are the user interface and the evaluation sheet.

Page 75: Evaluation of Techniques for Automatic Text Extraction

Page 74

7.6.1 User Interface

A user interface of the system was constructed for evaluation purposes. The user

can select the extraction method and the test data then he submits to view the

summary. Figure 5.1 shows a snapshot of the interface while selecting the extraction

method.

Figure 7. 1 Interface of the extrator system where user chooses the method

Figure 7.2 shows a snapshot of the interface while choosing the test data to perform

the extraction on.

Page 76: Evaluation of Techniques for Automatic Text Extraction

Page 75

Figure 7. 2 Interface of the system where the user chooses the document to perform the extraction on.

Figure 7.3 shows the result page where on the top there is a link to the original

document in case the user wants to go back to it and then a summary is presented.

Page 77: Evaluation of Techniques for Automatic Text Extraction

Page 76

Figure 7. 3 The result page of our extractor where a summary is provided for the document

7.6.2 Evaluation Sheet

After the user reads the summary he fills in a form. The form first asks some general

questions like the user's name, the document title, the subject area of the document, and

the user's knowledge about the subject area. Then some questions are asked about the

summary. A sample of the questionnaire is shown in Appendix D.

Page 78: Evaluation of Techniques for Automatic Text Extraction

Page 77

7.7 Evaluation Results

About 90 evaluation sheet was filled, 30 for each method. The questionnaire was

filled with users with different backgrounds. Within each method, 10 were filled by

experts, 13 were filled by medium users and 8 were filled by novice users. The numerical

results of this evaluation are shown in Appendix F.

7.7.1 Expert Users Results

The users found that the sentence extractor algorithm and the bushy node path

algorithm equally retain the information of the original document with a mean of 3.5.

However, the depth-first node path algorithm was found to be slightly below medium in

retaining the information of the original document with a mean of 2.8.

The users found that the two paragraph extractors are more misleading than the

sentence extractor. However, the misleading of the paragraph extractors is below medium

which diminishes the risk.

The users found that the sentence extractor provides the most elegant summary with a

mean of 3.33. The depth-first comes second with a mean of 2.88 and the bushy comes

last with a mean of 2.5.

The users found that the depth-first algorithm requires the user to have the least

background knowledge about the subject area compared with the other algorithms. The

users see that the sentence extractor requires a slightly more background about the

subject area. The bushy node algorithm is viewed as the most one that requires a user to

have back knowledge about the subject area.

Page 79: Evaluation of Techniques for Automatic Text Extraction

Page 78

The users graded the bushy node to be the best algorithm with a mean of 4. Next is the

sentence extractor with a mean of 3.8. Finally, the depth-first algorithm comes with a

mean of 3.

Figure 7.4 shows the chart of the expert users.

Figure 7. 4 A chart for the questionnaire answers of the expert users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.

Page 80: Evaluation of Techniques for Automatic Text Extraction

Page 79

7.7.2 Medium Users Results

The users found that the sentence extractor algorithm and the depth-first node path

algorithm equally retain the information of the original document with a mean of 3.77.

However, the bushy node path algorithm was found to be slightly below the other two

algorithms in retaining the information of the original document with a mean of 3.58.

The users found that the bushy node algorithm is the most misleading with a mean of

2.5. The other paragraph extractor is considered less misleading with a mean of 2.36.The

sentence extractor is considered the least misleading with a mean of 2.

The users found that the bushy node algorithm provides the most elegant summary

with a mean of 4. Right next to it comes the depth-first with a mean of 3.92 and the

sentence extractor comes last with a mean of 3.15.

The users found that the sentence extractor requires the user to have the least

background knowledge about the subject area compared with the other algorithms. The

users see that the depth-first algorithm requires a slightly more background about the

subject area. The bushy node algorithm also comes last. The differences between the

three are the same.

The users graded the bushy node to be the last algorithm with a mean of 3.5 (which is

still good as that still is above medium). The sentence extractor is considered the best to

these users with a mean of 3.84. Next is the depth-first algorithm with a mean of 3.6.

Figure 7.5 shows a chart of the medium users.

Page 81: Evaluation of Techniques for Automatic Text Extraction

Page 80

Figure 7. 5 A chart for the questionnaire answers of the medium users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.

7.7.3 Beginner Users Results

Novice users consider the depth-first algorithm to be the best in retaining of the

information from the original document with a mean of 3.625. The other two algorithms

are almost the same in the information retention with a mean of 3.2.

Page 82: Evaluation of Techniques for Automatic Text Extraction

Page 81

The users found that the bushy node path is the most misleading with a mean of 3

(which is exactly the medium). The other two algorithms are rated slightly below the

medium.

The depth-first algorithm was the most elegant with a mean of 3.875. The other two

algorithms have the exact same elegance rate.

Again the bushy node path algorithm requires the user to have a large amount of

background information with a mean of 4. Next is the sentence extractor with a mean of

3.25. The depth-first algorithm comes last with a mean of 2.2.5. The depth-first algorithm is graded the best with a mean of 3.875. The other two

algorithms have the same exact mean of 3. Figure 7.6 shows a chart of the beginner users (novice)

Page 83: Evaluation of Techniques for Automatic Text Extraction

Page 82

Figure 7. 6 A chart for the questionnaire answers of the novice users. The vertical line represents the mean and the horizontal line represents the questions. Series 2: The sentence extractor algorithm. Series 3: The depth-first node path extractor algorithm. Series 4: The bushy node path extractor algorithm.

Page 84: Evaluation of Techniques for Automatic Text Extraction

Page 83

Section 4: Discussion

8 Discussion

In Chapter 1, I have stated my aims and objectives in this thesis. I succeeded to achieve

the following:

• I implemented an automatic summarizer that is based on sentence extraction.

• I implemented an automatic summarizer that is based on bushy node path

paragraph extraction.

• I implemented an automatic summarizer that is based on depth-first node path

paragraph extraction.

• I have enabled the public to contribute and share their opinions on the automatic

summaries asking them some questions.

• I presented to them a lot of questions although I did not use them all. As the

results of these questions could be analyzed in the future.

• I evaluated the text summarizers with regards to the background aspect.

• I have been able to test how informative the summarizers are.

• I have been able to come out with which one of the summarizers used is the safest

(least misleading).

• I had performed some analysis on the results of the questionnaire I used that came

out with interesting observations. However, extensible analysis could be

performed on these results where very interesting conclusions could be obtained.

8.1 will discuss the conclusions I achieved and also the interesting observations I

discovered. 8.2 will discuss the future work I intend to do.

Page 85: Evaluation of Techniques for Automatic Text Extraction

Page 84

8.1 Conclusions

Interesting conclusions were found in this study. These conclusions are backed up by

the results of the evaluation and the personal experience in testing the algorithms. The

most important and positive conclusions are shown below.

The depth-first algorithm proved out to be the most appropriate algorithm for novice

users. While the sentence extractor algorithm proved out to be the most appropriate

algorithm for the medium users. Finally, the bushy node algorithm proved to be the most

appropriate for expert users. This method in particular was given very high grades in the

questionnaire which reassures that the expert users would prefer a bushy node algorithm.

This conclusion is considered logical as if an individual is an expert in a specific subject

area, he would prefer to have the basic points (most important points) of the document

which what a bushy node algorithm provides.

The bushy node algorithm is considered the most algorithm that requires the user to

have background knowledge about the subject area. This is backed up by the results of

the questionnaire. This could be a reason why novice users gave very bad grades to this

summarizer. The sentence extractor algorithm proved to be the safest algorithm as it is

the least misleading. The dept-first node algorithm proved to construct the most elegant

summaries.

Page 86: Evaluation of Techniques for Automatic Text Extraction

Page 85

These are some guidelines for which one algorithm to use. These guidelines are

backed up by the evaluation results and my personal experience in testing the algorithms:

• It is recommended to use a bushy node algorithm for experts.

• It is recommended to use a depth-first node algorithm for user who are new in

the subject area.

• Students that need a summary for their course at the night before their exam

would find the sentence extractor the best for them. It will provide small notes

that can be easy to remember.

• If the document is not very large, it is recommended not using any of the

paragraph extractors.

• If the document is small; i.e. a page or two, it is recommended not using any

one of the extractors.

• If the document need to be summarized is a critical document, it is

recommended to use the sentence extractor algorithm.

• No specific algorithm could be recommended if you are looking for elegance.

That is because the summarizers' elegance grades are very similar. However,

they all provide elegant summaries.

Page 87: Evaluation of Techniques for Automatic Text Extraction

Page 86

8.2 Future Work

Automatic Text Summarization will always be a very rich field for software

development as it faces a lot of challenges and it can provide a massive number of

opportunities.

The summarizers could be enhanced to construct more coherent summaries. This will

be done by adding to the system the functionality of lexical chains and using pronominal

resolution. Pronominal resolution is of, relating to, or functioning as a pronoun,

resembling a pronoun, as by specifying a person, place, or thing, while functioning

primarily as another part of speech [8].

More statistics could be performed on the results of the evaluation performed on the

three algorithms. As it was mentioned before that these results could provide interesting

information if a lot of analysis is performed on them.

The algorithms could be applied on different text genres and on different focuses

(query relevant or generic).

Video summarization has become a very interesting field that faces a lot of problems

and challenges. Performing summarization on video will be a very interesting project to

do.

Page 88: Evaluation of Techniques for Automatic Text Extraction

Page 87

References [1] Choi, F. Y. Y., Wiemer-Hastings, P. and Moore, J., 2001, Latent semantic analysis for text segmentation. In Proceedings of EMNLP (Pittsburgh, USA), 109-117.

[2] Edmunson, H. 1969. New methods in automatic extracting, Journal of the

ACM 16(2):264–285.

[3]Eduard H. and Daniel M., 1998, Automatic text Summarization Tutorial, http://ww w.isi.edu/~marcu/acl-tutorial.ppt, Accessed (23/6/2006).

[4] Gregory S. & Kathleen F. McCoy, 2000, Efficient Text Summarization Using

Lexical Chains.

[5] Firmin T. and Chranowski M.J., 1999, An Evaluation of Automatic Text

Summarization Systems.

[6] Graesser, A.C., et al., 2001, Intelligent tutoring systems with conversational dialogue. AI Magazine, 22(4).

[7]Halliday M. and Hasan H, 1976, Cohesion in English. Longman, London.

[8] Hassel, M. 2000. Pronominal Resolution in Automatic Text Summarisation. Master Thesis, University of Stockholm, Department of Computer and Systems Sciences (DSV).

[9] Hercules D, 2003, Automatic Text Summarization, www.gslt.hum.gu.se/courses/ia/OHSummarizeSept2003.pdf, Accessed (23/8/2006)

Page 89: Evaluation of Techniques for Automatic Text Extraction

Page 88

[10] Hovy, E.H. and Lin, C-Y, 1998, Automated Text Summarization in

SUMMARIST. In M. Maybury and I. Mani (eds), Intelligent Scalable

Summarization Text Summarization. Forthcoming.

[11] Hu, X., Cai, Z., Louwerse, M., Olney, A., Penumatsa, P., Graesser, A.C., & TRG (2003). A revised algorithm for Latent Semantic Analysis. Proceedings of the 2003 International Joint Conference on Artificial Intelligence(pp. 1489-1491)

[12]Jing, H., Barzilay, R., McKeown, K. and Elhadad M., Summarization Evaluation

Methods: Experiments and Analysis, Working Notes of he AAI-98 Spring Symposium on

Intelligent Text Summarization, pp. 60-68 (1998).

[13] Jing H., Sentence reduction for automatic text summarization. In Proceedings of ANLP, 2000.

[14] Kupiec J., Pedersen J. and Chen, F., 1995, A trainable document summarizer.

In Proceedings, 18th Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval, 68–73. Seattle, Washington: Special

Interest Group on Information Retrieval

[15] Landauer, T. K. & Dumais, S. T., 1997, A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.

[16] Landauer T.K., Foltz, P.W., & Laham D., 1998, An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.

[17] Luhn, H. 1958, The automatic creation of literature abstracts. IBM Journal of

Research and Development 2(2).

Page 90: Evaluation of Techniques for Automatic Text Extraction

Page 89

[18] Mani, I., 2001, Summarization Evaluation: An overview. In Proceedings of

the Second NTCIR Workshop on Rsearch in Chinese

[19] Mani, I. and Maybury M., eds. 1999, Advances in Automatic Text

Summarization, MIT Press.

[20]Martin H. and Hercules D.,2004, Generation of Reference Summaries http://www.nada.kth.se/~xmartin/papers/ltc_026_hassel_dalianis_final.pdf#search=%22Generation%20of%20Reference%20Summaries%22, Accessed (22/7/2006)

[21] Martin H. and Hercules D., 2005, Generation of reference summaries. In Proceedings of 2nd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 21–23, 2005.

[22] Mitra M., Amit S., and Chris B. 1997. Automatic text summarization by paragraph extraction. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization, pages 31{36, Madrid, Spain}

[23] Morris J., and Hirst G., 1991, Lexical cohesion computed by thesaural

relations as an indicator of the structure of text. Computational Linguistics 17(1):

21-45, 1001.

[24]Pachantouris G., 2005, GreekSum - A Greek Text Summarizer, Master Thesis,

Department of Computer and Systems Sciences, KTH-Stockholm university

[25] Patrick, W. Jordan,1998, An Introduction to Usability, Taylor & Francis Ltd

.

Page 91: Evaluation of Techniques for Automatic Text Extraction

Page 90

[26] Regina B. and Michael E., 1997, Using lexical chains for text summarization. In Proceedings of the ACL’97/EACL’07 Workshop on Intelligent Scalable Text Summarization, Madrid , Spain.

[27] Silber, H. G., McCoy, K. F, 2000, Efficient text summarization using lexical chains. In Proceedings of Intelligent User Interfaces 2000.

[28] Wikipedia Encyclopedia, http://en.wikipedia.org/wiki/Singular_Value_Decomposition, Accessed (18/8/2006)

Page 92: Evaluation of Techniques for Automatic Text Extraction

Page 91

Appendixes

Appendix A: Basic classes of the system

This part has the basic classes using in the system which are Extractor, Paragraph,

Sentence, Word, and WordFrequency.

First the Paragraph class:

//////////////////////////////////////////Paragraph////////////////////////////////////////////////////////////////////////

/*

* Paragraph.java

*

* Created on August 28, 2006, 7:28 PM

*

* To change this template, choose Tools | Template Manager

* and open the template in the editor.

*/

/**

*

* @author Omar Azzam

*/

public class Paragraph {

int id;

String value;

int numOfSentences;

Page 93: Evaluation of Techniques for Automatic Text Extraction

Page 92

int numOfChars;

WordFrequency[] representativeWordsFrequency;

/** Creates a new instance of Paragraph */

public Paragraph() {

}

}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Second the WordFrequency Class: //////////////////////////////////////////WordFrequency////////////////////////////////////////////////////////////////

/* * WordFrequency.java * * Created on August 24, 2006, 12:53 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class WordFrequency { /** Creates a new instance of WordFrequency */ int id,frequency; public WordFrequency(int id,int frequency) { this.id = id; this.frequency = frequency; } }

Page 94: Evaluation of Techniques for Automatic Text Extraction

Page 93

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Third the Sentence Class: //////////////////////////////////////////Sentence/////////////////////////////////////////////////////////////////// import java.util.ArrayList; import java.util.Vector; /* * Sentence.java * * Created on August 24, 2006, 12:48 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class Sentence { Integer id; String value; ArrayList<WordFrequency> wordFrequency = new ArrayList<WordFrequency>(); WordFrequency sentenceWordFrequency[]; /** Creates a new instance of Sentence */ public Sentence(int id,String value) { this.id = new Integer(id); this.value = value; } } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Page 95: Evaluation of Techniques for Automatic Text Extraction

Page 94

Fourth the Word Class: ///////////////////////////////////////Word////////////////////////////////////////////////////////////////////////////////// /* * Paragraph.java * * Created on August 28, 2006, 7:28 PM * * To change this template, choose Tools | Template Manager * and open the template in the editor. */ /** * * @author Omar Azzam */ public class Paragraph { int id; String value; int numOfSentences; int numOfChars; WordFrequency[] representativeWordsFrequency; /** Creates a new instance of Paragraph */ public Paragraph() { } } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Fifth the Extractor class: ///////////////////////////////////////Extractor///////////////////////////////////////////////////////////////////////////// /* * Extractor.java * * Created on August 31, 2006, 3:52 PM

Page 96: Evaluation of Techniques for Automatic Text Extraction

Page 95

*/ import java.io.*; import java.net.*; import java.util.ArrayList; import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version */ public abstract class Extractor extends HttpServlet { /** Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods. * @param request servlet request * @param response servlet response */ // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response */ FileInputStream fis;// byte b[]; ArrayList<String> a = new ArrayList<String>(); Sentence[] sentence; int sentenceCapacity = 0; String documentData; String words[]; Word[] preEnhancedWords; Word[] enhancedWords; int preEnhancedWordsCapacity; int enhancedWordsCapacity; int wordCapacity = 0; protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

Page 97: Evaluation of Techniques for Automatic Text Extraction

Page 96

} /** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */ protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { } public String readData(String fileName) { try { fis = new FileInputStream(fileName); b = new byte[fis.available()]; fis.read(b); } catch (FileNotFoundException ex) { ex.printStackTrace(); } catch (IOException ex) { ex.printStackTrace(); } return new String(b); } public void parseSentences() { int cursor = 0; String tempSentence; String restOfText; documentData = documentData.replaceAll("\r",""); documentData = documentData.replaceAll("\n",""); documentData = documentData.replaceAll("\t",""); String data = new String(documentData); cursor = data.indexOf(".");

Page 98: Evaluation of Techniques for Automatic Text Extraction

Page 97

if(cursor> data.indexOf("?") && data.indexOf("?")!=-1) cursor = data.indexOf("?"); if(cursor> data.indexOf("!") && data.indexOf("!")!=-1) cursor = data.indexOf("!"); do { tempSentence = data.substring(0,cursor+1); try { data = data.substring(cursor+1); } catch(StringIndexOutOfBoundsException strIndExc) { break; } if(tempSentence.length()>25) { a.add(sentenceCapacity,tempSentence); sentenceCapacity++; } cursor = data.indexOf("."); if(cursor> data.indexOf("?") && data.indexOf("?")!=-1) cursor = data.indexOf("?"); if(cursor> data.indexOf("!") && data.indexOf("!")!=-1) cursor = data.indexOf("!"); } while(cursor>=0); sentence = new Sentence[sentenceCapacity]; for(int z=0;z<sentenceCapacity;z++) { sentence[z] = new Sentence(z,a.get(z)); } } public void inStopList() {

Page 99: Evaluation of Techniques for Automatic Text Extraction

Page 98

preEnhancedWords = new Word[words.length]; preEnhancedWordsCapacity = 0; boolean exists = false; for(int i=0;i<words.length;i++) { try{ if(words[i].length()>3) //Stop List if( words[i].equalsIgnoreCase("want") || words[i].equalsIgnoreCase("then") || words[i].equalsIgnoreCase("that") || words[i].equalsIgnoreCase("when") || words[i].equalsIgnoreCase("this") || words[i].equalsIgnoreCase("where") || words[i].equalsIgnoreCase("which") || words[i].equalsIgnoreCase("when") || words[i].equalsIgnoreCase("with") || words[i].equalsIgnoreCase("these") || words[i].equalsIgnoreCase("those") || words[i].equalsIgnoreCase("know") || words[i].equalsIgnoreCase("have") || words[i].equalsIgnoreCase("from") || words[i].equalsIgnoreCase("about") || words[i].equalsIgnoreCase("your") || words[i].equalsIgnoreCase("what") || words[i].equalsIgnoreCase("between") || words[i].equalsIgnoreCase("using") || words[i].equalsIgnoreCase("different") || words[i].equalsIgnoreCase("like") || words[i].equalsIgnoreCase("very") || words[i].equalsIgnoreCase("other") || words[i].equalsIgnoreCase("part") || words[i].equalsIgnoreCase("just") || words[i].equalsIgnoreCase("don't") || words[i].equalsIgnoreCase("th ey") || words[i].equalsIgnoreCase("used") || words[i].equalsIgnoreCase("there") || words[i].equalsIgnoreCase("also") || words[i].equalsIgnoreCase("than") || words[i].equalsIgnoreCase("such") || words[i].equalsIgnoreCase("more") || words[i].equalsIgnoreCase("is") || words[i].equalsIgnoreCase("many") || words[i].equalsIgnoreCase("of") || words[i].equalsIgnoreCase("and") || words[i].equalsIgnoreCase("at") || words[i].equalsIgnoreCase("an") || words[i].equalsIgnoreCase("a") || words[i].equalsIgnoreCase("all") || words[i].equalsIgnoreCase("on") || words[i].equalsIgnoreCase("no") || words[i].equalsIgnoreCase("one") || words[i].equalsIgnoreCase("two") || words[i].equalsIgnoreCase("three") || words[i].equalsIgnoreCase("four") || words[i].equalsIgnoreCase("five") || words[i].equalsIgnoreCase("six") || words[i].equalsIgnoreCase("seven") || words[i].equalsIgnoreCase("eight") || words[i].equalsIgnoreCase("nine") || words[i].equalsIgnoreCase("ten") || words[i].equalsIgnoreCase("has") || words[i].equalsIgnoreCase("to ") || words[i].equalsIgnoreCase("yet") || words[i].equalsIgnoreCase("we") || words[i].equalsIgnoreCase("make") || words[i].equalsIgnoreCase("been") || words[i].equalsIgnoreCase("based") || words[i].equalsIgnoreCase("are") || words[i].equalsIgnoreCase("able") || words[i].equalsIgnoreCase("or") || words[i].equalsIgnoreCase("for") || words[i].equalsIgnoreCase("after") ||

Page 100: Evaluation of Techniques for Automatic Text Extraction

Page 99

words[i].equalsIgnoreCase("be") || words[i].equalsIgnoreCase("same") || words[i].equalsIgnoreCase("can") || words[i].equalsIgnoreCase("even") || words[i].equalsIgnoreCase("find") || words[i].equalsIgnoreCase("it") || words[i].equalsIgnoreCase("in") || words[i].equalsIgnoreCase("his") || words[i].equalsIgnoreCase("her") || words[i].equalsIgnoreCase("own") || words[i].equalsIgnoreCase("the") || words[i].equalsIgnoreCase("most") || words[i].equalsIgnoreCase("would") || words[i].equalsIgnoreCase("could") || words[i].equalsIgnoreCase("into") || words[i].equalsIgnoreCase("however") || words[i].equalsIgnoreCase("will") || words[i].equalsIgnoreCase("they") || words[i].equalsIgnoreCase("were") || words[i].equalsIgnoreCase("only") || words[i].equalsIgnoreCase("here") || words[i].equalsIgnoreCase("made")) { } else { exists = false; for(int j=0;j<preEnhancedWordsCapacity;j++) { try{ if(words[i].equalsIgnoreCase(preEnhancedWords[j].value)) { exists = true; break; } }catch(NullPointerException er) { int z = 44; } } if(!exists) { preEnhancedWords[preEnhancedWordsCapacity] = new Word(preEnhancedWordsCapacity,words[i]); preEnhancedWordsCapacity++; } }}catch(NullPointerException rr) { int jkh =33; }

Page 101: Evaluation of Techniques for Automatic Text Extraction

Page 100

} } public void getGlobalFrequencyOfWords() { int frequency=0; String tempData; String tempWord; int startIndex,endIndex; documentData = documentData.toLowerCase(); tempData = documentData; for(int i=0;i<preEnhancedWordsCapacity;i++) { tempData = documentData; ///Get the global frequency of the word frequency=0; tempWord = preEnhancedWords[i].value; tempWord = tempWord.toLowerCase(); do { startIndex = tempData.indexOf(tempWord); if(startIndex==-1) break; else { tempData = tempData.substring(startIndex+tempWord.length()); frequency++; } }while(startIndex!=-1); preEnhancedWords[i].globalFrequency = frequency; } } public void parseWords() { String tempwordsOfEachSentence[]; ArrayList <String[]>wordSent = new ArrayList<String[]>();

Page 102: Evaluation of Techniques for Automatic Text Extraction

Page 101

int i=0,j=0; int index=0; int questionMarkPosition; String tempSentence; for(i=0;i<sentence.length;i++) { tempSentence = new String(sentence[i].value); tempSentence = tempSentence.substring(0,tempSentence.length()-1); tempSentence = tempSentence.replaceAll("!"," "); tempSentence = tempSentence.replaceAll(","," "); tempSentence = tempSentence.replaceAll(";"," "); tempSentence = tempSentence.replaceAll(":"," "); tempwordsOfEachSentence = tempSentence.split(" "); wordCapacity+=tempwordsOfEachSentence.length; wordSent.add(i,tempwordsOfEachSentence); } words = new String[wordCapacity]; tempwordsOfEachSentence = wordSent.get(j++); while(true) { for(int k=0;k<tempwordsOfEachSentence.length;k++) { words[index] = tempwordsOfEachSentence[k]; index++; } try { tempwordsOfEachSentence = wordSent.get(j); } catch(IndexOutOfBoundsException rf) { break; } j++; } } public void removeNoiseWords() { enhancedWordsCapacity = 0; enhancedWords = new Word[preEnhancedWordsCapacity]; for(int i=0;i<preEnhancedWordsCapacity;i++) { if(preEnhancedWords[i].globalFrequency>2)

Page 103: Evaluation of Techniques for Automatic Text Extraction

Page 102

{ enhancedWords[enhancedWordsCapacity] = new Word(enhancedWordsCapacity,preEnhancedWords[i]); enhancedWordsCapacity++; } } } public abstract void exportToExcelSheet(); /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } // </editor-fold> } /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Appendix B: The sentence extractor program /* * SentenceExtractor.java * * Created on August 28, 2006, 11:54 AM */ import java.io.*; import java.net.*; import java.util.ArrayList; //import java. import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version

Page 104: Evaluation of Techniques for Automatic Text Extraction

Page 103

*/ public class SentenceExtractor extends Extractor { int numOfSentencesRequiredInSummary; int termDocumentMatrix[][]; PrintWriter out; Sentence enhancedSentence[]; int sortedSentencesID[]; int enhancedSentenceCapacity; String fileName = null;// = "C:/Trial.txt"; Word sortedEnhancedWords[]; boolean export = false; // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response */ protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { out = response.getWriter(); export = false; fileName = null; documentData = null; enhancedSentence = null; enhancedSentenceCapacity = 0; numOfSentencesRequiredInSummary = 0; preEnhancedWords = null; preEnhancedWordsCapacity = 0; sentence = null; sentenceCapacity = 0; sortedEnhancedWords = null; sortedSentencesID = null; termDocumentMatrix = null; wordCapacity = 0; words = null;

Page 105: Evaluation of Techniques for Automatic Text Extraction

Page 104

if(request.getParameter("service").equalsIgnoreCase("export")) { response.setContentType("application/vnd.ms-excel"); export = true; } fileName = request.getParameter("file"); documentData = readData(fileName); parseSentences(); parseWords(); inStopList(); getPreEnhancedWordsLength(); getGlobalFrequencyOfWords(); removeNoiseWords(); fillinWordClass(); fillInWordFrequencyClass(); enhancedSentence = sentence; enhancedSentenceCapacity = sentenceCapacity; numOfSentencesRequiredInSummary = getNumberOfSentencesRequiredInSummary(); termDocumentMatrix = createTermDocumentMatrix(); if(export) exportToExcelSheet(); else { // Set the outlook of the web page. int inn = fileName.lastIndexOf("/"); String tempFileName = fileName.substring(inn+1); tempFileName = tempFileName.replace(".txt",""); String title = tempFileName.replaceAll("/",""); String tt = null; boolean problema = false; try { tt = fileName.substring(0,inn); } catch(Exception e){ problema = true; } if(!problema) { tt+="/OriginalData/"+tempFileName+"OriginalDocument.htm"; title = title.replaceAll("\"","");

Page 106: Evaluation of Techniques for Automatic Text Extraction

Page 105

title = title.replaceAll("C:",""); tt.replaceAll("/","\""); out.println("<title>"+title+"</title>"); out.println("<br><a href='"+tt+"'>Full Document</a>"); } out.println("<br><br><h4>Summary</h4><br><br>"); String summarySentencesIds = ""; int temp; int []sortedSentencesId = sortSentences(); int selectedSentencedId[] = new int[numOfSentencesRequiredInSummary]; for(int i=0;i<numOfSentencesRequiredInSummary;i++) { selectedSentencedId[i] = sortedSentencesId[i]; } for(int i=0;i<numOfSentencesRequiredInSummary-1;i++) { for(int j=i+1;j<numOfSentencesRequiredInSummary;j++) { if(enhancedSentence[selectedSentencedId[i]].id>enhancedSentence[selectedSentencedId[j]].id) { temp = selectedSentencedId[i]; selectedSentencedId[i] = selectedSentencedId[j]; selectedSentencedId[j] = temp; } } } for(int i=0;i<numOfSentencesRequiredInSummary;i++) { out.println(enhancedSentence[selectedSentencedId[i]].value); out.println("<br>------<br>"); } } } /** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */

Page 107: Evaluation of Techniques for Automatic Text Extraction

Page 106

protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { } /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } public void removeNoiseSentences() { enhancedSentence = new Sentence[sentence.length]; enhancedSentenceCapacity = 0; boolean eligible = true; for(int i=0;i<sentence.length;i++) { if(sentence[i].sentenceWordFrequency.length<3) { enhancedSentence[enhancedSentenceCapacity] = sentence[i]; enhancedSentence[enhancedSentenceCapacity].id = enhancedSentenceCapacity; enhancedSentenceCapacity++; } } } public int getNumberOfSentencesRequiredInSummary() { return (int)(Math.floor(enhancedSentence.length/4)); } public void fillinWordClass() { int cursor; int wordFrequency; String tempSentence; String tempWord; WordFrequency tempWordFrequency; for(int i=0;i<enhancedWordsCapacity;i++) {

Page 108: Evaluation of Techniques for Automatic Text Extraction

Page 107

tempWord = enhancedWords[i].value; enhancedWords[i].numberOfSentencesContainingIt = 0; for(int j=0;j<sentence.length;j++) { tempSentence = sentence[j].value; wordFrequency=0; if((cursor=tempSentence.indexOf(tempWord))>=0) { do { wordFrequency++; tempSentence = tempSentence.substring(cursor+tempWord.length()); cursor = tempSentence.indexOf(tempWord); } while(cursor>=0); enhancedWords[i].numberOfSentencesContainingIt++; enhancedWords[i].sentenceContainingIt.add( new Integer(sentence[j].id)); tempWordFrequency = new WordFrequency( enhancedWords[i].id,wordFrequency); sentence[j].wordFrequency.add(tempWordFrequency); } } } for(int i=0;i<enhancedWordsCapacity;i++) { enhancedWords[i].sentencesIdContainingIt = new int[enhancedWords[i].numberOfSentencesContainingIt]; for(int j=0;j< enhancedWords[i].numberOfSentencesContainingIt;j++) { enhancedWords[i].sentencesIdContainingIt[j] = enhancedWords[i].sentenceContainingIt.get(j).intValue(); } } }

Page 109: Evaluation of Techniques for Automatic Text Extraction

Page 108

public void getPreEnhancedWordsLength() { for(int i=0;i<preEnhancedWords.length;i++) { if(preEnhancedWords[i]==null) { preEnhancedWordsCapacity = i-1; break; } } } public void fillInWordFrequencyClass() { ArrayList<WordFrequency> tempWordFrequency; Sentence tempSentence; int wordFrequencyCounter; for(int i=0;i<sentenceCapacity;i++) { tempSentence = sentence[i]; tempWordFrequency = tempSentence.wordFrequency; wordFrequencyCounter=0; while(true) { try{ tempWordFrequency.get(wordFrequencyCounter); }catch(IndexOutOfBoundsException igg){ break; } wordFrequencyCounter++; } tempSentence.sentenceWordFrequency = new WordFrequency[wordFrequencyCounter]; for(int j=0;j<wordFrequencyCounter;j++) tempSentence.sentenceWordFrequency[j] = (WordFrequency)tempWordFrequency.get(j);

Page 110: Evaluation of Techniques for Automatic Text Extraction

Page 109

} } public int[][] createTermDocumentMatrix() { int tdm[][] = new int[enhancedSentenceCapacity][enhancedWordsCapacity]; int counter,wordFrequencyCounter; Sentence tempSentence; Word tempWord; WordFrequency[] tempWordFrequency; for(int i=0;i<enhancedSentenceCapacity;i++) { counter = 0; tempSentence = enhancedSentence[i]; tempWordFrequency = tempSentence.sentenceWordFrequency; for(int j=0;j<tempWordFrequency.length;j++) { tdm[i][tempWordFrequency[j].id] = tempWordFrequency[j].frequency; } } return tdm; } public void exportToExcelSheet() { String wordValues = new String(); String sentenceValues = new String(); for(int i=0;i<enhancedWordsCapacity;i++) { wordValues = wordValues +enhancedWords[i].value+"\t"; } out.println(wordValues); for(int i=0;i<enhancedSentenceCapacity;i++) { sentenceValues =""; for(int j=0;j<enhancedWordsCapacity;j++) { sentenceValues = sentenceValues+termDocumentMatrix[i][j]+"\t";

Page 111: Evaluation of Techniques for Automatic Text Extraction

Page 110

} out.println(sentenceValues); } } public int[] sortSentences() { int tempTDM[][] = termDocumentMatrix; int tempSentencesWeight[] = new int[enhancedSentenceCapacity]; int tempSentencesId[] = new int[enhancedSentenceCapacity]; int temp; int sum; for(int i=0;i<enhancedSentenceCapacity;i++) { sum = 0; for(int j=0;j<enhancedWordsCapacity;j++) { sum+=tempTDM[i][j]; } tempSentencesWeight[i] = sum; tempSentencesId[i] = i; } for(int i=0;i<tempSentencesId.length-1;i++) { for(int j=i+1;j<tempSentencesId.length;j++) { if(tempSentencesWeight[i]<tempSentencesWeight[j]) { temp = tempSentencesWeight[i]; tempSentencesWeight[i] = tempSentencesWeight[j]; tempSentencesWeight[j] = temp; temp = tempSentencesId[i]; tempSentencesId[i] = tempSentencesId[j]; tempSentencesId[j] = temp; } } } return tempSentencesId; }

Page 112: Evaluation of Techniques for Automatic Text Extraction

Page 111

} Appendix C: The paragraph extractor program /* * ParagraphExtractor.java * * Created on August 28, 2006, 7:00 PM */ import java.io.*; import java.net.*; import java.util.ArrayList; import javax.servlet.*; import javax.servlet.http.*; /** * * @author Omar Azzam * @version */ public class ParagraphExtractor extends Extractor { PrintWriter out; String documentParagraphs[]; String fileName; int enhancedParagraphsCapacity=0; float documentSize; double summaryLimit; static int dd=0; boolean export = false; boolean summarize = false; double upperRatioLimit = 0.3; double lowerRatioLimit = 0.2; double[][] documentToDocumentMatrix; String service = new String(); String method = new String();

Page 113: Evaluation of Techniques for Automatic Text Extraction

Page 112

Paragraph[] paragraphs; Paragraph[] enhancedParagraph; /** * Processes requests for both HTTP <code>GET</code> and <code>POST</code> methods. * * @param request servlet request * @param response servlet response * * * // <editor-fold defaultstate="collapsed" desc="HttpServlet methods. Click on the + sign on the left to edit the code."> * /** Handles the HTTP <code>GET</code> method. * @param request servlet request * @param response servlet response * @throws javax.servlet.ServletException * @throws java.io.IOException */ protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { //Initializing data documentData = null; documentParagraphs = null; documentToDocumentMatrix = null; enhancedParagraph = null; enhancedParagraphsCapacity = 0; enhancedWords = null; enhancedWordsCapacity = 0; export = false; fileName = null; method = null; paragraphs = null; preEnhancedWords = null; preEnhancedWordsCapacity = 0; sentence = null; sentenceCapacity = 0; summarize = false; service = null; wordCapacity = 0; words = null; method = request.getParameter("method");

Page 114: Evaluation of Techniques for Automatic Text Extraction

Page 113

fileName = null; service = null; out = response.getWriter(); fileName = request.getParameter("file"); service = request.getParameter("service"); fileNames = getFileNames(); int counterr = 0; //Obtaining the physical name of the file while(counterr<NUM_OF_TEST_DATA) { if(fileName.equalsIgnoreCase(fileNames[counterr][0])) { fileName = fileNames[counterr][1]; break; } counterr++; } if(service.equalsIgnoreCase("Summarize")) summarize = true; else { export = true; response.setContentType("application/vnd.ms-excel"); } if(method.equalsIgnoreCase("Sentence Extraction")) { //This will redirect to another servlet "SentenceExtractor" if(summarize) response.sendRedirect("http://localhost:8084/trial2/trial?file="+fileName+"&service=summarize"); else response.sendRedirect("http://localhost:8084/trial2/trial?file="+fileName+"&service=export"); return; }

Page 115: Evaluation of Techniques for Automatic Text Extraction

Page 114

documentData = readData(fileName); documentSize = documentData.length(); parseParagraphs(); parseSentences(); parseWords(); inStopList(); getPreEnhancedWordsLength(); getGlobalFrequencyOfWords(); removeNoiseWords(); fillInParagraphClass(); removeIneligibleParagraphs(); sortWordFrequenciesOfParagraphs(); documentToDocumentMatrix = getDocumentToDocumentMatrix(); if(export) { exportToExcelSheet(); return; } // Set the outlook of the web page. int inn = fileName.lastIndexOf("/"); String tempFileName = fileName.substring(inn+1); tempFileName = tempFileName.replace(".txt",""); String title = tempFileName.replaceAll("/",""); String tt = fileName.substring(0,inn); tt+="/OriginalData/"+tempFileName+"OriginalDocument.htm"; title = title.replaceAll("\"",""); title = title.replaceAll("C:",""); tt.replaceAll("/","\""); out.println("<title>"+title+"</title>"); out.println("<br><a href='"+title+"'>Full Document</a>"); out.println("<br><br><h4>Summary</h4><br><br>"); //Select the path the summary will go through if(method.equalsIgnoreCase("Paragraph Extraction using Depth First Nodes path")) createDepthFirstPath(); else if(method.equalsIgnoreCase("Paragraph Extraction using Bushy Nodes path")) createBushyNodePath(); else

Page 116: Evaluation of Techniques for Automatic Text Extraction

Page 115

out.println("Malformed URL"); } public int getNumberOfSentences(String par) { int cursor=0; int sentenceCounter=0; do { cursor = par.indexOf("."); if(cursor>par.indexOf("?") && par.indexOf("?")!=-1) cursor = par.indexOf("?"); if(cursor> par.indexOf("!") && par.indexOf("!")!=-1) cursor = par.indexOf("!"); par = par.substring(cursor+1); sentenceCounter++; } while(cursor!=-1); return sentenceCounter; } /* *Creates a path for the summary based on the bushy node path which selectes the bushiest nodes */ public void createBushyNodePath() { summaryLimit = Math.floor((double)documentData.length()/4); int currentCapacity=0; double nodesWeights[] = new double[enhancedParagraphsCapacity]; int bushiestNodes[] = new int[enhancedParagraphsCapacity]; double sum; for(int i=0;i<enhancedParagraphsCapacity;i++) { sum = 0; for(int j=0;j<enhancedParagraphsCapacity;j++) { sum+=documentToDocumentMatrix[i][j]; }

Page 117: Evaluation of Techniques for Automatic Text Extraction

Page 116

nodesWeights[i] = sum; } bushiestNodes = sort(nodesWeights); currentCapacity = 0; int numOfParagraphsIncludedInSummary = 0; for(numOfParagraphsIncludedInSummary=0;numOfParagraphsIncludedInSummary<enhancedParagraphsCapacity;numOfParagraphsIncludedInSummary++) { currentCapacity+=enhancedParagraph[bushiestNodes[numOfParagraphsIncludedInSummary]].value.length(); if(currentCapacity>summaryLimit) break; } int finalParagraphs[] = sortChosenParagraphs(bushiestNodes,numOfParagraphsIncludedInSummary); int ii = 0; for(int i=0;i<numOfParagraphsIncludedInSummary;i++) { out.println(enhancedParagraph[finalParagraphs[i]].value); out.println("<br>-------------------------------------------------------------------<br>"); ii++; } } public int[] sortChosenParagraphs(int[] selectedParagrahs,int paragraphsInSummary) { int[] sortedParagraphs = new int[paragraphsInSummary]; for(int i=0;i<sortedParagraphs.length;i++) { sortedParagraphs[i] = selectedParagrahs[i]; } return sort(sortedParagraphs); } public int[] sort(int[] array) { int temp; for(int i=0;i<array.length-1;i++) {

Page 118: Evaluation of Techniques for Automatic Text Extraction

Page 117

for(int j=i+1;j<array.length;j++) { if(array[i]>array[j]) { temp = array[i]; array[i] = array[j]; array[j] = temp; } } } return array; } /* *Sorting the eligible nodes or paragraphs due to their weight and returns an array *containing the IDs of the paragraphs after being sorted in a descending order. */ public int[] sort(double[] array) { int[] bushiestNodes = new int[enhancedParagraphsCapacity]; int tempBushyNode; double tempArray; for(int i=0;i<bushiestNodes.length;i++) bushiestNodes[i] = i; for(int i=0;i<array.length-1;i++) { for(int j=i+1;j<array.length;j++) { if(array[i]<array[j]) { tempBushyNode = bushiestNodes[i]; bushiestNodes[i] = bushiestNodes[j]; bushiestNodes[j] = tempBushyNode; tempArray = array[i]; array[i] = array[j]; array[j] = tempArray; } } } return bushiestNodes; } /*

Page 119: Evaluation of Techniques for Automatic Text Extraction

Page 118

*Creates a path for the summary based on the depth first node path which begins with the first node and selectes the most similar node to it and so on. */ public void createDepthFirstPath() { summaryLimit = Math.floor((double)documentData.length()/4); String summaryPath="0;"; double similarity = 0.0; int position=-1; int counter=0; int ii = 0; int jj = 0; while(counter<=enhancedParagraphsCapacity && ii<enhancedParagraphsCapacity) { position=-1; similarity = 0.0; for(jj=0;jj<enhancedParagraphsCapacity;jj++) { if(ii==jj) continue; if(documentToDocumentMatrix[ii][jj]>similarity) { similarity = documentToDocumentMatrix[ii][jj]; position=jj; } } if(position!=-1) { if(summaryPath.indexOf(position+";")==-1) { ii=position; summaryPath+=position+";"; counter++; } else { ii++; if(ii==enhancedParagraphsCapacity) break; summaryPath+=ii+";"; counter++; }

Page 120: Evaluation of Techniques for Automatic Text Extraction

Page 119

} else { ii++; } } String eligibleParagraphs[] = summaryPath.split(";"); Integer eligibleParagraphsId[] = new Integer[eligibleParagraphs.length]; for(int i=0;i<eligibleParagraphs.length;i++) eligibleParagraphsId[i] = new Integer(eligibleParagraphs[i]); double currentCapacity=0; for(int i=0;i<eligibleParagraphsId.length;i++) { currentCapacity+=enhancedParagraph[eligibleParagraphsId[i].intValue()].value.length(); if(currentCapacity>summaryLimit) { out.println(enhancedParagraph[eligibleParagraphsId[i].intValue()].value); break; }out.println(enhancedParagraph[eligibleParagraphsId[i].intValue()].value); out.println("<br>----------------------------------------------- <br>"); } } /* *Gets the words that can represent this text excerpt and their frequencies in this part. */ public WordFrequency[] getRepresentativeWords(String par) { ArrayList<WordFrequency> arrWordFrequency = new ArrayList<WordFrequency>(); WordFrequency[] representativeWords;// = new WordFrequency[8]; int numOfRepresentativeWords = 0; int wordOccurence; for(int i = 0; i < enhancedWordsCapacity; i++) { wordOccurence = getWordOccurence(enhancedWords[i].value,par);

Page 121: Evaluation of Techniques for Automatic Text Extraction

Page 120

if(wordOccurence>2) { arrWordFrequency.add(new WordFrequency(i, wordOccurence)); numOfRepresentativeWords++; } } representativeWords = new WordFrequency[numOfRepresentativeWords]; for(int i=0;i<numOfRepresentativeWords;i++) { representativeWords[i] = arrWordFrequency.get(i); } return representativeWords; } /* *Returns the number of occurences of the paramater word in the parameter paragraph */ public int getWordOccurence(String word,String paragraph) { String tempParagraph = new String(paragraph); int wordCounter=0; int cursor; while(true) { try { cursor = tempParagraph.indexOf(word); if(cursor==-1) break; tempParagraph = tempParagraph.substring(cursor+word.length()); } catch(StringIndexOutOfBoundsException strExc) { break; } wordCounter++; } return wordCounter; } //Fill in the the properties of the paragraph class. public void fillInParagraphClass() { paragraphs = new Paragraph[documentParagraphs.length];

Page 122: Evaluation of Techniques for Automatic Text Extraction

Page 121

for(int i=0;i<documentParagraphs.length;i++) { paragraphs[i] = new Paragraph(); paragraphs[i].id = i; paragraphs[i].value = documentParagraphs[i]; paragraphs[i].numOfChars = documentParagraphs[i].length(); paragraphs[i].numOfSentences = getNumberOfSentences(documentParagraphs[i]); paragraphs[i].representativeWordsFrequency = getRepresentativeWords(documentParagraphs[i]); } } /* *Remove paragraphs that are considered ineligible. The paragraphs that have less than two words to represent them are considerend not eligible. */ public void removeIneligibleParagraphs() { enhancedParagraph = new Paragraph[paragraphs.length]; enhancedParagraphsCapacity = 0; // summarySize = 0; for(int i=0;i<paragraphs.length;i++) { if(paragraphs[i].representativeWordsFrequency.length>1) { enhancedParagraph[enhancedParagraphsCapacity++] = paragraphs[i]; } } } // /* *This method returns an array of string where each item contains a paragraph. */ /* *Parse paragraphs; paragraphs mostly are seperated by a new line. */ public void parseParagraphs() { documentParagraphs = documentData.split("\r\n\r\n"); }

Page 123: Evaluation of Techniques for Automatic Text Extraction

Page 122

/** Handles the HTTP <code>POST</code> method. * @param request servlet request * @param response servlet response */ public void getPreEnhancedWordsLength() { for(int i=0;i<preEnhancedWords.length;i++) { if(preEnhancedWords[i]==null) { preEnhancedWordsCapacity = i-1; break; } } } /** Returns a short description of the servlet. */ public String getServletInfo() { return "Short description"; } //Within each paragraph sort the words representing it in a descending order public void sortWordFrequenciesOfParagraphs() { Paragraph tempParagraph; WordFrequency[] tempWordFrequencyArray; WordFrequency tempWordFrequency; for(int i=0;i<enhancedParagraphsCapacity;i++) { tempParagraph = enhancedParagraph[i]; tempWordFrequencyArray = tempParagraph.representativeWordsFrequency; for(int j=0;j<tempWordFrequencyArray.length-1;j++) { for(int k=j+1;k<tempWordFrequencyArray.length;k++) { if(tempWordFrequencyArray[j].frequency < tempWordFrequencyArray[k].frequency) {

Page 124: Evaluation of Techniques for Automatic Text Extraction

Page 123

tempWordFrequency = tempWordFrequencyArray[j]; tempWordFrequencyArray[j] = tempWordFrequencyArray[k]; tempWordFrequencyArray[k] = tempWordFrequency; } } } enhancedParagraph[i].representativeWordsFrequency = tempWordFrequencyArray; } } //Get the document-document matrix mapping the relations between the paragraphs and each other public double[][] getDocumentToDocumentMatrix() { double tempMatrix[][] = new double[enhancedParagraphsCapacity][enhancedParagraphsCapacity]; WordFrequency[] wordFrequencyI,wordFrequencyJ; double similarity; String[] wordsOfI,wordsOfJ; int documentsSimilarity; for( int i = 0 ; i < enhancedParagraphsCapacity ; i++ ) { wordFrequencyI = enhancedParagraph[i].representativeWordsFrequency; for( int j = 0 ; j<enhancedParagraphsCapacity ; j++) { if(i==j) tempMatrix[i][j] = 1; else { wordFrequencyJ = enhancedParagraph[j].representativeWordsFrequency; similarity = findSimilarity(wordFrequencyI,wordFrequencyJ); tempMatrix[i][j] = similarity/(double)wordFrequencyI.length; } } } return tempMatrix; } //Finding the similarity between two paragraphs; representing words that are common within the two paragraphs are chosen public double findSimilarity(WordFrequency[] wordFrequencyI,WordFrequency[] wordFrequencyJ)

Page 125: Evaluation of Techniques for Automatic Text Extraction

Page 124

{ double similarity=0; for(int i=0;i<wordFrequencyI.length;i++) for(int j=0;j<wordFrequencyJ.length;j++) if(wordFrequencyI[i].id==wordFrequencyJ[j].id) similarity++; return similarity; } public void exportToExcelSheet() { String excelSheet=""; for(int i=0;i<enhancedParagraphsCapacity;i++) { excelSheet = ""; for(int j=0;j<enhancedParagraphsCapacity;j++) { excelSheet+=documentToDocumentMatrix[i][j]+"\t"; } out.println(excelSheet); } } } Appendix D: Evaluation Sheet

Evaluation sheet

Document Title: ……………………. Subject Area: ……………………….

Page 126: Evaluation of Techniques for Automatic Text Extraction

Page 125

User Name: ………………………… Knowledge about Subject area:

Novice Shallow Medium Expert Please answer the following questions by choosing from one of the check boxes. The

check boxes range from 1(very low) to 5(very high). Please choose only one answer per question.

Question Range

1 2 3 4 5

Comments

To what extent can the user answer all the questions of the original document by only reading the summary?

To what extent is the summary misleading?

To what extent is the summary elegant?

To what extent does the summary require the user to have background about the subject area?

How much will you grade that system?

Appendix E: Evaluation Samples

Page 127: Evaluation of Techniques for Automatic Text Extraction

Page 126

Page 128: Evaluation of Techniques for Automatic Text Extraction

Page 127

Page 129: Evaluation of Techniques for Automatic Text Extraction

Page 128

Page 130: Evaluation of Techniques for Automatic Text Extraction

Page 129

Page 131: Evaluation of Techniques for Automatic Text Extraction

Page 130

Page 132: Evaluation of Techniques for Automatic Text Extraction

Page 131

Appendix F: Evaluation Results Results of the Expert Users:

The mean values of the results of the expert users from the evaluation

Results of Medium Users:

Page 133: Evaluation of Techniques for Automatic Text Extraction

Page 132

The mean values of the results of the medium users from the evaluation

Results of Novice Users:

The mean values of the results of the novice users from the evaluation