A System for Building Corpus Annotated With Semantic Roles605801/FULLTEXT01.pdf · 2013-02-15 · A System for Building Corpus Annotated With Semantic Roles Sanaz Rahimi Rastgar Niloufar

A System for Building Corpus Annotated

With Semantic Roles

Sanaz Rahimi Rastgar

Niloufar Razavi

MASTER THESIS 2013 INFORMATICS

Postadress: Besöksadress: Telefon:

Box 1026 Gjuterigatan 5 036-10 10 00 (vx)

551 11 Jönköping

A System for Building Corpus Annotated

With Semantic Roles


Niloufar Razavi

Detta examensarbete är utfört vid Tekniska Högskolan i Jönköping inom

ämnesområdet informatik. Arbetet är ett led i masterutbildningen med

inriktning informationsteknik och management. Författarna svarar själva för

framförda åsikter, slutsatser och resultat.

Handledare: He Tan

Examinator: Vladimir Tarasov

Omfattning: 30 hp (D-nivå)

Datum: 8 February 2013

Arkiveringsnummer:

Abstract

i

Abstract

Semantic role labelling (SRL) is a natural language processing (NLP)

technique that maps sentences to semantic representations. This can be used in

different NLP tasks. The goal of this master thesis is to investigate how to

support the novel method proposed by He Tan [1] for building corpus

annotated with semantic roles. The mentioned goal provides the context for

developing a general framework of the work and as a result implementing a

supporting system based on the framework. Implementation is followed using

Java. Defined features of the system reflect the usage of frame semantics in

understanding and explaining the meaning of lexical items [2]. This prototype

system has been processed by the biomedical corpus as a dataset for the

evaluation. Our supporting environment has the ability to create frames with all

related associations through XML, updating frames and related information

including definition, elements and example sentences and at last annotating the

example sentences of the frame. The output of annotation is a semi structure

schema where tokens of a sentence are labelled. We evaluated our system by

means of two surveys. The evaluation results showed that our framework and

system have fulfilled the expectations of users and has satisfied them in a good

scale. Also feedbacks from users have defined new areas of improvement

regarding this supporting environment.

Acknowledgements

ii

Acknowledgements

We would like to thank our supervisor, Dr. He Tan who supported perfectly

during this master thesis work with her wise advices and guidance and also our

examiner, Dr. Vladimir Tarasov for his useful advices and discussions on our

thesis work.

We would like to kindly thank our families and friends whose moral supports

encouraged us to successfully deliver this thesis.

Niloufar Razavi


Key words

iii

Key words

Corpus Construction, Semantic Role Labelling, Semantic Roles, System

Development, Frame Semantics

Contents

iv

Contents

1 Introduction ............................................................................... 1

1.1 BACKGROUND ............................................................................................................................. 1 1.2 PURPOSE/OBJECTIVES ................................................................................................................ 1 1.3 LIMITATIONS ............................................................................................................................... 2 1.4 THESIS OUTLINE.......................................................................................................................... 2

2 Theoretical Background ............................................................. 3

2.1 NLP AND TEXT MINING APPLICATIONS ..................................................................................... 3 2.2 SEMANTIC ROLE LABELLING ....................................................................................................... 4

2.2.1 Semantic Roles ...................................................................................................................... 5 2.3 CORPUS ANNOTATED WITH SEMANTIC ROLES ............................................................................ 6

2.3.1 FrameNet ............................................................................................................................ 6 2.3.2 PropBank .......................................................................................................................... 13 2.3.3 Semantic Role Labeling for Biomedical Domain ......................................................................... 15

2.4 A NOVEL METHOD FOR CORPUS CONSTRUCTION ..................................................................... 17 2.5 RELATED WORK ....................................................................................................................... 17

3 Research Methods .................................................................... 19

3.1 AWARENESS OF THE PROBLEM: LITERATURE REVIEW ............................................................... 21 3.2 SUGGESTION ............................................................................................................................. 23 3.3 DEVELOPMENT: XP METHODOLOGY........................................................................................ 24 3.4 EVALUATION: DATA COLLECTION AND SURVEY ....................................................................... 27 3.5 CONCLUSION PHASE ................................................................................................................. 29

4 Framework and System ............................................................. 31

4.1 FRAMEWORK ............................................................................................................................. 31 4.1.1 Framing ............................................................................................................................. 33 4.1.2 Annotation ......................................................................................................................... 34

4.2 METHOD OF SYSTEM IMPLEMENTATION ................................................................................... 34 4.2.1 User Requirements: User Stories ............................................................................................. 34 4.2.2 Software Development Environment ......................................................................................... 36 4.2.3 Interface Design ................................................................................................................... 37 4.2.4 Java Class Design ................................................................................................................ 38 4.2.5 The Corpus Database ........................................................................................................... 39 4.2.6 System Requirements ............................................................................................................ 41

5 Results and Discussion ............................................................. 43

5.1 THEORETICAL RESULTS ............................................................................................................. 43 5.2 PRACTICAL RESULTS .................................................................................................................. 44

5.2.1 Implementation Results ......................................................................................................... 44 5.2.2 Evaluation Result and Discussion ........................................................................................... 46

6 Conclusion and Future Work .................................................... 49

7 References ................................................................................ 51

8 Appendix .................................................................................. 57

List of Figures

v

List of Figures

FIGURE ‎2-1: FRAMENET FRAME EXAMPLE ..................................................... 7

FIGURE ‎2-2: ANNOTATION LAYERS ...................................................................... 9

FIGURE ‎2-3: PROPBANK FRAME FILE EXAMPLE .......................................... 13

FIGURE ‎2-4: EXAMPLE OF FRAME DEFINITION ........................................... 14

FIGURE ‎2-5: AVAILABLE ARGUMENTS FOR AN EXAMPLE FRAME ... 15

FIGURE ‎3-1: EVALUATION MODEL ...................................................................... 28

FIGURE ‎4-1: GENERAL FRAMEWORK OF SYSTEM ......................................... 31

FIGURE ‎4-2: "FRAMING" PROCESS OVERVIEW ............................................... 32

FIGURE ‎4-3: "ANNOTATING" PROCESS OVERVIEW ..................................... 33

FIGURE ‎4-4: COMPARISON BETWEEN DIFFERENT XML SCHEMAS ...... 39

FIGURE ‎4-5: XML FILE CONTAINING DATA THROUGH FRAME SEMANTICS ............................................................................................................. 40

FIGURE ‎4-6: OUTPUT FILE CONTAINING ANNOTATED TOKENS ........... 41

FIGURE ‎5-1: ADDING NEW FRAME AND ITS DEFINITION ........................ 45

FIGURE ‎5-2: EVALUATION RESULT REGARDING EASINESS OF SYSTEM ..................................................................................................................... 47

FIGURE ‎5-3: EVALUATION RESULT REGARDING EASINESS OF SYSTEM ..................................................................................................................... 47

List of Figures

vi

FIGURE ‎8-1: ADDING FRAME ELEMENTS AND DEFINITION TO A FRAME ...................................................................................................................... 57

FIGURE ‎8-2: EDITING A FRAME AND ITS RELATED OPTIONS ................ 57

FIGURE ‎8-3: SELECTION SENTENCE FOR ANNOTATION PURPOSE ...... 58

FIGURE ‎8-4: CONFIRMING THE SELECTED SENTENCE TO BE ANNOTATED ......................................................................................................... 58

FIGURE ‎8-5: TOKENISATION OF SENTENCE .................................................... 59

FIGURE ‎8-6: SETTING ROLES TO TOKENS .......................................................... 59

FIGURE ‎8-7: SAVING THE RESULT THROUGH FILLED TABLE ............... 60

FIGURE ‎8-8: ANNOTATE USING POS TAGGER ................................................. 60

FIGURE ‎8-9: DIVIDING TOKENS .............................................................................. 61

FIGURE ‎8-10: SURVEY "EASINESS" ....................................................................... 63

FIGURE ‎8-11: SURVEY "SPEED" .............................................................................. 64

List of Abbreviations

viii

List of Abbreviations

ADVP: Adverbial Phrase

AI: Artificial Intelligence

DTD: Document Type Definition

DSD: Document Structure Description

DSR: Design Science Research

FE: Frame Element

GF: Grammatical Function

GUI: Graphic User Interface

IDE: Integrated Development Environment

IE: Information Extraction

IR: Information Retrieval

IS: Information System

LU: Lexical Unit

NLP: Natural Language Processing

NP: Noun Phrase

PAS: Predicate Argument Structure

POS: Part-Of-Speech

PP: Propositional Phrase

PT: Phrase Type

SOX: Schema for Object-Oriented XML

SRL: Semantic Role Labelling

ST: Semantic Type

TM: Text Mining

XML: Extensible Markup Language

XDR: XML-Data Reduced

XP: Extreme Programming

Introduction

1

1 Introduction

This section delivers general understanding of our work by means of discussing

related concepts in background part. Also the research questions are exposed to

discussion which gives direction to reaching the research’s objective.

1.1 Background

The study of Semantic Role Labelling (SRL) is an important notion in the field

of text mining, information extraction (IE) and Natural Language Processing

(NLP) as it helps interpreting sentences on semantic level [3]. SRL deals with

identifying the semantic roles or relationships in a sentence structure within a

semantic frame [4]. This is informally knows as assigning “who” did

something and “what” was the thing did, to “whom, when, where, why, how,

etc.” [3]. During the past years, projects such as PASBio, BioProp and

BioFrameNet have made lots of efforts in the biomedical domain to apply SRL.

However, the development of SRL systems for biomedical domain faced

problem by the lack of large corpora for such a domain. The problems appeared

due to difficulties in defining frames with their associated roles, grouping

example sentences to each semantic frame as well as collecting them from

databases [1].

Recently a method is proposed by He Tan [1] for building corpus which is

labelled by semantic role labelling for the biomedicine’s domain. The method

makes use of domain knowledge provided by ontology. By this method, a

corpus has been built which is related to biological transport events. In this

master thesis we have reviewed similar concepts and systems to discover how

to support this method of semi-automatic labelling. As an important step

towards fulfilling the objective, formulating correct research questions has a

vital role. We have mentioned the research question as followed in next

section.

1.2 Purpose/Objectives

A method of building corpus with frame semantics annotations using domain

knowledge provided by ontologies was developed by He Tan [1]. By using the

method, they have successfully built a corpus of biological transport events that

are based on the domain knowledge provided by GO biomedical process

ontology [1].

The purpose of this thesis work is formulated in three research questions as

follows:

Introduction

2

1. How to support the method of building corpus annotated with semantic

roles using ontological knowledge.

2. What general framework is needed regarding support of the novel

method?

3. How to implement a semi-automatic system based on the general

framework to support semi-automatic this kind of corpus construction.

1.3 Limitations

According to fulfilling the objectives of system which was explained

previously by means of three research questions, we did not find any limitation.

As long as the system delivers the expected goals, no limitations are exposed to

discuss. Currently the system is based on data in biomedical domain but it can

be used in other fields as well.

1.4 Thesis outline

This document is structured as six chapters:

Chapter 1 which has covered the outline of thesis is basically

introduction part that presents semantic role labelling, background and

objectives of work.

Chapter 2 gives the definitions for main concepts used in the system, the

previous related approaches.

Chapter 3 describes the research method followed for reaching the thesis

goals.

Chapter 4 deals with introducing framework and system overview as

well as explaining the method used for system implementation.

Chapter 5 presents the results achieved during the thesis work.

In chapter 6, the results and findings are consolidated in conclusions as

well as some ideas for further research are presented.

Theoretical Background

3

2 Theoretical Background

This chapter will include the basic knowledge regarding the development of

SRL systems, how they work and why they are important in text mining

applications, so that a reader of this research could understand basics of the

development process which is related to objective of our thesis.

2.1 NLP and Text Mining Applications

Text mining is the process of discovery and extracting interesting information

from unstructured text. This involves everything from information retrieval,

lexical analysis, to information extraction. The main objective of these

applications is turning text into data for analysis by the means of NLP and

analytical methods [5].

NLP methods try to extract a fuller meaning representation from text. One task

on the semantic level can be described as finding out who did what to whom,

where, when, how and why. Regarding this explanation, SRL can be seen as a

task of NLP which we have described more in details in section 2.2. In fact,

NLP makes possible the use of linguistic concepts, for instance part-of-speech

(POS), (such as noun, verb, adjective, etc.) and grammatical structure [6]. In

other words, different techniques have been developed by NLP which have

typically got inspiration from linguistic concepts. An example is parsing a text

syntactically by using formal grammar information or lexicon information.

Next step is interpretation of resulting information in a semantically way [7].

Working with linguistics concepts and grammatical structure, considerably

causes dealing with anaphora and ambiguities where anaphora is about “what

previous noun does a pronoun or other back-referring phrase correspond to”

and ambiguities is about “both of words and of grammatical structure, such as

being modified by a given word or prepositional phrase” [6]. For this matter it

is important to get benefit of several knowledge representations such as:

Lexical unit (lexicon of words and their meaning)

Grammatical properties

A set of grammar rules

Thesaurus of synonyms and abbreviations

Other resources such as ontology of entities and actions [6].


4

Several tasks approached by using text mining techniques, mainly split into two

groups. Some of them such as information retrieval, text categorization, and

document clustering operate on the document level, while the others like

document summarization, IE, and question answering operate on the sentence

level [5]. In fact both of the mentioned groups are affected by the problem of

“data sparsity” for modelling the language accurately, where the most

emphasize is on the latter group [5, 8]. The term of data sparsity is used for

giving explanation to the phenomenon in case of not considering enough data

in a corpus to model language accurately [8]. Lack of data causes problems in

observing the true distribution and pattern of language [8]. The nature of the

text mining task as well as the domain of interest, are other issues that need to

be considered.

Text mining technology is broadly applied for various research needs. It has

also lead to creation of different applications like biomedical applications or

even marketing applications. Text mining from biomedical text has grown to be

one of the main issues in bioinformatics field and NLP methods have been used

to increase the potential of text mining from biological text [9].

2.2 Semantic Role Labelling

Automatic semantic role labelling is the task in NLP that maps free text

sentences to the semantic representations. The task simply is to identify all

parts of a sentence and label them with a semantic role for a given predicate

[10]. Therefore the input of the SRL system is a sentence and a predicate (or

target) in that sentence, the output is labelled sentence with semantic roles. In

order to approach SRL, independently of one’s background, overall

understanding of the theory of semantic roles should be examined.

SRL is sometimes known as shallow semantic parsing which is consisted of

recognition of the semantic arguments associated with the predicate or verb of

a sentence. It also includes their classification into their specific roles [11]. We

can clarify the concept of semantic role labelling with an example:

Assuming the sentence “Anna sold the book to Marcus”, steps towards making

the meaning of sentence clear are:

Recognizing the verb “to sell” as representing the predicate

Recognizing “Anna” as representing the seller (agent)

Recognizing “the book” as representing the goods (theme)

Recognizing “Marcus” as representing the recipient


5

As it is shown, SRL is a shallow semantic processing task that has become

increasingly popular in NLP community over the last few years. The task is to

identify all parts of a sentence that represent arguments for a given predicate

and subsequently label each argument with a semantic role. Roughly speaking,

SRL can be thought of as the task of finding the words that answer simple

questions of the form who did what to whom when and where? The input to the

SRL system is a single sentence and a predicate in that sentence. The output is

the same sentence, but with labelled semantic roles.

The most important computational lexicons were created by FrameNet project

and PropBank. A vast amount of predicates and the corresponding roles of

those predicates were defined systematically by the lexicon. The first automatic

semantic role labelling system was developed based on FrameNet by Daniel

Gildea and Daniel Jurafsky [12].

2.2.1 Semantic Roles

The relationship which a syntactic constituent has with a predicate is called

sematic roles. Agent, patient and instrument create typical semantic arguments

[13]. Answering “WH” questions such as "Who", "When", "What", "Where",

"Why" in Information Extraction, Question Answering and Summarization,

needs recognition and labelling semantic arguments. In general labelling

semantic arguments play a key role in the NLP tasks which are related to some

kind of semantic interpretation. There are different schemes for specifying

semantic roles where the commonly used schemes out of them, are the

PropBank annotation scheme and FrameNet [14]. The PropBank is based on

Penn TreeBank and its corpus added semantic role annotations which are

created manually to the Penn TreeBank corpus of Wall Street Journal texts.

PropBank has been used by many automatic semantic role labelling systems as

a training dataset which usage helps understanding how to annotate new

sentences automatically [11, 15]. The FrameNet project has the key concept of

annotation using frame semantic which supports creating a lexical resource

[16].

Semantic roles, also known as thematic roles, are one of the oldest construction

classes in linguistic theory. Semantic roles are used to indicate the role played

by each entity in an event apart from linguistic encoding of that event [11]. For

example if someone named John hits someone named Bill, the John is the agent

and Bill is the patient of the hitting event. Agent and patient are the semantic

roles in following sentences:

John hit Bill.

Bill was hit by John.

http://en.wikipedia.org/wiki/PropBank

http://en.wikipedia.org/w/index.php?title=Penn_TreeBank&action=edit&redlink=1


6

In both of above sentences, the semantic role of Bill is patient and John has the

semantic role of agent. Although there is no consensus on a list of semantic

roles, some basic semantic roles like agent, patient, theme, location, source and

goal are followed by all.

Correctly identifying semantic roles of a sentence is a crucial part in sentence-

level text mining applications. Following paraphrases show that for a single

predicate, semantic arguments can have multiple syntactic understandings:

John will meet with Mary.

John will meet Mary.

John and Mary will meet.

The theoretical status of semantic roles in linguistic theory is still unsolved.

There is an uncertainty about whether semantic roles should be observed as

syntactic or semantic entities. However the most common appreciative is that

semantic roles are conceptual elements as a way of classifying the arguments of

a sentence [17].

2.3 Corpus Annotated with Semantic Roles

There are different ways of annotation a corpus with semantic roles. Two

related work are discussed here to demonstrate how these projects process

documents by means of SRL. These literature reviews provide us the

knowledge regarding how the text is processed. They also provide us a

perspective to investigate the way for supporting the mentioned method.

2.3.1 FrameNet

FrameNet is a lexical database based on the Frame Semantics theory that labels

words in a sentence. A word is stored by its meaning as a pair titled Lexical

unit (LU). Each predicate (target word) in a sentence and its arguments is

associated to a frame. The basic unit of this framework is the frame, initially

defined as a type of an event and its contributors called frame elements (FEs).

An example that shows a sentence annotated with FrameNet is provided to

explain concepts [17]:

[Cook Matilde] fried [Food the catfish] [Heating_instrument in a heavy iron skillet].

In this example, the target word “fried” evokes the frame “Apply_heat”.

“Apply_heat” describes a situation involving a “Cook”, some “Food”, and a

“Heating_instrument”. These are called frame elements. Frame evoking words

like bake, boil, steam, fry, etc. are LUs in the “Apply_heat” frame that also can

be a target word of annotated sentence.


7

For representing a schematic view of semantic knowledge better, another

example of frame in FrameNet can be shown in figure 2-1. In this example, the

GIVING frame relates the frame elements of verb Give to the Donor, Recipient

and Theme semantic roles. Other verbs that evoke the GIVING frame are

represented in LU.

Figure 2-1: FrameNet Frame Example [18]

The FrameNet database is different from other dictionaries and thesauri with its

exclusive characteristics [17]:

The main corpus is 100 million words British Natural Corpus (BNC). Analysis

of the English lexicon proceeds frame by frame rather than word by word, what

is done in traditional dictionaries. It provides a multiple annotated examples of

each lexical unit which illustrates all of the combinations of that lexical unit.

Each lexical unit is related to a semantic frame and also to other words which

activate that frame.

FrameNet provides a set of relations between frames including: Inheritance,

Using, Sub frame, Perspective on, and etc. However the FrameNet database

cannot be used as ontology of things, since there are many nouns and artefacts

which are not annotated. Daily work is made up of define a frame and its FEs,

LUs (list of words evoke the frame), extract example sentences relate to the

frame and annotate them. Annotation part is done by realization of FEs, phrase

type (PT), and grammatical function (GF). FrameNet comprises three main

parts [19]:

A Lexical unit database containing pair of word and related frame (used

to meaning of a word).

A frame database entailing a set of frames, associated frame elements,

and relations between frames.

An example sentence database including a collection of lexical

indication for frames used as a training set for labelling.


8

Frame development Process begins by searching corpus attestation of a group

of words that seems to have some semantic overlap. Later these attestations are

divided into groups to make frames in the reasonable point by target words,

lexical units, and frame elements. This idea is durable to assess since there

were some exceptions need to be managed separately. Following are some

criteria used to form the groups of frames [17]:

All LUs in a frame should have the same types of frame elements with

the same set of transitions.

The same frame elements must outlined across all lexical units of a

related frame.

The same interrelations between frame elements should exist for all the

LUs in the frame.

The basic denotation of target words should be similar in a frame.

Specifications of the frame evoking words give to all frame elements of

a frame should be similar.

The routine work of FrameNet consists mainly of annotating sentences chosen

from a corpus as examples of a particular lexical unit [17]. Initially, the

emphasis of annotation was on what was most relevant to lexical descriptions,

namely the core and peripheral frame elements of target words. Its goal is to

annotate words or phrases in a sentence that have relation in grammatical

construction with the target word.

For each target word, there is a set of annotation layers for the FEs, phrase

types, grammatical functions, etc. Each such set is represented by an entry in

the Annotation table. In addition to the FE, GF, and PT layers, annotators also

add labels on other layers, all of which are represented similarly. Certain

syntactic information is represented by adding labels on the part-of-speech-

specific layer [17]. In choosing the phrase types and grammatical functions; the

major criterion was whether or not a particular label might figure into a

description of the grammatical requirements of one of the target words.

The annotation start with labelling parts of the example sentences with tags

indicating relevant syntactic and semantic properties. Figure 2.2 shows

annotation layers of the following example sentence in the “Perception-

passive” frame: “Helmut saw a tall, black figure against the shining snow.”

A component of sentence may express a particular frame element such as

Hemlut states the FE ”Perceiver-passive” a tall, black figure, the FE

“Phenomenon”, and against the shining snow, the FE “Ground”. Next layer of

annotation is to specify phrase types of each of these constituents. Further

grammatical function regarding target word (“see” in the example) is described.

These three layers are independent called FE, PT and GF [17].


9

(TEXT) Helmut saw a tall, black figure against the snow

FE Perceiver-

passive

Phenomenon Ground

PT NP NP PP

GF Ext Obj COMP

Figure 2-2: Annotation layers. Adapted to [17]

Below, the detailed concepts of semantic annotation of natural language texts

used in FrameNet project are described:

2.3.1.1 Frame Semantics

Frame semantics starts with the assumption that in order to understand the

meanings of the words in a language, we must first have knowledge of the

background, motivation for their existence in the language and their use in

discourse [20]. The knowledge can be provided by the conceptual structures, or

semantic frames.

A frame semantic view would relate each of the relevant words to the

background frame. In a technical language, it is easy to support the association

of word to frame but in some lexical fields, for instance biomedical domain,

semantic theory is not enough to find the relevance of terms to the frames.

According to the definitions above, the most important issue about frame

semantics is remembering the task of frame semantics: understanding and

explanation of lexical items meanings as well as grammatical constructions [2].

As an extension of Charles J. Fillmore’s case grammar [21], it

relates linguistic semantics to encyclopaedic knowledge. In other words the

assumption in frame semantics is formulated in this sentence: understanding

the meanings of the words of a language needs having knowledge. This

knowledge refers to the conceptual structures, or semantic frames, which

underlie their usage. For example, while talking about “sell”, one would be able

to realize the meaning of word "sell" if only he has a knowledge about the state

of commercial transfer, which also includes other features such as a seller, a

buyer, goods, money, the relation between the money and the goods, the

relations between the seller and the goods and the money, the relation between

the buyer and the goods and the money and so on [22].

According to the frame semantics idea presented by Charles Fillmore’s [21]

frames work as a type of cognitive structuring device which the background

knowledge and motivation for the existence of words in a language is provided

by them as well as understanding their usage in discourse [2, 23].


10

The vast variation of approaches in systematic description of natural language

meanings is also discussed by the term frame semantics. There is something

common among these approaches which can be followed due to Charles

Fillmore’s saying: this saying states that meanings own internal structure which

are moderately determined to a background frame or a scene. This common

feature does not sufficiently make a distinction between frame semantics and

other frameworks of semantic description [24].

Two historical roots of frame semantics are available. First root centres on

linguistic syntax and semantics which is mainly about Fillmore’s case

grammar, the other one is in the field of Artificial Intelligence (AI) and centres

around the concept of frame introduced by Minsky [25].

To become in details, first discussion refers that within a case grammar, case

frame was used for characterizing a small abstract scene with the goal of

identifying the participants of the scene and as a consequence identifying the

arguments of predicates and sentences which are describing the scene. It is

assumed that in order to understand the sentence, the language user has mental

access to such schematized scenes.

Although discussion about second history root is difficult, details express that it

is about concept of frame-based systems of knowledge representations in AI.

This root of frame semantics is known as highly structured approach to

knowledge representation, which goal is arranging the collected information

about specific objects and events into a taxonomic hierarchy familiar to

biological taxonomies [24].

2.3.1.2 Frame

A semantic frame describes an event, a situation or object, together with the

participants (called frame elements (FE)) involved in it. A word evokes the

frame when its sense is based on the frame. The relation between frames

include is-a, using and subframe [23]. A collection of facts forms a frame

which this collection identifies "characteristic features, attributes, and functions

of a denotatum, and its characteristic interactions with things necessarily or

typically associated with it." [26]. It can also be defined as a coherent structure

of related concepts which without knowledge of all the related concepts,

understanding is not possible.

Words do not only focus on individual concepts, they also specify an assured

perspective from which the frame is viewed. For example "sell" defines the

situation from the perspective of the seller and "buy" from the perspective of

the buyer.

2.3.1.3 Frame Elements

Frame elements are the participants, props and roles of a frame including

agents and objects [27]. They are also well-defined for their syntactic

dependents role of a predicating word. Each FE is linked relevant to a single

frame.


11

FEs are divided into core, peripheral and extra-thematic, according to how

central they are to a frame. A core FE is theoretically necessary to a frame due

to the situation described in a frame. A peripheral FE is usually repeated in

different frames and marks such notions as Time, Place, Means, etc. and

therefore does not characterize individually a frame. Extra-thematic FE type is

somehow different from the peripheral type and introduces an extra state or

event. For this reason, these FEs don’t theoretically belong to the frame they

seem to be and have a somewhat independent status. These types of frame

elements have also the ability to evoke a larger frame embedding the reported

situation [4].

As we described frame semantics by the example of “sell”, the frame and frame

elements are recognisable in which the frame is commercial transaction and

frame elements are Buyer, Seller, Goods, and Money. This example is most

often cited example of Fillmore’s about frame semantics. In this frame, Lexical

units belonging to this frame are verbs such as buy, sell, spend, or charge,

nouns such as price, goods, or money, and adjectives such as cheap and

expensive. While all of these lexical units belong to the same semantic frame

(the commercial transaction frame), a specific choice of a lexical unit reveals a

particular perspective from which the commercial transaction frame is viewed

[28].

2.3.1.4 Lexical Unit

A lexical unit is a pair composite that represents a word and a meaning. Lexical

unit is different from word and is associated to a semantic frame [17]. For

example if the word bake (which has the word forms: bake, bakes, baked, and

baking) is linked to three different frames: Apply_heat, Cooking_creation and

Absorb_heat, Multiple expressions of the word bake in each of the above

frames, construct three different lexical units (and not the word forms). In some

lexicographic work, annotation is done to a lexical unit in the sentence, which

is a target word.

2.3.1.5 Target Words

Given an example sentence, the word with semantic and syntactic properties of

interest is called the Target Word or simply the target [17]. A target word can

be in any of the major lexical categories: a noun, verb, adjective, adverb or

preposition. In annotation process, a sentence from different texts of a corpus is

extracted by a predetermined target word and then frames are evoked by target

words. In order to annotate a collection of example sentences for a certain

target word, it is necessary for the annotators to understand the frame linked

with that word by getting access to the provided frame definition.

2.3.1.6 Example Sentence

The mainly work of FrameNet involves annotating example sentences extracted

from a corpus for a specific lexical unit. A software is used to choose example

sentences for a LU. The sentences are presented to the annotators and grouped

to patterns. The reason behind alignment example sentences is to annotate


12

easier and make sure to annotate a few examples of each different pattern.

Since there is a set of annotation layers for each target of an example sentence,

each such set is represented in the annotation file with linking a sentence and

LU [29].

2.3.1.7 Phrase Type and Grammatical Function

The syntactic Meta language used in the annotation process called phrase Type

(PT). In order to annotate words in a sentence this notion is used to show

lexical descriptions of terms in respect to the target word. Identifying phrase

type is important to distinct each frame element. Phrase types are assigned

manually by the annotators during the annotation process. What follows is a list

of phrase types that are used in the system, complemented by some examples

[17].

Noun Phrase (NP): Standard Noun Phrase that can fill core argument

slots.

[My neighbour] is a lot like my father.

[John] said so, too.

[You] want more ice-cream?

Prepositional Phrases (PP): With NP object Prepositional Phrases are

assigned.

Scrape it back [into the microwave bowl].

Adjective Phrase Types (AJP): It is used for relational modification of

adjectives.

Philip has [bright green] eyes.

The light turned [red].

Adverb Phrase (AVP): used for adverbs.

All items at [greatly] reduced prices!

Verb Phrases (VP): Verb phrase can be a main verb or an auxiliary.

This book [really stinks].

I didn’t expect you to [eat your sandwich so quickly].

In annotating example sentences, each constituent is tagged with a frame

element related to a target word. These constituents that are tagged with frame

elements are assigned grammatical function in respect to that target word.

Grammatical function (GF) defines in which ways the constituents fulfil

grammatical requirements related target word. Examples of the grammatical

functions used in the system are [17]:


13

External Argument (Ext)

Object (Obj)

Dependent (Dep)

2.3.2 PropBank

PropBank or proposition Bank project precedes a practical approach to

semantic representation. It is built by the aim of adding a layer of semantic

annotation consists of predicate-argument information or semantic role labels,

to the Penn Treebank [30, 31]. The idea of PropBank creation was mainly

serving as training data for machine learning-based semantic role

labelling systems. According to this idea, it is necessary that all arguments of

verbs be syntactic constituents and distinguish of various meanings of a word

is only possible if the differences bear on the arguments [32].

The focus of proposition bank is on the argument structure of verbs which has

made it known as a verb-oriented resource [30]. It provides a complete corpus

annotated with semantic roles which the roles are seen as arguments and

adjuncts [30]. In other words, the main option which has made PropBank

different from FrameNet is that annotation is based on verb-specific roles [31].

According to the fact that PropBank’s task is annotating all verbs in a corpus,

annotation of events or states of affairs which is termed by usage of nouns, is

not committed by PropBank. Annotation done by PropBank almost stays

closely to the syntactic level [33].

The lexicon of PropBank has defined frame files for all the verbs which every

verb is the owner of a unique frame file. A frame file is consisted of specific

role sets for every word-sense of the verb. Verbs are known as predicates in

PropBank, so each predicate refer to the verb in PropBank. The grouping of

predicate and related arguments is called a proposition [31]. An example of a

frame file for the verb Give is shown below:

Figure 2-3: PropBank Frame File Example [18]

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Semantic_role_labeling


http://en.wikipedia.org/wiki/Constituent_(linguistics)


14

Using PropBank makes the possibility of practically specifying frequency of

syntactic variations, the troubles they bring up for natural language

understanding, and the policies to which they may be tractable [30].

Due to the numbering of core arguments from 0 to 5 which is listed as ARG0,

ARG1, ARG2, ArG3, ArG4 and ARG5, they are called numbered arguments.

These arguments are specific for each verb sense. Beside the numbered

arguments specific to the verbs, they can be assigned to a set of general

arguments also. These general arguments are called ARGMs (verb modifiers).

ARGMs can be compared to non-core elements in FrameNet due to the fact

that they are not verb specific.

As a middle ground among many theoretical theories, numbered arguments

theory was selected and used in PropBank. The reason of choosing this theory

was the possibility in mapping consistently of numbered arguments to any

theory of argument structure [30]. PropBank gets benefit of Levin classes of

verbs in order to label verbs consistently. To understand the work process it is

better to take a look again at Fillmore’s theory.

Fillmore states that a relation exists between theta roles (deep cases) and

grammatical functions, an example can be the role of the subject of a transitive

non-passive verb which generally corresponds to the agent role, the direct

object to the patient role: [Anna Subject, Arg0, Agent] eats [the chocolate cake

direct object, Arg1, Patient]

It should be considered that the grammatical function of the patient role can be

modified if changes happen in the way verbal arguments are grammatically

expressed. These changes are called diathesis alternations:

Middle Alternation: [The chocolate cake Subject, Arg1, Patient] smells perfect.

In the first example the direct object plays the role of arg1, while in second

example the arg1 role of smell is expressed by the subject.

In Levin’s verb classification [34], verbs which are sharing the same diathesis

alternations, share the same argument structure. According to the efforts done

in PropBank, it was ensured that verbs belonging to the same class are given

consistent role labels. The verb “wonder” can be taken into account as an

example of frame definition in PropBank. The frame definition is shown in the

figure 2-4:

Figure 2-4: Example of frame definition [16]


15

The verb wonder takes two core arguments: arg0 and arg1. Additionally it can

take any number of ARGms like any other verbs. Figure 2-5 shows a summary

of available ARGMs.

Figure 2-5: Available Arguments for an example frame [16]

The example below is an example sentence taken from PropBank corpus which

shows the annotation process of a complete proposition:

[They ARG1] are [n’t ARGM−NEG] [accepted REL] [everywhere

ARGM−LOC], [however ARGM−DIS].

Development process of PropBank is consisted of two parts: framing and

annotation:

Framing: First step in the framing process is to examine a sample of

sentences. These sentences are from the corpus which they include

verbs in their structure. Next step is grouping the instances into one or

more major senses where later each of them converts to a single

frameset [30].

Annotation: First step in the annotation process was running a rule-

based argument tagger on the corpus. Second step is correcting the

tagger’s output manually. PropBank corpus annotation is a two-pass

process where each verb is annotated by two annotators tracked by an

adjudication phase to resolve alterations between the two initial passes

[31].

2.3.3 Semantic Role Labeling for Biomedical Domain

The ability to accurately identify the meanings of terms is an important step in

automatic text processing. It is necessary for applications such as information

extraction and text mining which are important in the biomedical domain.


16

Data in Biomedical area text significantly is different from FrameNet and

PropBank. Like text in other domains, biomedical documents contain a range

of terms with more than one possible meaning. These ambiguities form a

significant obstacle to the processing of biomedical texts. Currently there are

some approaches to resolving this problem but no large corpus for such a

domain exists.

Biology improvements have led to a great growth in the amount of biomedical

literature. Thus, automatic information retrieval and information extraction

methods become more and more important to help researchers to get to know

of the latest developments in this field. Current IR is still mostly limited to

keyword search especially when it is needed to infer the relationship between

two entities in a text. Understanding how words are related in a sentence is an

important factor to improve both the quality of IE systems and the ability to

search more complex queries by IR systems.

There are some difficulties in adapting semantic role labelling technologies to

new domains such as biomedical domain. These problems can be divided into

two main categories: differences in text style and differences in predicates. The

CoNLL 2005 shared task [35] explored semantic role labelling systems that

were trained on the Wall Street Journal and were established on the Brown

corpus. After comparing results on the Wall Street Journal data, they found that

"all systems experienced a severe drop in performance ". Reason of the drop

was mainly poorer performance of sub-components like part-of-speech taggers

and syntactic parsers. Researchers have found a similar performance drop was

where training semantic role labelling corpus on nominal predicates.

Pradhan et. al. [36] reached only an F-measure of 63.9 when evaluating their

models on nominal predicates from FrameNet and some manually annotated

nominalizations from TreeBank. Jiang and Ng [37] achieved better results on

the NomBank [38] corpus, but their F-measure was still only 72.7 that were

more than 10 points below normal performance for verbs. Therefore, these

research efforts suggest that adapting semantic role labelling to biomedical

domain involves some remarkable challenges.

One of the SRL systems that targeted in the biomedical text is BIOSMILE [39].

The BIOSMILE system was trained on BioProp corpus [40], a biomedical

proposition bank semi automatically annotated in the style of PropBank.

However BioProp, similar to other biomedical corpora with predicate argument

structures reflected only verbs, such as Kogan and colleagues corpora [41]. It

annotated 30 biomedical verbs in 500 abstracts. Our work significantly differs

from BIOSMILE in corpus construction method. Both the data and algorithm

that were used in this work is different. In BIOSMILE semantic roles are only

allowed to match full syntactic units because BioProp followed PropBank style.

We consider all data includes multi-word roles for handling nominal predicates

describing Transport events. Because of these many differences in text

comparing to other domains, we reconnoitred an alternative to the syntactic

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2474622/#B11








17

constituent approach used by BIOSMILE. Consequently studying models allow

us to evaluate methods that did not rely on syntactic parses.

2.4 A Novel Method for Corpus Construction

There are difficulties in construction a large corpus for domain based systems

by frame semantic. To ease the task ontologies, as a semantic representation

domain based knowledge, are used [1].

A method for building corpus that is labelled with semantic roles for the

domain of biomedicine is introduced by He Tan. This method is based on the

theory of frame semantics, and relies on domain knowledge provided by

ontologies. By using the method, they have built a corpus for transport events

strictly following the domain knowledge provided by GO biological process

ontology.

Ontology is a shared and common understanding of some domain which can be

defined as a conceptualization in order to support a specification, i.e., ontology

defines entities and relationships among them. Therefore ontology can be used

as a solution by describing all possible events and translating them it into

frame.

The successful corpus construction demonstrates that ontologies, as a formal

representation of domain knowledge, can instruct us and ease all the tasks in

building this kind of corpus [1]. Furthermore, ontological domain knowledge

leads to well-defined semantics exposed on the corpus, which will be very

valuable in text mining applications.

In this thesis, we aim to develop a supporting environment which has the

components that support parsing and visualizing lexical properties of

ontological terms, defining frame semantics description and annotation task for

such a corpus construction method.

2.5 Related Work

In this section, we have discussed some differences and similarities of other

projects to our project. FrameNet and PropBank are explained in section 2-3.

Understanding features of these projects helped us to find out the challenges

related to the work. We came to the decision at how we support the new

method by improving the existing system used in FrameNet project. Below you

can find explanation regarding this issue:

We have studied related works and used some common tools out of them. The

similarity of PropBank, FrameNet and our project is visible in terms of goal

which is presenting a semantic annotation layer for corpora. In fact the goal is

the same, but achieving the goal is different according to the existed problems.

As it was discussed in SRL systems, a problem is lack of large corpora in


18

biomedical domain because data in biomedical area texts is significantly

different from FrameNet and PropBank. For example many words exist in

biomedical domain which they never appear in general English and biomedical

documents contain a range of general English terms with very specific meaning

for the domain. These problems form a significant obstacle to the processing of

biomedical texts using FrameNet and PropBank which developed for general

English. Currently there are some approaches to resolving this problem but no

large corpus for such a domain exists. This challenge is responded by a solution

to consider all possible biomedical events. Our supporting system solves this

problem by help of frame semantics which gives the opportunity to have all

possible events available in the corpus by using the described method. This is

done by means of defining new frames and its associations using domain

knowledge provided by ontologies. Using frame semantics can be referred as a

similarity of our project to FrameNet. Although we have different semantic

senses from FrameNet, the work demonstrated here shows that FrameNet style

in annotation can be integrated with information from ontologies to discover

and define frames. Another challenge is that the method used in this supporting

system, can tokenise and annotate all arguments as well. The advantage of our

work can be considered as: supporting environment for ontology-based

method. This means it has the possibility of extending with ontology as it

supports ontological method.

Research Methods

19

3 Research Methods

In order to create a new knowledge, or make the knowledge about a subject

deeper, choosing the appropriate research method is necessary. There are two

main ways for research design: qualitative research and quantitative research.

One of these research design types are chosen by researches depending on the

research problem which is going to be observed or the research question which

is going to be answered. Different research methods are discussed in different

resources, for example Williamson has described methods are such as action

research, experimental research, etc. [42] and Ghauri & Gronhaug have

described methods are such as exploratory research, descriptive research,

casual research and case study [43]. Different steps of conducting this master

thesis as a research are explained in this part.

In this master thesis we have generally used design science methodology to

present a design science research. Besides, we got benefit of literature review

method for theoretical contribution in this subject, and then the implementation

method is used to implement a system related to the research’s subject.

Design science is a methodology in scientific researches which is mainly used

in researches in the field of information systems (IS). This kind of researches

focuses on development and performance of artifacts with a clear intention of

improving the performance of artifacts [44].

Design science paradigm was used in many of the initial researches of IS which

their focus was on systems development approaches and methods. Examples of

these kinds of researches are discussion on the socio-technical approach

(defined by Bostrom and Heinen [45] and Mumford [46], and the info-logical

approach defined by Langefors [47], Sundgren [48], Lundeberg et al. [49])

[50]. Basically the design science paradigm is a problem-solving paradigm [51]

which the foundations exist in engineering and the sciences of the artificial [52,

53].

Design process is consisting of some different phases which are defined by

various DSR frameworks. This division of phases are done by specifying a set

of milestones during the design process. These DSR frameworks usually launch

repetitive approach including several phases of the design process. The

example of this type can be referring to [54, 55, 56].

Different researches have different agreement on using variety of phases in a

design science process. According to this fact, a common understanding about

critical phases within a design science process helps this issue. Knowing these

critical phases makes the possibility of choosing vital research activities for

each phase. Related activities include inductive and deductive steps which are

essential to build design principles regarding to a practical problem.

Research Methods

20

Deductive steps move towards developing more concrete design decisions.

These steps include some activities routing to an instantiated artefact as well as

methods routing to a comprehensive evaluation concept with the aim of

permitting generalizations. Inductive steps concern about underlying design

principles and theories [50].

As we said, the roots of design science paradigm is in science of artifacts which

this feature has made the attention of researches towards the ‘design’ of

artificial artifacts (i.e., IT artifacts) and generating new issues which are not yet

exist. Using design science paradigm in a research is considerable because of

the nature of science which is both known as a process (set of activities) of

‘creating something new’ and a product (i.e., the artifact that results out of this

process) [57,58].

Characteristic of design science research can be described in these lists:

The primary focus in a design science research project concentrates

mostly on design research part (i.e. the creation of an IT artifact),

opposite to the design science part (i.e. generating new knowledge).

One of the issues involved in the design science research process is

searching for a relevant problem, the design and construction of an IT

artifact, and its ex ante and ex post evaluation.

Searching for real-world problems and solving them practically is one of

the most important goals in design science research.

Design science research is a general research approach with a set of

defining characteristics and can be used in combination with different

research methods.

Design science research is conducted most frequently within a

positivistic epistemological perspective.

The outcome of design science research (i.e., the problem solution) is

mostly an individual or local solution and the results cannot be readily

generalized to other settings [58].

We have used this methodology for providing this master thesis including

several steps for making design and later implementing the defined method.

The following steps cover the context of research such as knowing and stating

the problem, suggestion for solving the problem and implementation of the

suggested solutions. Last steps are about testing and evaluation the output.

Research Methods

21

These steps of design science research method (DSR) are illustrated as

following [59, 60]:

1. Awareness of the problem

2. Suggestion

3. Development

4. Evaluation

5. Conclusion

We have followed these steps for using the methodology in our master thesis.

Below we have described how we have translated the five steps in the process

of our work. 3.1 to 3.5 explain how the structure has been fitted to the steps of

design science methodology.

3.1 Awareness of the Problem: Literature Review

It is the first step in starting a research. Several different sources of information

can lead to the awareness of the problem. These sources are such as: new

development in industry or in a reference discipline [59, 60]. Recognizing the

problem leads to a proposal which can be formal or informal for creating a new

research [59, 60]. By reading different literatures related to the subject we have

chosen, the awareness of problem was obtained for our research work. We

stated the research questions for this matter.

There is a problem or research question which has been introduced in the

beginning of a research and has been answered during that research. Result of

the literature review can formulate the problem and become a motivation to the

research work. Using a relevant theory is helpful for applying some parts of the

theory into the proposed theory. This needs reviewing the past literature and the

question here is that how the literature should be reviewed? [43].The most

important note about this action is to use what we call “relevant” literature. A

definition for a literature review can be expressed in this sentence: “the

selection of available documents (both published and unpublished) on the topic,

which contain information, ideas, data and evidence written from a particular

standpoint to fulfil certain aims or express certain views on the nature of the

topic and how it is to be investigated, and the effective evaluation of these

documents in relation to the research being proposed.” [61].

Regarding our research questions and also implementing the proposed method,

we reviewed relevant books and articles which were carefully selected by

recently cited sources and authors.

Research Methods

22

Taking into consideration the role of literature review which is to develop

theoretical framework and also conceptual models, the act of combining

relevant elements from earlier studies is helpful [43]. Regarding this matter, we

have motivated our work by paying attention to the researches done in this

field.

This section helps to position the problem which was defined in the beginning

of the research and also helps understanding the concepts in similar projects as

a guideline in implementation of our new system. We reviewed many books

and articles which we have referred them in the references section. These

sources of information were achieved from broadly cited authors within the

field of semantic roles and biomedical.

In fact, according to Ghauri and Gronhaug [43] suggestion, the other roles of

literature review can be listed as below:

Structure the problem of research

Recognise relevant concepts, methods and facts

Using the existing knowledge to be involved in the new research

Identifying the advantages of literature review had encouraged us to use it as a

research method. The way we have used this method and applied it in our

study, is described below:

In order to getting benefit of relevant researches, we searched for different

kinds of materials which fit into our research area and used the most cited ones

out of them. Google scholar and the school’s library website helped us a lot for

issuing this matter. Among these amounts of sources, we only used those with

more relevant titles and those were more accessible. Studying researches about

“FrameNet”, “PropBank”, “SRL” , “software development methods” and more

on, gave us the understanding of concepts which we have reflected them in

theoretical background section.

Finally by reviewing different structure of systems in the field of biomedical,

we developed the framework of our system which is illustrated in Figure 4-1

and later described in chapter 4.

Research Methods

23

3.2 Suggestion

Next step in providing a research is “suggestion” phase. This phase is

necessarily used after recognizing the problem of research’s field and can be

applicable after making a proposal as a output of the problem recognition

[59,60]. Suggestions are the approaches including methods and methodologies

which help the proposal to solve the mentioned problem. Problems in a

software system complexity can be solved by the approach of software

development with focus on operation support systems, automation of the

maintenance function, and development of a high-level programming

environment [59, 60]. In design science research, a tentative design is essential

as a part of the proposal. Tentative design is “Tentative design is an essentially

creative step wherein new functionality is envisioned based on a novel

configuration of either existing or new and existing elements.”[59, 60]

As we said above, suggestions are consist of approaches which may be

methods and methodologies. Comparing with our research field, we have stated

the problems as research questions and in order to deliver them as an output of

our research, we obviously felt the lack of method.

By having suggestion’s definition in mind, the task of choosing an appropriate

method reached us to the suggested method by He Tan. We expressed that

method by implementing a system. The fact is that the method is developed by

He Tan [1] which is based on theory of frame semantics. Using this method has

led to a corpus labeled with semantic roles for the domain of biomedicine. We

have supported the new method by means of developing a supporting

environment. The suggestion was to deliver a new system to support semi-

automatic labeling of the new corpus built by this method.

By defining the goal as “delivering a system with specified functions”, we

decided to use Java as our programming language, and JavaNetBeans as our

programing environment. In the suggestion phase we have described about

development of system and the methodology we chose in next sections which

are mainly mentioned in 3.3 and 4.2. According to this step, we recognized the

need of requirements within system’s developing. For this matter we defined an

activity titled by writing user stories, where the story clarifies next steps in the

process. In fact these user requirements are objectives of our main solution

which is delivering a supporting environment to support the method of frame

semantics. Developing the suggestion’s step is the next phase which we have

explained below.

Research Methods

24

3.3 Development: XP Methodology

This phase focuses on the development and implementation of the tentative

design which was described previously in suggestion phase. Creative efforts are

needed while moving from tentative design into complete design requires.

Developing and implementing approaches are different due to the differences

in making artifacts, sometimes an algorithm is needed in order to build the

development technique [59, 60].

In our thesis work, we developed an environment for supporting the new

method of corpus construction using frame semantics to express the meaning of

natural language. Development environment is NetBeans IDE (Integrated

Development Environment) and development programing language we chose is

Java.

This master thesis project has been enthused by Extreme Programming (XP)

methodology in system development process, which was chosen by us. In this

section we have reviewed XP methodology concepts and also we have stated

why it fits our field of work. Extreme Programming is based on values of

simplicity, communication, feedback and courage. Following XPs values, shall

lead to more responsiveness to customer needs than traditional methods and

creation of software with better quality [62].

XP describes four basic activities that are performed within the software

development process: Coding, Testing, Listening and Designing [62]. Each of

these activities is described below.

Coding: The advocates of XP argue that the only truly important product of the

software development is code that a computer can interpret. All code is

produced by both of us, each programming one task on one workstation (pair

programming). Each is responsible for all the code and is allowed to change

any part of the code. We used the code standards in Java to make the code

easier to read and understand by other students.

Testing: Acceptance Tests were held to verify that the requirements are easily

understood and satisfy user’s actual requirements during regularly meeting. We

always work on the latest version of the software, and upload our latest changes

often. Test-driven development has Chosen to ensure that all code is properly

tested before integration.

Listening: We as programmers must listen to what the customers expect the

system to do, what “business logic” is needed. In order to communicate

between user and us Planning Game has been followed. Planning is divided

into two parts, release planning and iteration planning. In release planning the

user and developers plan which requirements that shall be included in coming

releases. Iteration planning plans the tasks for the developers and is done by the

developers and has been done by us and our supervisor.

Designing: If all the previous activities are performed well, the result should

always be a system that works. But in practice, this will not work and we

Research Methods

25

cannot avoid designing. Simple design was chosen to make the code easier to

understand by others.

Comparing to our system, the requirements were defined in the beginning but

later it was decided not to implement it based on ontologies. Figure 4-1 shows

the general architecture of our system based on requirements. “Knowledge

provided by ontologies” is considered in the architecture but in fact it is not

implemented by us and we have not integrated the system with ontologies. As a

result, the requirements will change again due to the use of domain ontologies.

Considering what we have said about XP, this kind of methodology can better

support the process of extending the system if the requirements change. By

choosing XP methodology for this project, the possibility of using ontology can

be applied to the current system because of the lower cost of change and

possibility of improving code in JAVA language.

Using extreme programming (XP), we start to collect user stories (described in

section 4-2-2) and conduct simple solutions in the first three weeks. Then we

plan to have a meeting for release planning including our supervisor (fulfilling

both user and client roles) and us (developers) to create a schedule that leads to

every week meeting. After that we start our iterative development with that

iteration planning which everyone agrees on it.

We looked for a methodology for software development after the system get in

some trouble. Our requirement specification seems useless. Another problem

that we faced was changing requirement that leads to recreate the schedule. All

these caused to use XP. We solve the problems first by replacing user stories

instead of requirement specification. Then we try an iterative way in

development process. We use unit tests for the integration bugs as well as the

acceptance test for production bugs. While both of us as a developer own the

same core classes in programming, we could become bottlenecks for each

other. So we try to make change to the core classes whenever there is a need by

applying pair programming. Continuing this way, we add some other practices.

We try to talking about problems and solutions to provide an open space to

encourage each other and improve communication. And finally our problems

have been solved and we got our project is completely under control. We show

different features of the XP methodology in our thesis work [63]:

Spike Solutions: Simple and Focused Answers

When we faced programming or design problems, we try to build spike

solutions to explore answers. A spike solution is a simple solution program to

figure out potential solutions. Most of the time they are good enough to only

address the current problem without consideration other issues, therefore we

expect to ignore them. The main idea of creating spikes was to decrease the risk

of a programming problem together with increase the reliability of user stories

[63]. It was helpful when a technical problem happens as a threat to hold the

development process; both of us reduce a potential risk by working as a pair

Research Methods

26

under examination and ignore all other concerns. Most spikes are not good

enough to keep, so we expect to throw it away.

User stories: user requirement specification

User stories are used to know what the customers expect from the system

instead of the large requirement specification [64]. They serve as time

estimation for the release planning meeting as well as the creation of the

acceptance tests. Comparing to traditional requirements specifications, they

only provide enough detail information to estimate how long the story will take

to implement by developers. Another important issue is that there are no details

of specific required technologies and algorithms, since the focus of stories is on

user needs. We have met our supervisor as a customer regularly to receive

description of the requirements face to face, when it was the time to implement

the story. Each story gets at least 1 week to implement depending on the level

of the tasks. User stories are described in details in section 4.2.1.

Release planning: on time planning

The basic idea of release planning is that every project is quantified by four

main things: scope, resources, time, and quality [64]. Scope is how much work

is to be done. Resources are how many people are available. Time is when the

project or release will be done. And quality is how good the software will be

and how well tested it will be [64]. We can have a better control of all these

issues by applying a release plan but no one can control all of them. The reason

is simple: when you change one, you cause another to change.

It is important to make technical decisions while starting implementation, so a

release planning meeting is used to lay out the overall project and build a

release plan [64]. We try to set some rules to define a method for scheduling

the priorities. Every user story need to be estimated by us on how long we need

a programming. We come to the acceptance of for the first release 4 weeks of

programming completely with nothing else to do. Then customer (our

supervisor) has decided that which user story is more important and which is

less. For example, adding an automatically POS tagger has a less priority in

Release planning than manually tagging in the annotation task. Together we

and supervisor decided on the set of the stories that need to be implemented as

the first or the next release. We had two alternatives which plan must be based

on: time or scope. We plan by time that 20 stories can be implemented before

1st October. We multiple the times in number of iteration to determine how

many user stories can be completed in a given time of thesis work which is 5

months. These estimations are used to iteration planning meeting.

Research Methods

27

Iteration planning : add agility to development process

An iteration planning meeting is called at the beginning of each iteration to

produce that iteration's plan of programming tasks [63]. Each iteration was 1 to

2 weeks long. User stories are chosen for this iteration by our supervisor from

the release plan in order of the most valuable to the customer first. Failed

acceptance tests to be fixed are also selected. The customer selects user stories

with estimates that total up to the project velocity from the last iteration.

The user stories and failed tests are then broken down into the programming

tasks that will support them. Tasks are written down on index cards like user

stories. While user stories are in the customer's language, tasks are in the

developer's language. Duplicate tasks can be removed. These task cards will be

the detailed plan for the iteration.

We sign up to do the tasks and then estimate how long their own tasks will take

to complete. It is important for us who accepts a task to also be the one who

estimates how long it will take to finish.

Acceptance test : regular testing system

Acceptance tests are created from user stories [63]. During iteration the user

stories selected during the iteration planning meeting will be translated into

acceptance tests. The customer specifies scenarios to test when a user story has

been correctly implemented. A story can have one or many acceptance tests,

whatever it takes to ensure the functionality works [65].

Acceptance tests are black box system tests. Each acceptance test represents

some expected result from the system. Customers are responsible for verifying

the correctness of the acceptance tests and reviewing test scores to decide

which failed tests are of highest priority. Acceptance tests are also used as

regression tests prior to a production release.

3.4 Evaluation: Data Collection and Survey

Evaluation is considered as an activity in software engineering to determine the

quality of the proposed software [66]. After developing the proposal, the output

should be evaluated. The evaluation phase focuses on evaluating artifact

according to the set criteria that are always implicit and frequently made

explicit in the proposal or awareness of the problem phase [59, 60]. Extra

information achieved from development and results from running of the artifact

are again collected together for another round of suggestion [59, 60]. The

attention of evaluation phase is judging results according to performance and

measurement of algorithm or design technique used in development phase.

Research Methods

28

Two phases which compose evaluation are: pre-study and evaluation of study.

Pre-study phase is mainly about data collection which deals with collecting

data from interest groups. Interest groups are those people who hand over the

data for evaluation phase. Concerning different goals in evaluation, interest

groups may differ [66]. Evaluation model can be shown in figure 3-1:

Interest group Evaluation

Pre-study Evaluation

Problem Domains

Data Collection(survey)

Action Plan

Action Proposal

Figure 3-1: Evaluation Model

The model describes the step we have done during evaluation in order to get

results. Pre-study phase defines the problems by help of interest group who

played a role in collecting data. Reviewing the results by evaluation phase,

make an action plan composed by different activities. These activities can

improve the results effectively.

In this project, the interest group collaborated with us are experts in the field of

computer science (especially in semantic roles area) and having knowledge in

biomedicine aspect. Data is collected through surveys shown further in figures

5-11 and 5-12.

Research Methods

29

In majority of thesis projects in order to answer research questions, researchers

need to decide what kind of data collection method should be used to gather

some primary data. There are several options for collecting primary data:

observation, experiment, interview or survey [43]. Decision on what kind of

these methods would be used depends on the research problem and the research

design. For some particular research problems, a researcher has to gather some

specific information from an individual respondent to carry out analytic

investigation. The communication approach should involve surviving domain

experts and recording their responses for analysis. In these cases the survey

approach can be a best choice of data collection methods which convinced us to

get benefit of surveys for our system’s evaluation.

Survey is a method of collecting data that develops questionnaires or interviews

for recording responses [43]. The excessive strength of the survey as a primary-

data collecting approach is that abstract information can be folded by

questioning people. Once the researcher has determined that survey is the

appropriate method, the research problem will determine which type of the

survey should be undertaken. The main types of surveys and questionnaires are

descriptive or analytical [43]. With descriptive surveys we can identify a

phenomena which we wish to describe while analytic survey are concerned

with testing a theory. However both types of surveys are often used to identify

the population. The population provides all the responses in order to answer

research questions. The next important issue is to construct a questionnaire.

Researchers should know what information they need and who should be the

respondent.

In our thesis project, after the research problem has specified and an

appropriate research design has developed, the next step was to select data

collection instrument. Section 5.2.2 will explain more about survey method and

also the results we got through this way of evaluation.

3.5 Conclusion Phase

The conclusion phase is the last step in creating a design science research. The

results are focused to address the usage of new method of corpus construction.

The main involvement of the conclusion is to achieve results, which are defined

clearly in the purpose or objective of the proposal. We conclude after the

evaluation phase from the domain experts and knowledge mentors, that the

results are authentic and that they are truly mapped according to the purpose of

this thesis.

The analysis of results, taken from surveys and available data leads us to have

an overall understanding of system’s usage. It can be concluded that how

accurate is the supporting system according to the percentage of system’s

response to the user requirements.

30

Framework and System

31

4 Framework and System

We present a general framework for semantic role labeling. The framework

combines a semi-automatic way of annotation with a frame specification option

which motivated by an effective approach of corpus construction. Within this

framework, we study the role of labeling sentences in biomedical applications.

4.1 Framework

The overview of the system is illustrated in figure 4-1, from the initial frame

description to the construction of the annotated file. In the sections that follow,

we will describe each step in detail.

FRAME SPECIFICATION

Selection of Examples

Tokenization for Annotation

Annotation (Mannually)

Save Formating

Knowledge Provided by

Ontology

Figure 4-1: General framework of system

Frame Specification

We first create a section where user can define and edit a frame, including a list

of frame elements developed by domain ontology. Based on the list of the

frame elements, a set of tags is prepared to use in the annotation process. A

detailed description of this phase is shown in figure 4-2.

Selection of Examples

It is the first step of the annotation process. User can select an example

sentence from existing options or input a new sentence.

Tokenization

Here the selected sentence is divided into tokens which are shown in the

annotation table. One can change any token if is not agree on the way of

tokenization.


32

Annotation

Annotators select frame element, phrase type and grammatical function for

each token in the sentence. A more detailed discussion of this process can be

found below in figure 4-3.

Save Formatting

All the result of the annotation and initial sentence are saved in a text file.

StartDefine New

Frame

Edit Existing Frame

Define FE(s)

Define Example

Sentence(s)

Add Example Sentence to it?

YES

Assign FE to it?

Edit Definition?

Edit FE?Assign target

words?

YES

Select Frame

Want to Annotae

Now?

NO

NO

END

NO

YESAnnotation

Figure 4-2: "Framing" Process overview


33

StartAnnotation

Select Example Sentence

Input new Sentences

Tokenizing

Display Tokens in

Table Rows

Annotate From List options

Save Output as a text file

END

Choose Frame

Figure 4-3: "Annotating" process overview

Since the system consists of two portions, the lexicon of frames files and the

annotated example sentences (corpus), the process is similarly divided into

framing and annotation.

4.1.1 Framing

The process of creating the frame files, that is, the collection of framesets for

each target word, begins with the examination of a sample of the sentences

from the corpus containing the word under consideration. These instances are

grouped into to a single frameset. To show all the possible syntactic

realizations of the frameset, many sentences from the corpus are included in the

frames file, which are called example sentences.

In some cases a particular sentence will not be attested within the related frame

corpus; in these cases, a constructed sentence is used, usually identified by the

user who wants to annotate the sentence. Care must be taken by user during the

framing process to make allied sentences have the same framing, with the same

number of roles and the same descriptors on those roles.


34

4.1.2 Annotation

We begin the annotation process by running a POS tagger on the corpus. This

tagger incorporates an extensive lexicon, which currently encodes in java.

Although tagger achieved high accuracy on data, the output of this tagger is

then corrected manually for defining better mappings between grammatical and

semantic.

Annotators are presented with an interface which gives them access to both the

frameset descriptions and the full tokenization of any sentence and allows them

to select tokens in the annotation table for labelling as arguments of the

predicate selected. For any sentence they are able to examine both the

descriptions of the associated frame and the example tagged sentences, much as

they have been presented in the system. The tagging is done on a token-by-

token basis, rather than all-words annotation of running text.

For new sentences, annotators had to determine which frameset was

appropriate for a given usage in order to assign the correct argument structure.

These sentences are arrayed in a classic file distribution, and their annotation

were stored in a stand-off notation, referring to frames within the system

without actually replicating any of the lexical material or structure of that

corpus. Both role labelling decisions and the choice of frameset were

adjudicated by an annotator.

The annotators themselves were drawn from a variety of backgrounds, from

undergraduates to holders of doctorates, including linguists, computer

scientists, and others. Undergraduates have the advantage of being inexpensive

but tend to work for only a few months each, so they require frequent training.

Linguists make the best overall judgments although several of our non-

linguistic annotators also had excellent skills. The role labelling process curve

for the annotation task tended to be very dependent with annotator’s

background but to become annotators becoming comfortable with the process

does not take more than one hour of work.

4.2 Method of System Implementation

4.2.1 User Requirements: User Stories

A user story is a form of software user requirement that has become quite

popular in Agile Methodologies such as Extreme Programming and Scrum.

Unlike more traditional methods such as a System Requirements Specification

or Use Case Diagrams, the emphasis in these methodologies is simplicity and

changeability. Therefore we have designed user stories to be easily described

and understood, and more importantly easily changed by the end user during

the project [67].


35

User stories are short, simple description of a feature told from the perspective

of the person who desires the new capability, usually a user or customer of the

system. They typically follow a simple template [67]:

As a <type of user>, I want <some goal> so that <some reason>.

In our system four main user stories are listed as below:

As a user, I want to add a new frame including name and definition.

As a user, I want to modify predefined frame definition but not the

associated frame elements.

As a user, I want to add new example sentence and annotate it.

As a user, I want to edit an existing example sentence and annotate it.

It is quite difficult to get a large number of User Stories at once. What happens

is that you get them from time to time. We have decided to get the User Stories,

Analyse them, Design and Document, then implement and test. Proceeding in

this fashion we had iteration reports, which helps to complete and updated

information about the system and its features as the project progresses.

4.2.1.1 User Story1: Scenario

Objective is adding a new frame to the corpus. Scenario will start through

“Frame Definition Section” and moving to “Frame Name and Frame Elements”

tab in the application. User can define the frame in terms of name and save the

changes in a file. The story will be continued by choosing “Frame Definition”

tab where user can set a definition to the desired frame located in dropdown

list.

4.2.1.2 User Story 2: Scenario

Objective is modifying predefined frame definition without changing any other

related frame elements. Scenario will start through “Frame Editing Section”

and loading the desired xml file or dataset as corpus. The story is continued by

choosing the desired frame among other frames located in the left side table.

Clicking “Edit” button will load all the associations and relations in the right

side table. User can change the content of fields he wants. In this case, the

target field is “Frame Definition”. The modification will be applied by clicking

“Save” button.


Objective is adding a new example sentence to a frame. It is assumed that user

wants to use a pre-defined corpus and add an example sentence to a pre-defined

frame. Scenario will start through “Frame Definition Section” and loading the

desired xml file or dataset as corpus. In this panel user can select the frame

which the new example sentence will be added to. Writing the example

sentence and clicking “Save” button will answer to the user’s need.


36

For annotating the example sentence scenario will continue through

“Annotation Section” and loading the desired xml file or dataset as corpus. In

this panel user can select the desired example sentence for annotation task by

choosing the proper frame. Clicking “annotate” button will tokenize the

sentence and load them in table. Selecting each row of table gives the

opportunity to edit or delete that token. Editing is done by clicking “Edit”

button, choosing proper roles and at last clicking “Enter” button. This is

applicable to all tokens and by clicking “Save” button the result of annotation

task will be saved in a file by user. User’s annotation result can be compared to

POS Tagger’s annotation result by clicking “Tagger” button.


Objective is editing an existing example sentence of a frame. Scenario will start

.through “Frame Editing Section” and loading the desired corpus. All the

available frames in the corpus will load in the frames table. User can select the

frame and load its related example sentences by clicking “Search” button. This

user story will be finished by modifying example sentence and clicking

“update” button.

For annotating the example sentence scenario will be continued as we

described in user story3.

4.2.2 Software Development Environment

4.2.2.1 Programming Language for Development

First of all, the object oriented programming language should be chosen for

realization of described system. Selection of programming tools is one of the

most serious steps in development cycle and will depend on few points

mentioned below [68]:

Speed of programming development

Convenience of user interface

Possibility to change/improve the code easily

Performance of software

Possibility to create useful documentation automatically

Based on the advantages listed below, Java programming language has been

chosen for the practical implementation [69]:

Portability: Java runs on most of the hardware and software.

Theoretically, by using this language, we make the system compatible

with other platforms.

Reliability: Java runtime realizes multiple checks of byte-code to avoid

any inconsistency and to verify correctness of code. As compared to C


37

and C++, some features were deleted from Java language (like pointers

and automatic type conversion).

Support of multithreading: This feature can be defined as “the ability of

a program or an operating system process to manage its use by more

than one user at a time. It even manages multiple requests by the same

user without the need to have multiple copies of a program running in a

computer. In our case multithreading can be used as an advantage for

GUI creation.

Robustness: It includes early checking of possible errors. Java compiler

is able to detect many problems. Java realizes runtime exception

handling feature, which means that it can catch and respond to an

exceptional situation so that the program can continue its normal

execution and terminate gracefully when a runtime error occurs.

As a final point, by technical estimation made before the programming

realization, advantages which Java tools provides to us are more ponderable

than other tools.

4.2.2.2 Integrated Development Environment

Among different integrated development environment, we have found that

NetBeans platform fits our needs. It is a cross-platform open source IDE for

Java that comes with a syntax highlighting code editor that supports code

completion, annotations, macros, auto-indentation, etc. It includes visual design

tools (wizards) for code generation. It integrates with numerous compilers,

debuggers, Java Virtual Machines and other tools.

Before the final acceptance of chosen IDE, small research and comparison was

done on their possibilities and properties. Since NetBeans supports more

modular structure than Eclipse IDE and IntelliJ IDEA, we have decided to pick

it for the development process.

4.2.3 Interface Design

Developed software contains graphical user interface module, which allows

end users to define new frame or to annotate. To create GUI in Java we will

make use of the swing package called javax.swing. This package contains a lot

of classes that will create GUI components for us.

Our first objective will be creating a frame. Once we have the frame, we can

add other components to it. Next is to look at what we want to achieve. There is

a title, it has a specific size, you can see it visible and you can close it.

Knowing this, we need to check what methods are in the Jframe class.

The containter is the area that we can put the components like buttons. To get

access to this container we make use of the method getContentPane() from the


38

Jframe class. To configure the layout of the container area we use setLayout()

method. Components are added from left to right and top to bottom when we

make use of the gridlayout.

We have added some labels by using pane.add(label). We make use of the

JTextField to be able to enter some data by user. To give the user the capability

to start an action, we will give the user a button. For this component we will

use the JButton class and to create the component as a button we will send a

text as a parameter that will appear on the button.

4.2.4 Java Class Design

In this section short description of developed package and classes to realize the

functionality is described. As mentioned before, two major classes named

Framing and Annotation are defined in the package.

Annotation class includes the table for annotating an example sentence. This

will be done manually by user. We write an algorithm to tokenize the sentence

which user wants to annotate. It allows user to change the divided tokens if he

is not agreed with the result of the tokens which are written to the tables in

rows. The output of the process is saved in the text file with the related frame.

Creating a new frame and all the related frame elements is made possible in

Framing class. You can add, remove, or edit frame definition and frame

elements through it. There is also a need for defining example sentences and

relates them to the frame definition which is done in this class. The output of

the class with all the relations will be saved in a XML file to easy access and

read. Framing class mainly uses TransformerFactory class for the “save”

method and DocumentBuilderFactory class for creating XML file. The data of

this XML file will be filled by whatever user defines in Frame Definition

Section of the system. This file will be also updated when user updates some

fields in Frame Editing Section or Frame Deleting Section. This process should

obey a code in order to add desired tag names and data to the XML file. For

tagging or in other words labelling the nodes in XML file, solution is to

consider the input of each textbox as a TextNode. The related elements are then

created for appending to the node. For adding data to the nodes and basically to

each tag, we have used the method getElementsByTagName.

Finally the design of the classes represents a compromise between the need for

user to annotate the sentences, find the related frame, and define frame and

elements at one hand and elegant object oriented design at the other hand.

The system can be started by clicking on the jar file and then the GUI of the

system will be appeared. Detailed about GUI and functionality of the system

has been described in section 5.2.1.


39

4.2.5 The Corpus Database

The considered corpus should support a structured collection of data as a

database in order to be retrieved when needed. This database can be loaded

through the system which we have developed or a new corpus can be saved

with the same database structure. In this work, the data including information

on frames, frame’s related associations and also annotated sentences are

organised in two ways of files as database which we have described below:

4.2.5.1 XML File

As we said, the data including information on frames and its related

associations are stored in an XML file. XML refers to Extensible Markup

Language which explains a class of data object called XML documents. XML

documents follow two structures which the first is physical structure and

discuss the document is made by entities. The second structure is logical one

discuss that document is made by declarations, comments, elements, character

references and processing instructions. The importance is composing a well-

formed XML documents by these two structures [70]. XML data can be better

explained by different XML schema languages such as XML DTD, XML

Schema, XDR, SOX, DSD and etc. [71]. A comparison between these schemas

is shown in figure 4-4 below:

Schematron

DSD

DTDXDR

XML Schema SOX

Usage Oriented

Definition Oriented

Pattern-based Grammar-based

Constrains oriented Structure oriented

Figure 4-4: Comparison between Different XML Schemas


40

We chose using XML DTD for these three reasons:

1. It has been survived for a long time and as they are supported by

considerable organisations, they have a high chance to be continued

using in future as well.

2. Known applications using this schema language

3. It is easy to learn despite its use of proprietary syntax

Figure 4-5 shows a piece of XML file used as corpus database.

Figure 4-5: XML file containing data through frame semantics

4.2.5.2 Text File

A text file is considered as database for annotated sentences. User chooses to

save the database himself and the file will store the data in a structure starting

by original example and following by annotated tokens. Figure 4-6 shows a

piece of this file.


41

Figure 4-6: Output file containing annotated tokens

4.2.6 System Requirements

The prerequisites for using this system are not very limited. Anyone who finds

the system effective to his work and related to his domain of interest can work

with the system. In other words, anyone whose work is related to biomedical

text mining can use this supporting system in order to recognise biomedical

terms with annotations. There is no limitation in installation of the application

as we have made a jar file which can be run on every computer updated by a

java runtime environment. The only considerable issue is that corpus’s

structure is in the format of XML which means that XML document should

exist for loading the corpus into the system. The work is delivered in a package

containing a jar file and an XML file as corpus database for testing the system.

We have tested the system in an environment equipped moderately with at least

these features: any kind of OS, processor 500MHz or higher, available disk of

100 MB, Memory 256 MB. It is expected that every computer having this

minimum features, supports the system properly.

42

Results and Discussion

43

5 Results and Discussion

This section presents the results of our work which we have done during this

master thesis. It also states how the objective of this thesis work is reached

according to the development process. Once the objective is achieved, the

result is ready. Objectives are those explained as research questions as well as

figuring out as user stories explained in section 4-2-2. In other words, the result

is a system for building corpus annotated with semantic roles. As we described

about the purpose of this thesis report in section 1.2, the development of a

supporting system is done based on a corpus of biological transport events

which the method of building corpus was previously proposed by He Tan.

For better understanding and also delivering more clear explanation of the

results, we have categorised results in 2 types as theoretical results and

practical results and described them in that way. In this format, results can

better describe the specific research question. We have reviewed various

similar works as a constructive research by means of literature review which

leads us to explain the theoretical findings. Achieving practical results in a

systematic way has got benefit of DSR method and for the development of

them, we have applied some development tools like JavaNetBeans IDE version

7.2, using Java codes and Java libraries for the execution of proposed solution.

These results are more described in details in the following parts. System

requirements are defined in section 4-2-1 and also we explain how the system

supports semi-automatic labelling of the corpus construction by means of

screenshots in section 4-2-5. Afterwards the analyses of the results are

discussed by means of two surveys which are fitted to evaluation section.

5.1 Theoretical Results

The first research question “Investigating how to support method of building

corpus annotated with semantic roles?” is finely answered in this section as

well as second question “How the general framework is developed”. These

types of results are the output of our research through literature review and also

reflected in theoretical background section. According to the understandings

from different projects, there is a need to solve the problem for lack of large

corpora and also solve ambiguities in possible meaning of terms for a domain

such as biomedical domain. We came to the decision to get benefit of frame

semantics to easily consider all definitions and contexts.


44

Describing all possible events can be done by means of ontology because

ontology can define entities and relationships in a domain completely. As a

result, the system should have the ability to support parsing and visualizing

lexical properties of ontological terms, defining frame semantics description

and annotation task based on such method. Finding out these features forms

the theoretical result of the work. For this matter, a general framework is

proposed which is described in details in section 4-1. General framework

considers knowledge provided by ontology for injecting information using

frame semantics. This framework includes two phases of framing and

annotation illustrated in two different flow charts as figure 4-1 and figure 4-2.

5.2 Practical Results

In this section, the third research question “How to implement a system based

on the framework” is finely answered. Our practical results can help us to

compare the efficiency of the system through supporting method of frame

semantics.

5.2.1 Implementation Results

In this section we explain the functionality of the system by means of figures

which are in fact Graphic User Interface (GUI) of the system. As a result, it is

understood how this system supports the user’s work in an easier way.

Discussing about the needed components, lead us to remember again the

structure of method and supporting system. So again we can refer that the

definition and description of the frames in the corpus, which is a part of this

system’s task, strictly follows the domain knowledge delivered by the Gene

Ontology describing the events. The core structure of the frame is similar to

FrameNet’s structure. The scenario’s explanation evoked by the frame is

delivered, besides a list of the FEs and their definitions.

By means of this supporting environment, user has the ability to work through

frame semantics and manage tasks such as: add a new frame with a name and

description, add frame elements with their name and descriptions, and add

example sentences to the new frame. It is also possible to edit a frame and its

associations in a corpus through system. The stored data is presented in semi-

structure schema by human. In other words, our system saves the definition and

description of the frames and the related items in an XML file. This was

discussed before in section 4.2.5.1 with presenting figure 4.5 where all possible

items are set into “FrameDefinitionSections” tag consisting of different subtags

and sub-subtags. Figure 5-1 shows the user interface for defining new frame

and its definition. Other figures regarding different functionalities are presented

in appendix 1.


45

Figure 5-1: Adding new frame and its definition

System can make output through two ways. One way is when the user selects

the sentence and annotates the divided tokens separately. In this way,

annotation is done manually by user. System works with the defined frames in

it, and annotates the sentences given by user. In fact the output of the system is

a text file containing annotated tokens which are derived by saving the

annotations labels to each token. This was previously shown in figure 4-6. The

other way is annotating by help of tagger. In this way, a POS Tagger is

responsible in suggesting the labels to each token. User then can use this

suggestion and apply it to his annotation work and then save the result. The text

file will be filled by the sentence itself as well as tokenized parts again. For this

way, we added a java library in our coding which delivers the POS Tagger. The

tagger used in this system is accessible by adding “Stanford POS Tagger”.


46

5.2.2 Evaluation Result and Discussion

Once the implementation of the system has finished, we have identified data

collection method for the evaluation of system. We could record the verbal

behaviour of system users by an effective tool to get opinions called survey

which is the most popular method for measuring expectations [72]. When our

research problem has formulated and the study has clearly defined, this governs

the type of survey we should have used. The focus of the survey is more on a

representative sample of the system rather than an analytical survey. Therefore

using a descriptive survey was more appropriate since we wanted to obtain user

attitudes towards working with our system. To construct the survey, a review of

earlier literature was a key role to determine what type of questions has to be

included in the questionnaire. In the development of survey, first we specify

what type of information is required to get the feedback from users. Second, we

consider only domain experts should be our respondents. To do that, we have

contacted two experts in the field of biomedical domain. Finally we consider

some guidelines for the construction of the questions. For example questions

should be simple, concise, realistic demand, specific and etc.

Applying this method needs some questionnaire. Various surveys are created to

deliver the results because different types of goals need different type of

questions. Considering this matter, we created two surveys; one is regarding the

goal of easiness, and the other one is considering the goal of speed.

Our surveys are created by the help of GoogleDocs, and then the survey’s links

are published through internet among different users. The task is working with

the system and filling the questionnaire as a feedback. Designed surveys are

shown as figures in appendix2. Figures 8-10 shows the survey based on the

goal “easiness” and figure 8-11 presents the survey for analysing the system for

the goal “speed”.

Filled surveys are analysed by the survey provider which in this case is

GoogleDocs. Analysis result can be shown in the format of charts and graphs

showing variety of feedbacks by percentage. Derived results from surveys

show to what extend system is easy in meeting the needs of users. Below the

general results are shown. Figure 5-13 is a pie chart which shows the

percentage of satisfactory in aspect of easiness and figure 5-14 is a bar chart

which shows the percentage of satisfactory in speed’s aspect.


47

Figure 5-2: Evaluation Result Regarding Easiness of System

Figure 5-3: Evaluation Result Regarding Easiness of System


48

Comments achieved from interest group, caused the creation of action plan

which discuss about possibility of improvement. This action plan consists of

action proposal such as:

Instead of manually starting the tagger for producing suggestion,

automatically fill in the POS suggestions produced by the tagger

Removing the need of running system repeatedly in case of re-opening

the corpus or opening another corpus

Conclusion and Further Work

49

6 Conclusion and Future Work

This thesis work can be seen as a step towards a better integration of theoretical

method of corpus construction and practical semantic role labelling. In this

paper, we support a method which aims to build corpus that are labelled with

semantic roles for biomedical domain by proposing a general framework. The

proposed framework can be tailored to any domain, and it is not prescriptive of

particular tools and techniques. Based on this framework, a system is

developed to support this method.

The system comprises a biomedical corpus, organized in terms of semantic

frames in ontology-based domain knowledge. In particular, we develop a

supporting environment which has the components that allow the user to define

frame semantics description, frame elements, add lexical units and example

sentences. The system also supports retrieving defined example sentences in

order to annotation task, annotating sentences and saving them in specific

format. It is available as a software environment for building corpus annotated.

While it includes support for common frames in biomedical domain, it is also

easily extended. Existing frames can be augmented and new ones added within

the features that system provides.

Developed system has an annotation scheme for biomedical texts and produces

an associated corpus. The corpus is unique within the biomedical field in that

ontologies act as a basis for knowledge creating domain-specific semantic

frames. It is hoped that the corpus will boost research into other areas of

domain-specific SRLs. However, our results show that use of a human

interaction in semi-automatically systems may cause some difficulties. This is

possibly due to the different expertise levels of annotators within the expression

domain. A solution for this would be to analyse the domain knowledge of

annotators in greater detail and, where appropriate, provide extra training in the

assignment of tags in the annotation task. Besides, validation of the system is

predicted by applying the system to texts from different biomedical domains

and by using data collection methods for evaluation of the results.

As we described in general framework of our work, the system can be

integrated later with ontological knowledge. Following our results, future work

could possibly considering an ontology plugin for the integration of system

with ontologies. Another subject for future work can be explained as choosing

alternatives between taggers. In our project, we have only got benefit of

Stanford POS tagger. This part of the work can be later replaced by a plugin for

generating different annotation suggestions by using various POS tagger

50

References

51

7 References

[1] Tan, H. Kaliyaperumal, R. & Benis, N. (2012). Ontology-Driven

Construction of Domain Corpus with Frame Semantics Annotations. 13th

International Conference on Intelligent Text Processing and Computational

Linguistics (CICLing 2012) New Delhi, India, (p.54-65).

[2] Boas, Hans C. (2002). On the role of semantic constraints in resultative

constructions, Linguistics on the way into the new millennium. Vol.1, (p.35-

44).

[3] Liu, Y. (2009) Semantic Role Labeling Using Lexicalized Tree Adjoining

Grammar: A Doctoral Thesis in Simon Freaser University.

[4] Tonelli, S. (2010). Semi-automatic techniques for extending the FrameNet

lexical database to new languages, PhD thesis, Dept. of Language Sciences,

Università Ca’ Foscari, Venezia.

[5] Bretonnel Cohen, K. & Hunter, L. (2008). Getting Started in Text Mining.

Plos Computational Biology 4.

[6] Kao, A & R.Poteet, S. (2007). Natural Language Processing and Text

Mining. Springer: London.

[7] Kao, A & R.Poteet, S. (2005). Text Mining and Natural Language

Processing: Introduction for special issue, ACM SIGKDD Explorations

Newsletter. Vol.7, issue 1.

[8] Alison, B. & Guthrie, L. (2006). Another look at the data sparsity problem.

9th international conference on Text, Speech and Dialogue (Berlin2006),

(p.327-334).

[9] Rodriguez, R. (2009). Biomedical Text Mining and Its Applications. PLoS

Comput Biol 5(12).

[10] Gildea, D., Jurafsky, D. (2002). Automatic labeling of semantic roles.

Computational Linguistics28(3), (p.245–288) .

[11] Wikipeida. (2011, November11). Semantic role labeling. Retrieved 2012-

10-30 from http://en.wikipedia.org/wiki/Semantic_role_labeling

[12] Gildea,D. & Jurafsky,D. (October 2000). Automatic Labeling of Semantic

Roles. 38th Annual Conference of the Association for Computational

Linguistics (ACL-00), Hong Kong, (p.512–520).

[13] Carreras,X. & Marquez,L. (2005). CoNLL-2004 and CoNLL-2005 Shared

Tasks. Semantic role labeling, Retrieved 2012-10-30 from

http://www.lsi.upc.edu/~srlconll/

[14] Palmer, M., Gildea, D., Kingsbury, P. (2005). The proposition bank: an

annotated corpus of semantic roles. Computational Linguistics 31, (p.71–105).


http://www.lsi.upc.edu/~srlconll/

References

52

[15] Hung, SH., Lin, CH., Hong,J. (2010). Web mining for event-based

commonsense knowledge using lexico-syntactic pattern matching and semantic

role labeling, Expert Systems With Applications.Vol.37(1), (p.341-347).

[16] Stevens, G. (2006). Automatic Semantic Role Labeling in a Dutch Corpus:

a master thesis in Utrecht University, Netherlands.

[17] Ruppenhofer, J., Ellsworth, M., Petruck, M.R.L., Johnson, C.R.,

Scheffczyk,J. (2005). FrameNet II: Extended theory and practice. Tech. rep.,

ICSI.

[18] Monachesi, P. (2009). Annotation of semantic roles. Utrechet University,

Netherlands.

[19] Collin, F. & Hiroaki, S. (2006). The FrameNet Data and Software.

[20] Beck, K. (2000). Extreme Programming Explained: Emberce Change.

Addison-Wesley.

[21] Fillmore, CH. (1985). Frames and the semantics of

understanding. Quaderni di Semantica.

[22] Wikipeida. (2012, September27). Frame Semantics. Retrieved 2012-11-

01, from http://en.wikipedia.org/wiki/Frame_semantics_(linguistics).

[23] Tan, H., Kaliyaperumal, R., Benis, N. (2011). Building frame-based

corpus on the basis of ontological domain knowledge, Proceedings of the 2011

Workshop on BioNLP , Portland, Oregon, USA (p.74-82).

[24] Hamm & Fritz. (2007). Frame Semantics. University of Stuttgart

publications.

[25] Minsky, M. (1975). A Framework for Representing Knowledge. In The

Psychology of Computer Vision, ed. P. H. Winston, New York: McGraw-Hill,

(p.211 – 277).

[26] Alan, K. (2001). Natural Language Semantics, Blackwell Publishers Ltd,

Oxford, (p. 251).

[27] Baker, C. & Sato, H. (2003). The FrameNet data and software, ACL

Companion The Association for computer linguistics, (p.161-164).

[28] Boas, Hans C. Frame Semantics as a framework for describing polysemy

and syntactic structures of English and German motion verbs in contrastive

computational lexicography, Proceedings of Corpus Linguistics 2001,

Lancaster, U.K. (p64-73).

[29] Ruppenhofer, J., Ellsworth, M., Petruck, M.R.L., Johnson, C.R.,

Scheffczyk,J. (2005). FrameNet II: Extended theory and practice. Tech. rep.,

ICSI .

[30] Palmer, M. & Gildea, D. & Kingsbury, P. (2005). The Proposition Bank:

An Annotated Corpus of Semantic Roles, Computational Linguistics, Vol. 31,

No. 1, (p. 71-106).

[31] Stevens, G. (2006). Master thesis, Universiteit Utrecht, Faculty of arts.

http://en.wikipedia.org/wiki/Frame_semantics_(linguistics)

http://aclweb.org/anthology-new/W/W11/W11-0209.pdf

http://aclweb.org/anthology-new/W/W11/W11-0209.pdf

References

53

[32] Loper, E. & Yi, S. & Palmer, M. (2007). Combining Lexical Resources:

Mapping Between PropBank and VerbNet. Proceedings of the 7th International

Workshop on Computational Linguistics, Tilburg, the Netherlands.

[33] Wikipeida. (2012, October12). PropBank. Retrieved 2012-11-02 from

http://en.wikipedia.org/wiki/PropBank.

[34] Levin, B. (1993). English Verb Classes and Alternations: A Preliminary

Investigation. University of Chicago Press, Chicago, USA.

[35] Carrera,s X., Màrquez, L.(2005) Introduction to the CoNLL-2005 Shared

Task: Semantic Role Labeling.

[36] Pradhan, S., Sun, H., Ward, W., Martin, JH., Jurafsky, D. (2004). Parsing

Arguments of Nominalizations in English and Chinese. The Human Language

Technology Conference/North American chapter of the Association for

Computational Linguistics annual meeting.

[37] Jiang, Z. (2006). Semantic role labeling of NomBank: A maximum entropy

approach. Conference on Empirical Methods in Natural Language Processing

(EMNLP) , (p. 138-145).

[38] Meyers A., Reeves R., Macleod C., Szekely R., Zielinska V., Young B.,

Grishman R. (2004). Annotating Noun Argument Structure for NomBank.

Conference on Language Resources and Evaluation (LREC).

[39] Chou W., Tsai R., Su Y., Ku W., Sung T., Hsu W. (2007). BIOSMILE: a

semantic role labeling system for biomedical verbs using a maximum-entropy

model with automatically generated template features.

[40] Chou W., Tsai R., Su Y., Ku W., Sung T., Hsu W. (2006). A Semi-

Automatic Method for Annotating a Biomedical Proposition Bank. Proceedings

of the Workshops on Frontiers in Linguistically Annotated Corpora 2006, (p. 5-

12).

[41] Kogan, Y. Collier, N. Pakhomov, S., Krauthammer, M. (2005).Towards

Semantic Role Labeling & IE in the Medical Literature.

[42] Williamson, K. (2002) Research methods for students, academics and

professionals: information management and systems, Wagga Wagga NSW:

Centre for Information Studies.

[43] Ghauri, P.N., Gronhaug, K. (2005). Research methods in business studies:

a practical guide, Prentice Hall.

[44] Wikipeida. (2012, September14). Design Science Research. Retrieved

2012-11-13 from http://en.wikipedia.org/wiki/Design_science_research

[45] Bostrom, R.P. and Heinen, J.S. (1977). MIS problems and failures: A

socio-technical perspective: Part I: The causes. MIS Quarterly, 1(3), (p.17-32).

http://en.wikipedia.org/wiki/PropBank

References

54

[46] Mumford, E. (1983). Designing Human Systems for New Technology,

The ETHICS method, Manchester Business School, Manchester.

[47] Langefors, B. (1966). Theoretical Analysis of Information Systems,

Studentlitteratur, Sweden, Lund.

[48] Sundgren, B. (1973). An Infological Approach to Data Bases, Ph.D. Diss.,

Skriftserie utgiven an statistiska centralbyrån, Nummer 7, Statistiska

Centralbyrån, Stockholm.

[49] Lundeberg, M., Goldkuhl, G., and Nilsson, A. (1978). Systemering,

Studentlitteratur, Sweden, Lund.

[50] Koppenhagen, N., Gass, O., Müller, B. and Maedche, A. (2012).Design

Science Research In Action –Anatomy Of Success Critical Activities For Rigor

And Relevance, Proceedings of the 20th European Conference on Information

Systems (ECIS 2012) Poster Presentation, Barcelona, Spain.

[51] Rittel, H. and M. Webber (1984) Planning problems are wicked problems,

in Developments in Design Methodology, N. Cross (ed.), John Wiley & Sons,

New York, (p. 135–144).

[52] Simon, H. (1996) The Sciences of Artificial, 3rd edn., MIT Press,

Cambridge, MA.

[53] Hevner, A., Chatterjee, S. (2010). Design Research in Information

Systems: Theory and Practice Integrated Series in Information Systems. Vol

22, (p.9-20).

[54] Peffers, K., Tuunanen, T., Rothenberger, M., and Chatterjee, S. (2008) A

design science research methodology for information systems research, Journal

of Management Information Systems 24 (3), (p. 45–77).

[55] Takeda, H., Veerkamp, P., Tomiyama, T., and Yoshikawa, H. (1990).

Modeling Design Processes. AI Magazine 11, 4, (p.37-48).

[56] Sein, M. K., Henfridsson, O., Purao, S., Rossi, M. and Lindgren, R.

(2011). Action Design Research. MIS Quarterly, (35:1), (p.37-56).

[57] Walls, J. G., Widmeyer, G. R. and El Sawy, O. A. (1992). Building an

Information System Design Theory for Vigilant EIS. Information Systems

Research, 3(1), (p.36-59).

[58] Gregory, R.W. (2010). Design Science Research and the Grounded

Theory Method: Characteristics, Differences, and Complementary Uses.

Proceedings of the 18th European Conference on Information System (ECIS

2010), Pretoria, South Africa.

[59] Vaishnavi V.K, Kuechler Jr. W., (2008) Design Science Research Methods

and Patterns: Innovating Information and Communication Technology,

Auerbach Publications, Taylor and Francis Group, New York, USA.

References

55

[60] Shareef, M. I., Rawi, A.W. (2012). The Customized Database

Fragmentation Technique in Distributed Database Systems: A Case Study,

Jönköping University, School of Engineering, JTH, Computer and Electrical

Engineering.

[61] Hart, C. (1998). Doing a Literature Review: Releasing the Social Science

Research Imagination. London: Sage, (p.14).

[62] Beck, K. (2000). Extreme Programming Explained: Emberce Change.

Addison-Wesley.

[63] Jeffries, R. (2001, November 8). What is Extreme Programming?

Retrieved 2012-11-10, from http://www.xprogramming.com/xpmag/whatisxp.

[64] Wells, D. (2009, September 9). Extreme Programming: A gentle

introduction. Retrieved 2012-11-04, from http://extremeprogramming.org/.

[65] Agile only. Retrieved 2012-10-28, from http://agile-only.com/master-

thesis/software-dm/agile-s-dm/xp.

[66] Sochacki, G. (2002). Evaluation of Software Projects, a recommendation

for implementation: The iterating evaluation model. A master thesis in:

Blekinge Institute of Technology, Sweden.

[67] Steinberg, D. and W. Palmer, D. (2004). Extreme Software Engineering,

Pearson Education, Inc. (p. 208-294).

[68] Eckel, B. (2006). Thinking in Java (4th Edition) Stockholm:PED AB,

Pearson Education.

[69] Liang. S. (2007). Java(TM) Native Interface: Programmer’s Guide and

Specification. Baltimore: Addison Welsey.

[70] Extensible Markup Language (XML) 1.0 (Fifth Edition). Retrieved 2013-

01-27, from http://www.w3.org/TR/REC-xml/#wf-entities

[71] Lee, D. & W.Chu, W.(2000). Comparative Analysis of Six XML Schema

Languages, ACM SIGMOD Record 29(3), Department of computer science

University of California, USA.

[72] Brendtsson, M. (2008). Developing your Objectives and Choosing

Methods. In Springer (Ed.), Planning and Implementing your Computing

Project – with Success! (p.54-71).

56

Appendix

57

8 Appendix

Appendix 1: Practical Results (Screenshots of system)

Figure 8-1: Adding frame elements and definition to a frame

Figure 8-2: Editing a frame and its related options

Appendix

58

Figure 8-3: Selection sentence for annotation purpose

Figure 8-4: Confirming the selected sentence to be annotated

Appendix

59

Figure 8-5: Tokenisation of sentence

Figure 8-6: Setting roles to tokens

Appendix

60

Figure 8-7: Saving the result through filled table

Figure 8-8: Annotate using POS Tagger

Appendix

61

Figure 8-9: Dividing tokens

Appendix

62

Appendix 2: Surveys

Appendix

63

Figure 8-10: Survey "Easiness"

Appendix

64

Figure 8-11: Survey "Speed"

Documents

A System for Building Corpus Annotated With Semantic Roles605801/FULLTEXT01.pdf · 2013-02-15 · A System for Building Corpus Annotated With Semantic Roles Sanaz Rahimi Rastgar Niloufar