A Rule-based Approach to External Context Extraction from … · 2010-12-06 · A Rule-based...

A Rule-based Approach to External Context Extraction from

Biomedical Literature: URL and Role Extraction

A dissertation submitted to The University of Manchester for the degree of

Master of Science Informatics

In the Faculty of Engineering and Physical Sciences

Azad Dehghan

School of Computer Science

2 | P a g e

Table of Contents

Table of Contents .......................................................................................................................... 2

List of Tables ................................................................................................................................. 4

List of Figures ................................................................................................................................ 6

List of Abbreviations ..................................................................................................................... 7

Abstract ........................................................................................................................................ 8

Declaration ................................................................................................................................... 9

Copyright Statement ..................................................................................................................... 9

Dedication .................................................................................................................................. 10

Acknowledgement ...................................................................................................................... 10

1. Introduction ........................................................................................................................ 11

1.1. Motivation ................................................................................................................... 11

1.2. Project Aims ................................................................................................................ 12

1.2.1. Conceptualisation of Project Specific Terminology ............................................... 12

1.3. Project Objectives ........................................................................................................ 13

1.4. Availability ................................................................................................................... 13

1.5. Overview of Chapters .................................................................................................. 14

2. Background ......................................................................................................................... 15

2.1. Text Mining.................................................................................................................. 15

2.1.1. Information Retrieval ........................................................................................... 15

2.1.2. Natural Language Processing ................................................................................ 16

2.2. Information Extraction ................................................................................................. 17

2.2.1. Rule-based and Statistical-based Approaches to IE ............................................... 18

2.2.2. IE Application Development Tools/Software......................................................... 18

2.3. NLM Journal Archiving and Publishing DTDs ................................................................. 19

2.4. Related Work ............................................................................................................... 21

2.5. Summary of Chapter .................................................................................................... 24

3. Software Requirements ....................................................................................................... 26

3.1. Description of Main Tasks ............................................................................................ 26

3.1.1. URL Extraction...................................................................................................... 26

3.1.2. Acknowledgement Extraction ............................................................................... 27

3.2. Functional User and System Requirements .................................................................. 27

3.2.1. Functional User Requirements and Use Case Diagram .......................................... 27

3.2.2. Functional System Requirements ......................................................................... 29

3.2.3. Requirement Traceability Matrix .......................................................................... 33

3 | P a g e

3.3. Non-Functional Requirements ..................................................................................... 34

4. System Design and Analysis ................................................................................................. 35

4.1. Generic System Architecture ........................................................................................ 35

4.2. Description of External Context Extraction ................................................................... 36

4.2.1. URL Module ......................................................................................................... 36

4.2.2. IE Module............................................................................................................. 39

4.3. System Architecture..................................................................................................... 41

4.3.1. Subsystems Architecture ...................................................................................... 41

4.4. System Design ............................................................................................................. 42

4.4.1. Database Layer..................................................................................................... 43

4.4.2. Application Layer ................................................................................................. 44

4.4.3. Presentation Layer ............................................................................................... 47

5. Implementation................................................................................................................... 48

5.1. Tools & Implementation Environment ......................................................................... 48

5.2. Implementation of URL Module ................................................................................... 48

5.2.1. Extraction of URLs ................................................................................................ 49

5.2.2. Checking Resource Availability ............................................................................. 49

5.2.3. Determining Resource Type ................................................................................. 50

5.3. Implementation of IE Module ...................................................................................... 53

5.3.1. GATE .................................................................................................................... 53

5.3.2. Java Annotation Pattern Engine............................................................................ 53

5.3.3. Implementation of IE Module Described .............................................................. 54

5.3.4. Information Extraction ......................................................................................... 60

6. Evaluation ........................................................................................................................... 63

6.1. URL Extraction ............................................................................................................. 63

6.1.1. Discussions........................................................................................................... 65

6.2. Role Extraction ............................................................................................................ 66

6.2.1. Discussions........................................................................................................... 68

6.3 System Limitations ....................................................................................................... 70

7. Conclusion ........................................................................................................................... 72

7.1. Limitations and Future Work........................................................................................ 73

References .................................................................................................................................. 74

Appendix A – System Architecture and Design ............................................................................ 77

Appendix B – Implementation ..................................................................................................... 80

Appendix C – Evaluation Data...................................................................................................... 81

4 | P a g e

List of Tables

Table 1 – Relevant XML Tags 20

Table 2 – Most Acknowledged Funding Organisations 23

Table 3 – Ideal Results from URL Extraction Process 26

Table 4 - Ideal Results of TM Process 27

Table 5 – Description of Actor (AC) 28

Table 6 – Description of Use Cases 28

Table 7 – Mapping between Projects Objective and Implementation Objectives 29

Table 8 – Implementation Objective 1 30

Table 15 – (Implementation) Objective 8 33

Table 16 – Requirement Traceability Matrix 33

Table 17 – Ideal Results from URL Extraction Process 37

Table 18 – HTTP Response Codes 38

Table 19 – Examples of REs for Collaborators and Funders 39

Table 20 - Results of TM Process 40

Table 21 – Regular Expressions for URL Validation 49

Table 22 – Sample of Keywords 50

Table 23 – Distributed Score of Soft Decision Algorithm 51

Table 24 – Result by Soft Decision Algorithm 52

Table 25 – Sample of One-Word Role Expression Lists 56

Table 26 – Sample of Multi-Word Role Expression Lists 56

Table 27 - Results of Role Extraction 61

Table 28 – Evaluation Terms Described 63

5 | P a g e

Table 29 – Total Resource Type Referenced 63

Table 30 – Resource Availability by Year 64

Table 31 – True Positives: Role Extraction 67

Table 32 – Most Acknowledged Funding Organisation 67

Table 33 – Description of RE Transducers Rule 69

Table 34 - Development and Evaluation Environment 70

Table 35 – Accomplished Project Aims 72

Table 36 – List Keywords for Resource Type Identification 80

Table 37 – URL Extraction Data 81

Table 38 – Role Extraction Data 82

Table 39 –Role Expression Extraction Data 83

Table 40 –Name Entity Extraction Data 80

6 | P a g e

List of Figures

Figure 1 - URL Decay (Wren, 2008) 24

Figure 2 - Use Case Diagram 28

Figure 3 – High-Level System Architecture 35

Figure 4 – URL Module Overview 37

Figure 5 - Generic NLP/IE Pipeline 40

Figure 6 - ExtConX2 Layered Subsystems 42

Figure 7 - ExtConX2 Database Layer 43

Figure 8 - Relational Database Schema 44

Figure 9 - ExtConX2 Application Layer 45

Figure 10 - ExtConX2 Presentation Layer 47

Figure 11 - IE Application Pipeline 55

Figure 12 – URL Decay 64

Figure 13 - System Db EER Diagram 77

Figure 14 - ExtConX2 Architectural Design 78

Figure 15 - ANNIE Default IE Modules (www.gate.ac.uk) 79

7 | P a g e

List of Abbreviations

a Nearly-New Information Extraction System ANNIE

API for XML SAX

Common Pattern Specification Language CPSL

Data Mining DM

Document Object Identifier DOI

Document Object Model DOM

Graphical User Interface GUI

Human Computer Interaction HCI

Hypertext Transfer Protocol HTTP

Information Extraction IE

Information Retrieval IR

Integrated Development Environment IDE

Java Annotation Pattern Engine JAPE

that Java Virtual Machines JVM

Left-hand-side LHS

Model-View Controller MVC

National Centre for Biotechnology Information NCBI

National Institute of Health NIH

National Library of Medicine NLM

Natural Language Processing NLP

Object Oriented Programming OOP

PubMed Central PMC

Relational Database Management System RDBMS

Right-hand-side LHS

Role Expression RE

Separation of Concern SoC

Software Development Processes SDP

Software Requirements Engineering SRE

Software Requirements Specification SRS

Text Mining TM

8 | P a g e

Abstract

With a huge number of publications within the biomedical domain, there is an increasing number

of references to URLs, and acknowledgements of individuals and funding organisations. This

project was motivated by providing a look-into the scope of the problem of URL decay, and to

explore and uncover fact of e.g., most active funding organisations, relationship between funding

agencies and research themes, and scientists and research themes, and so on.

EXTernal CONtext eXtractor 2 (ExtConX2) was developed in order to aid with this aim. Rule-

based approaches were adopted in order to extract URLs and acknowledgements from PubMed

Central documents. From the entire PMC dataset of roughly 190, 000 PMC documents processed,

147, 133 URLs, and 194,539 roles were extracted.

Using this data, we have analysed some trends in URL decay and acknowledgments. For example,

we found that URL decay can be described as a function of publication year: the older the

publication the less accessible resource contained within publications. We also found that most

funding acknowledgements were associated with National Institutes of Health, National Science

Foundation, and Wellcome Trust respectively.

The adopted approach for URL extraction achieved precision of 98.6% and a recall of 96%. The

role extraction task achieved a recall of 67.6% and precision of 92.6%.

9 | P a g e

Declaration No portion of the work referred to in the dissertation has been submitted in support of an

application for another degree or qualification of this or any other university or other institute of

learning.

Copyright Statement

i. The author of this dissertation (including any appendices and/or schedules to this

dissertation) owns any copyright in it (the ―Copyright‖) and he has given The University of

Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes.

ii. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these

regulations may be obtained from the Librarian. This page must form part of any such

copies made.

iii. The ownership of any patents, designs, trademarks and any and all other intellectual

property rights except for the Copyright (the ―Intellectual Property Rights‖) and any

reproductions of copyright works, for example graphs and tables (―Reproductions‖), which may be described in this dissertation, may not be owned by the author and may be owned

by third parties. Such Intellectual Property Rights and Reproductions cannot and must not

be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or

Reproductions described in it may take place is available from the Head of School of

Computer Science.

10 | P a g e

Dedication This project is first and foremost dedicated to Science. I hope that the excel of science and reason

will continue to prevail! The earth is round indeed!

Secondly, I would also like to dedicate this project to my family: my parents Siavash Dehghan and

Shahnaz Gharehjani, and my brother Arash for his support.

Acknowledgement

I am grateful to Dr. Goran Nenadic for helpful comments and suggestions. I would also like to acknowledge the gnTeam for providing the PubMed Central dataset.

11 | P a g e

1. Introduction

The presence of overwhelming amounts of unstructured textual information within scientific

literature has made the need for machine-supported analysis of text ever more important to aid

scientists with scientific hypothesis generation and knowledge discovery (Ananiadou & McNaught,

2006; Ananiadou et al., 2005; Uramoto et al., 2004). A specific problem domain is that of

biological sciences, reflected by the share volume of academic publications. For instance, in the

previous year alone (2009), over 710,000 approved references were added to MEDLINE®/

PubMed®; or between 60,000-120,000 reference added each month (NLM 2008; NLM 2009). The

share numbers of publications is simply not human digestible by any individual scientist.

This domain in particular has made the application of text mining (TM) techniques to analyse huge

quantities of unstructured information a vital means to extend and further scientific/knowledge

discovery (Ananiadou & McNaught, 2006). The implications of traditional knowledge discovery or

to generate scientific hypothesis without the aid of TM techniques should be evident.

With a huge number of publications within the biomedical domain, (1) there is an increasing

number of references to URLs or online resources (e.g., publications, software, and so on), and (2)

acknowledgements of individuals and funding organisations. The aim of this dissertation may be

described as discovery-oriented (see Fayyad et al., 1996), i.e., to uncover previously unknown facts

or knowledge in regards to relationships/patterns involving these aspects using TM techniques.

1.1. Motivation

With unprecedented growth of biomedical literature coupled with the increase practice of

referencing of online resources (URLs) that become inaccessible over time (i.e., URL decay). This

project is motivated by providing an analysis of the scope of this problem. While previous studies

(Wren, 2004; Wren, 2008) have confirmed the issue of URL decay, this project will extend upon

previous researches by providing a more holistic conclusion through the analysis of a broader

dataset.

Another motivation is similarly and partly derived from the unprecedented quantities of research

and publication within the biomedical domain. As biomedical research attracts billion of pounds of

research grants and investment from governmental, commercial, and academic sources worldwide

each year; it will be interesting to explore and uncover patterns of e.g., most active funding

12 | P a g e

agencies or institutions, relationship between funding agencies and research themes, and scientists

and research themes, and so on.1

1.2. Project Aims

The aim of this project is to design and implement a system to enable the analysis of trends such as

URL decay (i.e., the phenomenon of inaccessible online resources), type of online resources most

often referenced, and exploration of acknowledgements: of individuals and organisations and their

respective roles in relation to the research/article where acknowledged. Therefore, the system must

enable extraction of so called external context from biomedical research: (1) URLs and (2)

acknowledgments. This software system will be referred to as EXTernal CONtext eXtractor 2 or

ExtConX2 hereafter.2

Moreover, ExtConX2 may be described as two systems in one: (1) URL extractor and (2)

acknowledgement extractor. Description of these subsystems follows:

(1) URL Extractor

The URL Extractor must enable (1) extraction of URLs, (2) for each URL extracted, the system

must determine the type of resource referenced (i.e., Document, Databank, Software, or

Organisation), and (3) determine if the URL is accessible or not.

(1) Acknowledgement Extractor

The Acknowledgement Extractor must enable the identification and extraction of (1) name entities

(NEs) such as persons and organisations, (2) role expression (RE) or the acknowledged role of

given NE, and (3) identify relations or association between a NE and corresponding RE.

1.2.1. Conceptualisation of Project Specific Terminology

Various project specific terminologies are used throughout this dissertation. This section provides

conceptualisation of these terms for easy referencing:

1 Apart from providing practical applications as described in section 1.2.1, biomedical research could at time be

controversial (e.g., stem-cell research; health risk of cigarettes), hence, uncovering of patterns between funding organisations and research could be important to maintain scientific and academic integrity. 2 2 – Indicates the number of tasks the system handles: (1) URL extraction and (2) acknowledgement extraction.

13 | P a g e

(1) Conceptualisation of Role Entities:

i. Collaborator – any NE (person or organisation), apart from the author(s), that provide any

non-financial support (e.g., editorial, conceptual, technical, and so on).

ii. Funder – any NE that provides financial support to the corresponding research.

iii. Role Expression – the literal role of a collaborator or funder.

Note that collaborator / contributor, and sponsor / funder will be used interchangeably throughout

this report.

(2) Conceptualisation of Resource Types:

i. Databank – any database or repository of information which may facilitate dynamic

information retrieval.

ii. Document – any article, report, book, or any static information resource.

iii. Organisation – any organisation or institute (literal definition).

iv. Software – any computer program or application (literal definition).

1.3. Project Objectives This project will aim to achieve the following objectives:

1. Design and implement a relational database (Db) schema to store extracted data.

2. Design and implement a module to extract URLs from documents, determine if the given

URL is accessible or not, determine type of resource (or URL) extracted/referenced and

insert this data into a database.

3. Design and implement a module to identify and extract funders and collaborators (i.e.,

persons/organisations and their respective roles) from acknowledgements and insert this

data into a database.

4. Design and implement a GUI that will facilitate exploration of system functionalities and

which provides general statistics.

5. Evaluation of the purposed methodology.

1.4. Availability

The PubMed Central dataset will be available from gnode1 (gnode1.mib.man.ac.uk) for the use

within this project.

14 | P a g e

1.5. Overview of Chapters

The remainder of this dissertation is organised as followed:

Chapter 2 – Background: provides a general description of the project background such as Text

Mining (TM) processes and concepts, and review of related work

Chapter 3 – Software Requirements: provides a high-level description of the main requirements

of ExtConX2, and further defines functional and non-functional requirements.

Chapter 4 – System Design and Analysis: illustrates and discusses the overall system design and

individual software components of ExtConX2.

Chapter 5 – Implementation: discusses the implementation of the system by analysing selected implementation components.

Chapter 6 – Evaluation: presents and discusses the results of the knowledge discover stage of the dissertation and evaluation of adopted methods.

Chapter 7 – Conclusion: concludes the dissertation by reflection of the project aims, limitation of

the system, and suggestions for future work.

15 | P a g e

2. Background

2.1. Text Mining

TM generally involves the application of techniques such as Information Retrieval (IR), Natural

Language Processing (NLP), Information Extraction (IE), and Data Mining (DM) (JISC, 2006;

Uramoto et al., 2004) to unstructured text. Hearst (2003) summarises the general notion of TM as:

the discovery by computer of new, previously unknown information, by automatically [or

semi-automatically] extracting information from different written resources. A key element

is the linking together of the extracted information together to form new facts or new

hypotheses to be explored further by more conventional means of experimentation.

While TM is often an iterative process, its techniques/stages are generally applied in an ordered

manner. TM or knowledge discovery is a process-oriented activity. Further, due to the relative new

research field of TM, concepts used are not always consistent across literature (see Hotho et al.,

2005; Fayyad et al., 1996). However, while it is not within the scope of this report to further

discuss this issue it is important to acknowledge. Hence, this section will briefly review processes,

techniques, and concepts involved within TM. This ought to clarify the conceptual foundation and

aid the understanding of further description of the overall project pursued.

2.1.1. Information Retrieval

Information retrieval is a discipline and problem concerned with the finding of

documents/information (Hotho et al., 2005). IR covers a wide variety of research areas such as

document classification and categorisation, data visualisation, filtering, modelling, and so forth

(Baeza-Yates & Ribeiro-Neto, 1999). Often referenced IR systems are search engines such as

Yahoo3 and Google

4 which identify documents/information according to the user‘s search queries

(JISC, 2006). IR systems within the biomedical domain include Entrez PubMed and PubMed

Central (PMC). PubMed® is a free resource which provides access to MEDLINE

® (Medical

Literature Analysis and Retrieval System Online), the U.S. National Library of Medicine‘s (NLM)

database of citation and abstracts. Currently, PubMed contains over 19 million references from

approximately 5,400 biomedical journals published worldwide (NLM, 2010a). PubMed Central is

the corresponding (free) full-text digital archive developed and managed by U.S. National Institute

of Health‘s (NIH) National Centre for Biotechnological Information (NCBI).

3 www.yahoo.co.uk 4 www.google.co.uk

16 | P a g e

Moreover, within the context of TM or knowledge discovery process, IR refers to the process of

finding and retrieving appropriate documents relevant to some particular problem (JISC, 2006).

While IR is considered as a sub-process of NLP by some researchers (e.g., Polajnar, 2006), within

this project, IR will be regarded as a separate and antecedent process of NLP.

2.1.2. Natural Language Processing

Natural language processing is concerned with the problem of understanding natural language (NL)

by the use of computers (JISC, 2006; Hotho et al., 2005). Due to the inherent ambiguity of NL, the

complexity to analyse NL by the use of machines is a evident reality. Thus, NLP is commonly

divided into several layers of processing (Hahn & Wermter, 2006): lexical, syntactic, and semantic

level. The lexical level processing deals with how words can be recognised, analysed, and

identified to enable further processing (Hahn & Wermter, 2006). The syntactic level analysis deals

with identification of structural relationships between groups of words in sentences, and the

semantic level is concerned with the content-oriented perspective or the meaning attributed to the

various entities identified within the syntactic level (Hahn & Wermter, 2006).

(1) Lexical Level Processing

The tokenisation process or the segmentation of text into individual meaningful elements is the

initial stage of lexical level processing. Tokens such as words, acronyms, abbreviations, numbers,

and so on are linguistically identified (Hahn & Wermter, 2006). Other interrelated sub-processes

associated with lexical level processing include (Hahn & Wermter, 2006):

Part-Of-Speech (POS) tagging which is considered as the core of this level processing

Morphological analysis (the association/linking of varied forms of lexical elements to their

canonical base form)

Unknown word handling

Acronym detection

Name Entity Recognition (NER)

An example of a widely used and reliable POS tagger within the biomedical domain is GENIA

Tagger v3.0 (Tsuruoka et al., 2005). Computational lexicons (e.g., BioThesaurus) are also utilised

at this stage to aid with the overall lexical level processing. While lexicons often vary depending

upon domain/task, in general and the bare minimum, computational lexicons contain lexical

elements such as full or canonical base forms of words and additional linguistic information (e.g.,

part-of-speech category and morphological information), and so on.

17 | P a g e

(2) Syntactic Level Processing

Common methods applied within the syntactic level processing are chunkers and parsers.

Chunkers partition or label sentences into phrasal units (i.e., noun, preposition, verb, or adjective

phrases) (see Hahn & Wermter, 2006, p.23 for details), and parsers identify clauses such as word

sequences containing a subject and a predicate (Hahn & Wermter, 2006, p.25). An example of

domain specific (i.e., biomedical) shallow parser is GENIA Tagger. Moreover, the application of

name entity recogniser (NER) at this level of processing has proven beneficial within biological

text mining as most name entities are contained within nouns or prepositional phrases (Hahn &

Wermter, 2006). Some examples of NER systems include ANNIE for, e.g., person and organisation

name recognition (Cunningham et al., 2010), LINNAEUS for species name recognition (Gerner et

al., 2010), and TerMine for technical terms recognition.

Resources commonly utilised to aid with the overall syntactic level process are grammars and

treebanks. Treebanks are annotated text corpora with syntactic annotations at sentence level (i.e.,

POS tags and syntactic structures), and grammars contain some subset of linguistic syntax,

commonly, rules or constraints which characterises morpho-syntactic and nonterminal grammar

categories (see Hahn & Wermter, 2006, p.21). An example of widely used Treebank (within the

biomedical domain) is GENIA Treebank v1.0, which is based upon annotated PubMed abstracts

(Kim & Tsujii, 2006; Tateisi, 2004).

(3) Semantic Level Processing

The semantic level analysis consists of linking terms or concepts to form logical/knoweldge

propositions (Hahn & Wermter, 2006). This level processing is directly based upon the

combination of the lexical and syntactic level analysis. For instance, within the scope of this

project, the semantic level processing involves the linking of NEs and their respective roles.

2.2. Information Extraction

Information extraction may be described as a subsequent stage of NLP. IE is the process of

automatically or semi-automatically extracting predefined data from unstructured text (JISC, 2006)

and inserting this data into forms or templates (see McNaught & Black, 2006, p.143), which

subsequently convey the data into some factual information (Hotho et al., 2005). As defined by

Message of Understanding Conference (MUC), tasks commonly associated with IE are:

Recognition and classification of words denoting name of persons, organisations, locations;

and numeric and temporal expressions (i.e., name entity task).

Identifying links references to entities extracted (i.e., coreference task)

18 | P a g e

Extracting identifying and descriptive attributes of name entities (i.e., template element

task).

Extracting relationships between name entities (i.e., template relation task).

Extracting events in combination with either template element/relation tasks (McNaught

and Black, 2006, p.147).

Moreover, a common used method to aid the overall NER process include the use of gazetteers

(i.e., lists defining NEs such as persons, organisation, etc).

Data mining refers to the process of identifying patterns from a (often large) structured datasets

(such as a database). Within the TM process, DM techniques are typically applied to facts extracted

during the IE stage in the purpose to identify patterns and discover new knowledge (JISC, 2006).

2.2.1. Rule-based and Statistical-based Approaches to IE

Methods which may be used for IE tasks include rule-based (e.g., Common Pattern Specification

Language; Java Annotated Pattern Engine) and statistical-based (e.g., Support Vector Machines;

Hidden Markov Models) approaches. Both types of methods have their strengths and weaknesses.

For instance, statistical-based methods tend to require more computing resources as opposed to

rule-based which tend to be more light-weight (thus resulting in faster processing). On the other

hand, rule-based or knowledge engineering approach is domain or even task dependent, while

statistical or automatic training approach is relatively domain independent (Appelt & Israel 1999).

Hence, domain portability is quite straightforward with statistical-based approaches (Appelt &

Israel, 1999). While both methods could be equally labour and time intensive these methods differ

in their inherit way of designing an IE application. Rule-based approach often requires domain

knowledge and a skilled knowledge engineer to implement effective rules for the IE task. On the

other hand, statistical-based approach requires annotator(s) with some knowledge about the domain

and task in order to annotate some training corpus for model information sought to be extracted

(Appelt & Israel, 1999).

2.2.2. IE Application Development Tools/Software

Many tools/software are available to aid scientists and developers to create IE applications, e.g.,

CAFETIERE (see Black et al., 2005), LingPipe,5 MinorThird,

6 and GATE (General Architecture

5 http://alias-i.com/lingpipe/ 6 http://sourceforge.net/apps/trac/minorthird/wiki

19 | P a g e

for Text Engineering)7. A common denominator across the latter three tools is that they provide

Java APIs for use within custom build standalone applications.

(1) CAFETIERE (or Conceptual Annotation for Facts, Events, Terms, Individual Entities, and

RElations) is a rule-based information extraction system for various IE tasks as specified within its

title. CAFETIERE provides various NLP components as tokenisers, POS taggers, NERs, etc., for

text pre-processing and a customised rule-based language that may be used for semantic level

processing of text (Black et al., 2005). Further, CAFETIERE provides a graphical user interface

(GUI) (i.e., Analyser and Annotation Editor) which supports viewing and editing annotation (which

is useful for iterative development of IE rules).

(2) LingPipe may be described as a toolkit for processing text using computational linguistics and

primarily contains Java APIs for NER, POS, classification, and so on.

(3) MinorThird is another toolkit containing a collection of Java APIs for various NLP and IE

tasks. In contrast to Lingpipe, MinorThird also provides a GUI for invoking APIs and debugging or

manipulating annotations.

(4) GATE may be considered as the more mature tool of the latter two, due to its extensive

documentation and user friendly GUI. GATE is in essence an integrated development environment

providing reusable processing resources enabling the development and deployment of customised

applications to solve NLP problems/tasks (Cunningham et al., 2010). Processing resources are

individual NLP processing components such as tokanisers, POS taggers, NERs, etc., which may be

applied to individual documents or a corpus in a customised order to create an IE application.8

These resources are collectively known as a Collection of REusable Objects for Language

Engineering (CREOLE). GATE may be used to create annotations over documents (for instance, to

be used with statistical-based approaches) or create IE applications which may be used apart from

GATE interface via APIs (GATE Embedded) 9 (Cunningham et al., 2010).

2.3. NLM Journal Archiving and Publishing DTDs

Both PubMed and PubMed Central (PMC) documents are provided in XML formats (defined by

NLM Journal Archiving and Publishing DTDs) as an alternative to common Portable Document

Format (pdf). As previously mentioned, PubMed contains citations and abstracts, and PMC is the

7 http://www.Gate.ac.uk 8 Java APIs from LingPipe, Google, Yahoo (and many more) for NLP/IE are provided as processing resources. 9 GATE API to integrate the IE application into a Java application.

20 | P a g e

corresponding full-text digital archive. The dataset from PMC, which contains approximately 190,

000 documents, will be used in this project.

While NLM Journal Archiving and Interchange Tag Suit was created in order to provide a common

format for publisher and archives to exchange journal content (NLM, 2010b), its usefulness for TM

applications has been widely appreciated. This Tag Suit defines elements and attributes to describe

full article contents such as meta-data, acknowledgement, abstract, article body, citations, URLs,

and so on. This has proven beneficial to researchers who may only be interested in a particular

section(s) of articles, e.g., abstracts or acknowledgements. For, instance instead of using regular

expression over a whole document to identify particular sections of interest, researcher could use

XML parser10

to parse documents and extract relevant section. This has at least couple of

advantages over the use of regular expressions. Providing that a tag set exists for particular

document content of interest, the utilisation of an XML tags to extract this content could often be

more accurate than using regular expressions (hence improving results). In addition, when

designing a TM application, which often processes huge amount of documents, given the

opportunity to only parse documents for specific content rather than process whole documents

could significantly improve performance (i.e., response time and use of computing resources).

Currently there exist seven different types of Tag Suit versions or Document Type Definitions (or

DTDs)11

for PMC articles. However, these versions are consistent in regards to tags used for

content which are of interest to this project, namely for acknowledgements and URLs.

Table 1 describes XML tags which will be used in the implementation of ExtConX2 (NLM,

2010c):

Table 1 – Relevant XML Tags

(1a) <ext-link> </ext-link>

Tag defining external resource outside of the scope

of an article.

(1b)ext-link-type=”uri”

Tag (1a) must contain attribute: ext-link-type

which has the value uri. This indicates that the tag contains a URL.

(1c)xlink:href

Finally within the tag element a third attribute (1c)

must identify the external link. (2)<ack> </ack> Tag defining acknowledgement content/section.

Below is a simplified XML skeleton in the NLM Archiving and Interchange format. Sample of tags

described in Table 1 may be found at lines 28 and 34 in the following example:

10 XML Parser generally refers to an API that enables one to programmatically read XML files and extract content of

interest. Common APIs used for XML parsing in Java include Document Object Model and Simple API for XML. 11

Tag Suit versions include: 1.0, 1.1, 2.0, 2.1, 2.2, 2.3, and 3.0 (current).

21 | P a g e

1 <article>

2 <front>

3 <journal-meta>

4 <journal-id>Journal Acronym</journal-id>

6 </journal-meta>

7 <article-meta>

9 <contrib id=”A1” contrib-type=”author”>

10 <name>

11 <surname>First</surname>

12 <given-names>Last</given-names>

13 </name>

14 </contrib>

16 <abstract> ... </abstract>

17 </article-meta>

18 </front>

19 <body>

20 <sec> <title>Introduction</title>

21 <p> … </p>

22 </sec>

23 <sec sec-type=”method”> <title> Methods </title>

24 <p> … </p>

25 </sec>

26 </body>

27 <back>

28 <ack> We like to thank Armand Seguin for his support of the

project and for many simulating discussions. </ack>

30 <ref-list>

31 <ref id="A1">

32 <citation citation-type="other">

33 <article-title>An Online Resource</article-title>

34 <ext-link ext-link-type="uri"

xlink:href="http://www.web.com"/>

35 </citation>

36 </ref>

37 </ref-list>

38 </back>

39 </article>

2.4. Related Work

Giles and Councill (2004) developed a system for acknowledgment extraction from Information

Science literature.12

Based upon their analysis of extracted data a classification scheme of six

categories of acknowledgements were identified: (1) moral support, (2) financial support, (3)

editorial support, (4) presentational support (i.e., presenting the paper at a conference), (5)

instrumental/technical support, and (6) conceptual support, or peer interactive communication

(PIC) as coined by Giles and Councill. They justified their classification scheme on the basis of

The IR system utilised for document retrieval: CiteSeer digital library - http://www.citeseer.ist.psu.edu

22 | P a g e

significance of acknowledgements. For instance, conceptual and technical support is arguably more

noteworthy as academic contribution than moral support (Giles & Councill, 2004). Nevertheless,

their argument was never reflected in their results.13

Giles and Councill‘s method is inherently a NER system, as actual roles were only determined by

post-extraction analysis. For instance, they provide a table which partly includes acknowledge

companies and funding agencies. However, it cannot be undoubtedly concluded if these

acknowledge entities provided funding, material, or even intellectual support. Giles and Councill‘s

conclusion is based on pre-knowledge of names of funding organisations and analysis of a sub-set

of most acknowledged entities. Thus, acknowledgements of funding agencies and companies can

only be assumed to represent financial support (see Giles & Councill 2004, p.17601). ExtConX2

will be more sophisticated in that respect, as NEs and their respective roles will be identified and

extracted from acknowledgements. Hence, this task will be slightly more challenging than Giles

and Councill‘s method, and therefore as the nature of evaluation metrics will differ, good metrics

will be more challenging to obtain.

The methodology adopted by Giles and Councill (2004) is a combination of rule-based and

statistical-based approach. Initially, regular expressions were used to identify sections which most

likely contained acknowledgements, specifically, section headings labelled acknowledgment. In

addition, the authors also identified acknowledgement passages within unmarked sections of

articles, typically within the document header (i.e., before the abstract/introduction or on the first

page) or footnotes (i.e., before the references or first appendix). Hence, all text on first page of the

document and the last page, before reference section or the appendix were processed using an SVM

to identify sentences containing acknowledgements. Subsequently, a rule-based parser was applied

to extract acknowledged name entities. Through extensive testing involving 1,800 manually

labelled documents the method achieved 78.45% precision and 89.55% recall.

Table 2 is an excerpt from Giles and Councill‘s (2004, p.17602) result of most acknowledged

funding agencies.

Table 2 – Most Acknowledged Funding Organisations

Funding Agencies No. of

acknowledgements

National Science Foundation 12, 287

Defence Advanced Research Projects Agency 4, 712

Office of Naval Research 3, 080

Deutsche Forschungsgemeinschaft 2, 780

National Aeronautics and Space Administration 2, 408

Engineering and Physical Sciences Research Council 2, 007

Air Force Office of Scientific Research 1, 657

Apart from financial support, no other category was presented in their results.

23 | P a g e

National Sciences and Engineering Research Council of Canada 1, 422

Department of Energy 1, 054

Australian Research Council 1, 010

European Union Information Technologies Program 825

National Institutes of Health 709

Army Research Office 666

Netherlands Organization for Scientific Research 646

Science and Engineering Research Council 489

Another research related to one of the applications of ExtConX2 is Wren‘s (2004, 2008) study of

URL decay within MEDLINE/PubMed citations. Wren has justified his motivation by the growth

in electronic references and the assumption of the unreliable nature of online resources compared to

traditional means of printed journals. This was confirmed by the results of his study. The

methodology used by Wren within the knowledge discovery process was straightforward. Wren

used Visual Basic as the chosen programming language and regular expressions to identify and

extract URLs from XML documents (containing the citations). Additional heuristic rules and

manual editing was applied to handle/correct human errors such as mistyped URLs. However,

neither heuristic rules nor regular expressions were provided. Nevertheless, common encountered

errors discussed were inappropriate spaces within URLs, the use of back-ward slashes instead of

forward slashes, non-alphanumeric characters, and inclusion of erroneous characters (see Wren,

2004, p.669).

Wren‘s (2004) initial study involved 1630 URLs extracted from nearly 13 million PubMed

citations. These URLs were programmatically checked for availability, over a four week period

using Microsoft Component Objects Internet Transfer Control (API). A URL was considered as

inaccessible if it did not respond within 60 seconds or if the response code received indicated that

the resource is inaccessible (e.g., 404 not found, file not found, etc.). In addition, if 25 consecutives

tries failed, a URL was considered as inaccessible. URLs that were accessible 90% of the time

checked were considered as active. This method is appropriate as web-servers do not tend to have

100% up-time (or be available 100% of the time). Hence, this method ensures maximised accuracy

of availability statistics.

Wren‘s (2008) follow up study used practically the same method as described above. URLs were

extracted/surveyed in the following years of the initial study (except for 2006): 2004 (total of 2294

URLs surveyed), 2005 (3327 URLs), and 2007 (6154 URLs). Both studies (Wren 2004; Wren

2008) showed time-dependant decay of URLs. More specifically, URL decay could be described as

a function of publication year: the older the publication the less accessible resources it contained.

Below is a graph representing results of URL decay from Wren‘s studies (2004, 2008):

24 | P a g e

Figure 1 - URL Decay (Wren, 2008)

While Wren‘s approach is solely focused on abstracts, ExtConX2 will be applied to full-text

articles, thus covering a larger scope. This will also mean that a more holistic conclusion could be

drawn regarding URL decay. In addition, as previously stated, URLs will be classified within four

different types of categories, enabling a broader analysis of the nature of resources referenced.

Nevertheless, Wren‘s research/results are excellent for post-research evaluation benchmark and

comparison. For instance, I would hypothesise that URL decay will be more sever within full-text

as oppose to citations.

2.5. Summary of Chapter

The aim of this project is to develop a system (ExtConX2) to enable discovery of specific trends

within the biomedical domain. Specifically: (1) the exploration of acknowledgements of

individuals and organisations, and (2) analysis of URL decay and most often referenced resources.

The dataset which will be utilised within this project is full-text XML articles from PubMed

Central.

TM techniques will be used to achieve the main aims defined. In particular, NLP processing such

as lexical, syntactic, and semantic level processing will be utilised to enable role extraction. In

addition, XML tags provided by NLM Archiving and Interchange DTDs will also be used for

25 | P a g e

extraction of URLs (not exclusively) and to aid the initial extraction of acknowledgement text from

PMC articles.

While prior research has had similar applications as ExtConX2, this project looks at extending the

scope by analysing larger datasets and adopting more sophisticated approaches. For instance, Wren

(2004, 2008) study of URL decay was solely confined to PubMed citations. In contrast, ExtConX2

will enable the analysis of URL decay within full-text articles. This will enable us to draw a more

holistic conclusion in regards to the implication URL decay and types of resources most often

referenced within the biomedical domain. Moreover, acknowledgement extraction has yet to be

applied within the biomedical domain. ExtConX2 is the first system to do so. Giles and Councill

(2004) research of acknowledgement extraction is concerned with publications within CiteSeer

digital library. Their approach can at best be described as a NER system as semantic level

processing is never applied. For instance, their result of most acknowledged funding agencies and

companies are based on an assumption and analysis of a subset of articles. In contrast, ExtConX2

will enable us to determine if in fact extracted NEs has provided funding or not by extracting NE‘s

corresponding roles as acknowledged in text.

26 | P a g e

3. Software Requirements

The initial part of this chapter (Section 3.1) provides high-level description of ExtConX2‘s main

requirements (1) URL extraction and (2) role extraction. Subsequently detailed descriptions of

functional user and system requirement, and non-functional system requirements are provided

(Section 3.2 and 3.3). These requirements have been derived from the project‘s objectives and the

software requirement engineering (SRE) process during the initial stages of this dissertation. These

requirements constitute the foundation of ExtConX2.

3.1. Description of Main Tasks

This section provides breif high-level description of main functional requirements of ExtConX2:

(1) URL extraction and related processes and (2) acknowledgement extraction. Some details have

been deliberately ignored for the sake of simplification of descriptions (e.g., use of XML

documents).

3.1.1. URL Extraction

As previous described, ExtConX2 must enable the extraction of URLs from the biomedical

publications. For each URL extracted the system must determine the type of resource referenced

(refer to Section 1.2.1) and if the given URL is accessible or not (URL Status: see Table 3). For

instance, given these hypothetical examples:

1. R-Project (http://www.r-project.org) was used for statistical processing of data.

2. The data was collected using GenBank (http://www.ncbi.nlm.nih.gov).

The ideal results of subsequent processing of these sentences (inserted into a database) ought to be

(Table 3):

Table 3 – Ideal Results from URL Extraction Process

URL Type of Resource URL Status Date Checked

(1) http://www.r-project.org Software Active/Inactive 2010-09-01

(2) http://www.ncbi.nlm.nih.gov Databank Active/Inactive 2010-09-01

27 | P a g e

3.1.2. Acknowledgement Extraction

Acknowledgement extraction involves the extraction of NEs and their respective REs from

acknowledgement sections. The ideal results of processing of given acknowledgements given

below (inserted into a database) should be (see Table 4):

(1) Financial support was obtained from the Swedish Research Council.

(2) The authors thank Ms. Maureen Stoddard Marlow for editing.

Table 4 - Ideal Results of TM Process

(1) Name Entity: Swedish Research Council

Role (enumeration): Funder

Role Expression: Financial support

(2) Name Entity: Ms. Maureen Stoddard Marlow

Role (enumeration): Collaborator

Role Expression: Editing

3.2. Functional User and System Requirements

3.2.1. Functional User Requirements and Use Case Diagram

[R1]. The user shall be able to initiate extraction of URLs from PMC XML documents (stored in

the Shared Database) and insert this data and respective attributes into the System

Database.14

a. Attributes for each URL include:

(1) URL status: if link is active or inactive,

(2) type of resource (i.e., Databank, Document, Organisation, or Software),

(3) decision data: data used to determine type of resource, and (4) date checked.

[R2]. The user shall be able to initiate role extraction (i.e., extraction of NEs and their respective

REs) from full-text XML documents and insert this data and additional attribute into the

system database.

a. Attribute for each set of roles include: (1) the acknowledgement text where role(s)

has been extracted.

[R3]. The user shall be able view general statistics:

The System Database (Db) refers to the Db specifically designed for ExtConX2: used to insert processed data. The

Shared Db is provided by the gnTeam (http://gnode1.mib.man.ac.uk/) and contains the PMC dataset.

28 | P a g e

a. (1) Number of documents processed, (2) number of URLs extracted, including

descriptive statistics of URL status (i.e., by year; in total), and (3) number of roles

extracted.

[R4]. The user shall be able to set parameters e.g., number of documents to be processed for IE

processes (i.e., R1 and R2).

A use-case diagram derived from the functional user requirements is provided below (Figure 2):

Figure 2 - Use Case Diagram

Description of Use Case Diagram:

Table 5 – Description of Actor (AC)

AC01 User System user.

Table 6 – Description of Use Cases

UC01 URL Extraction AC01 may initiate URL Extraction and related processes to

determine URL status, determine type of resource, compose decision data, and insert this data (including the date inserted)

into the System Database.

UC02 Role Extraction AC01 may initiate Role Extraction and insert this data

(including the acknowledgement text) into the system database.

UC03 View Statistics AC01 will be able to view statistics of IE processes: (1)

number of documents processed, (2) number of URLs

extracted, (2a) descriptive statistics of URL status (i.e., by

29 | P a g e

year; in total), and (3) number of roles extracted.

UC04 Set Parameters AC01 can set system parameters: e.g., number of documents to

be processed for IE processes (i.e., UC01 and UC02).

3.2.2. Functional System Requirements

This section describes functional system requirements and related processes by implementation

objectives (Tables 8-15).15

The Project objectives have been refined into implementation objectives

to reflect architectural design of the system e.g., database operations have been separated into

separate objective (implementation objective 6). See Table 7 for mapping between the project

objectives and implementation objectives.

Table 7 – Mapping between Projects Objective and Implementation Objectives

Project Objectives Implementation Objectives

1 1 (Table 8)

2 2-4; (6) (Tables 9-11 and 13)

3 5; (6) (Tables 12 and 13)

4 7 (Table 14)

5 8 (Table 15)

(1) Conceptualisation of Terms:

Conceptualisation of terms used in the following tables (Tables 8-15):

Risk – refers to degree of risk of completing a module/task and is based on several

factors such as time constraint, difficulty, dependency on other modules/tasks, and

external dependency. The level of risk is based on a subjective estimate of these

factors.

External Dependency – refers to dependency on external factors, e.g., IR

system(s), database(s), software, and so on.

Shared Database (Db) – refers to the database containing PMC articles in XML

format (i.e., gnode1).

System Db – refers to the database designed and implemented to store

extracted/processed data.

Evaluation (Table 15) is also included for the sake of completeness of requirements even though it is not a functional

requirement.

30 | P a g e

Table 8 – Implementation Objective 1

1. Design and implement a relational database schema to store extracted data (i.e, System

Functional Requirement: N/A

Risk: Low.

External Dependency: None.

Priority: High.

Pre-condition: Installed relational database management system (RDBMS), such

as MySQL.

Post-condition: Skeleton or empty Db schema: System Db.

Difficulty: Easy

Processes : 1. Design Enhanced Entity Relationship (EER) diagram. 2. Translate EER to Relational Schema.

3. Implement relational schema.

2. Design and implement a module to extract URLs from PMC XML documents

Functional Requirement: [R5]. The module shall be able to identify and extract URLs

from PMC XML documents.

Risk: Low.

External Dependency: Availability of Shared Db.

Priority: Intermediate.

Pre-condition: Objective 1, and Objective 6 (A)

Post-condition: A set of extracted URLs.

Difficulty: Intermediate

Process overview:

1. Objective 6, process A (Table 13).

2. Parse document and extract URL(s).

3. Design and implement a module to determine type of resource (or URL)

extracted/referenced.

Functional Requirement: [R6]. The module shall be able to identify the type of online resource referenced; Databank, Document, Organisation,

or Software.

Risk: Low.

External Dependency: Availability of Shared Db.

Priority: High.

Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).

Post-condition: Return type of resource or URL referenced (i.e., Databank,

Document, Organisation, or Software).

Difficulty: Intermediate

Process overview:

1. Get URL context.

2. Determine resource type by: a. keyword(s) within the

URL string, b. keyword(s) within URL reference context (i.e., title of reference and/or description of reference), or

c. keyword(s) within the article body where the URL is

cited.

31 | P a g e

3. Return resource type.

4. Design and implement a module to determine URL status: active or inactive link Functional Requirement: [R7]. The module shall be able to determine if URL is active or

inactive URLs (accessible or not).

Risk: Low.

External Dependency: No direct dependency, see pre-condition.

Priority: High.

Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).

Post-condition: Return URL status: 0/FALSE if inaccessible or 1/TRUE if

accessible.

Difficulty: Easy

Process overview:

1. Get URL to be checked (see Obj. 2).

2. Check if URL is active/inactive: if inactive return

0/FALSE, else (if active) return 1/TRUE.

5. Design and implement a module to identify and extract sponsors and contributors (NEs such as persons/organisations and their respective roles) from acknowledgments

Functional Requirements: [R8]. The module shall be able to identify NEs, such as persons

and organisations/institutions.

[R9]. The module shall be able to identify REs (i.e., sponsors/funders or collaborators/contributors).

[R10]. The module shall be able to link NEs to their respective

REs. [R11]. The module shall be able to extract NEs and their

respective roles from annotated documents.

Risk: High. Main reasons for risk level:

Dependent upon the use of appropriate methodology, and efficient use of tools (i.e., GATE 5.2.1).

Difficulty: Hard

Time constraint: as approaching project deadline. External Dependency: GATE 5.2.1 (see Section 2.2.2).

Priority: High.

Pre-condition: Objective 1, and Objective 6 (A). Post-condition: Return NEs and corresponding REs identified.

Difficulty: Hard

Process overview:

1. Implementation objective 6, process A (see Table 13)

2. Parse document and extract acknowledgement passage. 3. Process acknowledgement passage through text processing

application designed with GATE 5.2.1 (which returns a

Gate XML document with tags representing annotated entities: NEs corresponding REs).

4. Parse Gate XML document.

5. Extract annotated NEs and their respective roles.

32 | P a g e

6. Design and implement a module to handle database operations: (1) ensure synchronisation

of retrieval of documents for processing and documents already processed, (2) insert extracted/processed data into the system database.

Functional Requirements: [R12]. The module shall be able to synchronise retrieval of

documents for processing (from the Shared Db) and documents already processed (in the System Db).

[R13]. The module shall be able to insert given (tuple) of data

into the system database.

Risk: Low.

External Dependency: -

Priority: High.

Pre-condition: Implementation objectives 2-4, or 5.

Post-condition: Relevant data is inserted into the System Db.

Difficulty: Easy

Process overview:

This module is separated into two different tasks: (A)

synchronisation of processed documents (in System Db) and of

retrieval of documents (from the Shared Db) for processing, and (B) data insertion into the System Db.

A. Check last document processed for role extraction / URL

extraction: a. if none, get first document (documents may be retrieved in an ascending order enabled by auto-

incremented keys of records in the Shared Db)16

from the

Shared DB, b. else, get auto-incremented id of last document processed in the System Db and start retrieval

process from Shared Db by last document processed + 1.

B. Either get URL data (implementation objective 2-4) or

role data (implementation objective 5) and insert this data into the system database.

7. Design and implement a GUI that will facilitate exploration of system functionalities and

provides general statistics.

Functional Requirements:

[R14]. The module shall be able to view general statistics upon user request, such as; (1) number of documents processed,

(2) number of URLs extracted, (2a) descriptive statistics

of URL status (i.e., by year; in total), and (3) number of roles extracted.

[R15]. The module shall be able to invoke user parameters for

numbers of documents to be processed.

Risk: Intermediate. Main reasons for risk level:

Time constraint: approaching project deadline.

Dependent on successful completion of previous modules.

16 The implementation will take advantage of the available auto-incremented key within the Shared Db (and the

corresponding foreign key in the System Db) to keep track of documents processed or documents to be processed when new session is initiated.

33 | P a g e

Priority: Intermediate.

Pre-condition: Implementation objectives 1-6.

Post-condition: Interactive GUI.

Difficulty: Intermediate.

Process overview: See Use Case Diagram (Figure 2). Table 15 – (Implementation) Objective 8

8. Evaluation of the purposed methodology

Functional Requirement: N/A

Risk: Intermediate.

1. Time constraint: approaching project deadline. 2. Dependent upon successful completion system modules.

Priority: High.

Pre-condition: Completion of 1-4

Post-condition: -

Difficulty: Easy

Process overview: 1. Choose a random sample of results derived from previous

steps and apply evaluation metrics (see Chapter 6)

3.2.3. Requirement Traceability Matrix Requirement Traceability Matrix (Table 16) by User and System Functional Requirements versus

project objectives:

Table 16 – Requirement Traceability Matrix

Obj. 2 Obj. 3 Obj. 4

[R01] X

[R02] X

[R03] X

[R04] X

[R05] X

[R06] X

[R07] X

[R08] X

[R09] X

[R10] X

[R11] X

[R12] X X

[R13] X X

[R14] X

[R15] X

34 | P a g e

3.3. Non-Functional Requirements

In addition to functional requirements, a set of non-functional requirements have been derived from

the (SRE process or) requirement elucidation and analysis stage. While non-functional

requirements typically include product, external, and organisational requirements (Sommerville,

2004), this dissertation solely focuses on product requirements, specifically, system properties to

guide the architectural design and implementation of ExtConX2.

1. Extensibility

Within software engineering, extensibility refers to the notion of design/implementation of

a system which takes into consideration potential future extension of system functionalities

(Wikipedia, 2009). Extensibility may also be described as a system architecture designed to

accommodate future changes with minimal effort. For instance, system architecture based

upon modularity or compartmentalisation of which various software functions/components

are separated by concern (SoC)17

may address this requirement. Use of Object Oriented

Programming (OOP) language may also aid to achieve this end.

2. Maintainability

The notion of maintainability is similar to extensibility to some respect as the approaches

to accommodate these requirements may intersect. Nevertheless, the aim of this

requirement is to accommodate effortless maintenance of the system, to ease feature

amendment to implementation, and locate potential hidden software bugs. The use of OOP

language and SoC, and detailed documentation may be used to fulfil this requirement.

3. Reusability

The system ought to enable reusability of modules to the extent possible. This will

facilitate both extensibility and maintainability, in addition to provide software components

which may be used within future (unrelated) applications/research. The application of SoC

at class level may be used to fulfil this requirement.

Separation of concern (SoC) – refers to a logical separation of system functionalities. For instance, an analogy may be

drawn from the Model-View Controller (MVC) paradigm often used in web applications.

35 | P a g e

4. System Design and Analysis

This chapter is divided into two general sections:

a) Generic overview of the system architecture/design which describes high-level approaches

to extraction of external context (i.e., URLs and acknowledgements).

b) System Design and Analysis.

4.1. Generic System Architecture

A high-level overview of ExtConX2 is provided below (Figure 3, see footnotes for description of

arrows). Brief description follows (Figure 3):

Figure 3 – High-Level System Architecture18

1. The Database Module is responsible for (1) synchronisation between the Shared Database

(containing PMC XML documents) and the System Database, (2) retrieval of documents

(Db Traverser) for processing, and (3) insertion of extracted/processed data (Data Inserter)

into the System Database.

2. The URL Module is responsible for (1) parsing of PMC documents and extracting URLs

(URL Extractor), (2) determining if given URL is accessible or not (URL Status) and (3)

determining the type of resource referenced (Resource Type).

3. The IE Module is responsible for role extraction (IE Application). This module

encapsulates text pre-processing and IE task required to identify and extract NEs and

respective REs.

Solid arrows represent data flow, dashed arrows may be described as sub-module (of): the arrows head point toward

the super module.

36 | P a g e

4. The Parser Module encapsulates the XML parser. In addition, it handles NLM Journal

Archiving and Interchange DTDs. These are needed to parse PMC documents. The

DTDResover redirected the XML System IDs to a local repository where the DTDs are

stored.

ExtConX2 architecture is guided by the designed principle of SoC at the system level: Database

Module (including the Shared Db and System Db) encapsulates database operations (i.e., Database

Layer), and the URL Module and IE Module (including the Parser Module) encapsulates

application logic (i.e., Application Layer). This approach is coined as subsystems architecture

where each subsystem represents different level of abstraction (Bennet et al., 2006).19

This could be

considered as an approach to fulfil non-functional requirements previously defined (Section 3.3).

4.2. Description of External Context Extraction

This section provides high-level description of external context extraction based upon the generic

system design (Figure 3).

4.2.1. URL Module

The URL Module (refer to Figure 3) contains three main tasks: (1) extraction of URLs from PMC

documents, (2) determine resource type for each URL extracted, and (3) determine if a URL is

active or inactive (i.e., if resource is accessible or not).

An approach to process a given sentence containing a citation to an online resource is illustrated

below (Figure 4).

The system is divided into SoC: the Database Layer deals solely with retrieving documents and inserting

data (this includes the RDBMS), while the Application Layer is solely responsible for application logic.

37 | P a g e

Figure 4 – URL Module Overview

Given the following sentence:

1. The report was provided by World Health Organisation (http://www.who.int).

The output (Processed Data) of the given process (Figure 4) ought to be as followed (Table 17):

Table 17 – Ideal Results from URL Extraction Process

URL Type of Resource URL Status Date Inserted

(1) http://www.who.int Document Active 2010-09-01

A more detailed description follows. The following subsection describes (a) extraction of URLs

and determination of URL status, and (b) determining resource type (from the extracted URL

context), respectively:

a) URL Extraction

As PMC documents are provided in the NLM Archive and Interchange format (XML), the unique

tag provided for identifying URLs may be used to extract these URLs. For instance, given

hypothetical example of a URL within a PMC document (disregarding any context):

1 <ext-link ext-link-type="uri" xlink:href="http://www.who.int">

2 http://www.who.int

3 </ext-link>

The approach that may be adopted to extract the given URL follows:

Get URL:

1. Parse the given document using an XML Parser.

38 | P a g e

2. Traverse through the parsed XML document to find the XML tag identifying URLs (i.e.,

ext-link): see line 1 in the example above.

a. Ensuring ext-link contains the attribute tag: ext-link-type and that this attribute

equals uri (i.e., ext-link-type="uri”).20

This is an inference that the XML tag

contains an external URL.21

3. Subsequently, either (a) extract the URL between the ext-link start and end tag (on line 2

from the given example), or (2) extract the value of the attribute xlink:href (which also also

contains the URL: on line 1).

The XML tag pattern discussed above is consistent across all NLM Archiving and

Interchange DTDs used for PMC documents. Thus, this single approach ought to be a sufficient

method to extract URLs from different formatted PMC documents.

Determine URL Status: URL status may be determined programmatically by Hypertext Transfer Protocol (HTTP)

messages/response codes. For instance, common response codes returned by HTTP when trying to

establish a connection (either through a browser or programmatically) include (Berners-Lee et al.,

1996) these listed in Table 18:

Table 18 – HTTP Response Codes

HTTP Response Code Description

HTTP/1.0 200 OK The request was successful: URL accessible

HTTP/1.0 401 Unauthorized Unauthorised access: inaccessible

HTTP/1.0 404 Error/Not Found The resource could not be found: inaccessible

Determine Resource Type:

For each URL extracted the system must determine the type of resource referenced. For instance; is

the URL a reference to a Databank, Document, Organisation, or Software (refer to Section 1.2.1

for conceptualisation of terms).

A potential approach to determine resource type of a given URL is a mix of rule-based and

keyword-based lists which correspond to a specific resource types. Consider the following

hypothetical example:

1 <ref id="CR9">

3 The report was provided by World Health Organisation (

4 <ext-link ext-link-type="uri" link:href="http://www.who.int/annualreport">

5 http://www.who.int/report

Another valid value for ext-link-type is: ftp (File Transfer Protocol). 21 An external URL refers to resources/URLs outside the scope of the article. For instance, there exist other (which may

be described as internal) URLs within PMC documents which are for various XML specific validation (e.g., namespace declaration, and so on); these are non-valid.

39 | P a g e

6 </ext-link>

8 </citation>

9 </ref>

A potential solution to determine referenced resource type is:

1. Analyse the URL string extracted for keywords that characterise specific URL

classes (e.g., report could be used as a keyword indicating Document resource

type); if unable to determine resource type, try next process (b):

2. Get the URL context:

3-7 The report was provided by World Health Organisation (http://www.who.int/report).

3. Subsequently, analyse this context (word by word) for keywords, starting from the

location of URL within the string until the start of the sentence (see bold text in

example given above).

In this example, report could be used as keywords to determine the resource type (Document). For

each of the URL types, a list of characteristic keywords will be constructed and used.

4.2.2. IE Module

The IE Module encapsulates the IE application which is responsible for role extraction.

Specifically, given an acknowledgement sentence, the IE Module must enable the identification and

extraction of NEs and their respective REs.

a) Acknowledgement Extraction

A rule-based approach in conjunction with gazetteers may be adopted for role extraction. Apart

from common TM stages previously discussed (see Section 2.1), some notable highlights are:

1. The use of gazetteers to define:

i. NEs: persons and organisations

ii. REs: collaborators and funders (Table 19)

Table 19 – Examples of REs for Collaborators and Funders

Collaborator Roles Funder Roles

Editorial support Financial support

Reviewing the manuscript Grant-in-aid

Helpful comments Grant

Helpful suggestions Funding

2. A rule-based approach applied at semantic level processing (see Section 2): linking of NEs

and their respective REs (Role Matcher: Figure 5).

40 | P a g e

3. Subsequently, programmatically extract these sets of NEs and corresponding REs (IE) and

insert them into a predefined template/database.

The generic NLP/IE pipeline is given in Figure 5.

Figure 5 - Generic NLP/IE Pipeline

For instance, consider the following acknowledgements:

1. The authors are grateful to John Dough for reviewing the manuscript.

2. This research was funded by BBSRC.

The NLP/IE process is as followed:

a) Get NEs

i. Person NE: John Dough

ii. Organisation NE: BBSRC

b) Get REs

i. Collaborator RE: reviewing the manuscript

ii. Funder RE: funded

c) Identify respective RE for each NE :

Patterns which indicate association between NE and RE, identified from above examples

1. NE for RE (collaborator)

2. RE by NE (funder)

Hence, the application of rules to identify given patterns will be sufficient at semantic level

processing, for the given example.

d) Insert this data into predefined template/database:

Table 20 - Results of TM Process

41 | P a g e

(1) Name Entity: John Dough

Role Expression: reviewing the manuscript

(2) Name Entity: BBSRC

Role (enumeration): Funder

Role Expression: funded

4.3. System Architecture System Architecture is the organisation of a system in terms of its software components, including

subsystems and the relationship and interaction among them, and the principles that guide the

design of that software system (Bennett et al. 2006, p.340). System architecture could directly

influence non-functional features of a system (Bennett et al., 2006). For instance, subsystems

architecture is known for advantages such as maximising reusability and improving maintainability

among other things (Bennett et al., 2006). Therefore, the guidance of non-functional requirements

previously defined (Section 3.2) has been a central factor in the architectural design and

implementation ExtConX2.

4.3.1. Subsystems Architecture

The design of ExtConX2 is based on subsystems architecture, i.e., SoC at system level or

subdivision of software components which share some common properties (Bennett et al., 2006).

This means that a system is subdivided into different layers of abstraction or layers of service

which are responsible for different aspect of functionality of the system as whole (Bennett et al.,

2006, p.350). This approach has several known advantages such as:

Maximise reusability

Aid developers to handle complexities

Improve maintainability

Aid portability

ExtConX2 has three layers of abstraction:

1. Presentation Layer

The presentation layer is the topmost layer and is responsible for the human computer

interaction (HCI). This layer enables interaction between the user, and system

functionalities through a graphical user interface (GUI). A user is able to control/initiate

system functionalities (encapsulated by layer 2 or the application layer) through input

parameters, and view output resulting from the processing of the application layer. The

presentation layer satisfies functional user requirements 1-4 and functional system

requirements 14-15 (refer to Section 3.2).

2. Application Layer

42 | P a g e

The application layer is responsible for domain logic or domain specific functionalities of

ExtConX2: the core functional requirements of the system (i.e., functional system

requirements 5-11).

3. Database Layer

The database layer encapsulates the relational database management system (RDBMS) and

system specific database operations such as synchronisation between Shared DB and

System DB (i.e., between processed documents and PMC documents available for

processing), retrieval of documents to be processed, and insert data into the System DB.

The database layer satisfies functional system requirements 12-13.

The architecture of ExtConX2 is based on layered subsystems (see Bennett et al. 2006, p.351): any

layer N can only use the services provided by the layer immediately below it (N -1). For instance,

the presentation layer cannot directly use any services provided by the database layer (see Figure

6). This level of abstraction minimises dependencies among layers (and software components) and

facilitates extensibility and maintainability of the system (Bennett et al., 2006).

Figure 6 - ExtConX2 Layered Subsystems

4.4. System Design

This section provides detailed description of the system design, such as: database, application, and

presentation layers. All illustrations provided are based on class implementations. Complete system

designed is provided in Appendix A, Figure 14.

43 | P a g e

4.4.1. Database Layer

The database layer encapsulates system functionalities or services which are responsible for

database operations. This layer provides services for the application layer directly above it (N + 1).

The following Figure 7 illustrates main components of the database layer.

Figure 7 - ExtConX2 Database Layer

a) Description of Database Layer

1. Db Manager - The Db Manager is responsible for maintaining synchronisation between

the Shared Db (containing PMC XML documents) and System Db. This is achieved by two

methods: (1) determines the last existing PMC document in the Shared Db, and (2) to

determines the last processed PMC document stored in the System Db.22

2. Db Traverser - The Db Traverser is responsible for retrieving data from the Shared Db. In

addition, Db Manager is utilised by Db Traverser to ensure synchronisation.

3. Data Inserter - The Data Inserter encapsulates methods to insert processed data into the

System Db.

b) Relational System’s Schema

Below is the Relational Database Schema used by ExtConX2, the EER Diagram may be viewed in

Appendix A, Figure 13. The Shared Db (in part) 23

and System Db are both represented by the

following Figure 8.

PMC Articles contains PMC articles in XML format, and is linked from the Shared Db.

The System Db contains four relations: Meta Data, URL, Role, and Acknowledgement.

22 Both methods relay on the auto-incremented key and foreign key in the Shared Db and System Db respectively. 23

Only the relevant relation (PMC-Articles) and attributes of the Shared Db is included in the Relational/EER diagram.

44 | P a g e

Figure 8 - Relational Database Schema

4.4.2. Application Layer

The application layer encapsulates domain logic: functional system requirements 5-11. This layer is

further subdivided into three separate modules (see Figure 9):

URL Module, which contains classes for URL extraction and related processes.

IE Module, which contains classes for role extraction and related processes.

Parser Module, encapsulates classes for parsing and handling NLM Journal Archiving and

Interchange DTDs.

This subdivision of the application layer into further refined SoC is another example (in addition to

the subdivision at system level) of architectural design which addresses non-functional

requirements of ExtConX2.

Different types of arrows are only for visibility.

45 | P a g e

Figure 9 - ExtConX2 Application Layer

a) URL Module

The URL Module is responsible for extracting URLs from PMC documents,25

checking each URL

extracted if it is accessible or not, and determine the type of resource referenced. The URL Module

contains the following classes:

1. URL - The URL class may be described as a super-class; its responsibility includes

extraction of URLs from PMC documents and invoking other operations (i.e., URL Status

and Resource Type). In addition, URL acts as a gateway between the database layer and

application layer (i.e., retrieving PMC documents and returning processed data).

2. URL Status - URL Status checks if a given URL is accessible or not.

3. URL Identifier - URL Identifier is responsible for syntactically validating URLs, and to

identify URL protocols if any (i.e., http:// and ftp://). The latter functionality is used by

URL Status.

Not including URLs which are part of the article metadata, i.e., the corresponding prepublication paper and licence

(http://creativecommons.org).

46 | P a g e

4. Resource Type - Resource Type is responsible for collecting possible types of resource

referenced (i.e., Databank, Document, Organisation, or Software). Refer to Section 0 for

further description.

5. Soft Decision - Soft Decision may be described as a sub-class of Resource Type which

contains a method to determine the most likely URL resource type from a set of collected

possibilities (refer to Section 4.2.1 for description).

b) IE Module

The IE Module encapsulates the TM application which handles role extraction. Specifically, pre-

processing of acknowledgement text i.e., NLP, and subsequent IE (or extraction of collaborators

and funders, and their respective REs).

1. IE - The IE class is a the super-class within the IE Module, that extracts acknowledgement

text from PMC documents, and invokes the IE Application and Role Extractor in order to

complete the acknowledgement extraction sequence.

2. IE Application - The IE Application encapsulates the TM application (designed with

GATE). This class handles the pre-processing of acknowledgement text (including,

providing annotation over NEs and their respective REs). Further description is provided in

Section 4.4.2.

3. Role Extractor - The Role Extractor extracts NEs and their corresponding roles from pre-

processed acknowledgement text.

c) Parser Module

The Parser Module encapsulates the parser and a class to handle NLM Journal Archiving and

Interchange DTDs.

1. Parser - The Parser encapsulates the Document Object Model (DOM) parser used to parse

PMC documents.

2. DTD Resolver - DTD Resolver is responsible for redirecting XML System IDs 26

to the

local directory where NLM Journal Archiving and Interchange DTDs are stored. This class

is needed due to the variety of DTDs required for parsing PMC documents.

System ID is the URI/URL pointing to the given XML document‘s DTD.

47 | P a g e

4.4.3. Presentation Layer The presentation layer encapsulates methods for HCI (Figure 10). It includes the following classes:

Figure 10 - ExtConX2 Presentation Layer

1. Function Panel - This class constructs the function panel or buttons to initiate various

functionalities (e.g., initiating URL extraction and role extraction).

2. Entry Panel - This class constructs the entry panel: e.g., text fields for user input such as

parameters for number of documents to be processed etc.

3. Quitable Frame - This class is responsible for the popup dialog box to confirm user of

application and to exit the program.

4. GUI - This class constructs the GUI by invoking other classes.

5. InvokeApp - acts as a gateway to the application layer initiating application logic by user

input (see Appendix A, Figure 14).

48 | P a g e

5. Implementation

This chapter describes the implementation of the main functional requirements of ExtConX2: URL

Module and IE Module (refer to Figure 9). However, these descriptions are not comprehensive, as

only a few of the more noteworthy aspects are included. Other materials not provided in this

dissertation are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/).

5.1. Tools & Implementation Environment

Tools used to implement various component of ExtConX2 include:

1. Java Standard Edition 6 & Java Platform Enterprise Edition 6

Due to wide availability of tools and APIs for TM, Java was used as the main

programming language.

2. Eclipse IDE

Eclipse IDE was the used development environment.

3. Xerces Java Parser 2.6.1 – Document Object Model Parser

DOM API is used by ExtConX2 to parse XML documents. While Simple API for XML

(SAX) uses less resource and outperforms DOM parser in terms of speed (Frankling,

2010), DOM provides greater flexibility in terms of functionality for the tasks required .

For instance, within some PMC document, and all GATE XML Documents (see Section

5.3.3.1), certain XML tags lack separate closing tags e.g.:

1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov" />

In these cases, SAX does not recognise these tags, and is therefore unable to extract these

URLs. However, DOM provides the functionality required.

Descriptions of other tools used are provided by relevant implementation modules/components.

5.2. Implementation of URL Module

This section provides detailed description of the implementation of the URL Module. Specifically,

methods adopted for URL extraction, method adopted to determine type of resource referenced

(including soft decision), and a brief description of the implemented process of extracting a URL

and determining its resource type.

49 | P a g e

5.2.1. Extraction of URLs

Extractions of URLs from PMC documents are achieved through the use of inherit NLM Journal

Archiving and Interchange Tag Suit and regular expressions. The use of both methods supplements

each other and achieves better recall and precision than using either method on its own. An analysis

of roughly 100 documents showed that it is becoming common practice to provide hyperlink text

within XML documents rather than visible URL (see an example below). Thus, the sole use of

regular expressions on printable text resulted in poor recall.

1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov"> hyperlink</ext-link>

In addition, a clear majority of documents providing visible URLs also include the URLs as

attribute value within the XML tags. Therefore, the extraction of URL may be achieved solely

through the use of XML tags. However, regular expressions were still used to syntactically validate

URLs extracted to accommodate human error. This helped improve precision.

The implemented process to extract URLs is described below:

1. Parse PMC document using the DOM parser.

2. Traverse through the parsed document to find XML tags defining URLs (i.e., ext-link).

a. Ensuring latter tag includes the attribute: ext-link-type, and that this attribute has

the value uri (i.e., ext-link-type="uri”). This is an inference that the XML element

contains a URL.

3. Extract the attribute value of xlink:href which, by inference, ought to be a URL.

4. Finally, the value of xlink:href is syntactically validated as a URL by applying regular

expressions (see Table 21). This step may be applied as a precaution due to potential

human error.27

Table 21 – Regular Expressions for URL Validation (((http|https|ftp)://)|(www\\.))+

([\\d\\D&&[^\\s(@]])*\\.([\\d\\D&&[^\\s@)]]\\.?)+

5.2.2. Checking Resource Availability

The Java API: URLConnection (i.e., its sub-class HttpURLConnection) is used to check if

extracted URLs are accessible or not. A connection request is sent for each URL extracted with a

set connection timeout of 10 seconds. The URL is considered as accessible/active if a HTTP 200

OK response code is received (see Table 18). If no response code is returned within 10 seconds or

if any other response code is received, the URL is considered as inaccessible.

For instance, a common error found is included Document Object Identifier (DOI) instead of URLs within tags defined

for the use of defining URLs.

50 | P a g e

5.2.3. Determining Resource Type

For each URL extracted the system must determine the type of resource referenced (refer to

Section 1.2.1 for conceptualisation of resource types). The approach used to achieve this end is

rule-based in conjunction with lists containing keywords (and URLs). The choice of keywords is

based upon iterative testing and analysis of roughly 100 PMC documents, and carefully chosen to

reflect the relevant resource type. Table 22 shows a subset of five keywords used for each resource

type, the full list is provided in Appendix B, Table 36.

Table 22 – Sample of Keywords

Databank Document Organisation Software

data bank .doc organisation software

databank .pdf organization sourceforge

database journal institute program

genBank report international agency application

ncbi.nlm.nih.gov/protein facts - system

Moreover, all keywords are loaded as regular expressions. This has advantages such as:

1. Keywords can easily be used as case insensitive; uppercase and corresponding lower case

spelling for each word is not needed.

2. The use of grammatical root form of keywords is sufficient.28

Hence, shorter keyword lists

are sufficient to fulfil this function.

5.2.3.1. Soft Decision

Soft decision is a method/algorithm used to determine the most likely resource type for each URL

extracted. Up to four instances of resource type(s) could be determined for each URL mentioned

through the analysis of the URL context (also see description of implementation Section 5.2.3.2):

1. By keywords identified within the URL string.

2. By keywords identifies within the parent node of the URL tag.

Typically, within the reference list the parent node contains: title of the reference

and/or description of a reference.

3. By keywords identified within parent-parent node of the URL tag.

See previous description. This is needed due to inconsistent use of nodes with

XML documents: some reference titles/descriptions are not contained within the

first parent node, rather within the parent-parent node.

4. By keywords identified within citation context of the article (i.e., the actual sentence where

the resource is cited within the article body).

For instance, singular and plural forms of each keyword is not needed

51 | P a g e

Once all instances have been collected, this data is subsequently processed by soft decision.

The soft decision algorithm assigns a distributed weight of total of 1 to each (resource type)

instance identified. Subsequently, the instance with the largest weight is identified as the most

likely resource type. If two instances have equal weight, the first identified (instance) resource type

is returned as the likely type. The distributed weight is based upon an iterative analysis of which

decision instance is most reliable. The distributed weight is defined as followed:

Table 23 – Distributed Score of Soft Decision Algorithm

Distributed Score Description

1 0.400 Keyword identified within the URL string.

2 0.225 Keyword identifies within parent node.

3 0.225 Keyword identified within parent-parent node.

4 0.150 Keyword identified by citation reference within the article body.

5.2.3.2. Implementation of URL Module Described

Consider this hypothetical example:

1 <ref id="CR9">

3 MZmine 2 – software for mass-spectrometry was used in this research(

4 <ext-link ext-link-type="uri" link:href="http://www.mzm.sourceforge.net">

5 http://www.mzm.sourceforge.net

6 </ext-link>

7 ); to process the data presented in the results section.

8 </citation>

9 </ref>

The implemented process adopted to determine referenced resource type is as followed:

1. Parse the document using DOM parser.

2. Traverse through the parsed document to extract the URL.

a. Analyse URL string for keywords (see bold text below). Save the result for

analysis by soft decision.

http://www.mzm.sourceforge.net

b. (1) Get the parent node‘s (i.e., citation) context (all text between the citation start

and end node). (2) Analyse this context (word by word) for keywords, starting

from the location of the URL within this string until the start of the sentence (see

bold text below).

MZmine 2 – software for mass-spectrometry was

used in this research

(http://www.mzm.sourceforge.net); to process

the data presented in the results section.

52 | P a g e

If unable to determine a resource type, (3) analyse whole citation context (see bold

text below) starting from the beginning of the sentence to the end.

MZmine 2 – software for mass-spectrometry was

used in this research

(http://www.mzm.sourceforge.net); to process

the data presented in the results section.

Save the result for analysis by soft decision.

c. Do the same as previous step but with the parent-parent node context (in this

example the parent-parent node is ref tag; and the analysis of its context will give

identical result as the previous step). Save the result for analysis by soft decision.

d. (1) Get ref element attribute (id) value (i.e., CR9), if it exists (if not, return null).

(2) Find this citation within the article body by the reference id (CR9). (3) Finally,

analyse the sentence word by word for keywords starting from the location of

citation until the start of that sentence/paragraph. Save result for analysis by soft

decision.

3. Determine the most likely resource type by soft decision.

a. The soft decision data derived from the example above, based on the keywords

provided in Table 22, would be (Table 24):

Table 24 – Result by Soft Decision Algorithm

Instance Weight Resource Type Description

1 0.40 Software By keyword: sourceforge within the URL string

2 0.225 Software By keyword: software within the parent context

3 0.225 Software By keyword: software within the parent-parent context

4 0.150 null Assuming unidentifiable keywords within the article body

citation.

Hence, Software resource type would have a total weight of 0.85, so even if the last instance would

be identified as any other resource type, Software would be returned by soft decision as the likely

resource type.

53 | P a g e

5.3. Implementation of IE Module

This section provides a detailed description of a subset of the implementation of the IE Module

(refer to Figure 9). It presents the methods adopted for identification and extraction of NEs, REs,

and the semantic level processing.

5.3.1. GATE

GATE was used to develop the IE Application for extraction of acknowledgements. While there

exist many alternatives such as LingPipe or MinorThird, GATE was used due to availability of

extensive documentation, user friendly IDE for debugging and development, and easy integration

with Java.

GATE‘s default IE system, a Nearly-New Information Extraction System (or ANNIE), was used as

a starting point for the development of the IE Application. ANNIE contains a set of default

processing resources mostly based on Java Annotation Pattern Engine (JAPE)29

(see default

ANNIE pipeline in Appendix A, Figure 15) which was amended and extended as required, to

meet the requirements of this module.

5.3.2. Java Annotation Pattern Engine

Java Annotation Pattern Engine (JAPE) is a rule-based language which provides finite state

transduction over annotations (Cunningham et al., 2010) enabling various IE tasks through

manipulation of existing and creation of new annotations. An JAPE grammar may be split up in a

set of phases consisting of patterns and action rules that may be run sequentially (Cunningham et

al., 2010) in a customised order defined. In fact, the ability to create sequential pattern/action rules

enables the simplification of extraction of complex patterns into incremental simplified rules (see

Section 5.3 for example).

A JAPE rule consists of two primary parts: left-hand-side (LHS) and right-hand-side (RHS). LHS

shall consist of rule-based pattern description(s), and RHS shall consist of action rules or

annotation manipulation statements. JAPE syntax used for pattern description is quite similar to

regular expressions used in any programming language, hence no description of syntax will be

provided (refer Cunningham et al., 2010, Chapter 8). Following example is a simplified JAPE rule

to identify the pattern of two consecutive, upper initial proper nouns, and to subsequently labels

them as Person (see description of syntax provided):

JAPE is based on Common Pattern Specification Language (CPSL)

54 | P a g e

1 Phase: AnnotatePerson // Phase name or identifier for rule

3 // Input annotation must be defined (e.g., annotated by POS tagger)

4 // that will be used by the pattern description

5 Input: Token

7 Rule: Person1 // Rule name

9 // Pattern: NNP NNP (with uppercase initials)

10 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}

11 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}

12 ):temp // Temporary label

14 -->// Everything above this symbol is the LHS, and below RHS

16 // Convert temporary label to permanent annotation/label: Person

17 :temp.Person = {rule = " Person1"}

5.3.3. Implementation of IE Module Described

Similarly to Giles and Councill (2004), the NLP/IE process is not applied to entire PMC

documents, but solely the acknowledgement sections extracted. The general process for role

extraction follows:

1. PMC documents are parsed using a DOM parser.

2. The acknowledgement (section) is extracted using NLM Journal Archiving and

Interchange DTD tags: ack.

3. Subsequently, this text is processed using the IE Application developed using GATE (refer

to Figure 9). The output of this process is a GATE XML document which contains the

dump of annotations (i.e., NEs and their respective RE) in an XML format.

4. The Gate XML document is programmatically processed (or parsed) to extract NEs and

respective REs, and inserted them into the System Db.

55 | P a g e

Figure 11 - IE Application Pipeline

5.3.3.1. Description of IE Application

Out of eight processing resources used for text pre-processing and IE task (Figure 11), four are

custom designed: Gazetteer (partially), NE-Extended Transducers, Role Expression Transducers,

and Role Context Transducers. The latter three are developed using JAPE. Description of these

processing resources and some implementation examples follows:

1. Gazetteer - The ANNIE gazetteer which is used for name entity recognition (by default) is

further extended to accommodate role extraction.30

In particular:

i. The organisation‘s list is extended with known funding organisations.31

ii. Role Expression lists are added: containing collaboration and funder roles (see

Table 25 and 26). Each type of role has two separate lists: (1) multi-word and (2)

one-word lists. This enables prioritisation of multi-word roles at semantic level

30 Extended lists are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/) 31

Resources used for collecting research funding organisation names include: Wikipedia (2010), NIH(2010), and Giles

and Councill (2004).

56 | P a g e

processing which results in better evaluation results (i.e., one-word roles tend to

result in partial identification of roles).

Table 25 – Sample of One-Word Role Expression Lists

Funding Roles Collaboration Roles

Grant-In-Aid advice

grants assistance

sponsor discussions

A Rule-based Approach to External Context Extraction from … · 2010-12-06 · A Rule-based...

Documents

Evolution of Rule-based Information Extraction: From ...€¦ · Evolution of Rule-based Information Extraction: From Grammars to Algebra Rajasekar Krishnamurthy, Sriram Raghavan,

Information Extraction Lecture 3 – Rule-based Named Entity Recognition

Knowledge extraction and visualisation using rule-based machine learning

Combining NLP Approaches for Rule Extraction from Legal ... · Combining NLP Approaches for Rule Extraction from Legal Documents Mauro Dragoni1, Serena Villata2, Williams Rizzi3,

An Empirical Evaluation of Rule Extraction from Recurrent ...clgiles.ist.psu.edu/pubs/rule-extraction.pdf1 An Empirical Evaluation of Rule Extraction from Recurrent Neural Networks

Large-Scale Learning of Relation-Extraction Rules with ...iswc2012.semanticweb.org/sites/default/files/76490257.pdfKeywords: information extraction, IE, relation extraction, RE, rule

Pore-Water Extraction Intermediate- Scale Laboratory ... › main › publications › external › technical_reports › PNNL-20507.pdfsoil vapor extraction purposes, gas phase flow

Rule extraction from autoencoder-based connectionist

Combining NLP Approaches for Rule Extraction from … · Combining NLP Approaches for Rule Extraction from Legal Documents Mauro Dragoni1, Serena Villata2, Williams Rizzi3, and Guido

Ensemble Methods for Data Mining and Knowledge Extraction ... · Rule extraction from NN Gallant [1988] ... Vietri 2002 Neuro/fuzzy integration Any rule based fuzzy system may be

A Rule-based Approach to External Context Extraction from ...studentnet.cs.manchester.ac.uk/resources/library/... · A Rule-based Approach to External Context Extraction from Biomedical

DeepRED – Rule Extraction from Deep Neural Networks...DeepRED – Rule Extraction from Deep Neural Networks Jan Ruben Zilke, Eneldo Loza Mencía, Frederik JanssenKnowledge Engineering

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip

Diagnosis of Psychopathology using Clustering and Rule ... · Diagnosis of Psychopathology using Clustering and Rule Extraction ... School of Computer Science and Engineering,

Knowledge Extraction from Neural Networks using the All ...michaelm/barca.pdf · Keywords: Feedforward neural networks, knowledge extraction, rule extrac-tion, rule generation, hybrid

Uncertainty Management In Rule Based Information Extraction Systems

Review Article On Feature Selection and Rule Extraction ...with feature selection and rule extraction and its training scheme designed for a high dimensional dataset is described in

WordPress.com...exploz; extend extent - external external dangers - extraction - extradition - fail - fair - fair competition faith fallen defence services personal fallow land - family

A Rule-Based Meronymy Extraction Module for Portuguese · A Rule-Based Meronymy Extraction Module for ... tions these authors identify are member-of (player - team) ... A Rule-Based

Association Rule Mining Based Extraction of Semantic Relations Using Markov Logic Network