View
12
Download
0
Category
Preview:
Citation preview
A Rule-based Approach to External Context Extraction from
Biomedical Literature: URL and Role Extraction
A dissertation submitted to The University of Manchester for the degree of
Master of Science Informatics
In the Faculty of Engineering and Physical Sciences
2010
Azad Dehghan
School of Computer Science
2 | P a g e
Table of Contents
Table of Contents .......................................................................................................................... 2
List of Tables ................................................................................................................................. 4
List of Figures ................................................................................................................................ 6
List of Abbreviations ..................................................................................................................... 7
Abstract ........................................................................................................................................ 8
Declaration ................................................................................................................................... 9
Copyright Statement ..................................................................................................................... 9
Dedication .................................................................................................................................. 10
Acknowledgement ...................................................................................................................... 10
1. Introduction ........................................................................................................................ 11
1.1. Motivation ................................................................................................................... 11
1.2. Project Aims ................................................................................................................ 12
1.2.1. Conceptualisation of Project Specific Terminology ............................................... 12
1.3. Project Objectives ........................................................................................................ 13
1.4. Availability ................................................................................................................... 13
1.5. Overview of Chapters .................................................................................................. 14
2. Background ......................................................................................................................... 15
2.1. Text Mining.................................................................................................................. 15
2.1.1. Information Retrieval ........................................................................................... 15
2.1.2. Natural Language Processing ................................................................................ 16
2.2. Information Extraction ................................................................................................. 17
2.2.1. Rule-based and Statistical-based Approaches to IE ............................................... 18
2.2.2. IE Application Development Tools/Software......................................................... 18
2.3. NLM Journal Archiving and Publishing DTDs ................................................................. 19
2.4. Related Work ............................................................................................................... 21
2.5. Summary of Chapter .................................................................................................... 24
3. Software Requirements ....................................................................................................... 26
3.1. Description of Main Tasks ............................................................................................ 26
3.1.1. URL Extraction...................................................................................................... 26
3.1.2. Acknowledgement Extraction ............................................................................... 27
3.2. Functional User and System Requirements .................................................................. 27
3.2.1. Functional User Requirements and Use Case Diagram .......................................... 27
3.2.2. Functional System Requirements ......................................................................... 29
3.2.3. Requirement Traceability Matrix .......................................................................... 33
3 | P a g e
3.3. Non-Functional Requirements ..................................................................................... 34
4. System Design and Analysis ................................................................................................. 35
4.1. Generic System Architecture ........................................................................................ 35
4.2. Description of External Context Extraction ................................................................... 36
4.2.1. URL Module ......................................................................................................... 36
4.2.2. IE Module............................................................................................................. 39
4.3. System Architecture..................................................................................................... 41
4.3.1. Subsystems Architecture ...................................................................................... 41
4.4. System Design ............................................................................................................. 42
4.4.1. Database Layer..................................................................................................... 43
4.4.2. Application Layer ................................................................................................. 44
4.4.3. Presentation Layer ............................................................................................... 47
5. Implementation................................................................................................................... 48
5.1. Tools & Implementation Environment ......................................................................... 48
5.2. Implementation of URL Module ................................................................................... 48
5.2.1. Extraction of URLs ................................................................................................ 49
5.2.2. Checking Resource Availability ............................................................................. 49
5.2.3. Determining Resource Type ................................................................................. 50
5.3. Implementation of IE Module ...................................................................................... 53
5.3.1. GATE .................................................................................................................... 53
5.3.2. Java Annotation Pattern Engine............................................................................ 53
5.3.3. Implementation of IE Module Described .............................................................. 54
5.3.4. Information Extraction ......................................................................................... 60
6. Evaluation ........................................................................................................................... 63
6.1. URL Extraction ............................................................................................................. 63
6.1.1. Discussions........................................................................................................... 65
6.2. Role Extraction ............................................................................................................ 66
6.2.1. Discussions........................................................................................................... 68
6.3 System Limitations ....................................................................................................... 70
7. Conclusion ........................................................................................................................... 72
7.1. Limitations and Future Work........................................................................................ 73
References .................................................................................................................................. 74
Appendix A – System Architecture and Design ............................................................................ 77
Appendix B – Implementation ..................................................................................................... 80
Appendix C – Evaluation Data...................................................................................................... 81
4 | P a g e
List of Tables
Table 1 – Relevant XML Tags 20
Table 2 – Most Acknowledged Funding Organisations 23
Table 3 – Ideal Results from URL Extraction Process 26
Table 4 - Ideal Results of TM Process 27
Table 5 – Description of Actor (AC) 28
Table 6 – Description of Use Cases 28
Table 7 – Mapping between Projects Objective and Implementation Objectives 29
Table 8 – Implementation Objective 1 30
Table 9 – Implementation Objective 2 30
Table 10 – Implementation Objective 3 30
Table 11 – Implementation Objective 4 31
Table 12 – Implementation Objective 5 31
Table 13 – Implementation Objective 6 32
Table 14 – Implementation Objective 7 32
Table 15 – (Implementation) Objective 8 33
Table 16 – Requirement Traceability Matrix 33
Table 17 – Ideal Results from URL Extraction Process 37
Table 18 – HTTP Response Codes 38
Table 19 – Examples of REs for Collaborators and Funders 39
Table 20 - Results of TM Process 40
Table 21 – Regular Expressions for URL Validation 49
Table 22 – Sample of Keywords 50
Table 23 – Distributed Score of Soft Decision Algorithm 51
Table 24 – Result by Soft Decision Algorithm 52
Table 25 – Sample of One-Word Role Expression Lists 56
Table 26 – Sample of Multi-Word Role Expression Lists 56
Table 27 - Results of Role Extraction 61
Table 28 – Evaluation Terms Described 63
5 | P a g e
Table 29 – Total Resource Type Referenced 63
Table 30 – Resource Availability by Year 64
Table 31 – True Positives: Role Extraction 67
Table 32 – Most Acknowledged Funding Organisation 67
Table 33 – Description of RE Transducers Rule 69
Table 34 - Development and Evaluation Environment 70
Table 35 – Accomplished Project Aims 72
Table 36 – List Keywords for Resource Type Identification 80
Table 37 – URL Extraction Data 81
Table 38 – Role Extraction Data 82
Table 39 –Role Expression Extraction Data 83
Table 40 –Name Entity Extraction Data 80
6 | P a g e
List of Figures
Figure 1 - URL Decay (Wren, 2008) 24
Figure 2 - Use Case Diagram 28
Figure 3 – High-Level System Architecture 35
Figure 4 – URL Module Overview 37
Figure 5 - Generic NLP/IE Pipeline 40
Figure 6 - ExtConX2 Layered Subsystems 42
Figure 7 - ExtConX2 Database Layer 43
Figure 8 - Relational Database Schema 44
Figure 9 - ExtConX2 Application Layer 45
Figure 10 - ExtConX2 Presentation Layer 47
Figure 11 - IE Application Pipeline 55
Figure 12 – URL Decay 64
Figure 13 - System Db EER Diagram 77
Figure 14 - ExtConX2 Architectural Design 78
Figure 15 - ANNIE Default IE Modules (www.gate.ac.uk) 79
7 | P a g e
List of Abbreviations
a Nearly-New Information Extraction System ANNIE
API for XML SAX
Common Pattern Specification Language CPSL
Data Mining DM
Document Object Identifier DOI
Document Object Model DOM
Graphical User Interface GUI
Human Computer Interaction HCI
Hypertext Transfer Protocol HTTP
Information Extraction IE
Information Retrieval IR
Integrated Development Environment IDE
Java Annotation Pattern Engine JAPE
that Java Virtual Machines JVM
Left-hand-side LHS
Model-View Controller MVC
National Centre for Biotechnology Information NCBI
National Institute of Health NIH
National Library of Medicine NLM
Natural Language Processing NLP
Object Oriented Programming OOP
PubMed Central PMC
Relational Database Management System RDBMS
Right-hand-side LHS
Role Expression RE
Separation of Concern SoC
Software Development Processes SDP
Software Requirements Engineering SRE
Software Requirements Specification SRS
Text Mining TM
8 | P a g e
Abstract
With a huge number of publications within the biomedical domain, there is an increasing number
of references to URLs, and acknowledgements of individuals and funding organisations. This
project was motivated by providing a look-into the scope of the problem of URL decay, and to
explore and uncover fact of e.g., most active funding organisations, relationship between funding
agencies and research themes, and scientists and research themes, and so on.
EXTernal CONtext eXtractor 2 (ExtConX2) was developed in order to aid with this aim. Rule-
based approaches were adopted in order to extract URLs and acknowledgements from PubMed
Central documents. From the entire PMC dataset of roughly 190, 000 PMC documents processed,
147, 133 URLs, and 194,539 roles were extracted.
Using this data, we have analysed some trends in URL decay and acknowledgments. For example,
we found that URL decay can be described as a function of publication year: the older the
publication the less accessible resource contained within publications. We also found that most
funding acknowledgements were associated with National Institutes of Health, National Science
Foundation, and Wellcome Trust respectively.
The adopted approach for URL extraction achieved precision of 98.6% and a recall of 96%. The
role extraction task achieved a recall of 67.6% and precision of 92.6%.
.
9 | P a g e
Declaration No portion of the work referred to in the dissertation has been submitted in support of an
application for another degree or qualification of this or any other university or other institute of
learning.
Copyright Statement
i. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns any copyright in it (the ―Copyright‖) and he has given The University of
Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes.
ii. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these
regulations may be obtained from the Librarian. This page must form part of any such
copies made.
iii. The ownership of any patents, designs, trademarks and any and all other intellectual
property rights except for the Copyright (the ―Intellectual Property Rights‖) and any
reproductions of copyright works, for example graphs and tables (―Reproductions‖), which may be described in this dissertation, may not be owned by the author and may be owned
by third parties. Such Intellectual Property Rights and Reproductions cannot and must not
be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or
Reproductions described in it may take place is available from the Head of School of
Computer Science.
10 | P a g e
Dedication This project is first and foremost dedicated to Science. I hope that the excel of science and reason
will continue to prevail! The earth is round indeed!
Secondly, I would also like to dedicate this project to my family: my parents Siavash Dehghan and
Shahnaz Gharehjani, and my brother Arash for his support.
Acknowledgement
I am grateful to Dr. Goran Nenadic for helpful comments and suggestions. I would also like to acknowledge the gnTeam for providing the PubMed Central dataset.
11 | P a g e
1. Introduction
The presence of overwhelming amounts of unstructured textual information within scientific
literature has made the need for machine-supported analysis of text ever more important to aid
scientists with scientific hypothesis generation and knowledge discovery (Ananiadou & McNaught,
2006; Ananiadou et al., 2005; Uramoto et al., 2004). A specific problem domain is that of
biological sciences, reflected by the share volume of academic publications. For instance, in the
previous year alone (2009), over 710,000 approved references were added to MEDLINE®/
PubMed®; or between 60,000-120,000 reference added each month (NLM 2008; NLM 2009). The
share numbers of publications is simply not human digestible by any individual scientist.
This domain in particular has made the application of text mining (TM) techniques to analyse huge
quantities of unstructured information a vital means to extend and further scientific/knowledge
discovery (Ananiadou & McNaught, 2006). The implications of traditional knowledge discovery or
to generate scientific hypothesis without the aid of TM techniques should be evident.
With a huge number of publications within the biomedical domain, (1) there is an increasing
number of references to URLs or online resources (e.g., publications, software, and so on), and (2)
acknowledgements of individuals and funding organisations. The aim of this dissertation may be
described as discovery-oriented (see Fayyad et al., 1996), i.e., to uncover previously unknown facts
or knowledge in regards to relationships/patterns involving these aspects using TM techniques.
1.1. Motivation
With unprecedented growth of biomedical literature coupled with the increase practice of
referencing of online resources (URLs) that become inaccessible over time (i.e., URL decay). This
project is motivated by providing an analysis of the scope of this problem. While previous studies
(Wren, 2004; Wren, 2008) have confirmed the issue of URL decay, this project will extend upon
previous researches by providing a more holistic conclusion through the analysis of a broader
dataset.
Another motivation is similarly and partly derived from the unprecedented quantities of research
and publication within the biomedical domain. As biomedical research attracts billion of pounds of
research grants and investment from governmental, commercial, and academic sources worldwide
each year; it will be interesting to explore and uncover patterns of e.g., most active funding
12 | P a g e
agencies or institutions, relationship between funding agencies and research themes, and scientists
and research themes, and so on.1
1.2. Project Aims
The aim of this project is to design and implement a system to enable the analysis of trends such as
URL decay (i.e., the phenomenon of inaccessible online resources), type of online resources most
often referenced, and exploration of acknowledgements: of individuals and organisations and their
respective roles in relation to the research/article where acknowledged. Therefore, the system must
enable extraction of so called external context from biomedical research: (1) URLs and (2)
acknowledgments. This software system will be referred to as EXTernal CONtext eXtractor 2 or
ExtConX2 hereafter.2
Moreover, ExtConX2 may be described as two systems in one: (1) URL extractor and (2)
acknowledgement extractor. Description of these subsystems follows:
(1) URL Extractor
The URL Extractor must enable (1) extraction of URLs, (2) for each URL extracted, the system
must determine the type of resource referenced (i.e., Document, Databank, Software, or
Organisation), and (3) determine if the URL is accessible or not.
(1) Acknowledgement Extractor
The Acknowledgement Extractor must enable the identification and extraction of (1) name entities
(NEs) such as persons and organisations, (2) role expression (RE) or the acknowledged role of
given NE, and (3) identify relations or association between a NE and corresponding RE.
1.2.1. Conceptualisation of Project Specific Terminology
Various project specific terminologies are used throughout this dissertation. This section provides
conceptualisation of these terms for easy referencing:
1 Apart from providing practical applications as described in section 1.2.1, biomedical research could at time be
controversial (e.g., stem-cell research; health risk of cigarettes), hence, uncovering of patterns between funding organisations and research could be important to maintain scientific and academic integrity. 2 2 – Indicates the number of tasks the system handles: (1) URL extraction and (2) acknowledgement extraction.
13 | P a g e
(1) Conceptualisation of Role Entities:
i. Collaborator – any NE (person or organisation), apart from the author(s), that provide any
non-financial support (e.g., editorial, conceptual, technical, and so on).
ii. Funder – any NE that provides financial support to the corresponding research.
iii. Role Expression – the literal role of a collaborator or funder.
Note that collaborator / contributor, and sponsor / funder will be used interchangeably throughout
this report.
(2) Conceptualisation of Resource Types:
i. Databank – any database or repository of information which may facilitate dynamic
information retrieval.
ii. Document – any article, report, book, or any static information resource.
iii. Organisation – any organisation or institute (literal definition).
iv. Software – any computer program or application (literal definition).
1.3. Project Objectives This project will aim to achieve the following objectives:
1. Design and implement a relational database (Db) schema to store extracted data.
2. Design and implement a module to extract URLs from documents, determine if the given
URL is accessible or not, determine type of resource (or URL) extracted/referenced and
insert this data into a database.
3. Design and implement a module to identify and extract funders and collaborators (i.e.,
persons/organisations and their respective roles) from acknowledgements and insert this
data into a database.
4. Design and implement a GUI that will facilitate exploration of system functionalities and
which provides general statistics.
5. Evaluation of the purposed methodology.
1.4. Availability
The PubMed Central dataset will be available from gnode1 (gnode1.mib.man.ac.uk) for the use
within this project.
14 | P a g e
1.5. Overview of Chapters
The remainder of this dissertation is organised as followed:
Chapter 2 – Background: provides a general description of the project background such as Text
Mining (TM) processes and concepts, and review of related work
Chapter 3 – Software Requirements: provides a high-level description of the main requirements
of ExtConX2, and further defines functional and non-functional requirements.
Chapter 4 – System Design and Analysis: illustrates and discusses the overall system design and
individual software components of ExtConX2.
Chapter 5 – Implementation: discusses the implementation of the system by analysing selected implementation components.
Chapter 6 – Evaluation: presents and discusses the results of the knowledge discover stage of the dissertation and evaluation of adopted methods.
Chapter 7 – Conclusion: concludes the dissertation by reflection of the project aims, limitation of
the system, and suggestions for future work.
15 | P a g e
2. Background
2.1. Text Mining
TM generally involves the application of techniques such as Information Retrieval (IR), Natural
Language Processing (NLP), Information Extraction (IE), and Data Mining (DM) (JISC, 2006;
Uramoto et al., 2004) to unstructured text. Hearst (2003) summarises the general notion of TM as:
the discovery by computer of new, previously unknown information, by automatically [or
semi-automatically] extracting information from different written resources. A key element
is the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation.
While TM is often an iterative process, its techniques/stages are generally applied in an ordered
manner. TM or knowledge discovery is a process-oriented activity. Further, due to the relative new
research field of TM, concepts used are not always consistent across literature (see Hotho et al.,
2005; Fayyad et al., 1996). However, while it is not within the scope of this report to further
discuss this issue it is important to acknowledge. Hence, this section will briefly review processes,
techniques, and concepts involved within TM. This ought to clarify the conceptual foundation and
aid the understanding of further description of the overall project pursued.
2.1.1. Information Retrieval
Information retrieval is a discipline and problem concerned with the finding of
documents/information (Hotho et al., 2005). IR covers a wide variety of research areas such as
document classification and categorisation, data visualisation, filtering, modelling, and so forth
(Baeza-Yates & Ribeiro-Neto, 1999). Often referenced IR systems are search engines such as
Yahoo3 and Google
4 which identify documents/information according to the user‘s search queries
(JISC, 2006). IR systems within the biomedical domain include Entrez PubMed and PubMed
Central (PMC). PubMed® is a free resource which provides access to MEDLINE
® (Medical
Literature Analysis and Retrieval System Online), the U.S. National Library of Medicine‘s (NLM)
database of citation and abstracts. Currently, PubMed contains over 19 million references from
approximately 5,400 biomedical journals published worldwide (NLM, 2010a). PubMed Central is
the corresponding (free) full-text digital archive developed and managed by U.S. National Institute
of Health‘s (NIH) National Centre for Biotechnological Information (NCBI).
3 www.yahoo.co.uk 4 www.google.co.uk
16 | P a g e
Moreover, within the context of TM or knowledge discovery process, IR refers to the process of
finding and retrieving appropriate documents relevant to some particular problem (JISC, 2006).
While IR is considered as a sub-process of NLP by some researchers (e.g., Polajnar, 2006), within
this project, IR will be regarded as a separate and antecedent process of NLP.
2.1.2. Natural Language Processing
Natural language processing is concerned with the problem of understanding natural language (NL)
by the use of computers (JISC, 2006; Hotho et al., 2005). Due to the inherent ambiguity of NL, the
complexity to analyse NL by the use of machines is a evident reality. Thus, NLP is commonly
divided into several layers of processing (Hahn & Wermter, 2006): lexical, syntactic, and semantic
level. The lexical level processing deals with how words can be recognised, analysed, and
identified to enable further processing (Hahn & Wermter, 2006). The syntactic level analysis deals
with identification of structural relationships between groups of words in sentences, and the
semantic level is concerned with the content-oriented perspective or the meaning attributed to the
various entities identified within the syntactic level (Hahn & Wermter, 2006).
(1) Lexical Level Processing
The tokenisation process or the segmentation of text into individual meaningful elements is the
initial stage of lexical level processing. Tokens such as words, acronyms, abbreviations, numbers,
and so on are linguistically identified (Hahn & Wermter, 2006). Other interrelated sub-processes
associated with lexical level processing include (Hahn & Wermter, 2006):
Part-Of-Speech (POS) tagging which is considered as the core of this level processing
Morphological analysis (the association/linking of varied forms of lexical elements to their
canonical base form)
Unknown word handling
Acronym detection
Name Entity Recognition (NER)
An example of a widely used and reliable POS tagger within the biomedical domain is GENIA
Tagger v3.0 (Tsuruoka et al., 2005). Computational lexicons (e.g., BioThesaurus) are also utilised
at this stage to aid with the overall lexical level processing. While lexicons often vary depending
upon domain/task, in general and the bare minimum, computational lexicons contain lexical
elements such as full or canonical base forms of words and additional linguistic information (e.g.,
part-of-speech category and morphological information), and so on.
17 | P a g e
(2) Syntactic Level Processing
Common methods applied within the syntactic level processing are chunkers and parsers.
Chunkers partition or label sentences into phrasal units (i.e., noun, preposition, verb, or adjective
phrases) (see Hahn & Wermter, 2006, p.23 for details), and parsers identify clauses such as word
sequences containing a subject and a predicate (Hahn & Wermter, 2006, p.25). An example of
domain specific (i.e., biomedical) shallow parser is GENIA Tagger. Moreover, the application of
name entity recogniser (NER) at this level of processing has proven beneficial within biological
text mining as most name entities are contained within nouns or prepositional phrases (Hahn &
Wermter, 2006). Some examples of NER systems include ANNIE for, e.g., person and organisation
name recognition (Cunningham et al., 2010), LINNAEUS for species name recognition (Gerner et
al., 2010), and TerMine for technical terms recognition.
Resources commonly utilised to aid with the overall syntactic level process are grammars and
treebanks. Treebanks are annotated text corpora with syntactic annotations at sentence level (i.e.,
POS tags and syntactic structures), and grammars contain some subset of linguistic syntax,
commonly, rules or constraints which characterises morpho-syntactic and nonterminal grammar
categories (see Hahn & Wermter, 2006, p.21). An example of widely used Treebank (within the
biomedical domain) is GENIA Treebank v1.0, which is based upon annotated PubMed abstracts
(Kim & Tsujii, 2006; Tateisi, 2004).
(3) Semantic Level Processing
The semantic level analysis consists of linking terms or concepts to form logical/knoweldge
propositions (Hahn & Wermter, 2006). This level processing is directly based upon the
combination of the lexical and syntactic level analysis. For instance, within the scope of this
project, the semantic level processing involves the linking of NEs and their respective roles.
2.2. Information Extraction
Information extraction may be described as a subsequent stage of NLP. IE is the process of
automatically or semi-automatically extracting predefined data from unstructured text (JISC, 2006)
and inserting this data into forms or templates (see McNaught & Black, 2006, p.143), which
subsequently convey the data into some factual information (Hotho et al., 2005). As defined by
Message of Understanding Conference (MUC), tasks commonly associated with IE are:
Recognition and classification of words denoting name of persons, organisations, locations;
and numeric and temporal expressions (i.e., name entity task).
Identifying links references to entities extracted (i.e., coreference task)
18 | P a g e
Extracting identifying and descriptive attributes of name entities (i.e., template element
task).
Extracting relationships between name entities (i.e., template relation task).
Extracting events in combination with either template element/relation tasks (McNaught
and Black, 2006, p.147).
Moreover, a common used method to aid the overall NER process include the use of gazetteers
(i.e., lists defining NEs such as persons, organisation, etc).
Data mining refers to the process of identifying patterns from a (often large) structured datasets
(such as a database). Within the TM process, DM techniques are typically applied to facts extracted
during the IE stage in the purpose to identify patterns and discover new knowledge (JISC, 2006).
2.2.1. Rule-based and Statistical-based Approaches to IE
Methods which may be used for IE tasks include rule-based (e.g., Common Pattern Specification
Language; Java Annotated Pattern Engine) and statistical-based (e.g., Support Vector Machines;
Hidden Markov Models) approaches. Both types of methods have their strengths and weaknesses.
For instance, statistical-based methods tend to require more computing resources as opposed to
rule-based which tend to be more light-weight (thus resulting in faster processing). On the other
hand, rule-based or knowledge engineering approach is domain or even task dependent, while
statistical or automatic training approach is relatively domain independent (Appelt & Israel 1999).
Hence, domain portability is quite straightforward with statistical-based approaches (Appelt &
Israel, 1999). While both methods could be equally labour and time intensive these methods differ
in their inherit way of designing an IE application. Rule-based approach often requires domain
knowledge and a skilled knowledge engineer to implement effective rules for the IE task. On the
other hand, statistical-based approach requires annotator(s) with some knowledge about the domain
and task in order to annotate some training corpus for model information sought to be extracted
(Appelt & Israel, 1999).
2.2.2. IE Application Development Tools/Software
Many tools/software are available to aid scientists and developers to create IE applications, e.g.,
CAFETIERE (see Black et al., 2005), LingPipe,5 MinorThird,
6 and GATE (General Architecture
5 http://alias-i.com/lingpipe/ 6 http://sourceforge.net/apps/trac/minorthird/wiki
19 | P a g e
for Text Engineering)7. A common denominator across the latter three tools is that they provide
Java APIs for use within custom build standalone applications.
(1) CAFETIERE (or Conceptual Annotation for Facts, Events, Terms, Individual Entities, and
RElations) is a rule-based information extraction system for various IE tasks as specified within its
title. CAFETIERE provides various NLP components as tokenisers, POS taggers, NERs, etc., for
text pre-processing and a customised rule-based language that may be used for semantic level
processing of text (Black et al., 2005). Further, CAFETIERE provides a graphical user interface
(GUI) (i.e., Analyser and Annotation Editor) which supports viewing and editing annotation (which
is useful for iterative development of IE rules).
(2) LingPipe may be described as a toolkit for processing text using computational linguistics and
primarily contains Java APIs for NER, POS, classification, and so on.
(3) MinorThird is another toolkit containing a collection of Java APIs for various NLP and IE
tasks. In contrast to Lingpipe, MinorThird also provides a GUI for invoking APIs and debugging or
manipulating annotations.
(4) GATE may be considered as the more mature tool of the latter two, due to its extensive
documentation and user friendly GUI. GATE is in essence an integrated development environment
providing reusable processing resources enabling the development and deployment of customised
applications to solve NLP problems/tasks (Cunningham et al., 2010). Processing resources are
individual NLP processing components such as tokanisers, POS taggers, NERs, etc., which may be
applied to individual documents or a corpus in a customised order to create an IE application.8
These resources are collectively known as a Collection of REusable Objects for Language
Engineering (CREOLE). GATE may be used to create annotations over documents (for instance, to
be used with statistical-based approaches) or create IE applications which may be used apart from
GATE interface via APIs (GATE Embedded) 9 (Cunningham et al., 2010).
2.3. NLM Journal Archiving and Publishing DTDs
Both PubMed and PubMed Central (PMC) documents are provided in XML formats (defined by
NLM Journal Archiving and Publishing DTDs) as an alternative to common Portable Document
Format (pdf). As previously mentioned, PubMed contains citations and abstracts, and PMC is the
7 http://www.Gate.ac.uk 8 Java APIs from LingPipe, Google, Yahoo (and many more) for NLP/IE are provided as processing resources. 9 GATE API to integrate the IE application into a Java application.
20 | P a g e
corresponding full-text digital archive. The dataset from PMC, which contains approximately 190,
000 documents, will be used in this project.
While NLM Journal Archiving and Interchange Tag Suit was created in order to provide a common
format for publisher and archives to exchange journal content (NLM, 2010b), its usefulness for TM
applications has been widely appreciated. This Tag Suit defines elements and attributes to describe
full article contents such as meta-data, acknowledgement, abstract, article body, citations, URLs,
and so on. This has proven beneficial to researchers who may only be interested in a particular
section(s) of articles, e.g., abstracts or acknowledgements. For, instance instead of using regular
expression over a whole document to identify particular sections of interest, researcher could use
XML parser10
to parse documents and extract relevant section. This has at least couple of
advantages over the use of regular expressions. Providing that a tag set exists for particular
document content of interest, the utilisation of an XML tags to extract this content could often be
more accurate than using regular expressions (hence improving results). In addition, when
designing a TM application, which often processes huge amount of documents, given the
opportunity to only parse documents for specific content rather than process whole documents
could significantly improve performance (i.e., response time and use of computing resources).
Currently there exist seven different types of Tag Suit versions or Document Type Definitions (or
DTDs)11
for PMC articles. However, these versions are consistent in regards to tags used for
content which are of interest to this project, namely for acknowledgements and URLs.
Table 1 describes XML tags which will be used in the implementation of ExtConX2 (NLM,
2010c):
Table 1 – Relevant XML Tags
(1a) <ext-link> </ext-link>
Tag defining external resource outside of the scope
of an article.
(1b)ext-link-type=”uri”
Tag (1a) must contain attribute: ext-link-type
which has the value uri. This indicates that the tag contains a URL.
(1c)xlink:href
Finally within the tag element a third attribute (1c)
must identify the external link. (2)<ack> </ack> Tag defining acknowledgement content/section.
Below is a simplified XML skeleton in the NLM Archiving and Interchange format. Sample of tags
described in Table 1 may be found at lines 28 and 34 in the following example:
10 XML Parser generally refers to an API that enables one to programmatically read XML files and extract content of
interest. Common APIs used for XML parsing in Java include Document Object Model and Simple API for XML. 11
Tag Suit versions include: 1.0, 1.1, 2.0, 2.1, 2.2, 2.3, and 3.0 (current).
21 | P a g e
1 <article>
2 <front>
3 <journal-meta>
4 <journal-id>Journal Acronym</journal-id>
5 ...
6 </journal-meta>
7 <article-meta>
8 ...
9 <contrib id=”A1” contrib-type=”author”>
10 <name>
11 <surname>First</surname>
12 <given-names>Last</given-names>
13 </name>
14 </contrib>
15
16 <abstract> ... </abstract>
17 </article-meta>
18 </front>
19 <body>
20 <sec> <title>Introduction</title>
21 <p> … </p>
22 </sec>
23 <sec sec-type=”method”> <title> Methods </title>
24 <p> … </p>
25 </sec>
26 </body>
27 <back>
28 <ack> We like to thank Armand Seguin for his support of the
project and for many simulating discussions. </ack>
29
30 <ref-list>
31 <ref id="A1">
32 <citation citation-type="other">
33 <article-title>An Online Resource</article-title>
34 <ext-link ext-link-type="uri"
xlink:href="http://www.web.com"/>
35 </citation>
36 </ref>
37 </ref-list>
38 </back>
39 </article>
2.4. Related Work
Giles and Councill (2004) developed a system for acknowledgment extraction from Information
Science literature.12
Based upon their analysis of extracted data a classification scheme of six
categories of acknowledgements were identified: (1) moral support, (2) financial support, (3)
editorial support, (4) presentational support (i.e., presenting the paper at a conference), (5)
instrumental/technical support, and (6) conceptual support, or peer interactive communication
(PIC) as coined by Giles and Councill. They justified their classification scheme on the basis of
12
The IR system utilised for document retrieval: CiteSeer digital library - http://www.citeseer.ist.psu.edu
22 | P a g e
significance of acknowledgements. For instance, conceptual and technical support is arguably more
noteworthy as academic contribution than moral support (Giles & Councill, 2004). Nevertheless,
their argument was never reflected in their results.13
Giles and Councill‘s method is inherently a NER system, as actual roles were only determined by
post-extraction analysis. For instance, they provide a table which partly includes acknowledge
companies and funding agencies. However, it cannot be undoubtedly concluded if these
acknowledge entities provided funding, material, or even intellectual support. Giles and Councill‘s
conclusion is based on pre-knowledge of names of funding organisations and analysis of a sub-set
of most acknowledged entities. Thus, acknowledgements of funding agencies and companies can
only be assumed to represent financial support (see Giles & Councill 2004, p.17601). ExtConX2
will be more sophisticated in that respect, as NEs and their respective roles will be identified and
extracted from acknowledgements. Hence, this task will be slightly more challenging than Giles
and Councill‘s method, and therefore as the nature of evaluation metrics will differ, good metrics
will be more challenging to obtain.
The methodology adopted by Giles and Councill (2004) is a combination of rule-based and
statistical-based approach. Initially, regular expressions were used to identify sections which most
likely contained acknowledgements, specifically, section headings labelled acknowledgment. In
addition, the authors also identified acknowledgement passages within unmarked sections of
articles, typically within the document header (i.e., before the abstract/introduction or on the first
page) or footnotes (i.e., before the references or first appendix). Hence, all text on first page of the
document and the last page, before reference section or the appendix were processed using an SVM
to identify sentences containing acknowledgements. Subsequently, a rule-based parser was applied
to extract acknowledged name entities. Through extensive testing involving 1,800 manually
labelled documents the method achieved 78.45% precision and 89.55% recall.
Table 2 is an excerpt from Giles and Councill‘s (2004, p.17602) result of most acknowledged
funding agencies.
Table 2 – Most Acknowledged Funding Organisations
Funding Agencies No. of
acknowledgements
National Science Foundation 12, 287
Defence Advanced Research Projects Agency 4, 712
Office of Naval Research 3, 080
Deutsche Forschungsgemeinschaft 2, 780
National Aeronautics and Space Administration 2, 408
Engineering and Physical Sciences Research Council 2, 007
Air Force Office of Scientific Research 1, 657
13
Apart from financial support, no other category was presented in their results.
23 | P a g e
National Sciences and Engineering Research Council of Canada 1, 422
Department of Energy 1, 054
Australian Research Council 1, 010
European Union Information Technologies Program 825
National Institutes of Health 709
Army Research Office 666
Netherlands Organization for Scientific Research 646
Science and Engineering Research Council 489
Another research related to one of the applications of ExtConX2 is Wren‘s (2004, 2008) study of
URL decay within MEDLINE/PubMed citations. Wren has justified his motivation by the growth
in electronic references and the assumption of the unreliable nature of online resources compared to
traditional means of printed journals. This was confirmed by the results of his study. The
methodology used by Wren within the knowledge discovery process was straightforward. Wren
used Visual Basic as the chosen programming language and regular expressions to identify and
extract URLs from XML documents (containing the citations). Additional heuristic rules and
manual editing was applied to handle/correct human errors such as mistyped URLs. However,
neither heuristic rules nor regular expressions were provided. Nevertheless, common encountered
errors discussed were inappropriate spaces within URLs, the use of back-ward slashes instead of
forward slashes, non-alphanumeric characters, and inclusion of erroneous characters (see Wren,
2004, p.669).
Wren‘s (2004) initial study involved 1630 URLs extracted from nearly 13 million PubMed
citations. These URLs were programmatically checked for availability, over a four week period
using Microsoft Component Objects Internet Transfer Control (API). A URL was considered as
inaccessible if it did not respond within 60 seconds or if the response code received indicated that
the resource is inaccessible (e.g., 404 not found, file not found, etc.). In addition, if 25 consecutives
tries failed, a URL was considered as inaccessible. URLs that were accessible 90% of the time
checked were considered as active. This method is appropriate as web-servers do not tend to have
100% up-time (or be available 100% of the time). Hence, this method ensures maximised accuracy
of availability statistics.
Wren‘s (2008) follow up study used practically the same method as described above. URLs were
extracted/surveyed in the following years of the initial study (except for 2006): 2004 (total of 2294
URLs surveyed), 2005 (3327 URLs), and 2007 (6154 URLs). Both studies (Wren 2004; Wren
2008) showed time-dependant decay of URLs. More specifically, URL decay could be described as
a function of publication year: the older the publication the less accessible resources it contained.
Below is a graph representing results of URL decay from Wren‘s studies (2004, 2008):
24 | P a g e
Figure 1 - URL Decay (Wren, 2008)
While Wren‘s approach is solely focused on abstracts, ExtConX2 will be applied to full-text
articles, thus covering a larger scope. This will also mean that a more holistic conclusion could be
drawn regarding URL decay. In addition, as previously stated, URLs will be classified within four
different types of categories, enabling a broader analysis of the nature of resources referenced.
Nevertheless, Wren‘s research/results are excellent for post-research evaluation benchmark and
comparison. For instance, I would hypothesise that URL decay will be more sever within full-text
as oppose to citations.
2.5. Summary of Chapter
The aim of this project is to develop a system (ExtConX2) to enable discovery of specific trends
within the biomedical domain. Specifically: (1) the exploration of acknowledgements of
individuals and organisations, and (2) analysis of URL decay and most often referenced resources.
The dataset which will be utilised within this project is full-text XML articles from PubMed
Central.
TM techniques will be used to achieve the main aims defined. In particular, NLP processing such
as lexical, syntactic, and semantic level processing will be utilised to enable role extraction. In
addition, XML tags provided by NLM Archiving and Interchange DTDs will also be used for
25 | P a g e
extraction of URLs (not exclusively) and to aid the initial extraction of acknowledgement text from
PMC articles.
While prior research has had similar applications as ExtConX2, this project looks at extending the
scope by analysing larger datasets and adopting more sophisticated approaches. For instance, Wren
(2004, 2008) study of URL decay was solely confined to PubMed citations. In contrast, ExtConX2
will enable the analysis of URL decay within full-text articles. This will enable us to draw a more
holistic conclusion in regards to the implication URL decay and types of resources most often
referenced within the biomedical domain. Moreover, acknowledgement extraction has yet to be
applied within the biomedical domain. ExtConX2 is the first system to do so. Giles and Councill
(2004) research of acknowledgement extraction is concerned with publications within CiteSeer
digital library. Their approach can at best be described as a NER system as semantic level
processing is never applied. For instance, their result of most acknowledged funding agencies and
companies are based on an assumption and analysis of a subset of articles. In contrast, ExtConX2
will enable us to determine if in fact extracted NEs has provided funding or not by extracting NE‘s
corresponding roles as acknowledged in text.
26 | P a g e
3. Software Requirements
The initial part of this chapter (Section 3.1) provides high-level description of ExtConX2‘s main
requirements (1) URL extraction and (2) role extraction. Subsequently detailed descriptions of
functional user and system requirement, and non-functional system requirements are provided
(Section 3.2 and 3.3). These requirements have been derived from the project‘s objectives and the
software requirement engineering (SRE) process during the initial stages of this dissertation. These
requirements constitute the foundation of ExtConX2.
3.1. Description of Main Tasks
This section provides breif high-level description of main functional requirements of ExtConX2:
(1) URL extraction and related processes and (2) acknowledgement extraction. Some details have
been deliberately ignored for the sake of simplification of descriptions (e.g., use of XML
documents).
3.1.1. URL Extraction
As previous described, ExtConX2 must enable the extraction of URLs from the biomedical
publications. For each URL extracted the system must determine the type of resource referenced
(refer to Section 1.2.1) and if the given URL is accessible or not (URL Status: see Table 3). For
instance, given these hypothetical examples:
1. R-Project (http://www.r-project.org) was used for statistical processing of data.
2. The data was collected using GenBank (http://www.ncbi.nlm.nih.gov).
The ideal results of subsequent processing of these sentences (inserted into a database) ought to be
(Table 3):
Table 3 – Ideal Results from URL Extraction Process
URL Type of Resource URL Status Date Checked
(1) http://www.r-project.org Software Active/Inactive 2010-09-01
(2) http://www.ncbi.nlm.nih.gov Databank Active/Inactive 2010-09-01
27 | P a g e
3.1.2. Acknowledgement Extraction
Acknowledgement extraction involves the extraction of NEs and their respective REs from
acknowledgement sections. The ideal results of processing of given acknowledgements given
below (inserted into a database) should be (see Table 4):
(1) Financial support was obtained from the Swedish Research Council.
(2) The authors thank Ms. Maureen Stoddard Marlow for editing.
Table 4 - Ideal Results of TM Process
(1) Name Entity: Swedish Research Council
Role (enumeration): Funder
Role Expression: Financial support
(2) Name Entity: Ms. Maureen Stoddard Marlow
Role (enumeration): Collaborator
Role Expression: Editing
3.2. Functional User and System Requirements
3.2.1. Functional User Requirements and Use Case Diagram
[R1]. The user shall be able to initiate extraction of URLs from PMC XML documents (stored in
the Shared Database) and insert this data and respective attributes into the System
Database.14
a. Attributes for each URL include:
(1) URL status: if link is active or inactive,
(2) type of resource (i.e., Databank, Document, Organisation, or Software),
(3) decision data: data used to determine type of resource, and (4) date checked.
[R2]. The user shall be able to initiate role extraction (i.e., extraction of NEs and their respective
REs) from full-text XML documents and insert this data and additional attribute into the
system database.
a. Attribute for each set of roles include: (1) the acknowledgement text where role(s)
has been extracted.
[R3]. The user shall be able view general statistics:
14
The System Database (Db) refers to the Db specifically designed for ExtConX2: used to insert processed data. The
Shared Db is provided by the gnTeam (http://gnode1.mib.man.ac.uk/) and contains the PMC dataset.
28 | P a g e
a. (1) Number of documents processed, (2) number of URLs extracted, including
descriptive statistics of URL status (i.e., by year; in total), and (3) number of roles
extracted.
[R4]. The user shall be able to set parameters e.g., number of documents to be processed for IE
processes (i.e., R1 and R2).
A use-case diagram derived from the functional user requirements is provided below (Figure 2):
Figure 2 - Use Case Diagram
Description of Use Case Diagram:
Table 5 – Description of Actor (AC)
AC01 User System user.
Table 6 – Description of Use Cases
UC01 URL Extraction AC01 may initiate URL Extraction and related processes to
determine URL status, determine type of resource, compose decision data, and insert this data (including the date inserted)
into the System Database.
UC02 Role Extraction AC01 may initiate Role Extraction and insert this data
(including the acknowledgement text) into the system database.
UC03 View Statistics AC01 will be able to view statistics of IE processes: (1)
number of documents processed, (2) number of URLs
extracted, (2a) descriptive statistics of URL status (i.e., by
29 | P a g e
year; in total), and (3) number of roles extracted.
UC04 Set Parameters AC01 can set system parameters: e.g., number of documents to
be processed for IE processes (i.e., UC01 and UC02).
3.2.2. Functional System Requirements
This section describes functional system requirements and related processes by implementation
objectives (Tables 8-15).15
The Project objectives have been refined into implementation objectives
to reflect architectural design of the system e.g., database operations have been separated into
separate objective (implementation objective 6). See Table 7 for mapping between the project
objectives and implementation objectives.
Table 7 – Mapping between Projects Objective and Implementation Objectives
Project Objectives Implementation Objectives
1 1 (Table 8)
2 2-4; (6) (Tables 9-11 and 13)
3 5; (6) (Tables 12 and 13)
4 7 (Table 14)
5 8 (Table 15)
(1) Conceptualisation of Terms:
Conceptualisation of terms used in the following tables (Tables 8-15):
Risk – refers to degree of risk of completing a module/task and is based on several
factors such as time constraint, difficulty, dependency on other modules/tasks, and
external dependency. The level of risk is based on a subjective estimate of these
factors.
External Dependency – refers to dependency on external factors, e.g., IR
system(s), database(s), software, and so on.
Shared Database (Db) – refers to the database containing PMC articles in XML
format (i.e., gnode1).
System Db – refers to the database designed and implemented to store
extracted/processed data.
15
Evaluation (Table 15) is also included for the sake of completeness of requirements even though it is not a functional
requirement.
30 | P a g e
Table 8 – Implementation Objective 1
1. Design and implement a relational database schema to store extracted data (i.e, System
Db).
Functional Requirement: N/A
Risk: Low.
External Dependency: None.
Priority: High.
Pre-condition: Installed relational database management system (RDBMS), such
as MySQL.
Post-condition: Skeleton or empty Db schema: System Db.
Difficulty: Easy
Processes : 1. Design Enhanced Entity Relationship (EER) diagram. 2. Translate EER to Relational Schema.
3. Implement relational schema.
Table 9 – Implementation Objective 2
2. Design and implement a module to extract URLs from PMC XML documents
Functional Requirement: [R5]. The module shall be able to identify and extract URLs
from PMC XML documents.
Risk: Low.
External Dependency: Availability of Shared Db.
Priority: Intermediate.
Pre-condition: Objective 1, and Objective 6 (A)
Post-condition: A set of extracted URLs.
Difficulty: Intermediate
Process overview:
1. Objective 6, process A (Table 13).
2. Parse document and extract URL(s).
Table 10 – Implementation Objective 3
3. Design and implement a module to determine type of resource (or URL)
extracted/referenced.
Functional Requirement: [R6]. The module shall be able to identify the type of online resource referenced; Databank, Document, Organisation,
or Software.
Risk: Low.
External Dependency: Availability of Shared Db.
Priority: High.
Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).
Post-condition: Return type of resource or URL referenced (i.e., Databank,
Document, Organisation, or Software).
Difficulty: Intermediate
Process overview:
1. Get URL context.
2. Determine resource type by: a. keyword(s) within the
URL string, b. keyword(s) within URL reference context (i.e., title of reference and/or description of reference), or
c. keyword(s) within the article body where the URL is
cited.
31 | P a g e
3. Return resource type.
Table 11 – Implementation Objective 4
4. Design and implement a module to determine URL status: active or inactive link Functional Requirement: [R7]. The module shall be able to determine if URL is active or
inactive URLs (accessible or not).
Risk: Low.
External Dependency: No direct dependency, see pre-condition.
Priority: High.
Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).
Post-condition: Return URL status: 0/FALSE if inaccessible or 1/TRUE if
accessible.
Difficulty: Easy
Process overview:
1. Get URL to be checked (see Obj. 2).
2. Check if URL is active/inactive: if inactive return
0/FALSE, else (if active) return 1/TRUE.
Table 12 – Implementation Objective 5
5. Design and implement a module to identify and extract sponsors and contributors (NEs such as persons/organisations and their respective roles) from acknowledgments
Functional Requirements: [R8]. The module shall be able to identify NEs, such as persons
and organisations/institutions.
[R9]. The module shall be able to identify REs (i.e., sponsors/funders or collaborators/contributors).
[R10]. The module shall be able to link NEs to their respective
REs. [R11]. The module shall be able to extract NEs and their
respective roles from annotated documents.
Risk: High. Main reasons for risk level:
Dependent upon the use of appropriate methodology, and efficient use of tools (i.e., GATE 5.2.1).
Difficulty: Hard
Time constraint: as approaching project deadline. External Dependency: GATE 5.2.1 (see Section 2.2.2).
Priority: High.
Pre-condition: Objective 1, and Objective 6 (A). Post-condition: Return NEs and corresponding REs identified.
Difficulty: Hard
Process overview:
1. Implementation objective 6, process A (see Table 13)
2. Parse document and extract acknowledgement passage. 3. Process acknowledgement passage through text processing
application designed with GATE 5.2.1 (which returns a
Gate XML document with tags representing annotated entities: NEs corresponding REs).
4. Parse Gate XML document.
5. Extract annotated NEs and their respective roles.
32 | P a g e
Table 13 – Implementation Objective 6
6. Design and implement a module to handle database operations: (1) ensure synchronisation
of retrieval of documents for processing and documents already processed, (2) insert extracted/processed data into the system database.
Functional Requirements: [R12]. The module shall be able to synchronise retrieval of
documents for processing (from the Shared Db) and documents already processed (in the System Db).
[R13]. The module shall be able to insert given (tuple) of data
into the system database.
Risk: Low.
External Dependency: -
Priority: High.
Pre-condition: Implementation objectives 2-4, or 5.
Post-condition: Relevant data is inserted into the System Db.
Difficulty: Easy
Process overview:
This module is separated into two different tasks: (A)
synchronisation of processed documents (in System Db) and of
retrieval of documents (from the Shared Db) for processing, and (B) data insertion into the System Db.
A. Check last document processed for role extraction / URL
extraction: a. if none, get first document (documents may be retrieved in an ascending order enabled by auto-
incremented keys of records in the Shared Db)16
from the
Shared DB, b. else, get auto-incremented id of last document processed in the System Db and start retrieval
process from Shared Db by last document processed + 1.
B. Either get URL data (implementation objective 2-4) or
role data (implementation objective 5) and insert this data into the system database.
Table 14 – Implementation Objective 7
7. Design and implement a GUI that will facilitate exploration of system functionalities and
provides general statistics.
Functional Requirements:
[R14]. The module shall be able to view general statistics upon user request, such as; (1) number of documents processed,
(2) number of URLs extracted, (2a) descriptive statistics
of URL status (i.e., by year; in total), and (3) number of roles extracted.
[R15]. The module shall be able to invoke user parameters for
numbers of documents to be processed.
Risk: Intermediate. Main reasons for risk level:
Time constraint: approaching project deadline.
Dependent on successful completion of previous modules.
External Dependency: No direct dependency, see pre-condition.
16 The implementation will take advantage of the available auto-incremented key within the Shared Db (and the
corresponding foreign key in the System Db) to keep track of documents processed or documents to be processed when new session is initiated.
33 | P a g e
Priority: Intermediate.
Pre-condition: Implementation objectives 1-6.
Post-condition: Interactive GUI.
Difficulty: Intermediate.
Process overview: See Use Case Diagram (Figure 2). Table 15 – (Implementation) Objective 8
8. Evaluation of the purposed methodology
Functional Requirement: N/A
Risk: Intermediate.
1. Time constraint: approaching project deadline. 2. Dependent upon successful completion system modules.
External Dependency: No direct dependency, see pre-condition.
Priority: High.
Pre-condition: Completion of 1-4
Post-condition: -
Difficulty: Easy
Process overview: 1. Choose a random sample of results derived from previous
steps and apply evaluation metrics (see Chapter 6)
3.2.3. Requirement Traceability Matrix Requirement Traceability Matrix (Table 16) by User and System Functional Requirements versus
project objectives:
Table 16 – Requirement Traceability Matrix
Obj. 2 Obj. 3 Obj. 4
[R01] X
[R02] X
[R03] X
[R04] X
[R05] X
[R06] X
[R07] X
[R08] X
[R09] X
[R10] X
[R11] X
[R12] X X
[R13] X X
[R14] X
[R15] X
34 | P a g e
3.3. Non-Functional Requirements
In addition to functional requirements, a set of non-functional requirements have been derived from
the (SRE process or) requirement elucidation and analysis stage. While non-functional
requirements typically include product, external, and organisational requirements (Sommerville,
2004), this dissertation solely focuses on product requirements, specifically, system properties to
guide the architectural design and implementation of ExtConX2.
1. Extensibility
Within software engineering, extensibility refers to the notion of design/implementation of
a system which takes into consideration potential future extension of system functionalities
(Wikipedia, 2009). Extensibility may also be described as a system architecture designed to
accommodate future changes with minimal effort. For instance, system architecture based
upon modularity or compartmentalisation of which various software functions/components
are separated by concern (SoC)17
may address this requirement. Use of Object Oriented
Programming (OOP) language may also aid to achieve this end.
2. Maintainability
The notion of maintainability is similar to extensibility to some respect as the approaches
to accommodate these requirements may intersect. Nevertheless, the aim of this
requirement is to accommodate effortless maintenance of the system, to ease feature
amendment to implementation, and locate potential hidden software bugs. The use of OOP
language and SoC, and detailed documentation may be used to fulfil this requirement.
3. Reusability
The system ought to enable reusability of modules to the extent possible. This will
facilitate both extensibility and maintainability, in addition to provide software components
which may be used within future (unrelated) applications/research. The application of SoC
at class level may be used to fulfil this requirement.
17
Separation of concern (SoC) – refers to a logical separation of system functionalities. For instance, an analogy may be
drawn from the Model-View Controller (MVC) paradigm often used in web applications.
35 | P a g e
4. System Design and Analysis
This chapter is divided into two general sections:
a) Generic overview of the system architecture/design which describes high-level approaches
to extraction of external context (i.e., URLs and acknowledgements).
b) System Design and Analysis.
4.1. Generic System Architecture
A high-level overview of ExtConX2 is provided below (Figure 3, see footnotes for description of
arrows). Brief description follows (Figure 3):
Figure 3 – High-Level System Architecture18
1. The Database Module is responsible for (1) synchronisation between the Shared Database
(containing PMC XML documents) and the System Database, (2) retrieval of documents
(Db Traverser) for processing, and (3) insertion of extracted/processed data (Data Inserter)
into the System Database.
2. The URL Module is responsible for (1) parsing of PMC documents and extracting URLs
(URL Extractor), (2) determining if given URL is accessible or not (URL Status) and (3)
determining the type of resource referenced (Resource Type).
3. The IE Module is responsible for role extraction (IE Application). This module
encapsulates text pre-processing and IE task required to identify and extract NEs and
respective REs.
18
Solid arrows represent data flow, dashed arrows may be described as sub-module (of): the arrows head point toward
the super module.
36 | P a g e
4. The Parser Module encapsulates the XML parser. In addition, it handles NLM Journal
Archiving and Interchange DTDs. These are needed to parse PMC documents. The
DTDResover redirected the XML System IDs to a local repository where the DTDs are
stored.
ExtConX2 architecture is guided by the designed principle of SoC at the system level: Database
Module (including the Shared Db and System Db) encapsulates database operations (i.e., Database
Layer), and the URL Module and IE Module (including the Parser Module) encapsulates
application logic (i.e., Application Layer). This approach is coined as subsystems architecture
where each subsystem represents different level of abstraction (Bennet et al., 2006).19
This could be
considered as an approach to fulfil non-functional requirements previously defined (Section 3.3).
4.2. Description of External Context Extraction
This section provides high-level description of external context extraction based upon the generic
system design (Figure 3).
4.2.1. URL Module
The URL Module (refer to Figure 3) contains three main tasks: (1) extraction of URLs from PMC
documents, (2) determine resource type for each URL extracted, and (3) determine if a URL is
active or inactive (i.e., if resource is accessible or not).
An approach to process a given sentence containing a citation to an online resource is illustrated
below (Figure 4).
19
The system is divided into SoC: the Database Layer deals solely with retrieving documents and inserting
data (this includes the RDBMS), while the Application Layer is solely responsible for application logic.
37 | P a g e
Figure 4 – URL Module Overview
Given the following sentence:
1. The report was provided by World Health Organisation (http://www.who.int).
The output (Processed Data) of the given process (Figure 4) ought to be as followed (Table 17):
Table 17 – Ideal Results from URL Extraction Process
URL Type of Resource URL Status Date Inserted
(1) http://www.who.int Document Active 2010-09-01
A more detailed description follows. The following subsection describes (a) extraction of URLs
and determination of URL status, and (b) determining resource type (from the extracted URL
context), respectively:
a) URL Extraction
As PMC documents are provided in the NLM Archive and Interchange format (XML), the unique
tag provided for identifying URLs may be used to extract these URLs. For instance, given
hypothetical example of a URL within a PMC document (disregarding any context):
1 <ext-link ext-link-type="uri" xlink:href="http://www.who.int">
2 http://www.who.int
3 </ext-link>
The approach that may be adopted to extract the given URL follows:
Get URL:
1. Parse the given document using an XML Parser.
38 | P a g e
2. Traverse through the parsed XML document to find the XML tag identifying URLs (i.e.,
ext-link): see line 1 in the example above.
a. Ensuring ext-link contains the attribute tag: ext-link-type and that this attribute
equals uri (i.e., ext-link-type="uri”).20
This is an inference that the XML tag
contains an external URL.21
3. Subsequently, either (a) extract the URL between the ext-link start and end tag (on line 2
from the given example), or (2) extract the value of the attribute xlink:href (which also also
contains the URL: on line 1).
The XML tag pattern discussed above is consistent across all NLM Archiving and
Interchange DTDs used for PMC documents. Thus, this single approach ought to be a sufficient
method to extract URLs from different formatted PMC documents.
Determine URL Status: URL status may be determined programmatically by Hypertext Transfer Protocol (HTTP)
messages/response codes. For instance, common response codes returned by HTTP when trying to
establish a connection (either through a browser or programmatically) include (Berners-Lee et al.,
1996) these listed in Table 18:
Table 18 – HTTP Response Codes
HTTP Response Code Description
HTTP/1.0 200 OK The request was successful: URL accessible
HTTP/1.0 401 Unauthorized Unauthorised access: inaccessible
HTTP/1.0 404 Error/Not Found The resource could not be found: inaccessible
Determine Resource Type:
For each URL extracted the system must determine the type of resource referenced. For instance; is
the URL a reference to a Databank, Document, Organisation, or Software (refer to Section 1.2.1
for conceptualisation of terms).
A potential approach to determine resource type of a given URL is a mix of rule-based and
keyword-based lists which correspond to a specific resource types. Consider the following
hypothetical example:
1 <ref id="CR9">
2 <citation citation-type="other">
3 The report was provided by World Health Organisation (
4 <ext-link ext-link-type="uri" link:href="http://www.who.int/annualreport">
5 http://www.who.int/report
20
Another valid value for ext-link-type is: ftp (File Transfer Protocol). 21 An external URL refers to resources/URLs outside the scope of the article. For instance, there exist other (which may
be described as internal) URLs within PMC documents which are for various XML specific validation (e.g., namespace declaration, and so on); these are non-valid.
39 | P a g e
6 </ext-link>
7 ).
8 </citation>
9 </ref>
A potential solution to determine referenced resource type is:
1. Analyse the URL string extracted for keywords that characterise specific URL
classes (e.g., report could be used as a keyword indicating Document resource
type); if unable to determine resource type, try next process (b):
2. Get the URL context:
3-7 The report was provided by World Health Organisation (http://www.who.int/report).
3. Subsequently, analyse this context (word by word) for keywords, starting from the
location of URL within the string until the start of the sentence (see bold text in
example given above).
In this example, report could be used as keywords to determine the resource type (Document). For
each of the URL types, a list of characteristic keywords will be constructed and used.
4.2.2. IE Module
The IE Module encapsulates the IE application which is responsible for role extraction.
Specifically, given an acknowledgement sentence, the IE Module must enable the identification and
extraction of NEs and their respective REs.
a) Acknowledgement Extraction
A rule-based approach in conjunction with gazetteers may be adopted for role extraction. Apart
from common TM stages previously discussed (see Section 2.1), some notable highlights are:
1. The use of gazetteers to define:
i. NEs: persons and organisations
ii. REs: collaborators and funders (Table 19)
Table 19 – Examples of REs for Collaborators and Funders
Collaborator Roles Funder Roles
Editorial support Financial support
Reviewing the manuscript Grant-in-aid
Helpful comments Grant
Helpful suggestions Funding
2. A rule-based approach applied at semantic level processing (see Section 2): linking of NEs
and their respective REs (Role Matcher: Figure 5).
40 | P a g e
3. Subsequently, programmatically extract these sets of NEs and corresponding REs (IE) and
insert them into a predefined template/database.
The generic NLP/IE pipeline is given in Figure 5.
Figure 5 - Generic NLP/IE Pipeline
For instance, consider the following acknowledgements:
1. The authors are grateful to John Dough for reviewing the manuscript.
2. This research was funded by BBSRC.
The NLP/IE process is as followed:
a) Get NEs
i. Person NE: John Dough
ii. Organisation NE: BBSRC
b) Get REs
i. Collaborator RE: reviewing the manuscript
ii. Funder RE: funded
c) Identify respective RE for each NE :
Patterns which indicate association between NE and RE, identified from above examples
are:
1. NE for RE (collaborator)
2. RE by NE (funder)
Hence, the application of rules to identify given patterns will be sufficient at semantic level
processing, for the given example.
d) Insert this data into predefined template/database:
Table 20 - Results of TM Process
41 | P a g e
(1) Name Entity: John Dough
Role (enumeration): Collaborator
Role Expression: reviewing the manuscript
(2) Name Entity: BBSRC
Role (enumeration): Funder
Role Expression: funded
4.3. System Architecture System Architecture is the organisation of a system in terms of its software components, including
subsystems and the relationship and interaction among them, and the principles that guide the
design of that software system (Bennett et al. 2006, p.340). System architecture could directly
influence non-functional features of a system (Bennett et al., 2006). For instance, subsystems
architecture is known for advantages such as maximising reusability and improving maintainability
among other things (Bennett et al., 2006). Therefore, the guidance of non-functional requirements
previously defined (Section 3.2) has been a central factor in the architectural design and
implementation ExtConX2.
4.3.1. Subsystems Architecture
The design of ExtConX2 is based on subsystems architecture, i.e., SoC at system level or
subdivision of software components which share some common properties (Bennett et al., 2006).
This means that a system is subdivided into different layers of abstraction or layers of service
which are responsible for different aspect of functionality of the system as whole (Bennett et al.,
2006, p.350). This approach has several known advantages such as:
Maximise reusability
Aid developers to handle complexities
Improve maintainability
Aid portability
ExtConX2 has three layers of abstraction:
1. Presentation Layer
The presentation layer is the topmost layer and is responsible for the human computer
interaction (HCI). This layer enables interaction between the user, and system
functionalities through a graphical user interface (GUI). A user is able to control/initiate
system functionalities (encapsulated by layer 2 or the application layer) through input
parameters, and view output resulting from the processing of the application layer. The
presentation layer satisfies functional user requirements 1-4 and functional system
requirements 14-15 (refer to Section 3.2).
2. Application Layer
42 | P a g e
The application layer is responsible for domain logic or domain specific functionalities of
ExtConX2: the core functional requirements of the system (i.e., functional system
requirements 5-11).
3. Database Layer
The database layer encapsulates the relational database management system (RDBMS) and
system specific database operations such as synchronisation between Shared DB and
System DB (i.e., between processed documents and PMC documents available for
processing), retrieval of documents to be processed, and insert data into the System DB.
The database layer satisfies functional system requirements 12-13.
The architecture of ExtConX2 is based on layered subsystems (see Bennett et al. 2006, p.351): any
layer N can only use the services provided by the layer immediately below it (N -1). For instance,
the presentation layer cannot directly use any services provided by the database layer (see Figure
6). This level of abstraction minimises dependencies among layers (and software components) and
facilitates extensibility and maintainability of the system (Bennett et al., 2006).
Figure 6 - ExtConX2 Layered Subsystems
4.4. System Design
This section provides detailed description of the system design, such as: database, application, and
presentation layers. All illustrations provided are based on class implementations. Complete system
designed is provided in Appendix A, Figure 14.
43 | P a g e
4.4.1. Database Layer
The database layer encapsulates system functionalities or services which are responsible for
database operations. This layer provides services for the application layer directly above it (N + 1).
The following Figure 7 illustrates main components of the database layer.
Figure 7 - ExtConX2 Database Layer
a) Description of Database Layer
1. Db Manager - The Db Manager is responsible for maintaining synchronisation between
the Shared Db (containing PMC XML documents) and System Db. This is achieved by two
methods: (1) determines the last existing PMC document in the Shared Db, and (2) to
determines the last processed PMC document stored in the System Db.22
2. Db Traverser - The Db Traverser is responsible for retrieving data from the Shared Db. In
addition, Db Manager is utilised by Db Traverser to ensure synchronisation.
3. Data Inserter - The Data Inserter encapsulates methods to insert processed data into the
System Db.
b) Relational System’s Schema
Below is the Relational Database Schema used by ExtConX2, the EER Diagram may be viewed in
Appendix A, Figure 13. The Shared Db (in part) 23
and System Db are both represented by the
following Figure 8.
PMC Articles contains PMC articles in XML format, and is linked from the Shared Db.
The System Db contains four relations: Meta Data, URL, Role, and Acknowledgement.
22 Both methods relay on the auto-incremented key and foreign key in the Shared Db and System Db respectively. 23
Only the relevant relation (PMC-Articles) and attributes of the Shared Db is included in the Relational/EER diagram.
44 | P a g e
Figure 8 - Relational Database Schema
24
4.4.2. Application Layer
The application layer encapsulates domain logic: functional system requirements 5-11. This layer is
further subdivided into three separate modules (see Figure 9):
URL Module, which contains classes for URL extraction and related processes.
IE Module, which contains classes for role extraction and related processes.
Parser Module, encapsulates classes for parsing and handling NLM Journal Archiving and
Interchange DTDs.
This subdivision of the application layer into further refined SoC is another example (in addition to
the subdivision at system level) of architectural design which addresses non-functional
requirements of ExtConX2.
24
Different types of arrows are only for visibility.
45 | P a g e
Figure 9 - ExtConX2 Application Layer
a) URL Module
The URL Module is responsible for extracting URLs from PMC documents,25
checking each URL
extracted if it is accessible or not, and determine the type of resource referenced. The URL Module
contains the following classes:
1. URL - The URL class may be described as a super-class; its responsibility includes
extraction of URLs from PMC documents and invoking other operations (i.e., URL Status
and Resource Type). In addition, URL acts as a gateway between the database layer and
application layer (i.e., retrieving PMC documents and returning processed data).
2. URL Status - URL Status checks if a given URL is accessible or not.
3. URL Identifier - URL Identifier is responsible for syntactically validating URLs, and to
identify URL protocols if any (i.e., http:// and ftp://). The latter functionality is used by
URL Status.
25
Not including URLs which are part of the article metadata, i.e., the corresponding prepublication paper and licence
(http://creativecommons.org).
46 | P a g e
4. Resource Type - Resource Type is responsible for collecting possible types of resource
referenced (i.e., Databank, Document, Organisation, or Software). Refer to Section 0 for
further description.
5. Soft Decision - Soft Decision may be described as a sub-class of Resource Type which
contains a method to determine the most likely URL resource type from a set of collected
possibilities (refer to Section 4.2.1 for description).
b) IE Module
The IE Module encapsulates the TM application which handles role extraction. Specifically, pre-
processing of acknowledgement text i.e., NLP, and subsequent IE (or extraction of collaborators
and funders, and their respective REs).
1. IE - The IE class is a the super-class within the IE Module, that extracts acknowledgement
text from PMC documents, and invokes the IE Application and Role Extractor in order to
complete the acknowledgement extraction sequence.
2. IE Application - The IE Application encapsulates the TM application (designed with
GATE). This class handles the pre-processing of acknowledgement text (including,
providing annotation over NEs and their respective REs). Further description is provided in
Section 4.4.2.
3. Role Extractor - The Role Extractor extracts NEs and their corresponding roles from pre-
processed acknowledgement text.
c) Parser Module
The Parser Module encapsulates the parser and a class to handle NLM Journal Archiving and
Interchange DTDs.
1. Parser - The Parser encapsulates the Document Object Model (DOM) parser used to parse
PMC documents.
2. DTD Resolver - DTD Resolver is responsible for redirecting XML System IDs 26
to the
local directory where NLM Journal Archiving and Interchange DTDs are stored. This class
is needed due to the variety of DTDs required for parsing PMC documents.
26
System ID is the URI/URL pointing to the given XML document‘s DTD.
47 | P a g e
4.4.3. Presentation Layer The presentation layer encapsulates methods for HCI (Figure 10). It includes the following classes:
Figure 10 - ExtConX2 Presentation Layer
1. Function Panel - This class constructs the function panel or buttons to initiate various
functionalities (e.g., initiating URL extraction and role extraction).
2. Entry Panel - This class constructs the entry panel: e.g., text fields for user input such as
parameters for number of documents to be processed etc.
3. Quitable Frame - This class is responsible for the popup dialog box to confirm user of
application and to exit the program.
4. GUI - This class constructs the GUI by invoking other classes.
5. InvokeApp - acts as a gateway to the application layer initiating application logic by user
input (see Appendix A, Figure 14).
48 | P a g e
5. Implementation
This chapter describes the implementation of the main functional requirements of ExtConX2: URL
Module and IE Module (refer to Figure 9). However, these descriptions are not comprehensive, as
only a few of the more noteworthy aspects are included. Other materials not provided in this
dissertation are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/).
5.1. Tools & Implementation Environment
Tools used to implement various component of ExtConX2 include:
1. Java Standard Edition 6 & Java Platform Enterprise Edition 6
Due to wide availability of tools and APIs for TM, Java was used as the main
programming language.
2. Eclipse IDE
Eclipse IDE was the used development environment.
3. Xerces Java Parser 2.6.1 – Document Object Model Parser
DOM API is used by ExtConX2 to parse XML documents. While Simple API for XML
(SAX) uses less resource and outperforms DOM parser in terms of speed (Frankling,
2010), DOM provides greater flexibility in terms of functionality for the tasks required .
For instance, within some PMC document, and all GATE XML Documents (see Section
5.3.3.1), certain XML tags lack separate closing tags e.g.:
1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov" />
In these cases, SAX does not recognise these tags, and is therefore unable to extract these
URLs. However, DOM provides the functionality required.
Descriptions of other tools used are provided by relevant implementation modules/components.
5.2. Implementation of URL Module
This section provides detailed description of the implementation of the URL Module. Specifically,
methods adopted for URL extraction, method adopted to determine type of resource referenced
(including soft decision), and a brief description of the implemented process of extracting a URL
and determining its resource type.
49 | P a g e
5.2.1. Extraction of URLs
Extractions of URLs from PMC documents are achieved through the use of inherit NLM Journal
Archiving and Interchange Tag Suit and regular expressions. The use of both methods supplements
each other and achieves better recall and precision than using either method on its own. An analysis
of roughly 100 documents showed that it is becoming common practice to provide hyperlink text
within XML documents rather than visible URL (see an example below). Thus, the sole use of
regular expressions on printable text resulted in poor recall.
1 <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov"> hyperlink</ext-link>
In addition, a clear majority of documents providing visible URLs also include the URLs as
attribute value within the XML tags. Therefore, the extraction of URL may be achieved solely
through the use of XML tags. However, regular expressions were still used to syntactically validate
URLs extracted to accommodate human error. This helped improve precision.
The implemented process to extract URLs is described below:
1. Parse PMC document using the DOM parser.
2. Traverse through the parsed document to find XML tags defining URLs (i.e., ext-link).
a. Ensuring latter tag includes the attribute: ext-link-type, and that this attribute has
the value uri (i.e., ext-link-type="uri”). This is an inference that the XML element
contains a URL.
3. Extract the attribute value of xlink:href which, by inference, ought to be a URL.
4. Finally, the value of xlink:href is syntactically validated as a URL by applying regular
expressions (see Table 21). This step may be applied as a precaution due to potential
human error.27
Table 21 – Regular Expressions for URL Validation (((http|https|ftp)://)|(www\\.))+
([\\d\\D&&[^\\s(@]])*\\.([\\d\\D&&[^\\s@)]]\\.?)+
5.2.2. Checking Resource Availability
The Java API: URLConnection (i.e., its sub-class HttpURLConnection) is used to check if
extracted URLs are accessible or not. A connection request is sent for each URL extracted with a
set connection timeout of 10 seconds. The URL is considered as accessible/active if a HTTP 200
OK response code is received (see Table 18). If no response code is returned within 10 seconds or
if any other response code is received, the URL is considered as inaccessible.
27
For instance, a common error found is included Document Object Identifier (DOI) instead of URLs within tags defined
for the use of defining URLs.
50 | P a g e
5.2.3. Determining Resource Type
For each URL extracted the system must determine the type of resource referenced (refer to
Section 1.2.1 for conceptualisation of resource types). The approach used to achieve this end is
rule-based in conjunction with lists containing keywords (and URLs). The choice of keywords is
based upon iterative testing and analysis of roughly 100 PMC documents, and carefully chosen to
reflect the relevant resource type. Table 22 shows a subset of five keywords used for each resource
type, the full list is provided in Appendix B, Table 36.
Table 22 – Sample of Keywords
Databank Document Organisation Software
data bank .doc organisation software
databank .pdf organization sourceforge
database journal institute program
genBank report international agency application
ncbi.nlm.nih.gov/protein facts - system
Moreover, all keywords are loaded as regular expressions. This has advantages such as:
1. Keywords can easily be used as case insensitive; uppercase and corresponding lower case
spelling for each word is not needed.
2. The use of grammatical root form of keywords is sufficient.28
Hence, shorter keyword lists
are sufficient to fulfil this function.
5.2.3.1. Soft Decision
Soft decision is a method/algorithm used to determine the most likely resource type for each URL
extracted. Up to four instances of resource type(s) could be determined for each URL mentioned
through the analysis of the URL context (also see description of implementation Section 5.2.3.2):
1. By keywords identified within the URL string.
2. By keywords identifies within the parent node of the URL tag.
Typically, within the reference list the parent node contains: title of the reference
and/or description of a reference.
3. By keywords identified within parent-parent node of the URL tag.
See previous description. This is needed due to inconsistent use of nodes with
XML documents: some reference titles/descriptions are not contained within the
first parent node, rather within the parent-parent node.
4. By keywords identified within citation context of the article (i.e., the actual sentence where
the resource is cited within the article body).
28
For instance, singular and plural forms of each keyword is not needed
51 | P a g e
Once all instances have been collected, this data is subsequently processed by soft decision.
The soft decision algorithm assigns a distributed weight of total of 1 to each (resource type)
instance identified. Subsequently, the instance with the largest weight is identified as the most
likely resource type. If two instances have equal weight, the first identified (instance) resource type
is returned as the likely type. The distributed weight is based upon an iterative analysis of which
decision instance is most reliable. The distributed weight is defined as followed:
Table 23 – Distributed Score of Soft Decision Algorithm
Distributed Score Description
1 0.400 Keyword identified within the URL string.
2 0.225 Keyword identifies within parent node.
3 0.225 Keyword identified within parent-parent node.
4 0.150 Keyword identified by citation reference within the article body.
5.2.3.2. Implementation of URL Module Described
Consider this hypothetical example:
1 <ref id="CR9">
2 <citation citation-type="other">
3 MZmine 2 – software for mass-spectrometry was used in this research(
4 <ext-link ext-link-type="uri" link:href="http://www.mzm.sourceforge.net">
5 http://www.mzm.sourceforge.net
6 </ext-link>
7 ); to process the data presented in the results section.
8 </citation>
9 </ref>
The implemented process adopted to determine referenced resource type is as followed:
1. Parse the document using DOM parser.
2. Traverse through the parsed document to extract the URL.
a. Analyse URL string for keywords (see bold text below). Save the result for
analysis by soft decision.
http://www.mzm.sourceforge.net
b. (1) Get the parent node‘s (i.e., citation) context (all text between the citation start
and end node). (2) Analyse this context (word by word) for keywords, starting
from the location of the URL within this string until the start of the sentence (see
bold text below).
MZmine 2 – software for mass-spectrometry was
used in this research
(http://www.mzm.sourceforge.net); to process
the data presented in the results section.
52 | P a g e
If unable to determine a resource type, (3) analyse whole citation context (see bold
text below) starting from the beginning of the sentence to the end.
MZmine 2 – software for mass-spectrometry was
used in this research
(http://www.mzm.sourceforge.net); to process
the data presented in the results section.
Save the result for analysis by soft decision.
c. Do the same as previous step but with the parent-parent node context (in this
example the parent-parent node is ref tag; and the analysis of its context will give
identical result as the previous step). Save the result for analysis by soft decision.
d. (1) Get ref element attribute (id) value (i.e., CR9), if it exists (if not, return null).
(2) Find this citation within the article body by the reference id (CR9). (3) Finally,
analyse the sentence word by word for keywords starting from the location of
citation until the start of that sentence/paragraph. Save result for analysis by soft
decision.
3. Determine the most likely resource type by soft decision.
a. The soft decision data derived from the example above, based on the keywords
provided in Table 22, would be (Table 24):
Table 24 – Result by Soft Decision Algorithm
Instance Weight Resource Type Description
1 0.40 Software By keyword: sourceforge within the URL string
2 0.225 Software By keyword: software within the parent context
3 0.225 Software By keyword: software within the parent-parent context
4 0.150 null Assuming unidentifiable keywords within the article body
citation.
Hence, Software resource type would have a total weight of 0.85, so even if the last instance would
be identified as any other resource type, Software would be returned by soft decision as the likely
resource type.
53 | P a g e
5.3. Implementation of IE Module
This section provides a detailed description of a subset of the implementation of the IE Module
(refer to Figure 9). It presents the methods adopted for identification and extraction of NEs, REs,
and the semantic level processing.
5.3.1. GATE
GATE was used to develop the IE Application for extraction of acknowledgements. While there
exist many alternatives such as LingPipe or MinorThird, GATE was used due to availability of
extensive documentation, user friendly IDE for debugging and development, and easy integration
with Java.
GATE‘s default IE system, a Nearly-New Information Extraction System (or ANNIE), was used as
a starting point for the development of the IE Application. ANNIE contains a set of default
processing resources mostly based on Java Annotation Pattern Engine (JAPE)29
(see default
ANNIE pipeline in Appendix A, Figure 15) which was amended and extended as required, to
meet the requirements of this module.
5.3.2. Java Annotation Pattern Engine
Java Annotation Pattern Engine (JAPE) is a rule-based language which provides finite state
transduction over annotations (Cunningham et al., 2010) enabling various IE tasks through
manipulation of existing and creation of new annotations. An JAPE grammar may be split up in a
set of phases consisting of patterns and action rules that may be run sequentially (Cunningham et
al., 2010) in a customised order defined. In fact, the ability to create sequential pattern/action rules
enables the simplification of extraction of complex patterns into incremental simplified rules (see
Section 5.3 for example).
A JAPE rule consists of two primary parts: left-hand-side (LHS) and right-hand-side (RHS). LHS
shall consist of rule-based pattern description(s), and RHS shall consist of action rules or
annotation manipulation statements. JAPE syntax used for pattern description is quite similar to
regular expressions used in any programming language, hence no description of syntax will be
provided (refer Cunningham et al., 2010, Chapter 8). Following example is a simplified JAPE rule
to identify the pattern of two consecutive, upper initial proper nouns, and to subsequently labels
them as Person (see description of syntax provided):
29
JAPE is based on Common Pattern Specification Language (CPSL)
54 | P a g e
1 Phase: AnnotatePerson // Phase name or identifier for rule
2
3 // Input annotation must be defined (e.g., annotated by POS tagger)
4 // that will be used by the pattern description
5 Input: Token
6
7 Rule: Person1 // Rule name
8 (
9 // Pattern: NNP NNP (with uppercase initials)
10 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
11 {Token.kind==word, Token.category==NNP, Token.orth==upperInitial}
12 ):temp // Temporary label
13
14 -->// Everything above this symbol is the LHS, and below RHS
15
16 // Convert temporary label to permanent annotation/label: Person
17 :temp.Person = {rule = " Person1"}
5.3.3. Implementation of IE Module Described
Similarly to Giles and Councill (2004), the NLP/IE process is not applied to entire PMC
documents, but solely the acknowledgement sections extracted. The general process for role
extraction follows:
1. PMC documents are parsed using a DOM parser.
2. The acknowledgement (section) is extracted using NLM Journal Archiving and
Interchange DTD tags: ack.
3. Subsequently, this text is processed using the IE Application developed using GATE (refer
to Figure 9). The output of this process is a GATE XML document which contains the
dump of annotations (i.e., NEs and their respective RE) in an XML format.
4. The Gate XML document is programmatically processed (or parsed) to extract NEs and
respective REs, and inserted them into the System Db.
55 | P a g e
Figure 11 - IE Application Pipeline
5.3.3.1. Description of IE Application
Out of eight processing resources used for text pre-processing and IE task (Figure 11), four are
custom designed: Gazetteer (partially), NE-Extended Transducers, Role Expression Transducers,
and Role Context Transducers. The latter three are developed using JAPE. Description of these
processing resources and some implementation examples follows:
1. Gazetteer - The ANNIE gazetteer which is used for name entity recognition (by default) is
further extended to accommodate role extraction.30
In particular:
i. The organisation‘s list is extended with known funding organisations.31
ii. Role Expression lists are added: containing collaboration and funder roles (see
Table 25 and 26). Each type of role has two separate lists: (1) multi-word and (2)
one-word lists. This enables prioritisation of multi-word roles at semantic level
30 Extended lists are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/) 31
Resources used for collecting research funding organisation names include: Wikipedia (2010), NIH(2010), and Giles
and Councill (2004).
56 | P a g e
processing which results in better evaluation results (i.e., one-word roles tend to
result in partial identification of roles).
Table 25 – Sample of One-Word Role Expression Lists
Funding Roles Collaboration Roles
Grant-In-Aid advice
grants assistance
sponsor discussions
sponsored comments
sponsors encouragement
Table 26 – Sample of Multi-Word Role Expression Lists
Funding Roles Collaboration Roles
financially supported assistance and comments
fellowship award critically reading
financial support critically reviewing the manuscript
research fund helpful comments
research funds technical assistance
2. NE-Extended Tranducer - ANNIE Gazetteer provides annotation of name entities (e.g.,
persons: first and last names), subsequently the ANNIE NE Transducers, which is based on
JAPE, contains rules to manipulate these annotations to further to create, e.g., person full
names (i.e., linking of first and last names). NE-Extended Transducers is required to
complement the ANNIE Gazetteer and NE Transducers. Initial testing of the ANNIE
system showed considerable number of NEs neglected in particular non-English names.
This resource was needed to improve the performance of the semantic processing resource
(i.e., Role Context Transducers) or the linking of NE and their respective RE.32
Below is a
simplified version of a rule used, see documentation in bold for description:
1 Rule: PersonExt1 // Rule name or identifier
2
3 /* VB - verb - base form: subsumes imperatives, infinitives and * subjunctives.
4 * VBP - verb - non-3rd person singular present.
5 * Target: e.g., “thank”, “grateful”, and so on
6 */
7 ({Token.kind==word, Token.category==VB}|
{Token.kind==word, Token.category==VBP})
8
9 // Any word token, non- Person or Organization.
// Target e.g., „to‟, „for‟, and so on.
10 ({Token.kind==word, !Token.orth==upperInitial,
!Person, !Organization})?
11
32
This due to the reason that: role association rules are based on the good functioning of NER system.
57 | P a g e
12 /* NNP - proper noun - singular: All names are typically
* capitalised.
13 * Create temporary label over the following pattern, given
* that the preceding patterns are true.
14 */
15 (
16 {Token.kind==word, Token.category==NNP,
Token.orth==upperInitial,!Person, !Organization}
17 {Token.kind==word, Token.category==NNP,
Token.orth==upperInitial, !Person, !Organization}
18 ):temp // Temporary label
19
20 --> // LHS --> RHS
21
22 /* Convert temporary label, “temp”, to permanent label
* “Person” with given features:
23 * “rule = PersonExt1” and “rule1 = PersonFull”
24 */
25 :person.Person = {rule = "PersonExt1", rule1 = "PersonFull"}
The above rule annotates two consecutive proper nouns (with uppercase initials) (NNP) as
Person, given that the NNPs have not been annotated by default ANNIE resources and that
they are preceded by a verb (either base form: subsumes imperatives, infinitives and
subjunctives (VB) or non-3rd
person singular present (VBP)). Relevant VB/VBP include:
thank, grateful, etc. For instance, given the following sentences:
i. We are grateful to Jong Zang...
ii. We thank Youm Dom...
Jong Zang and Youm Dom will be annotated as Person by using the rule provided above.
3. Role Expression Transducer - In addition to the use of lists (gazetteer) to identify
collaboration roles, the JAPE grammar is used. In fact, the use of JAPE grammar to
indentify REs results in better performance than lists. This is due to the reason that there
exist far too many varieties of collaboration roles to account for and include in lists. For
instance, consider the following acknowledgement:
i. We thank Youm Dom for constructive feedback and providing GEO212.
The following rule (RoleExpression1), assuming that Youm Dom is annotated as Person,
provides annotation (of RE) over following parts of the sentence (see bold text):
ii. We thank Youm Dom for constructive feedback and providing GEO212.
The latter annotation would have also been accounted for by the use of gazetteer. However,
this is only a partial identification of the RE (i.e., providing GEO212 is missing).
Description follows:
58 | P a g e
1 Rule: RoleExpression1 // Rule name
2 (
3 {Person} // Annotated NE: Person
4
5 // NE may be (note use of: “?”) followed by '( ..... )'
// typically containing associations
6 ({Token.string=="("} ({Token})* {Token.string==")"})?
7
8 // There might exist additional words between {Person} and
// ['for'|'who']
9 ({Token.kind==word, !Person}{Token.kind==word, !Person}|
10 {Token.kind==word, !Person}{Token.kind==word,
!Person}{Token.kind==word, !Person})?
11
12 // NE must be followed by 'for' or 'who' – indicating beginning
// of a role expression
13 ({Token.string=="for"}|{Token.string=="who"})
14
15 // PRP$ - probably, possessive pronoun. Target cases: his, her,
// or their (may exists)
16 ({Token.category=="PRP$"})?
17 )
18
19 /* Annotate the following tokens/words as role (with temporary
* label).
20 * End annotation if negation cases are true: [.,;] or 'and'
21 */
22 (
23 ({Token, !Token.string==~"[.,;]", !Token.string=="and"})*
24 ): role // Temporary label
25 --> // LHS --> RHS
26 // Convert temporary label to permanent label: RoleEntity1 with
// given features.
27 :role.RoleExpression1 =
{kind = "PersonCollab", rule = "CollabRule1"}
The acknowledgement example given above involves two challenges: (1) the RE expands
over a conjunction (i.e., and) which could also indicate the end of an RE, and (2) the RE
itself cannot be accounted for prior to processing the text (as previously discussed). JAPE
provides the facility to split rules into a set of separate rules/phases in order to approach
this sort of complexities. This approach has been adopted for annotating RE.33
The following rule (RoleExpression3) is applied to text subsequent to RoleExpression1,
hence, continuing on the prior example, the result of RE annotation would be as followed
(see bold text):
iii. We thank Youm Dom for constructive feedback and providing GEO212.
33
The use of phases has been adopted for all three processing resourced developed: NE-Extended Tranducer, Role-
Expression Tranducer, and Role-Context Tranducer.
59 | P a g e
1 Rule: RoleExpression3 // Rule name
2 (
3 // Annotated Role Expression (derived from previous rule)
4 ({RoleExpression1})
5
6 // Ensure RoleExpression1 is not followed by a new
acknowledgement
7 ({Token.string=="and"}{!Person})
8 ({Token.kind==word, !Person, !Token.string=="and"})?
9 ({Token, !Token.string=="and", !Person, !Organization,
!Token.string ==~"[.,;]"})*
10 ):temp // Temporary label
11
12 --> // LHS --> RHS
13
14 :temp.RoleExpression3 =
{kind = “PersonCollab”, rule = “CollabRule3”}
4. Role Context Tranducer
Role Context Tranducer is responsible for semantic level processing: linking of NEs and
respective REs. This is the last processing resource before extraction of annotated roles.
Below are examples of rules which link NEs (i.e., organisations) and their corresponding
REs.
Consider the following example:
i. National Institute of Health provided funding for this research.
The application of the following rule (OrgFund1) results in the annotation of the NE and
RE (which is annotated with customised RE lists discussed earlier) as a role context (see
bold text):
ii. National Institute of Health provided funding for this research.
1 Rule: OrgFund1 // Rule name
2 (
3 {Organization} // Annotated NE: Organisation
4
5 // There might exists a word between NE and RE (e.g., provided)
6 ({Token.string==word, !Token.string==","})?
7
8 /* Find Gazetteer annotated REs: Funder/Sponsor,
9 * priority given to 'multi-word' roles.
10 */
11 ({Lookup.majorType==role_fund, Lookup.minorType==multi_word}|
12 {Lookup.majorType==role_fund, Lookup.minorType==one_word})
13 ):temp // Temporary label
14
15 --> // LHS --> RHS
16
17 /* create new annotation from temporary label*/
60 | P a g e
18 :temp.roleContext = {rule="OrgFund1"}
Another feature of JAPE which is useful and utilised by Role Context Tranducer is that it
enables prioritisation of rules which are applied sequentially or in a cascade. For instance,
if rules may overlap prioritisation weights may be applied accordingly: prioritising one rule
or set of annotations over another. The following rule (OrgFund2), which is applied
subsequent to the previous rule explained, identifies a funder RE (as annotated by the
gazetteer) and annotates the whole sentence as role context. This method is practically
feasible as acknowledgements of funders are typically separated by sentence.
1 Rule: OrgFund2
2 (
3 /* find Gazetteer annotated roles: Funder/Sponsor,
4 * priority given to 'multi-word' roles.
5 */
6 ( {Lookup.majorType==role_fund, Lookup.minorType==multi_word}|
7 {Lookup.majorType==role_fund, Lookup.minorType==one_word} )
8
9 /* Label all tokens until end of sentence */
10 ({Token, !Split})*
11 ):temp // Temporary label
12
13 --> // LHS --> RHS
14
15 /*create new annotation from temporary label*/
16 :temp.roleContext = {rule="OrgFund2"}
5.3.4. Information Extraction
Following the text pre-processing, the IE Application (refer to Figure 9) returns a GATE XML
document (see sample of GATE document in the end of this section). The process adopted to
extract roles is as followed:
1. Parse the GATE XML document using DOM parser.
2. Find annotation type: Role Context and store its StartNode and EndNode values (e.g., 9 and
87 respectively in the example below) for later reference:
28 <Annotation Id="1924" Type="roleContext" StartNode="9" EndNode="87">
3. Get annotation type Role Expression within the range of Role Context StartNode and
EndNode:
34 <Annotation Id="1923" Type="RoleExp" StartNode="29" EndNode="87">
35 <Feature>
36 <Name className="java.lang.String">rule </Name>
37 <Value className="java.lang.String">CollabRule3</Value>
38 </Feature>
39 <Feature>
40 <Name className="java.lang.String">kind</Name>
41 <Value className="java.lang.String">PersonCollab</Value>
42 </Feature>
43 </Annotation>
61 | P a g e
4. Determine the type of Role Expression by appropriate child node. In this example (see
above: line 41), it is identified as a Person Collaboration role (hence, NE: Person and RE:
Collaboration). In addition, store Role Expression StartNode and EndNode values (i.e., 29
and 87 respectively) for later reference.
5. Get annotation type Person within the range of Role Context, and store its StartNode and
EndNode values (i.e., 9 and 24 respectively) for later reference:
44 <Annotation Id="1921" Type="Person" StartNode="9" EndNode="24">
45 <Feature>
46 <Name className="java.lang.String">rule</Name>
47 <Value className="java.lang.String">PersonFinal</Value>
48 </Feature>
49 </Annotation>
6. Extract NE: Person and Role Expression by previously stored node values from the
document content area (which contains the acknowledgement text with serialised nodes
corresponding to annotations):
10 <TextWithNodes>
11 <Node id="0" />We<Node id="2" />
12 <Node id="3" />thank<Node id="8" />
13 <Node id="9" />Dr<Node id="11" />
14 <Node id="12" />Melvin<Node id="18" />
15 <Node id="19" />Simon<Node id="24" />
16 <Node id="25" />for<Node id="28" />
17 <Node id="29" />critical<Node id="37" />
18 <Node id="38" />reading<Node id="45" />
19 <Node id="46" />of<Node id="48" />
20 <Node id="49" />the<Node id="52" />
21 <Node id="53" />manuscript<Node id="63" />
22 <Node id="64" />and<Node id="67" />
23 <Node id="68" />helpful<Node id="75" />
24 <Node id="76" />discussions<Node id="87" />.<Node id="88" />
25 </TextWithNodes>
7. Result of role extraction from given example is provided in Table 27. The full GATE
XML document used in this example follows.
Table 27 - Results of Role Extraction
(1) Name Expression: Dr Melvin Simon
Role (enumeration): Collaborator
Role Expression: Critical reading of the manuscript and helpful
discussions
62 | P a g e
Gate XML Document:
1 <GateDocument>
2 <!-- The document's features-->
3 <GateDocumentFeatures>
4 <Feature>
5 <Name className="java.lang.String">MimeType</Name>
6 <Value className="java.lang.String">text/plain</Value>
7 </Feature>
8 </GateDocumentFeatures>
9 <!-- The document content area with serialised nodes -->
10 <TextWithNodes>
11 <Node id="0" />We<Node id="2" />
12 <Node id="3" />thank<Node id="8" />
13 <Node id="9" />Dr<Node id="11" />
14 <Node id="12" />Melvin<Node id="18" />
15 <Node id="19" />Simon<Node id="24" />
16 <Node id="25" />for<Node id="28" />
17 <Node id="29" />critical<Node id="37" />
18 <Node id="38" />reading<Node id="45" />
19 <Node id="46" />of<Node id="48" />
20 <Node id="49" />the<Node id="52" />
21 <Node id="53" />manuscript<Node id="63" />
22 <Node id="64" />and<Node id="67" />
23 <Node id="68" />helpful<Node id="75" />
24 <Node id="76" />discussions<Node id="87" />.<Node id="88" />
25 </TextWithNodes>
26 <!-- The default annotation set -->
27 <AnnotationSet>
28 <Annotation Id="1924" Type="roleContext" StartNode="9" EndNode="87">
29 <Feature>
30 <Name className="java.lang.String">rule</Name>
31 <Value className="java.lang.String">PersonCollab1</Value>
32 </Feature>
33 </Annotation>
34 <Annotation Id="1923" Type="RoleEntity" StartNode="29" EndNode="87">
35 <Feature>
36 <Name className="java.lang.String">rule </Name>
37 <Value className="java.lang.String">CollabRule3</Value>
38 </Feature>
39 <Feature>
40 <Name className="java.lang.String">kind</Name>
41 <Value className="java.lang.String">PersonCollab</Value>
42 </Feature>
43 </Annotation>
44 <Annotation Id="1921" Type="Person" StartNode="9" EndNode="24">
45 <Feature>
46 <Name className="java.lang.String">rule</Name>
47 <Value className="java.lang.String">PersonFinal</Value>
48 </Feature>
49 </Annotation>
50 </AnnotationSet>
51 </GateDocument>
63 | P a g e
6. Evaluation
This chapter presents and discusses the evaluation of the methods adopted and results obtained
from facts analysed during the knowledge discovery stage of the dissertation. This chapter is
subdivided into three main sections: (1) presents and discusses the evaluation of the URL
Extraction, (2) presents and discusses the evaluation of Role Extraction, and (3) discusses system
issues.
The evaluation of ExtConX2 IE tasks was evaluated according to customary means such as recall-
precision-based metrics. Table 28 defines the evaluation terms used in the subsequent definitions
of Precision (P), Recall (R), and F Measure (F).
Table 28 – Evaluation Terms Described
Relevant Non-relevant Extracted True positives (tp) False positives (fp) Not Extracted False negatives (fn) True negatives (tn)
𝑃 = 𝑡𝑝
𝑡𝑝 + 𝑓𝑝 𝑅 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑛 𝐹 =
2 × 𝑃 × 𝑅
𝑃 + 𝑅
6.1. URL Extraction
Roughly 190, 000 PMC documents were processed, of these, 47, 644 contained a total of147, 133
URLs and 95, 799 unique URLs. Based on the evaluation of a random sample of 50 documents
(222 URLs), the adopted approach achieved 98.6% precision and 96% recall for URL extraction
(see Appendix C for evaluation data). In addition, the soft decision algorithm achieved a recall of
81.1% and precision of 88.7% for classification of resources.
The most referenced resource type by total number of extracted URLs is presented below (Table
29). As expected, Document resource type is referenced most of all. However, an interesting
discovery is the percentage of referenced Software type (see Chapter 7).
Table 29 – Total Resource Type Referenced
URL Resource Type Total Identified URLs % of
None 16, 865 11.46%
Databank 15, 409 10.47%
Document 7, 2197 49.07%
Organisation 7, 353 5.00%
Software 35, 309 24.00%
TOTAL: 147, 133 100%
64 | P a g e
Table 30 provides a summary of accessible/inaccessible online resources by year of publication of
URLs:34
Table 30 – Resource Availability by Year
Year Total URLs Accessible URLs Inaccessible
URLs
% Inaccessible
by Year
2010 1,382 1,248 134 9.70 2009 42,995 38,251 4,744 11.03 2008 37,790 32,242 5,548 14.68 2007 26,133 21,874 4,259 16.30 2006 16,669 13,390 3,279 19.67 2005 9,932 7,561 2,371 23.87 2004 6,745 4,910 1,835 27.21 2003 2,561 1,827 734 28.66 2002 1,659 1,179 480 28.93 2001 729 470 259 35.53 2000 251 172 79 31.47 1999 186 115 71 38.17
Table 29 is illustrated by the following Figure (12):
Figure 12 – URL Decay
As obvious from Figure 12 the notion of URL decay is equally applicable to full-text journals as
found in citation (see Wren 2004; Wren 2008). The trend may be described as a function of
publication year, the older the publications the less accessible resource are exists within
publications.
34
Summary is based on total of 147032 URLs not 147133. This due the reason that metadata for those articles containing
the remaining 101 URLs was not extracted.
0
10
20
30
40
50
60
70
80
90
100
2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
% o
f A
cces
sib
le R
eso
urc
es
% of Accessible Resources
65 | P a g e
6.1.1. Discussions
This section provides some discussions of facts presented in regards to URLs within PMC
documents and the underlying implementation of ExtConX2 which may have affected those facts.
Potential suggestions or improvements are also provided.
(1) FTPs
One of the limitations of the data presented is that FTP URIs was not checked for availability.
However, the analysis of data extracted showed that out of 147,133 URLs extracted, only 791 (or
0.5%) were FTPs, hence we can conclude that the impact on the statistics present is minimal.
(2) Resource Availability
The method adopted to check availability of resource has some weaknesses. As availability is only
checked once before insertion into the database, the accuracy of availability results may be
implicated. For instance, web servers do not have 100% up-time or unlimited capacity for online
traffic. Hence, either of latter factors may have impacted the results. A better approach to
maximise accuracy of URL availability is to implement an additional module which crawls the
database and updates the URL status appropriately. For instance, Wren‘s (2004, 2008) approach
would be ideal: URLs were checked every day over a 4 week period and any URL which was
accessible over 90% of the times was deemed as an active resource.
In addition, due to the project time constraint, the implementation for checking URLs availability
had a 10 seconds time-out limit.35
As some web servers take longer to respond to HTTP requests,
this limit may have affected the results presented.36
(3) Soft Decision and Resource Identification
Approximately 10.5% of the resources identified were incorrectly classified and 8.5% were not
identified at all. A manual review of these documents (and others) shows two primary issues with
the implementation. The use of keywords to identify resources failed due to (1) lack of keywords
within the citation context to indicate type of resource, and (2) non-accounted resource type within
the implementation, e.g., laboratory tools and equipment. The latter limitation may be addressed
by creating a new list of keywords that characterises laboratory tools and equipment and do some
minor amendments to the implementation to facilitate an additional resource type.
35
Wren‘s (2004, 2008) implementation had a 60 second time-out limit: this is probably an appropriate limit. As the URL
data was loaded into the database the last 10 days of the dissertation, a 60 second timeout would have taken take around 12 days to insert into the database (considering existing system issues: see Section 6.3) 36
Testing of the implementation to check URL availability confirmed cases which did take more than 10 seconds to
confirm accessibility of URLs.
66 | P a g e
Moreover, both the soft decision algorithm (i.e., distributed weight applied to instances) and
method used for resource classification could be further improved. For instance, consider the
following generic citation, which is similar to examples found in manually analysed documents,
which the soft decision failed to classify:
1. James [1] proved that the method has good performance.
This example does not include any keywords per se enabling classification of resource referenced.
However, the citation style (James [1]) indicates Document type. Thus, the use of regular
expressions to match the following pattern: ‗NE [NUMBER]‘ may be applied as an additional
method to use of keyword lists.
6.2. Role Extraction
The adopted rule-based approach to role extraction (i.e., extraction of NEs and corresponding REs)
achieved a recall of 67.6% and precision of 92.6% and F-score of 77.7%. The NER achieved a
recall of 69.9% and precision of 95%, and the extraction of REs achieved a recall of 75% and
precision of 97.6%. The evaluation was based on a random sample of 50 documents. From the
whole PMC dataset processed 86,751 acknowledgements were extracted, 71,615 of these were
identified as containing roles.
(1) Evaluation Principles:
The evaluation was guided by the following principles:
Acknowledgements of NE with no roles were not considered and ignored.
Acknowledgement of entities that were not individuals or organisations (e.g., laboratory
staff, teams/groups, etc) were not considered and ignored
In addition, some acknowledgments, in particular of organisation could have two valid REs. Thus,
either role extracted was considered as a true positive. For instance, in the following example both
supported and grant is considered as true positives:37
1. This work was supported by NIH grant
37
For the evaluation results of REs extracted, this example would be considered as containing 1 RE, and either one
extracted would be considered as true positive.
67 | P a g e
Acknowledgements of multiple NEs with identical RE were considered as separate
acknowledgements. For instance, the following acknowledgement would be considered to contain
three separate roles (see Table 31):
2. We like to thank John Dough, Jim Baker, Zoe Zindan for reviewing the manuscript.
Table 31 – True Positives: Role Extraction
(1) Name Entity: John Dough
Role Expression: reviewing the manuscript
(2) Name Entity: Jim Baker
Role Expression: reviewing the manuscript
(3) Name Entity: Zoe Zindan
Role Expression: reviewing the manuscript
(2) Extracted Facts
Table 32 shows the result of most acknowledged funding organisations within PMC. As the role
extraction system does not handle acronyms prior to IE (i.e, organisations and their corresponding
acronyms are extracted as separate roles) additional manual analysis was needed to present this
result. In addition, some organisations have identical names in different countries. For instance,
National Cancer Institute exists both in US and Canada. This was not taken into consideration.
However, other organisations presented (Table 32) are unique, either by country or globally.
Table 32 – Most Acknowledged Funding Organisation
Name of Funding Organisations Total Nr.
Acknowledgements
1 National Institutes of Health 10,613
2 National Science Foundation 3,099
3 Wellcome Trust 2,287
4 European Union 1,443
5 Deutsche Forschungsgemeinschaft 1,301
6 National Cancer Institute (US and Canada) 1,114
7 Canadian Institutes of Health Research 928
8 Biotechnology and Biological Sciences Research Council (BBSRC) 829
9 European Commission 746
10 National Health and Medical Research Council (NHMRC) 663
11 National Natural Science Foundation of China 548
12 Swedish Research Council 538
13 Swiss National Science Foundation 467
68 | P a g e
6.2.1. Discussions
The overall performance of the IE task was quite poor in terms of recall. This was due to a
combination of factors. However, the most notable factor being the performance of the NER. As
both the RE Tranducer and Role Context Tranducer (refer to Section 4.4.2) rely on the good
performance of the NER, a domino effect lead to the overall poor performance. Description of the
NER, RE Tranducer, and Role Context Traducers follows:
(1) NER
Couple of issues with the NER processing resources include: none or partial recognition (1) of non-
English names and (2) of multi-word organisation NEs.
NEs that did not adhere to customary orthographical rules used in English spelling of names (i.e.,
capitalised initials of NNPs) accounted for significant number of cases. For instance, common
examples included Italian names e.g., Marco de Bartol (note: bold), and Chinese names, which
often adhere to English orthography, but include two letter NNPs e.g., Hurng-Yi Wang (note: bold)
which was not recognised by the NER.
Another issue was the non-recognition of multi-word organisations. Some examples from the data
extracted include:
i. Ministry of Health, Labour and Welfare of Japan
ii. Ministry of Education, Science, Sports and Culture of Japan
iii. Mental Illness Research, Education and Clinical Centre
A potential approach to handle this issue would be at the lexical level processing such as the
expansion of the gazetteer. While around 150 organisation names was added to during the
development process this was clearly inadequate.
(2) RE Tranducer
Factors affecting the performance of the RE Tranducer (labelling of collaboration and funder roles)
include: (1) the poor performance of the NER system, and (2) limitations in terms of variety of
rules used.
The sole pattern used for labelling collaboration roles was (see Table 33 for explanation):38
i. [Person] [for|who|provided] [PRP]? [ROLE]
38
The given pattern is somewhat simplified, but represents the generic rule applied in the RE Tranducer.
69 | P a g e
Table 33 – Description of RE Transducer Rule
Pattern Description
[Person] NE: person
[for|who|provided] Word token: for, who, or provided
[PRP]? Possessive pronoun: his, her, their, etc. (may or may not exist).
[ROLE] The role being labelled, if and only if, the preceding patterns were matched.
Thus, roles that did not adhere to the above pattern were ignored. Below is a common example
identified during the evaluation of the system (NEs are in bold):
i. We like to thank Jim Dough, John Stew, and John Crow from Manchester University, UK,
for helping with the laboratory work.
Hence, as a NE is not preceding the relevant RE (i.e., helping with the laboratory work). As a result
the processing resource fails to identify the RE. See discussion of Role Context Transducer for an
example of the RE Transducer failure to identify a RE due to the poor performance of the NER.
(3) Role Context Transducer
The performance of the Role Context Transducer is almost entirely dependent on preceding
resources, in particular, the NER and RE Transducer. The semantic level processing uses an
identical pattern used by the RE Transducer. However, in contrast, a NE or consecutive NEs which
are followed by a RE (identified by prior processing resource) are collectively labelled as Role
Context. Given that the NER and RE Transducer have correctly identified existing NEs and a RE,
the following example illustrates the ideal result of the application of the Role Context Transducer
(see highlighted text):
i. We are indebted to Brian Boyle, Mark Andersen, and Jeffrey Dean for critically
reviewing the manuscript.
However, due to a domino effect initiated by the poor performance of the NER, the performance of
the Role Context Tranducer and therefore the evaluation results were affected. The following
examples illustrated couple of common results observed during the evaluation stage (identified NEs
are in bold and identified RE is in bold and underlined):
i. We are indebted to Michel Cusson, Pierre Fobert, Frédéric Vigneault, Brian Boyle, Mark
Andersen, and Jeffrey Dean for critically reviewing the manuscript.
ii. We are indebted to Michel Cusson, Brian Boyle, Mark Andersen, and especially Jeffrey
Dean for critically reviewing the manuscript.
70 | P a g e
In the first example given, Mark Andersen is not identified as a NE by the NER process. Therefore,
as the Role Context Transducer relies on either consecutive NEs39
or a single NE followed by a RE,
only 1 out of 6 roles is identified by the Role Context Transducer.
In the second example, the NER processing has failed to identify Jeffrey Dean, hence, the RE
Transducer is unable to identify any RE, and subsequently the Role Context Transducer fails to
identify any roles.
This domino effect initiated by the poor performance of the NER was one of the most significant
issues of the IE application. This limitation may be addressed by expanding the gazetteer and
adding additional rules for recognition of non-English NEs.
6.3 System Limitations
The following environment (Table 34) was used during the development and evaluation of
ExtConX2:
Table 34 - Development and Evaluation Environment
Nr. Environment Value
1 Operating System Windows 7 Home Edition 32-bit
2 Database Server MySQL 5.0
3 Processor Intel Core2 Solo 1.4Ghz
4 Memory Ram 2GB
5 JVM Maximum Memory 512MB
Following sections discusses couple specific software issues uncovered during the evaluation stage:
(1) URL Module
The current implementation to check URL availability contains a bug. The bug is inherited from
the Java API used to check URL availability (i..e, HttpUrlConnection). While the cause has not
been undoubtedly confirmed, it seems to be caused by severs which do not allow HTTP connection
programmatically. This is assumed because, none of the URLs manually checked were unavailable
or had any syntactical issue. Furthermore, the API used freezes when trying to get a response code
from the host to determine if the URL is accessible or not. This issue can be solved by the use of
threads: if no response is received within a certain amount of time, the thread can safely be
terminated (without affecting any concurrent processes) and the URL could be marked for manual
check.
39
Consecutive NEs must be separated by commas or the word token: and.
71 | P a g e
(2) IE Module
The IE Application which handles the text-pre-processing is unable to process acknowledgment
paragraphs over 200 words in the used environment. A java.lang.OutofMemoryError: Java heap
space exception is thrown. This due to the reason that: Java Virtual Machines (JVM) heap size is
insufficient. This is a known issue with GATE API (Cunningham et al. 2010, p.35). However, due
to the environment used, the JVM maximum memory couldn‘t be increased to address this issue.
However, in order to address this issue, the Java maximum heap size needs to be set to 768MB or
more.
72 | P a g e
7. Conclusion
The aim of this project was to develop a text mining system (ExtConX2) to enable:
(1) the exploration of acknowledgements of individuals and organisations, and
(2) analysis of URL decay and most often referenced online resources.
Table 35 summarises the project aims, which have all been fully met.
Table 35 – Accomplished Project Aims
Project Aims
1 Design and implement a relational database (Db) schema to store extracted data.
2 Design and implement a module to extract URLs from documents, determine if the given
URL is accessible or not, determine type of resource (or URL) extracted/referenced and
insert this data into a database.
3 Design and implement a module to identify and extract funders and collaborators (i.e.,
persons/organisations and their respective roles) from acknowledgements and insert this
data into a database.
4 Design and implement a GUI that will facilitate exploration of system functionalities and which provides general statistics.
5 Evaluation of the purposed methodology.
TM techniques were used to achieve the main functional requirements of the system. In particular
NLP processing such as lexical, syntactic, and semantic level processing was used for
acknowledgement extraction. In addition, a rule-based approach (JAPE) was used for semantic
level processing to enable the IE task of role extraction. We differentiated between two classes of
roles: funders and contributors. Finally, a combination of regular expressions and lists containing
keywords were used for extraction of URLs and classification of these resources into four classes
(i.e., Databank, Document, Organisation, and Software).
As part of the project, we have processed a set of 190,000 full-text journal articles from PubMed
Central.40
A subset of 50 documents was manually checked to evaluate ExtConX‘s performance.
For URL extraction, the system achieved 98.6% precision and 96% recall. For URL resource
classification, the system was able to correctly classify 81.1% of URLs (recall) with precision of
88.7%. For role extraction, the system achieved 92.7% precision, 67.6% recall and an F measure of
77.7%.
Using this data, we have analysed some trends in URL decay and acknowledgments. For example,
we found that URL decay can be described as a function of publication year: the older the
publication the less accessible resource contained within publications. We also found that most
funding acknowledgements were associated with National Institutes of Health.
40
However, the full dataset was not available in XML format. Hence, roughly 120,000-130,000 were processed.
73 | P a g e
While prior research has had similar applications as ExtConX2, this project has extended the scope
of that research by analysing larger datasets and adopting more sophisticated approaches. For
instance, Wren‘s (2004, 2008) study was solely confined to PubMed citations, while ExtConX2 has
enable the analysis of URL decay within full-text articles. This has enabled us to draw a more
holistic conclusion in regards to the scope of URL decay within the biomedical domain. In
addition, ExtConX2 is the first system to enables acknowledgement extraction within PMC.
7.1. Limitations and Future Work
The following list defines ExtConX2‘s limitations and provides suggestions for future
enhancements:
1. The URL Module is currently only able to check HTTP (i.e.,http:// and https://) for
availability. Additional implementation is needed for File Transfer Protocol.
2. The IE Module extracts organisation names and its abbreviation as separate NEs, hence
resulting in two separate roles. This could be handled by implementing an additional for
acronym detection.
3. Soft decision and keywords for resource classification may be further studied and
improved. For instance, additional category type: laboratory tools and equipment ought to
be added.
4. Implementation of concurrent processing to speed up check of resource availability and to
handle non responding URLs to address system issues discussed.
5. Currently the implementation is only analysing acknowledgements within defined
acknowledgements sections. However, other
6. The facts presented are quite limited, with available data extracted other
patterns/relationships may be uncovered e.g., (1) resource types and journals which are
most affected by URL decay, and (2) relationship between funding organisations and
discipline of research most often sponsored.
In addition, other topics of interesting was realised during the course of this project:
1. Document representation seems to be changing. More and more documents do not provide
visible/printable URLs, instead, hyperlinks encapsulating URL strings are provided.
2. It would be interesting to analyse the type of applications referenced within PMC. For
instance, what types of software are referenced and what are their uses?
74 | P a g e
References
Ananiadou, S. & McNaught, J., 2006. Text Mining for Biology and Biomedicine. Artech House: London.
Ananiadou, S. et al., 2005. The National Centre for Text Mining: Aim and Objectives. Ariadne, [online] 30 Jan., (42). Available at: http://www.ariadne.ac.uk/issue42/ananiadou/ [Accessed 13
April 2010].
Appelt, E.D. & Israel, J.D., 1999. Introduction to Information Extraction Technology: A Tutorial Prepared for IJCAI-99. [Online] Available at: http://user.phil-fak.uni-
duesseldorf.de/~rumpf/SS2005/ Informationsextraktion/Pub/AppIsr99.pdf [Accessed 1 May 2010].
Automatic Content Extraction (ACE), 2004. Automatic Content Extraction 2004 Evaluation
(ACE04). [Online] Available at: http://www.itl.nist.gov/iad/mig//tests/ace/2004/ [Accessed 10 May
2010].
Baeza-Yates, R. & Ribeiro-Neto, B., 1999. Modern Information Retrieval. Pearson
Education Limited. ACM Press, New York.
Bennet, S., McRobb, S. & Farmer, R., 2006. Object-Oriented Systems Analysis and
Design, 3rd
ed. McGraw-Hill: London.
Berners-Lee, T., Fielding, R. & Frystyk, H., 1996. Hypertext Transfer Protocol -- HTTP/1.0.
[Online] Available at: http://www.ietf.org/rfc/rfc1945.txt [Accessed 4 September 2010].
Black, J.W. et al., 2005. CAFETIERE: Conceptual Annotation for Facts, Events, Terms, Individual Entities, and Relations. Parmenides Technical Report TR-U4.3.1. [Online] Available at:
http://ilk.uvt.nl/~kzervanou/dwn/TRU431.pdf [Accessed 4 September 2010].
Chinchor, N. & Sundheim, B., 1993. MUC-5 Evaluation Metrics. Proceedings of the 5
th
Conference of Message Understanding. Baltimore, Maryland, USA 25-27 August 1993. [Online]
Available at: http://www.aclweb.org/anthology-new/M/M93/M93-1007.pdf [Accessed 9 May 2010].
Cunningham, H. et al., 2010. Developing Language Processing Components with GATE Version 5
(a User Guide). [Online] Available at: http://Gate.ac.uk/sale/tao/tao.pdf [Accessed 9 May 2010].
Cunningham, H., 2006. Information Extraction, Automatic. In: Brown, K., ed. Encyclopedia of
Language & Linguistics, 2nd
ed. Oxford: Elsevier.
Fayyad, U. Piatetsky-Shapiro, G. & Smyth, P., 1996. Knowledge Discovery and Data Mining:
Towards a Unifying Framework. Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. Portland, Oregon, USA, 2-4 August 1996. [Online] Available at: http://www.aaai.org/Papers/KDD/1996/KDD96-014.pdf [Accessed 21 April 2010].
Frankling, S., 2010. XML Parser: DOM and SAX Put to the Test. [Online] Available at: http://www.devx.com/xml/Article/16922/1954 [Accessed 27 August 2010].
Frantzi, K., Ananiadou, S. & Mima, H., 2000. Automatic Recognition of Multi-word Terms. International Journal of Digital Libraries, 3(2), p.117-132.
Gerner, M. Nenadic, G. & Bergman, C. M., 2010. An Exploration of Mining Gene Expression
Mentions and their Anatomical Locations from Biomedical Text. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala, Sweden, 15 July 2010. [Online]
Available at: http://www.aclweb.org/anthology/W/W10/W10-1909.pdf [Accessed 4 September
2010].
75 | P a g e
Giles, C.L. & Councill, G.I., 2004. Who gets acknowledged: Measuring scientific contribution
through automatic acknowledgment indexing. PNAS, 101(51), pp.599-604.
Hahn, U. & Wermter, J., 2006. Levels of Natural Language Processing for Text Mining. In:
Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House:
London.
Hearst, M.A., 1999. Untangling Text Data Mining. Proceedings of the 37
th Annual Meeting of the
Association for Computational Linguistics on Computational Linguistics. College Park, Maryland,
USA 20-26 June 1999. [Online] Available at: http://www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html [Accessed 14 April 2010].
Hotho, A. Nurnberger, A. & Paaß, G., 2005. A Brief Survey of Text Mining. LDV-Forum, 20(1), pp.19-62.
JISC, 2006. Text Mining: Briefing Paper. [Online] Available at:
http://www.jisc.ac.uk/media/documents/publications/textminingbp.pdf [Accessed 16 April 2010].
Kim, J. & Tsujii, J., 2006. Corpora and Their Annotation. In: Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London. Hearst, M.A., 2003. What is
Text Mining? [Online] Available at: http://www.ischool.berkeley.edu/~hearst/text-mining.html
[Accessed 14 April 2010].
McNaught, J. & Black, W.J., 2006. Information Extraction. In: Ananiadou, S. &
McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London.
National Institute of Health (NIH), 2010. [Online] http://www.nih.gov/icd/ [Accessed 6 August
2010].
National Library of Medicine (NLM), 2010a. Fact Sheet. [Online] Available at:
http://www.nlm.nih.gov/pubs/factsheets/pubmed.html [Accessed 13 April 2010].
National Library of Medicine (NLM), 2010b. http://dtd.nlm.nih.gov/publishing/ [Accessed 25 August 2010].
National Library of Medicine (NLM), 2010c. http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/ tagging-guidelines/article/tags.html [Accessed 25 August 2010].
National Library of Medicine (NLM), 2009. Key MEDLINE
® Indicators. [Online] Available at:
http://www.nlm.nih.gov/bsd/bsd_key.html [Accessed 13 April 2010].
National Library of Medicine (NLM), 2008. Fact Sheet: MEDLINE®. [Online] Available at:
http://www.nlm.nih.gov/pubs/factsheets/medline.html [Accessed 13 April 2010].
Polajnar, T., 2006. Survey of Text Mining of Biomedical Corpora. [Online] Available at:
http://www.dcs.gla.ac.uk/~tamara/surveyoftm.pdf [Accessed 10 May 2010].
Sommerville, I., 2004. Software Engineering.7
th ed. London: Pearson.
Tateisi, Y., 2004. GENIA Corpus. [Online] Available at: http://www-tsujii.is.s.u-
tokyo.ac.jp/~genia/topics/Corpus/ [Accessed 13 May 2010].
Tsuruoka, Y. et al., 2005. Developing a Robust Part-of-Speech Tagger for Biomedical Text.
Advances in Informatics: 10th Panhellenic Conference on Informatics. Volas, Greece 11-13
76 | P a g e
November 2005. [Online] Available at:
http://www.springerlink.com/content/3275150j32h61345/fulltext.pdf [Accessed 14 May 2010].
Uramoto, N. et al., 2004. A text-mining System for Knowledge Discovery from Biomedical
Documents. IBM Systems Journal, 43(3), pp.516-533.
Wikipedia, 2009. Extensibility. [Online] Available at: http://en.wikipedia.org/wiki/Extensibility
[Accessed 22 August 2010].
Wikipedia, 2010. Research Funding. [Online] Available at:
http://en.wikipedia.org/wiki/Research_funding [Accessed 6 August 2010].
Wren, D.J., 2004. 404 not found: the stability and persistence of URLs published in MEDLINE.
Bioinformatics, 20(5), pp.668-672.
Wren, D.J., 2008. URL decay in MEDLINE—a 4-year follow-up study. Bioinformatics, 24(11),
pp.1381-1385.
Zelenko D. Aone C. & Richardella, A., 2003. Kernel Methods for Relation Extraction. Journal of
Machine
Learning Research, 2003(3), pp.1083-1106
Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation
Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
(pp. 419–426).
Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation
Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, USA 25-30 June 2005. [Online] Available at:
http://www.aclweb.org/anthology-new/P/P05/P05-1053.pdf [Accessed 10 May 2010].
Appendix A – System Architecture and Design
Figure 13 - System Db EER Diagram
System Architecture
78 | P a g e
Figure 14 - ExtConX2 Architectural Design
Default ANNIE Modules
79 | P a g e
Figure 15 - ANNIE Default IE Modules (www.gate.ac.uk)
Appendix B – Implementation Table 36 – List Keywords for Resource Type Identification
Databank Document Software Organisation
Annotate .doc Algorithm Organisation
Data bank .pdf Application Organization
Databank .txt BLAST Institute
Database Article Interface Foundation
Genbank Artikel Program International Agency
geneontology.org Biomedcentral.com r-project.org
ncbi.nlm.nih.gov/biosystems/ Book Software
ncbi.nlm.nih.gov/cancerchromo Chapter Sourceforge
ncbi.nlm.nih.gov/cdd Conclu System
ncbi.nlm.nih.gov/dbEST Content Tool
ncbi.nlm.nih.gov/dbvar Data
ncbi.nlm.nih.gov/domains Dictionary
ncbi.nlm.nih.gov/epigenomics Doc
ncbi.nlm.nih.gov/gap Document
ncbi.nlm.nih.gov/gds dx.doi.org
ncbi.nlm.nih.gov/Genbank/ Elsevie
ncbi.nlm.nih.gov/gene Facts
ncbi.nlm.nih.gov/genome/ Genomebilogogy
ncbi.nlm.nih.gov/genomes/FLU/ Gudeline
ncbi.nlm.nih.gov/geo Icmje
ncbi.nlm.nih.gov/homologene Info
ncbi.nlm.nih.gov/nuccore Interscience/wiley
ncbi.nlm.nih.gov/nucest Issue
ncbi.nlm.nih.gov/nucgss Journal
ncbi.nlm.nih.gov/omia Molvis.org
ncbi.nlm.nih.gov/omim News
ncbi.nlm.nih.gov/pcassay Overview
ncbi.nlm.nih.gov/pccompound Paper
ncbi.nlm.nih.gov/pcsubstance Publication
ncbi.nlm.nih.gov/pcsubstance Report
ncbi.nlm.nih.gov/peptidome Result
ncbi.nlm.nih.gov/popset Review
ncbi.nlm.nih.gov/probe statistic
ncbi.nlm.nih.gov/projects/CCDS/ stats
ncbi.nlm.nih.gov/projects/gensat/ table
ncbi.nlm.nih.gov/projects/sky/ Vol
ncbi.nlm.nih.gov/projects/SNP Volume
ncbi.nlm.nih.gov/protein Wikipedia.org
ncbi.nlm.nih.gov/proteinclusters
ncbi.nlm.nih.gov/RefSeq/
ncbi.nlm.nih.gov/SNP
ncbi.nlm.nih.gov/Structure/
ncbi.nlm.nih.gov/Structure/VAST/
ncbi.nlm.nih.gov/taxonomy
ncbi.nlm.nih.gov/unigene
ncbi.nlm.nih.gov/unists
ncbi.nlm.nih.gov/VecScreen/
pubchem.ncbi.nlm.nih.gov/
81 | P a g e
Appendix C – Evaluation Data Table 37 – URL Extraction Data
PMCID Total Nr.
URLs
Extracted
URLs
Duplicate
URL
Correct Resource
Type Indentified of
Extracted URLs
PMC2413013 4 4 0 2
PMC2761731 4 3 1 1
PMC1988857 9 9 0 1
PMC2752617 4 3 0 3
PMC2764095 9 9 0 9
PMC2661364 3 4 1 3
PMC2111041 2 2 0 2
PMC1919404 41 41 0 40
PMC2533341 2 2 0 2
PMC1779804 6 4 0 4
PMC1525208 3 3 0 1
PMC2768983 6 6 0 3
PMC2731543 4 4 0 4
PMC2801496 2 2 0 0
PMC1624845 1 1 0 1
PMC1839892 1 1 0 1
PMC2206495 4 3 0 3
PMC2239252 1 1 0 1
PMC2685015 6 4 0 1
PMC2440928 2 2 0 2
PMC2478650 5 5 0 4
PMC2793031 1 1 0 1
PMC1994066 3 3 0 3
PMC2515323 2 2 0 2
PMC2765943 4 4 0 3
PMC1599749 5 5 0 4
PMC2570968 8 9 1 9
PMC2787492 7 7 0 7
PMC2806257 5 5 0 3
PMC1805747 2 2 0 2
PMC2276520 7 7 0 6
PMC2600755 4 4 0 4
PMC2071966 2 2 0 1
PMC1266361 1 1 0 1
PMC2755136 4 4 0 2
PMC2600409 2 2 0 2
PMC2405930 1 1 0 1
PMC1851970 2 2 0 2
PMC1698487 6 5 0 5
PMC2671451 2 2 0 2
PMC2759026 4 4 0 2
PMC2627827 1 1 0 1
PMC441568 6 6 0 5
PMC1797064 6 6 0 4
PMC2657239 4 4 0 4
PMC151303 3 3 0 3
PMC2018828 10 10 0 10
PMC1790700 4 4 0 1
PMC2791112 2 2 0 1
PMC2740322 1 1 0 1
Table 38 – Role Extraction Data
PMCID Nr. Relevant True Nr. Partially False Positives
82 | P a g e
Roles Positives Extracted Roles
PMC2750102 5 0 1 2
PMC2761731 3 2 0 0
PMC2246224 2 2 0 0
PMC519127 14 13 0 0
PMC2293642 4 2 0 0
PMC2759026 2 2 0 1
PMC2688212 2 2 0 0
PMC2588630 7 2 0 0
PMC2718519 5 2 1 0
PMC1885552 5 2 1 0
PMC545072 4 1 0 0
PMC1940049 3 3 0 0
PMC2528195 7 4 0 0
PMC1819381 5 5 0 0
PMC1805747 2 2 0 0
PMC2442612 7 5 1 0
PMC1712367 8 7 0 0
PMC2453772 13 8 0 0
PMC2672046 5 5 0 0
PMC2734341 1 1 0 0
PMC2779906 2 2 0 0
PMC2291575 2 2 0 0
PMC2533119 9 8 0 0
PMC2764095 6 4 0 0
PMC2082466 9 2 0 0
PMC2709726 8 1 0 0
PMC102553 3 2 1 0
PMC2121139 8 4 0 0
PMC2658886 4 4 0 0
PMC2734340 2 1 0 0
PMC2186343 2 0 1 0
PMC166148 5 2 0 0
PMC1616969 5 4 0 0
PMC222959 5 5 0 0
PMC2246224 2 2 0 0
PMC2702309 3 2 0 0
PMC1379658 3 2 1 1
PMC102419 4 3 0 0
PMC2391254 5 4 0 0
PMC2751461 3 3 0 1
PMC128935 4 1 0 0
PMC2427038 4 4 0 0
PMC546163 8 5 0 0
PMC2759976 1 1 0 0
PMC2714901 4 4 0 0
PMC2532720 5 4 1 0
PMC1481595 6 3 0 0
PMC2671166 3 3 0 0
PMC1459217 4 4 0 0
PMC2738522 5 5 0 0
Table 39 –Role Expression Extraction Data
PMCID Nr. Relevant True Nr. Partially False Positives
83 | P a g e
REs Positives Extracted REs
PMC2750102 4 3 0 0
PMC2761731 3 2 0 0
PMC2246224 2 2 0 0
PMC519127 5 4 0 0
PMC2293642 4 2 0 1
PMC2759026 2 3 0 0
PMC2688212 2 2 0 0
PMC2588630 3 2 0 0
PMC2718519 5 3 0 0
PMC1885552 4 3 0 0
PMC545072 4 1 0 0
PMC1940049 3 3 0 0
PMC2528195 7 2 0 0
PMC1819381 5 5 0 0
PMC1805747 2 2 0 0
PMC2442612 6 5 0 0
PMC1712367 4 3 0 0
PMC2453772 7 5 0 0
PMC2672046 2 2 0 0
PMC2734341 1 1 0 0
PMC2779906 2 2 0 0
PMC2291575 2 2 0 0
PMC2533119 3 2 0 0
PMC2764095 3 1 0 0
PMC2082466 3 1 0 0
PMC2709726 1 1 0 0
PMC102553 3 2 1 0
PMC2121139 4 2 0 0
PMC2658886 1 1 0 0
PMC2734340 2 1 0 0
PMC2186343 1 0 1 0
PMC166148 3 2 0 0
PMC1616969 4 3 0 0
PMC222959 2 2 0 0
PMC2246224 1 1 0 0
PMC2702309 1 1 0 0
PMC1379658 2 1 0 0
PMC102419 3 3 0 0
PMC2391254 4 3 0 0
PMC2751461 1 1 0 0
PMC128935 1 1 0 0
PMC2427038 3 3 0 0
PMC546163 6 5 0 0
PMC2759976 1 1 0 0
PMC2714901 3 3 0 0
PMC2532720 4 3 0 0
PMC1481595 5 3 0 0
PMC2671166 2 2 0 0
PMC1459217 3 3 0 0
PMC2738522 2 2 0 0
Table 40 –Name Entity Extraction Data
PMCID Nr. Relevant True Nr. Partially False Positives
84 | P a g e
NEs Positives Extracted NEs
PMC2750102 5 0 1 3
PMC2761731 3 2 0 0
PMC2246224 2 2 0 0
PMC519127 14 13 0 0
PMC2293642 4 2 0 0
PMC2759026 2 2 0 1
PMC2688212 2 2 0 0
PMC2588630 7 2 0 0
PMC2718519 5 3 0 0
PMC1885552 5 2 1 0
PMC545072 4 1 0 0
PMC1940049 3 3 0 0
PMC2528195 7 4 0 0
PMC1819381 5 5 0 0
PMC1805747 2 2 0 0
PMC2442612 7 6 0 0
PMC1712367 8 7 0 0
PMC2453772 13 8 0 0
PMC2672046 5 5 0 0
PMC2734341 1 1 0 0
PMC2779906 2 2 0 0
PMC2291575 2 2 0 0
PMC2533119 9 8 0 0
PMC2764095 6 4 0 0
PMC2082466 9 1 0 1
PMC2709726 8 1 0 0
PMC102553 3 3 0 0
PMC2121139 8 4 0 0
PMC2658886 4 4 0 0
PMC2734340 2 1 0 0
PMC2186343 2 1 0 0
PMC166148 5 2 0 0
PMC1616969 5 4 0 0
PMC222959 5 5 0 0
PMC2246224 2 2 0 0
PMC2702309 3 2 0 0
PMC1379658 3 3 0 1
PMC102419 4 3 0 0
PMC2391254 5 4 0 0
PMC2751461 3 3 0 1
PMC128935 4 1 0 0
PMC2427038 4 4 0 0
PMC546163 8 5 0 0
PMC2759976 1 1 0 0
PMC2714901 4 4 0 0
PMC2532720 5 5 0 0
PMC1481595 6 3 0 0
PMC2671166 3 3 0 0
PMC1459217 4 4 0 0
PMC2738522 5 5 0 0
Recommended